spark streaming

How to write your first Spark application with Stream-Stream Joins with working code.

How to write your first Spark application with Stream-Stream Joins with working code. Have you been waiting to try Streaming but cannot take the plunge? In a single blog, we will teach you whatever needs to be understood about Streaming Joins. We will give you a working code which you can use for your next Streaming Pipeline. The steps involved: Create a fake dataset at scale Set a baseline using traditional SQL Define Temporary Streaming Views Inner Joins with optional Watermarking Left Joins with Watermarking The cold start edge case: withEventTimeOrder Cleanup What is Stream-Stream Join?

Continue reading

Dive Deep into Spark Streaming Checkpoint

From Beginner to Pro: A Comprehensive Guide to understanding the Spark Streaming Checkpoint Spark is a distributed computing framework that allows for processing large datasets in parallel across a cluster of computers. When running a Spark job, it is not uncommon to encounter failures due to various issues such as network or hardware failures, software bugs, or even insufficient memory. One way to address these issues is to re-run the entire job from the beginning, which can be time-consuming and inefficient.

Continue reading

ARC Uses a Lakehouse Architecture for Real-time Data Insights That Optimize Drilling Performance and Lower Carbon Emissions

ARC has deployed the Databricks Lakehouse Platform to enable its drilling engineers to monitor operational metrics in near real-time, so that we can proactively identify any potential issues and enable agile mitigation measures. In addition to improving drilling precision, this solution has helped us in reducing drilling time for one of our fields. Time saving translates to reduction in fuel used and therefore a reduction in CO2 footprint that result from drilling operations.

Continue reading

How Audantic Uses Databricks Delta Live Tables to Increase Productivity for Real Estate Market Segments

To support our data-driven initiatives, we had ‘stitched’ together various services for ETL, orchestration, ML leveraging AWS, Airflow, where we saw some success but quickly turned into an overly complex system that took nearly five times as long to develop compared to the new solution. Our team captured high-level metrics comparing our previous implementation and current lakehouse solution. As you can see from the table below, we spent months developing our previous solution and had to write approximately 3 times as much code.

Continue reading