Blog

Databricks Workspace Best Practices- A checklist for both beginners and Advanced Users

Databricks Workspace Best Practices- A checklist for both beginners and Advanced Users Most good things in life come with a nuance. While learning Databricks a few years ago, I spent hours searching for best practices. Thus, I devised a set of best rules that should hold in almost all scenarios. These will help you start on the right foot. Here are some basic rules for using Databricks Workspace: Version control everything: Use Repos and organize your notebooks and folders: Keep your notebooks and files in folders to make them easy to find and manage.

Continue reading

How to get the Job ID and Run ID for a Databricks Job

How to get the Job ID and Run ID for a Databricks Job with working code Sometimes there is a need to store or print system-generated values like job_id, run_id, start_time, etc. These entities are called ‘task parameter variables’. A list of supported parameters is listed here. This is a simple 2-step process: Pass the parameter when defining the job/task Get/Fetch and print the values Step 1: Pass the parameters Step 2: Get/Fetch and print the values print(f""" job_id: {dbutils.

Continue reading

How to prepare yourself to be better at Data Interviews?

How to prepare yourself to be better at Data Interviews? In this blog, let’s talk about some specific actions you can take to perform better at Data Interviews. Below is general advice based on my experience coaching 100+ candidates and my industry experience being on both sides of the table. Popular skill set as of 2023 still seems to be SQL, Python & Big Data fundamentals. Here is how to prepare for each of them.

Continue reading

How I wrote my first Spark Streaming Application with Joins?

How I wrote my first Spark Streaming Application with Joins with working code When I started learning about Spark Streaming, I could not find enough code/material which could kick-start my journey and build my confidence. I wrote this blog to fill this gap which could help beginners understand how simple streaming is and build their first application. In this blog, I will explain most things by first principles to increase your understanding and confidence and you walk away with code for your first Streaming application.

Continue reading

How to upgrade your Spark Stream application with a new checkpoint!

How to upgrade your Spark Stream application with a new checkpoint With working code Sometimes in life, we need to make breaking changes which require us to create a new checkpoint. Some example scenarios: You are doing a code/application change where you are changing logic Major Spark Version upgrade from Spark 2.x to Spark 3.x The previous deployment was wrong, and you want to reprocess from a certain point There could be plenty of scenarios where you want to control precisely which data(Kafka offsets) need to be processed.

Continue reading

How to parameterize Delta Live Tables and import reusable functions

How to parameterize Delta Live Tables and import reusable functions with working code This blog will discuss passing custom parameters to a Delta Live Tables (DLT) pipeline. Furthermore, we will discuss importing functions defined in other files or locations. You can import files from the current directory or a specified location using sys.path.append(). Update: As of *December 2022, you can directly import files if the reusable_functions.py file exists in the same repository by just using the import command, which is the preferred approach.

Continue reading

Merge Multiple Spark Streams Into A Delta Table

Merge Multiple Spark Streams Into A Delta Table with working code This blog will discuss how to read from multiple Spark Streams and merge/upsert data into a single Delta Table. We will also optimize/cluster data of the delta table. Overall, the process works in the following manner: Read data from a streaming source Use this special function ***foreachBatch. ***Using this we will call any user-defined function responsible for all the processing.

Continue reading

Using Spark Streaming to merge/upsert data into a Delta Lake with working code

Using Spark Streaming to merge/upsert data into a Delta Lake with working code This blog will discuss how to read from a Spark Streaming and merge/upsert data into a Delta Lake. We will also optimize/cluster data of the delta table. In the end, we will show how to start a streaming pipeline with the previous target table as the source. Overall, the process works in the following manner, we read data from a streaming source and use this special function ***foreachBatch.

Continue reading

ARC Uses a Lakehouse Architecture for Real-time Data Insights That Optimize Drilling Performance and Lower Carbon Emissions

ARC has deployed the Databricks Lakehouse Platform to enable its drilling engineers to monitor operational metrics in near real-time, so that we can proactively identify any potential issues and enable agile mitigation measures. In addition to improving drilling precision, this solution has helped us in reducing drilling time for one of our fields. Time saving translates to reduction in fuel used and therefore a reduction in CO2 footprint that result from drilling operations.

Continue reading

How Audantic Uses Databricks Delta Live Tables to Increase Productivity for Real Estate Market Segments

To support our data-driven initiatives, we had ‘stitched’ together various services for ETL, orchestration, ML leveraging AWS, Airflow, where we saw some success but quickly turned into an overly complex system that took nearly five times as long to develop compared to the new solution. Our team captured high-level metrics comparing our previous implementation and current lakehouse solution. As you can see from the table below, we spent months developing our previous solution and had to write approximately 3 times as much code.

Continue reading