Blog

A Productive Life: How to Parallelize Code Execution in Python

A Productive Life: How to Parallelize Code Execution in Python Asynchronous programming has become increasingly popular in recent years, especially in web development, where it is used to build high-performance, scalable applications. Python has built-in support for asynchronous programming through the asyncio module, which provides a powerful framework for writing asynchronous code. In this blog post, we will explore the asyncio module in Python 3.10 and learn how to run tasks in parallel using the new features introduced in this version.

Continue reading

How to Cut Your Data Processing Costs by 30% with Graviton

How to Cut Your Data Processing Costs by 30% with Graviton What is AWS Graviton ? AWS Graviton is a family of Arm-based processors that are designed by AWS to provide cost-effective and high-performance computing for cloud workloads. Graviton processors are built using 64-bit Arm, which are optimized for power efficiency and performance. They offer a more cost-effective alternative to traditional x86-based processors, making them a popular choice for running a variety of workloads on AWS.

Continue reading

Spark Streaming Best Practices-A bare minimum checklist for Beginners and Advanced Users

Spark Streaming Best Practices-A bare minimum checklist for Beginners and Advanced Users Most good things in life come with a nuance. While learning Streaming a few years ago, I spent hours searching for best practices. However, I would find answers to be complicated to make sense for a beginner’s mind. Thus, I devised a set of best practices that should hold true in almost all scenarios. The below checklist is not ordered, you should aim to check off as many items as you can.

Continue reading

How to write your first Spark application with Stream-Stream Joins with working code.

How to write your first Spark application with Stream-Stream Joins with working code. Have you been waiting to try Streaming but cannot take the plunge? In a single blog, we will teach you whatever needs to be understood about Streaming Joins. We will give you a working code which you can use for your next Streaming Pipeline. The steps involved: Create a fake dataset at scale Set a baseline using traditional SQL Define Temporary Streaming Views Inner Joins with optional Watermarking Left Joins with Watermarking The cold start edge case: withEventTimeOrder Cleanup What is Stream-Stream Join?

Continue reading

Dive Deep into Spark Streaming Checkpoint

From Beginner to Pro: A Comprehensive Guide to understanding the Spark Streaming Checkpoint Spark is a distributed computing framework that allows for processing large datasets in parallel across a cluster of computers. When running a Spark job, it is not uncommon to encounter failures due to various issues such as network or hardware failures, software bugs, or even insufficient memory. One way to address these issues is to re-run the entire job from the beginning, which can be time-consuming and inefficient.

Continue reading

Delta Live Tables Advanced Q & A

Delta Live Tables Advanced Q & A This is primarily written for those folks who are trying to handle edge cases. Q1.) DLT Pipeline was deleted, but the Delta table exists. What to do now? What if the owner has left the org Step 1.) Verify via CLI if the pipeline has been deleted databricks --profile <your_env> pipelines list databricks --profile <your_env> pipelines get --pipeline-id <deleted_pipeline_id> Step 2.) Create a new pipeline with the existing storage path

Continue reading

Databricks Workspace Best Practices- A checklist for both beginners and Advanced Users

Databricks Workspace Best Practices- A checklist for both beginners and Advanced Users Most good things in life come with a nuance. While learning Databricks a few years ago, I spent hours searching for best practices. Thus, I devised a set of best rules that should hold in almost all scenarios. These will help you start on the right foot. Here are some basic rules for using Databricks Workspace: Version control everything: Use Repos and organize your notebooks and folders: Keep your notebooks and files in folders to make them easy to find and manage.

Continue reading

How to get the Job ID and Run ID for a Databricks Job

How to get the Job ID and Run ID for a Databricks Job with working code Sometimes there is a need to store or print system-generated values like job_id, run_id, start_time, etc. These entities are called ‘task parameter variables’. A list of supported parameters is listed here. This is a simple 2-step process: Pass the parameter when defining the job/task Get/Fetch and print the values Step 1: Pass the parameters Step 2: Get/Fetch and print the values print(f""" job_id: {dbutils.

Continue reading

How to prepare yourself to be better at Data Interviews?

How to prepare yourself to be better at Data Interviews? In this blog, let’s talk about some specific actions you can take to perform better at Data Interviews. Below is general advice based on my experience coaching 100+ candidates and my industry experience being on both sides of the table. Popular skill set as of 2023 still seems to be SQL, Python & Big Data fundamentals. Here is how to prepare for each of them.

Continue reading

How I wrote my first Spark Streaming Application with Joins?

How I wrote my first Spark Streaming Application with Joins with working code When I started learning about Spark Streaming, I could not find enough code/material which could kick-start my journey and build my confidence. I wrote this blog to fill this gap which could help beginners understand how simple streaming is and build their first application. In this blog, I will explain most things by first principles to increase your understanding and confidence and you walk away with code for your first Streaming application.

Continue reading