spark

Spark Stream Aggregation: A Beginner’s Journey with Practical Code Example

Streaming Any File Type with Autoloader in Databricks: A Working Guide Spark Streaming has emerged as a dominant force as a streaming framework, known for its scalable, high-throughput, and fault-tolerant handling of live data streams. While Spark Streaming and Databricks Autoloader inherently support standard file formats like JSON, CSV, PARQUET, AVRO, TEXT, BINARYFILE, and ORC, their versatility extends far beyond these. This blog post delves into the innovative use of Spark Streaming and Databricks Autoloader for processing file types which are not natively supported.

Continue reading

Spark Stream Aggregation: A Beginner’s Journey with Practical Code Example

Spark Stream Aggregation: A Beginner’s Journey with Practical Code Example Welcome to the fascinating world of Spark Stream Aggregation! This blog is tailored for beginners, featuring hands-on code examples from a real-world scenario of processing vehicle data. I suggest reading the blog first without the code and then reading the code with the blog. Setting up Spark Configuration, Imports & Parameters First, let’s understand our setup. We configure the Spark environment to use RocksDB for state management, enhancing the efficiency of our stream processing:

Continue reading

Solving Delta Table Concurrency Issues

Solving Delta Table Concurrency Issues Delta Lake is a powerful technology for bringing ACID transactions to your data lakes. It allows multiple operations to be performed on a dataset concurrently. However, dealing with concurrent operations can sometimes be tricky and may lead to issues such as ConcurrentAppendException, ConcurrentDeleteReadException, and ConcurrentDeleteDeleteException. In this blog post, we will explore why these issues occur and how to handle them effectively using a Python function, and how to avoid them with table design and using isolation levels and write conflicts.

Continue reading

Databricks SQL Dashboards Guide: Tips and Tricks to Master Thems

Databricks SQL Dashboards Guide: Tips and Tricks to Master Them Welcome to the world of Databricks SQL Dashboards! You’re in the right place if you want to learn how to go beyond just building visualizations and add some tricks to your arsenal. This guide will walk you through creating, managing, and optimizing your Databricks SQL dashboards. 1. Getting Started with Viewing and Organizing Dashboards: Accessing Your Dashboards: Navigate to the workspace browser and click “Workspace” in the sidebar.

Continue reading

Optimizing Databricks SQL: Achieving Blazing-Fast Query Speeds at Scale

Optimizing Databricks SQL: Achieving Blazing-Fast Query Speeds at Scale In this data age, delivering a seamless user experience is paramount. While there are numerous ways to measure this experience, one metric stands tall when evaluating the responsiveness of applications and databases: the P99 latency. Especially vital for SQL queries, this seemingly esoteric number is, in reality, a powerful gauge of the experience we provide to our customers. Why is it so crucial?

Continue reading

Simplifying Real-time Data Processing with Spark Streaming’s foreachBatch with working code

Simplifying Real-time Data Processing with Spark Streaming’s foreachBatch with working code Comprehensive guide to implementing a fully operational Streaming Pipeline that can be tailored to your specific needs. In this working example, you will learn how to parameterize the ForEachBatch function. Spark Streaming & foreachBatch Spark Streaming is a powerful tool for processing streaming data. It allows you to process data as it arrives, without having to wait for the entire dataset to be available.

Continue reading

How to Cut Your Data Processing Costs by 30% with Graviton

How to Cut Your Data Processing Costs by 30% with Graviton What is AWS Graviton ? AWS Graviton is a family of Arm-based processors that are designed by AWS to provide cost-effective and high-performance computing for cloud workloads. Graviton processors are built using 64-bit Arm, which are optimized for power efficiency and performance. They offer a more cost-effective alternative to traditional x86-based processors, making them a popular choice for running a variety of workloads on AWS.

Continue reading