Canadian Data Guy Unfiltered

Inside Delta Lake’s Idempotency Magic: The Secret to Exactly-Once Spark

Canadian Data Guy — Tue, 27 Jan 2026 01:51:30 GMT

Introduction

When a Spark Structured Streaming job fails mid-flight, how does it know where to resume? What prevents duplicate writes to your Delta tables? This article explores the elegant mechanisms that make Spark Structured Streaming fault-tolerant and exactly-once.

Key Insight:
The checkpoint directory and Delta Lake’s transaction log work together to ensure correctness even when clusters die between writing data and recording completion.

Checkpoint Directory Structure

When you start a streaming query, Spark creates a checkpoint directory with the following structure:

Critical Timing:
offsets/N is written before processing batch N starts.
commits/N is written after batch N completes successfully.

The Normal Happy Path

The Critical Failure Scenario

Here’s where things get interesting. What happens when the cluster dies after writing to Delta but before writing the commit file?

The Problem
Batch N+1 was successfully written to Delta Lake, but the commit file was never created. On restart, Spark will see:
offsets/N+1 exists
commits/N+1 does not exist
Data is already in the Delta table
Question: Won’t re-running batch N+1 create duplicates? 🤔

Delta Lake’s Idempotency Magic

This is where Delta Lake’s transaction log saves the day. Delta records two critical pieces of metadata with every streaming write:

Note: The terms epochId and batchId refer almost to the same thing - the monotonically increasing micro-batch number. I am trying to find more details to figure out the difference

txnAppId (Query ID) The unique streaming query identifier from metadata/ .This ensures different queries don’t interfere with each other

txnVersion (Epoch ID) The micro-batch number (0, 1, 2, 3...) Monotonically increasing; each batch gets its own ID

The Solution

When Spark retries batch N+1, Delta Lake checks its transaction log:
Has transaction (queryId: “abc-def-123-456”, epochId: N+1) been committed?
If YES: Skip the duplicate write, create commits/N+1
If NO: Proceed with the write, then create commits/N+1

Complete Recovery Flow

Key Insight: The checkpoint and Delta transaction log work together as a distributed two-phase commit. The checkpoint tracks intent, while Delta’s log tracks completion. Both must agree for the system to move forward.

Offset Semantics / ( inclusive, exclusive]

When reading from Kafka, understanding offset boundaries is crucial for reasoning about what each batch consumes.

Start Offset: Inclusive- If start = 100, offset 100 is included in the batch

End Offset: Exclusive- If end = 200, offset 200 is not included in the batch

Next Batch: Batch N+1 would start at offset 200(the previous end becomes the next start)

TLDR: Recovery Rules

Where to Find the IDs

Streaming Query ID

📁Checkpoint metadata/ file
🖥️Spark Streaming UI (while query runs)
📓Notebook outputs showing query status

Batch/Epoch IDs

📊Tracked per micro-batch (0, 1, 2, ...)
🏷️Used by Delta to prevent duplicate commits (same as epochId)
📜Visible in Delta transaction log

When Things Go Wrong: A Real-World Accident

I encountered a scenario where duplicate records appeared in my Delta Lake table despite Structured Streaming’s exactly-once guarantees. The quickest way to identify the scope of the problem was using Delta’s _metadata column to pinpoint which Parquet files contained duplicates. By tracing these files back through the Delta transaction log versions, I discovered the root cause: the same epochId appeared in multiple transactions with different queryId values. This broke Delta Lake’s idempotency mechanism, which relies on the unique combination of (queryId, epochId) to detect and skip duplicate writes.

Root Cause

The checkpoint directory was accidentally overwritten or corrupted, causing Spark to reinitialize with a new queryId while replaying already-processed batches. Since Delta only saw new queryId values, it treated these as legitimate new transactions rather than duplicates, resulting in data duplication.

Key Lessons:

This incident brought to light the critical importance of:

Enable comprehensive logging Implement S3 Server Access Logging or AWS CloudTrail to audit all checkpoint location and detect unauthorized changes
Implement strict access control on checkpoint directories to prevent accidental modifications or deletions using UC Volumes
Treat checkpoint directories as critical infrastructure requiring the same level of protection and operational discipline as your data itself

Critical Takeaway: While Spark and Delta Lake provide strong exactly-once semantics, the checkpoint directory is a critical piece of infrastructure that requires the same level of protection and monitoring as your data itself.

Summary: The Complete Picture

Checkpoint structure: offsets/tracks what to process, commits/ tracks what’s been completed
Timing matters: Offsets written before processing, commits written after success
Delta Lake’s role: Transaction log with (query ID, epoch ID) prevents duplicates
Safe replay: If a batch is replayed, Delta checks for prior commit and skips if found
Exactly-once guarantee: Together, checkpoint + Delta transaction log ensure no data loss or duplication

Related Deep Dives:

If you found this troubleshooting walkthrough helpful, I have a couple of other related posts that dive deeper into Delta Lake forensics and data management:

Forensic Analysis: I’ve written a detailed guide on how to trace exactly which Delta version number, Parquet file, and commit produced specific records, including sample code for the investigation process. If there’s interest, I’m happy to write a dedicated breakdown on this methodology—just leave a comment below!
Handling Data Deletion: When you need to delete data from Delta Lake tables, understanding the impact on downstream streaming consumers is critical. I’ve covered this scenario in depth, including patterns for safe deletion and stream recovery. Check it out here: How to Actually Delete Data in Spark
📺 Deep Dive into Stateful Stream Processing in Structured Streaming
This talk covers the internals of stateful stream processing, checkpoint mechanisms, and recovery patterns in production environments. It provides invaluable insights into how Spark manages state and handles failures at scale.

Let me know in the comments if you’d like to see more operational war stories and troubleshooting techniques!

How to Choose Between Liquid Clustering and Partitioning with Z-Order in Databricks

Canadian Data Guy — Thu, 15 Jan 2026 19:01:00 GMT

This is one of the most-read posts on the website, so we decided to give it a well-deserved 2026 update. Thank you to for co-authoring on this revision and for raising the technical bar of the article.

Delta Lake, an open source storage format, offers two primary methods for organizing data: liquid clustering and partitioning with Z-order. This blog post will help you navigate the decision-making process between these two approaches. Clustering in Delta Lake enhances query performance by organizing data based on frequently accessed columns, similar to indexing in relational databases. The key difference is that clustering physically sorts the data within the table rather than creating separate index structures.

Understanding the Basics: Liquid Clustering vs. Partitioned Z-Order Tables

Liquid Clustering

Liquid clustering is a newer algorithm for Delta Lake tables, offering several advantages:

Flexibility: You can change clustering columns at any time.
Optimization for Unpartitioned Tables: It works well without partitioning.
Efficiency: It doesn’t re-cluster previously clustered files unless explicitly instructed.

Liquid clustering relies on optimistic concurrency control (OCC) to handle conflicts when multiple writes occur to the same table.

Partitioned Z-Order Tables

Partitioning combined with Z-ordering is a traditional approach that:

Control: Allows greater control over data organization.
Parallel Writes: Supports parallel writes more effectively.
Fine-Grained Optimization: Enables optimization of specific partitions.

However, data engineers must be aware of querying patterns upfront to choose an appropriate partition column.

Decision Tree

Built in Jan 2026, this decision tree will be continuously updated as technology evolves. As new enhancements emerge, my understanding will grow, and this resource will be refined accordingly. This is a complex topic, but I will do my best to provide at least an intuitive grasp to help you develop a clearer understanding.

Factors to Consider When Choosing

Table Size

Small tables (< 10 TB): If you need fast lookups on exactly two columns, Liquid Clustering on those columns typically delivers comparable performance with simpler maintenance. If your workload involves highly selective lookups across three or more columns, Partition + Z-order may perform better, assuming the partition key has low cardinality. That said, Liquid Clustering can still work for multi-column lookups and is often worth benchmarking with tuned clustering keys.
Medium tables (10 TB -500TB): For medium-sized tables, the key decision factor is partition cardinality. If partitioning results in fewer than ~5,000 distinct values (for example, ~1,100 partitions for 3 years of daily data), Partition + Z-order can work well when queries include the partition column. If the number of distinct values exceeds ~5,000, Liquid Clustering is generally preferred to avoid over-partitioning. In practice, benchmark both approaches with representative queries to validate performance.
Large tables (> 500 TB): You should reach out to your Databricks representative and have a discussion.

Note: Liquid is being actively improved so the guidance could change

Data Ingestion Pattern

How data is written - batch or streaming - can influence which data organization strategy is most appropriate.

Batch Ingestion : For batch workloads, Liquid Clustering remains a strong default choice. Batch writes naturally organize data efficiently. In the latest Databricks Runtime versions, eager clustering can be enabled to make the data well-clustered as it is written, so queries see an optimized view right away.
Streaming Ingestion : For streaming workloads, the choice depends on your main priority.
- Low Latency: If getting data into the table quickly is most important, Liquid Clustering is preferred without eager clustering. This reduces shuffle overhead during ingestion. Data may not be fully optimized immediately, but query performance can improve later using Predictive I/O.
- Fast Downstream Lookups: If queries need to be fast as soon as data arrives, Liquid Clustering with eager clustering is recommended. This ensures data is well-clustered on write, and follow-up OPTIMIZE can further improve query performance.

Query Patterns

If users consistently include the partition column in their queries, partitioning can be very effective.
Liquid clustering may be more suitable for more flexible query patterns where users may not always include the partition column.

Data Distribution

If you have uneven partition sizes, the liquid will be better.
Date-based data (e.g., clickstream data) often benefits from partitioning.
For data without a clear partitioning strategy, liquid clustering may be better.

Partition Column Selection

When choosing a partition column:

Select immutable columns (e.g., click date, sale date)
Avoid high-cardinality columns like timestamps
For timestamp data, create a derived date column for partitioning

Aim for fewer than 10,000 distinct partition values.
Each partition should contain at least ~1-10 GB of data.

Real-World Example: Amazon Clickstream Data

Let's consider a real-world scenario using Amazon's clickstream data:

The table stores 3 years of data for 10 countries
Partitioning by click date results in approximately 1,000 partitions (365 * 3)
10 countries * 1,000 date partitions = 10,000 total partitions

This setup is within the recommended partition count (< 10,000) and provides good control over the data. Here's how we might structure this table:

Partition by click_date, country
Z-order by merchant_id, and advertiser_id

Optimizing the Partitioned Table

To maintain optimal performance, you can run a daily optimization job on the newest partition:

OPTIMIZE table_name
WHERE click_date = 'ANY_DATE' and country = 'CANADA'
ZORDER BY ( merchant_id, advertiser_id)

This approach ensures good performance for date-range queries and lookups on Z-ordered columns.

Optimistic Concurrency Control

Delta Lake uses optimistic concurrency control to manage parallel writes. Here's how it works:

Writers check the current version of the Delta table (e.g., version 100).
They attempt to write a new JSON file (e.g., 101.json).
Only one writer can succeed in creating this file.
The "losing" writer checks if there are conflicts with what was previously written.
If no conflicts, it creates the next version (e.g., 102.json).

This approach works well for appends but can be challenging for updates, especially when multiple writers are trying to modify the same files.

Potential Pitfalls and Best Practices

Here are some key considerations and common mistakes to avoid:

Do not add Co-related columns to liquid: If two columns are highly correlated, you only need to include one of them as a clustering key. Example, if you have click_date, click_timestamp then only cluster by click_timestamps
Skip meaningless keys: When it comes to clustering, try to avoid using meaningless keys such as UUIDs, which are inheritable and unsortable strings. If possible, refrain from using them in both liquid and z-order clustering. However, I understand that sometimes customers require quick lookups on these UUID columns. In those cases, you may include them.
Over-Partitioning: A common mistake is creating too many partitions. While partitioning helps with performance, too many partitions can result in overhead. A good rule of thumb is to keep partition counts under 10,000. For example, if you're storing three years of daily click data, partitioning by click_date would result in around 1,000 partitions for three years—well within the 10,000-partition guideline. Example: Avoid partitioning on high cardinality columns (e.g., timestamps). This would result in too many partitions, leading to performance degradation. Instead, partition on a date column and ensure it has enough data per partition.
Enable Predictive Optimization on your Databricks workspace to automatically manage maintenance for Unity Catalog–managed tables. PO identifies tables that can benefit from operations such as OPTIMIZE, VACUUM, and ANALYZE, and schedules these jobs using serverless compute. This eliminates the need to manually schedule OPTIMIZE for compaction or clustering, as the platform triggers operations based on usage patterns, table statistics, and overall table health.
- For partitioned tables, PO applies compaction and layout improvements within each partition.
- For Liquid Clustered tables, PO integrates with CLUSTER BY AUTO, automatically selecting clustering keys and scheduling incremental clustering jobs. This reduces manual tuning and ensures that the table layout evolves with changing query patterns, keeping queries efficient without intervention.
Schedule Optimization (If Required) : With Predictive Optimization (PO) enabled, most maintenance tasks are handled automatically. You only need to manually run OPTIMIZE in the following cases:
- For Zordered tables, Note that OPTIMIZE does not automatically apply ZORDER, so manual OPTIMIZE runs are still required if Z-ordering is needed.
- For Liquid Clustered tables, manual OPTIMIZE is only needed if queries require faster response times immediately after data arrival, or additional optimization is necessary to improve query performance.

Conclusion

Choosing between liquid clustering and partitioned Z-order tables depends on various factors including table size, write patterns, and query requirements. Always consider your specific use case and be prepared to test both approaches to determine the best fit for your data and query patterns. The right choice will significantly impact your query performance and overall data management efficiency.

Keep This Post Discoverable: Your Engagement Counts!

Your engagement with this blog post is crucial! Without claps, comments, or shares, this valuable content might become lost in the vast sea of online information. Search engines like Google rely on user engagement to determine the relevance and importance of web pages. If you found this information helpful, please take a moment to clap, comment, or share. Your action not only helps others discover this content but also ensures that you’ll be able to find it again in the future when you need it. Don’t let this resource disappear from search results — show your support and help keep quality content accessible!

References

https://docs.databricks.com/aws/en/delta/clustering
https://www.databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html
https://docs.databricks.com/en/delta/clustering.html
https://www.databricks.com/blog/announcing-general-availability-liquid-clustering

Unlocking Sub-Second Latency with Databricks

Canadian Data Guy — Wed, 14 Jan 2026 23:16:43 GMT

I spent a whole month trying to write the “perfect” single blog on Spark Structured Streaming Real-Time Mode… and then I accepted reality: it’s too much to cram into one post without turning it into a textbook. So this is a series.

In this first post, I’m not building a crypto demo. I’m building a pattern you can reuse for things that actually move the needle: fraud detection, IoT sensor monitoring, real-time offers, security signals—anything where you need to respond to events ASAP.

The goal is simple:

When an event looks suspicious or invalid, flag it immediately and route it differently.

That “suspicious or invalid” could be:

Fraud detection: a transaction looks off → trigger a downstream
IoT: a sensor reading is impossible → trigger an action
Security: payload contains secrets/PII patterns → quarantine in real time
Offers/personalization: Respond to specific events

For the dataset, I’m using Ethereum blocks because they’re high volume and behave like real production traffic. But the point isn’t crypto. The point is the operational pattern: real-time guardrails.

Concretely, I’m doing two checks on every block event:

Payload hygiene: flag suspicious strings in extra_data (think accidental secrets/PII-style patterns)
Data quality: gas_used > gas_limit (This should not happen—if it does, something is wrong)

If any check trips, the event gets tagged QUARANTINE. Otherwise ALLOW. The output is just an enriched Kafka event that downstream consumers can act on immediately.

Also, I used Redpanda to run Kafka because they make it ridiculously easy to spin up a cluster, and new signups get $100 in credits for 14 days. Not sponsored.
Redpanda, if you’re reading this: give me more credits. I have too many experiments.

If you know me, you know I always test things at scale; if it does not scale, then I don’t write about it. I uploaded the full Ethereum chain into Kafka—about 95 GB into 4 partitions—roughly 23 million messages. If you want my notebook that dumps data into Redpanda, drop a comment and I’ll share it.

Why am I doing this?

Because I’m tired of hearing:

“Spark isn’t fast enough like Flink, so we need a whole new stack for this one use case.”

After 10+ years in data engineering, one lesson keeps paying rent: maintainability beats shiny tools more often than people want to admit. Even if the new tool is 20% faster and I have the energy to learn it, that does not automatically mean my whole team should learn it too.

I’ve benchmarked Spark streaming enough to be confident about this: if you can tolerate ~1–2 seconds, Spark micro-batch will happily land data into Delta all day. I should redo that benchmark—last time it cost me $1,300 (Confluent waived it, bless them). I’m here for sub-second latency, not sub-second “your card has been charged” notifications.

Now Spark is stepping into the sub-second territory with Real-Time Mode—a new trigger type designed for operational workloads that need immediate response, with end-to-end latency advertised as low as 5 ms (Public Preview, DBR 16.4 LTS+). Databricks Documentation

I don’t buy marketing. So I tested it.

What we’re building: Operational Guardrail Stream

Every incoming event gets evaluated immediately and we emit an enriched event downstream with:

a decision: ALLOW vs QUARANTINE
reasons: why we flagged it (data quality, payload hygiene, etc.)

This is the operational pattern that shows up everywhere:

“Do I quarantine it?”
“Do I enrich it so downstream can react instantly?”

In my dataset, the “event” is an Ethereum block. In your world, it could be a transaction, sensor reading, auth log, API call—same idea.

The dataset and assumptions

Source topic: ethereum-blocks-ordered-global
Target topic: topic-with-4-partitions

Assumptions:

Kafka value is JSON
We parse it into a hardcoded schema so we have typed columns like gas_used, gas_limit, timestamp, etc.
We also keep kafka_ts (Kafka append timestamp) because for operational monitoring, arrival time matters.

What makes an event “bad” in this post

I’m keeping the rules intentionally simple and high-signal:

Rule 1: Payload hygiene check

Scan extra_data for obvious “this shouldn’t be here” patterns.

In the blog code, I show basic examples (email/JWT/AWS key shapes). Replace these with your real rules (PII patterns, internal IDs, API keys, etc.).

The point isn’t regex perfection. The point is: real-time guardrails belong in the pipeline, not in a postmortem.

Rule 2: Bad data check

gas_used > gas_limit

This should not happen. If it happens, either:

the data is corrupted,
the producer is wrong,
you’re parsing incorrectly,
or something upstream is broken.

Operationally, that’s exactly what we want: flag it immediately.

Real-Time Mode: the tiny bit you need to know

Real-Time Mode is enabled by using the real-time trigger and runs under update mode. In PySpark you specify an interval like "5 minutes".

Two important “don’t skip this” notes:

Cluster config matters (Databricks documents the required job cluster settings and the RTM enablement flag).
Output mode must be update with RTM triggers.

That’s all I’m going to say here, because this post is about the operational pattern. I’ll do a deeper “RTM setup checklist” in the next post.

The code: Real-time guardrail (Kafka → Spark RTM → Kafka)

This is a single-pass pipeline:

Connect to Kafka
Parse kafka input
Compute decision, reasons
write JSON back to Kafka as strings (no binary needed)

Imports & Configuration

import json
import re
import uuid
from pyspark.sql import functions as F
from pyspark.sql.types import (
    StructType, StructField, StringType, LongType, 
    DoubleType, TimestampType, DateType
)

# -------------------------------------------------------------------------
# 1. CONFIGURATION
# -------------------------------------------------------------------------

# --- Kafka Connection Details ---
# Ideally, fetch these from secrets (e.g., dbutils.secrets.get)
BOOTSTRAP_SERVERS = "d5deqhbrcoacstishppg.any.us-west-2.mpx.prd.cloud.redpanda.com:9092"
SASL_MECHANISM = "SCRAM-SHA-256"
RP_USERNAME = "redpanda"
RP_PASSWORD = ""

# --- Kafka Options ---
RP_KAFKA_OPTIONS = {
    "kafka.bootstrap.servers": BOOTSTRAP_SERVERS,
    "kafka.security.protocol": "SASL_SSL",
    "kafka.sasl.mechanism": SASL_MECHANISM,
    "kafka.sasl.jaas.config": (
        'kafkashaded.org.apache.kafka.common.security.scram.ScramLoginModule required '
        f'username="{RP_USERNAME}" password="{RP_PASSWORD}";'
    ),
    "kafka.ssl.endpoint.identification.algorithm": "https",
}

# --- Job Settings ---
INPUT_TOPIC = "ethereum-blocks-ordered-global"
OUTPUT_TOPIC = "topic-with-4-partitions"
CHECKPOINT_LOCATION = f"/tmp/chk_rtm_stateless_guardrail_{uuid.uuid4()}"

# Set shuffle partitions for purposes (default 200 which is too high in low latency use cases)
spark.conf.set("spark.sql.shuffle.partitions", "8")

Connect to Kafka

# --- Step A: Read from Kafka ---
df_raw = (
    spark.readStream
    .format("kafka")
    .options(**RP_KAFKA_OPTIONS)
    .option("subscribe", INPUT_TOPIC)
    .option("startingOffsets", "earliest")
    .option("failOnDataLoss", "false")
    .load()
)

#display(df_raw)

Special note: display() It is a special function that initiates the streaming query for you, allowing you to preview the live output. It kicks off the stream so you can see rows flowing without wiring up a full sink. You don’t even need to specify a checkpoint just to preview results. It’s perfect for quick debugging—just don’t confuse it with a production pipeline.

Parse JSON Payload

# -------------------------------------------------------------------------
# 2. SCHEMA DEFINITION
# -------------------------------------------------------------------------

block_schema = StructType([
    StructField("hash", StringType(), True),
    StructField("miner", StringType(), True),
    StructField("nonce", StringType(), True),
    StructField("number", LongType(), True),
    StructField("size", LongType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("total_difficulty", DoubleType(), True),
    StructField("base_fee_per_gas", LongType(), True),
    StructField("gas_limit", LongType(), True),
    StructField("gas_used", LongType(), True),
    StructField("extra_data", StringType(), True),
    StructField("logs_bloom", StringType(), True),
    StructField("parent_hash", StringType(), True),
    StructField("state_root", StringType(), True),
    StructField("receipts_root", StringType(), True),
    StructField("transactions_root", StringType(), True),
    StructField("sha3_uncles", StringType(), True),
    StructField("transaction_count", LongType(), True),
    StructField("date", DateType(), True),
    StructField("last_modified", TimestampType(), True),
])

# --- Step B: Parse JSON Payload ---
# We cast the binary 'value' to string, parse it, and flatten the struct
df_parsed = (
    df_raw
    .select(
        F.col("timestamp").alias("kafka_ts"),
        F.col("key").cast("string").alias("kafka_key"),
        F.col("value").cast("string").alias("value_str")
    )
    .withColumn("parsed", F.from_json(F.col("value_str"), block_schema))
    .where(F.col("parsed").isNotNull())  # Filter out malformed JSON
    .select(
        "kafka_ts", 
        "kafka_key",
        F.col("parsed.*")
    )
)

Compute `decision`, `reasons/ Your Custom Logic / Rules`

# -------------------------------------------------------------------------
# 3. UDF DEFINITIONS
# -------------------------------------------------------------------------

# Pre-compile regex patterns for efficiency
EMAIL_RE = re.compile(r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}", re.IGNORECASE)
JWT_RE   = re.compile(r"eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+")
AWS_RE   = re.compile(r"AKIA[0-9A-Z]{16}")

@F.udf("string")
def extra_data_reason_udf(extra_data: str) -> str:
    """
    Scans 'extra_data' field for sensitive or suspicious patterns.
    Returns a reason code if a match is found, otherwise None.
    """
    if extra_data is None:
        return None
    if EMAIL_RE.search(extra_data):
        return "EXTRA_DATA_EMAIL"
    if JWT_RE.search(extra_data):
        return "EXTRA_DATA_JWT"
    if AWS_RE.search(extra_data):
        return "EXTRA_DATA_AWS_KEY"
    return None


# --- Step C: Enrich & Apply Business Rules ---

# Rule 1: Gas Used cannot exceed Gas Limit
condition_bad_gas = (
    F.col("gas_used").isNotNull() &
    F.col("gas_limit").isNotNull() &
    (F.col("gas_used") > F.col("gas_limit"))
)

# Rule 2: Check for suspicious patterns in extra_data
col_extra_reason = extra_data_reason_udf(F.col("extra_data"))

df_enriched = (
    df_parsed
    .withColumn("bad_gas", condition_bad_gas)
    .withColumn("extra_reason", col_extra_reason)
    # Collect all failure reasons into an array
    .withColumn(
        "reasons",
        F.expr("""
            filter(
                array(
                    case when bad_gas then 'BAD_GAS_USED_GT_LIMIT' end,
                    extra_reason
                ),
                x -> x is not null
            )
        """)
    )
    # Determine final decision: Quarantine if any reasons exist
    .withColumn("is_quarantined", F.size(F.col("reasons")) > 0)
    .withColumn(
        "decision",
        F.when(F.col("is_quarantined"), F.lit("QUARANTINE"))
         .otherwise(F.lit("ALLOW"))
    )
    # Prepare final output structure for Kafka (Key, Value JSON)
    .select(
        F.col("kafka_key").cast("binary").alias("key"),
        F.to_json(F.struct(
            F.col("kafka_ts"),
            F.col("number"),
            F.col("hash"),
            F.col("miner"),
            F.col("timestamp").alias("block_ts"),
            F.col("gas_used"),
            F.col("gas_limit"),
            F.col("decision"),
            F.col("is_quarantined"),
            F.col("reasons"),
            F.col("extra_data")
        )).alias("value")
    )
)

Write Back To Kafka

# --- Step D: Write to Kafka (Real-Time Mode) ---
# The key highlight here is trigger(realTime="...")
query = (
    df_enriched.writeStream
    .format("kafka")
    .options(**RP_KAFKA_OPTIONS)
    .option("topic", OUTPUT_TOPIC)
    .option("checkpointLocation", CHECKPOINT_LOCATION)
    .option("queryName", f"rtm-stateless-guardrail-{OUTPUT_TOPIC}")
    .outputMode("update")
    # -----------------------------------------------------------------
    # REAL TIME MODE: Asynchronous checkpointing for lower latency
    # -----------------------------------------------------------------
    .trigger(realTime="1 minutes")
    .start()
)

The Best Part? It’s Just the Flip of a Switch

Perhaps the most surprising aspect of Real-Time Mode is its remarkable ease of adoption for developers already familiar with Structured Streaming. Enabling this powerful new capability does not require a complex migration or a rewrite of existing code.

Instead, users can unlock millisecond-level latency by simply changing the trigger configuration in their existing query.

This seamless user experience is a critical feature. It means teams can prototype and productionize operational workloads without the massive overhead of learning, deploying, and managing an entirely separate technology stack. This drastically accelerates innovation and reduces the risk associated with adopting new real-time use cases.

References

Real-Time Mode Technical Deep Dive: How We Built Sub-300 Millisecond Streaming Into Apache Spark™

Delivering Sub-Second Latency for Operational Workloads on Databricks

Structured Streaming Paper

I Knew the Answer. I Just Couldn’t Remember It.

Canadian Data Guy — Sat, 10 Jan 2026 16:31:05 GMT

I don’t use ChatGPT or Google to answer most technical questions people ask me. Here’s why: most people think AI advantage comes from picking the right model. But models are becoming commodities. The real advantage is whether your AI has your knowledge.

You don’t forget things. You forget where your brain stored them.

This kept happening: someone would ask, “How did you speed up merges when writing to N tables at once ?” And I’d think—I know this. I’ve seen this. I probably even wrote about it. But I couldn’t recall where it is, and then spent 10 minutes finding the right piece.

The problem wasn’t a lack of knowledge. The problem was fragmentation. My knowledge lived in bookmarks, blogs, notes, documents, and videos. The information existed. Retrieval was broken.

My Old Stack: Why It Failed

For years I was paying for Notion. I was paying for Notion AI. And I was paying for a separate diagramming tool because I build a lot of decision trees and system diagrams.

Then I realized something uncomfortable: I was paying just to store my own thinking. Everything lived in different platforms. And I still had the same problem—when someone asked a question, I knew I had the answer somewhere. I just didn’t know where.

That’s when I knew I didn’t need more tools. I needed a recall layer.

Discovering Obsidian

Then a colleague showed me Obsidian. If you don’t know it, it’s a free, local, markdown-based note tool. Out of curiosity, I looked up who builds it.

This surprised me: Obsidian is built and maintained by a team of less than 20 people—not a massive tech company. Which is amazing, because the tool feels extremely polished.

What I loved immediately:

◆Local: My notes are just Markdown files on my computer
◆Markdown: Plain text, universally readable, version-controllable
◆Mine: No lock-in. I own my data completely.

The Simple / No Code Solution

Around the same time, I already knew Cursor could index everything inside a workspace. Then the idea clicked:

What if Obsidian and Cursor pointed to the same folder?

Obsidian manages my notes. Cursor indexes them. Suddenly Cursor stopped being just an IDE and became my personal knowledge agent. No training. No fine-tuning. Just my corpus.

Hack: You can use Cursor to do the dirty work of organizing and cleaning your notes.

How I Feed the Agent

Then I started feeding it properly. All my daily work notes go into Obsidian. All my blogs go into Obsidian. I set up RSS feeds for sources I trust—Databricks blogs, Canadian Data Guy, and a few others.

And one extra trick: I manually downloaded transcripts from YouTube videos I really liked and stored them as notes. Now, this vault is a living, evolving record of my thoughts.

Recommended Obsidian Plugins

RSS Reader— Auto-import articles from trusted sources
Templater— Standardize note structures
Daily Notes— Automatic daily journal creation

Creative Things I Use the Cursor For

Cursor isn’t just helping me retrieve notes. This is where it gets really interesting. Because now that it understands my entire vault, I don’t just ask it questions—I use it to work on my knowledge with me.

Chaos → Structure

Sometimes I have a single document that’s just messy—bullet points, half sentences, random links. I highlight the file and say: “Turn these incomplete notes into a clean, structured explanation.” Cursor organizes it into sections and connects it to things I’ve written before.

Multi-Note Synthesis

I select three or four old notes from different months and say: “These are related. Combine them into one coherent technical explanation.” Cursor doesn’t just rewrite—it synthesizes. It connects things I forgot were even related.

Example Prompts

> “Look at this document and turn these incomplete notes into a clean explanation”
> “These 4 files are related. Combine them into one coherent technical doc”
> “How have my approaches to streaming cost optimization evolved over time?”
> “What would I say about Delta Lake checkpoints based on my past notes?”
> “Reflect my thinking back to me - what blind spots do you see?”

What It’s Good At (And Not)

This is not a research engine. If someone asks me something completely outside my domain, I don’t pretend my agent knows. That’s a Google or ChatGPT question.

This solves a different problem. About 80% of the questions I get are: “I know you’ve seen this before…” That’s exactly what this is built for.

⚡ Key Insight

The answers don’t sound like the internet. They sound like me. Because they’re grounded in my notes. This is the difference between generic AI and a personal knowledge agent.

Why This Matters

Not all knowledge belongs in public models. Your real thinking is messy. Evolving. Sometimes confidential.

Your agent needs to evolve with you. My notes change every day. So my agent changes every day.

This doesn’t make me smarter. It makes me harder to forget.

Going Deeper: Production Agents

This setup solves most of my daily problems. It helps me every single day.

But if you want something much more concrete—something deeper, something you might even share with others—then you should look at building agents with AgentBricks on Databricks.

That gives you a much more powerful, production-grade way to build agents. This personal setup doesn’t replace that. This is the precursor.

Start here. Build your own local recall engine.

And when you’re ready to go deeper, AgentBricks helps you scale it to other humans and agents.

Databricksters

Your Low-Code Shortcut to Production-Grade Agent on Databricks

The code base is available at https://github.com/jiteshsoni/BrickBrain, but you likely don’t need it. You can simply use the Databricks UI to create your agent…

Listen now

5 months ago · 3 likes · Canadian Data Guy and Veena

4 Surprising Truths That Will Change How You Think About Spark Streaming

Canadian Data Guy — Mon, 15 Dec 2025 15:13:29 GMT

TL;DR

Spark now competes with Flink on real‑time: Real‑Time mode achieves double‑digit millisecond latency; think 20 ms
One engine, one API: Batch, near‑real‑time, and true real‑time in the same Spark paradigm.
Simplicity at scale: Checkpointing, fault tolerance, exactly‑once semantics built in.
Real‑time without friction: No second system or new programming model required.

Four Counter‑Intuitive Truths

Real-time without a new paradigm
You don’t need a separate engine + a separate mental model. Same Spark APIs, same ecosystem, same team skillset.
Real-time to hourly, depending on the business need
Streaming doesn’t mean 24/7. Spark Streaming is incremental, not perpetual. Choose the schedule your business needs—continuous, 15‑minute, hourly, or weekly. triggerAvailableNow kicks off, processes all new data in a single efficient run, and exits. You get batch‑style cost control with streaming‑grade correctness.
Checkpointing changes the game operationally
Build batch with the streaming paradigm. Design batch pipelines as streaming from day one to avoid rewrites when SLAs tighten. Going from daily to every four hours can be a small code change, not an architectural overhaul. And you drop brittle input parameters (e.g., process_date): Spark tracks progress so engineers focus on logic, not bookkeeping. Streaming can be cheaper than batch Late or frequently updated data punishes batch. Checkpointing persists a “bookmark” so Spark processes only net‑new changes and skips what it has already seen.
“checkpointing says you don’t worry about what’s net new I’ll identify what’s net new and only process that”
Latency is a business decision, not an ego metric
The learning curve is flatter than you think Spark’s unified API means you reuse the same DataFrame/Dataset logic for batch and streaming. The shift is incremental: same engine, same abstractions, different triggers and sinks. Teams extend what they know instead of adopting a second framework.

Conclusion

If this blog has reshaped how you think about Spark Streaming, the YouTube session demonstrates how it actually works in practice.

In the video, I go beyond concepts and walk through:

How Spark Structured Streaming achieves double-digit millisecond latency in real deployments
Kafka → Delta ingestion patterns that teams run in production
How checkpointing simplifies operations and reduces cost compared to batch reprocessing
When to use continuous mode vs triggerAvailableNow vs micro-batch — and why this is a business decision, not a technical flex

This isn’t a theoretical take. It’s based on shipping streaming workloads to production week after week at Databricks, dealing with real SLAs, real failures, and real cost pressure.

Watch the full walkthrough here:
👉

Why I Materialize Delta History for Debugging

Canadian Data Guy — Thu, 27 Nov 2025 22:36:52 GMT

When I’m debugging a Delta table with millions of commits — especially tables with heavy ingestion, lots of parquet files — I often need to trace a specific record back to:

which commit wrote it
which wrote this record (Job id, Job Run Id)
which operation triggered that write

DESCRIBE HISTORY gives you this metadata, but on large tables it can be slow, and running it repeatedly while investigating a bug quickly becomes painful.

The practical workaround is to dump the entire history once into a physical table.
From there, you can filter, join, and slice it instantly — without re-scanning the entire Delta log on every query.

One-Time Dump of Delta Table History

CREATE TABLE IF NOT EXISTS databricks_support.default.describe_history__your_table_name AS
SELECT *
FROM (
    DESCRIBE HISTORY your_catalog_name.your_database_name._your_table_name
);

For deep debugging (record → parquet file → commit lineage), this table becomes a fast, queryable audit log.
In practice, this works best when run from a notebook, where long-running metadata operations are less fragile.

I also have a script that can identify which row is written in which Parquet file by which commit; drop me a comment if you need it.

Stop Waiting for Connectors: Stream ANYTHING into Spark (It's 4 Functions)

Canadian Data Guy — Mon, 03 Nov 2025 17:24:45 GMT

💡 What You’ll Learn

By the end of this guide, you’ll understand that building a custom Spark streaming source isn’t rocket science. It’s actually a well-defined conversation between Spark and your code, with just 5 key methods to implement. We’ll use a real Ethereum blockchain streaming example to show you exactly how it works.

The Problem: You Have Data, Spark Wants It

You’ve got data streaming in from somewhere unique — maybe it’s IoT sensors, a blockchain, a custom message queue, or an internal database. You want to process it with Spark’s powerful distributed engine, but there’s no pre-built connector. What do you do?

The good news: You can build your own custom source. The even better news: It’s simpler than you think.

Real-World Use Case: In this guide, we’ll walk through streaming Ethereum blockchain data into Spark. The same principles apply to any data source — from proprietary APIs to custom databases. The pattern is universal.

The Secret: It’s Just a Conversation

Think of building a custom Spark streaming source as a conversation between two specialists:

The Two Characters in Our Story

Spark’s job (the Project Manager) is to handle all the complex distributed computing stuff: checkpointing, fault tolerance, distributing work across a cluster, and guaranteeing exactly-once processing semantics.

Your code’s job (the Data Specialist) is much simpler: answer Spark’s questions about where your data is, how to access it, and how to break it into chunks that can be processed in parallel.

🎯 Key Insight: You don’t need to understand distributed systems, fault tolerance algorithms, or checkpoint mechanisms. You just need to implement 5 simple methods that answer Spark’s questions about your data source.

The 5 Questions Spark Will Ask You

Spark’s conversation with your code follows a predictable pattern. It asks 5 questions, and you provide straightforward answers. Let’s look at each one:

Let’s See Real Code: Streaming Ethereum Blocks

Theory is great, but let’s look at actual implementation. Here’s how these 5 methods work in practice for streaming Ethereum blockchain data:

1. initialOffset() — Setting the Starting Point

def initialOffset(self) -> dict:
    “”“
    Called ONCE when starting a brand new query.
    Return where to begin reading.
    “”“
    start_block = self.options.get(”start_block”, 0)
    return {”offset”: int(start_block)}

That’s it! Just return a dictionary with your starting position. Spark saves this and uses it as the baseline for the entire query lifecycle.

2. latestOffset() — Checking What’s Available

def latestOffset(self) -> dict:
    “”“
    Called at the START of every batch.
    Connect to your source and return the newest available data.
    “”“
    latest_block = self.w3.eth.block_number
    return {”offset”: int(latest_block)}

This method connects to your data source (in this case, an Ethereum node) and asks “what’s the latest?” The answer defines the upper bound for the current batch.

⚠️ Python API Limitation: In PySpark, latestOffset() must return the absolute latest data point. If you’re backfilling from very old data, your first batch could be huge. The Scala API offers more fine-grained control here, but for most real-time use cases, the Python API works perfectly.

📝 Note: This limitation is actively being addressed - there’s currently a pull request in progress to fix this in Spark.

3. partitions() — Dividing the Work

def partitions(self, start: dict, end: dict) -> list:
    “”“
    Spark gives you a range (start → end).
    You break it into smaller chunks for parallel processing.
    “”“
    start_block = start[”offset”]
    end_block = end[”offset”]  # This is EXCLUSIVE (not included)
    
    num_partitions = self.spark.conf.get(”spark.sql.shuffle.partitions”, “4”)
    blocks_per_partition = (end_block - start_block) // int(num_partitions)
    
    partitions = []
    for i in range(int(num_partitions)):
        partition_start = start_block + (i * blocks_per_partition)
        partition_end = partition_start + blocks_per_partition
        if i == int(num_partitions) - 1:  # Last partition gets any remainder
            partition_end = end_block
            
        partitions.append(BlockRangePartition(partition_start, partition_end))
    
    return partitions

How Partitioning Works

🔑 Critical Detail: Notice that the end block (1100) is exclusive. This means partition ranges are [1000, 1025), [1025, 1050), etc. Block 1100 is NOT processed—it becomes the start of the next batch. This [start, end) pattern is how Spark guarantees no data is ever processed twice.

4 read() — Actually Fetching the Data

def read(self, partition: BlockRangePartition):
    “”“
    This runs on EXECUTOR nodes (distributed across the cluster).
    Each executor gets one partition and must fetch its assigned data.
    
    Must be DETERMINISTIC - same input = same output, every time.
    This allows Spark to safely retry failed tasks.
    “”“
    for block_number in range(partition.start_block, partition.end_block):
        # Connect to Ethereum and fetch this specific block
        block = self.w3.eth.get_block(block_number, full_transactions=True)
        
        # Convert to Spark Row format
        yield Row(
            block_number=block.number,
            block_hash=block.hash.hex(),
            timestamp=block.timestamp,
            transaction_count=len(block.transactions),
            # ... more fields ...
        )

This is where the real work happens! Each executor in your cluster runs this method for its assigned partition, fetching the actual data.

💪 The Power of Parallelism: If you have 10 executors and create 100 partitions, all 10 executors work simultaneously. Each one processes its chunk, and as executors finish, Spark automatically assigns them new partitions. This is how Spark achieves massive throughput.

5 commit() — Cleanup (Usually Empty)

def commit(self, end: dict):
    “”“
    Called AFTER all partitions successfully complete.
    The checkpoint/commit/{N} file gets created at this point.
    This method is optional - mainly used for cleanup tasks.
    “”“
    pass  # Usually empty unless you need cleanup

In most cases, this method is empty. The checkpoint/commit/{N} file gets created automatically. You only need to implement this if you have cleanup tasks to perform after a batch completes.

The Complete Flow: Visual Walkthrough

Now let’s see how these methods work together in a complete streaming query:

Why This Design Is Brilliant

🛡️ Fault Tolerance

If an executor fails while reading blocks 1025-1050, Spark simply restarts that task on another machine. Because read() is deterministic, it fetches exactly the same data again. The user never knows a failure occurred.

⚡ Exactly-Once Semantics

The [start, end) exclusive range pattern means no block is ever processed twice. Block 1100 is the start of the next batch, not the end of the previous one. Combined with checkpointing, this guarantees exactly-once processing.

🚀 Massive Parallelism

By implementing partitions(), you tell Spark how to break work into chunks. Spark handles distributing those chunks to hundreds or thousands of executors. You get massive scale “for free.”

🧩 Separation of Concerns

You focus on your data source’s logic. Spark handles scheduling, distribution, checkpointing, fault recovery, and coordination. Clean boundaries make complex systems manageable.

What About Edge Cases?

Handling Source Failures

What if Ethereum node goes down during read()?

def read(self, partition: BlockRangePartition):
    max_retries = 3
    for block_number in range(partition.start_block, partition.end_block):
        for attempt in range(max_retries):
            try:
                block = self.w3.eth.get_block(block_number, full_transactions=True)
                yield Row(...)
                break  # Success!
            except Exception as e:
                if attempt == max_retries - 1:
                    raise  # Let Spark handle the failure
                time.sleep(2 ** attempt)  # Exponential backoff

If retries don’t work, the exception bubbles up, Spark marks the task as failed, and restarts it on another executor. Eventually the source recovers and processing continues from the checkpoint.

Dealing with Large Batches

What if latestOffset() returns a huge number?

The Golden Rule: Your processing rate should be greater than your input rate. Ideally, aim for 10x faster processing than data arrival. This is the key design principle.

If you’re processing data faster than it’s arriving, Spark will naturally catch up with any backfill over the next few batches. You don’t need to worry about temporarily large batch sizes.

About spark.sql.shuffle.partitions: You can adjust this, but don’t set it to an extremely high number. A reasonable partition count is sufficient as long as your processing rate exceeds your input rate.

Ensuring Determinism in read()

The golden rule: Same partition input must produce same output.

Bad (non-deterministic):

# ❌ DON’T DO THIS
def read(self, partition):
    current_time = time.time()  # Different each time!
    yield Row(timestamp=current_time, ...)

Good (deterministic):

# ✅ DO THIS
def read(self, partition):
    block = self.w3.eth.get_block(partition.block_number)
    yield Row(timestamp=block.timestamp, ...)  # Block timestamp is consistent

The Complete Picture: Architecture

🎯 You’re Ready to Build Your Own!

You now understand the complete lifecycle of a custom Spark streaming source. It’s not magic—it’s a well-designed conversation between Spark and your code.
Just implement 5 methods, and Spark handles the rest: fault tolerance, distribution, checkpointing, and exactly-once semantics.

Quick Reference: The 5 Methods

Your Implementation Checklist

Final Thoughts: Why This Matters

The beauty of this architecture is its universality. Whether you’re streaming from Ethereum, MongoDB, a proprietary API, or carrier pigeons 🐦, the pattern is the same:

Define where to start (initialOffset)
Check what’s new (latestOffset)
Break work into chunks (partitions)
Fetch the data (read)
Confirm completion (commit)

Spark handles everything else—checkpointing, distribution, scheduling, fault recovery. You just focus on the specifics of your data source.

🚀 Take Action

The barrier to entry is lower than you thought. Pick a data source you’re working with, implement these 5 methods, and you’ll have a production-ready Spark streaming source in an afternoon.

Start small: Get initialOffset() and latestOffset() working first. Then add partitions() and read(). Test with a single partition before scaling up. You’ve got this! 💪

Now go build something amazing with Spark Streaming. The data world is your oyster. 🌊

Download the code

How to write your first Spark application with Stream-Stream Joins with working code

Canadian Data Guy — Wed, 15 Oct 2025 17:39:41 GMT

Have you been waiting to try Streaming but cannot take the plunge?

In a single blog, we will teach you whatever needs to be understood about Streaming Joins. We will give you a working code which you can use for your next Streaming Pipeline.

The steps involved:

Create a fake dataset at scale
Set a baseline using traditional SQL
Define Temporary Streaming Views
Inner Joins with optional Watermarking
Left Joins with Watermarking
The cold start edge case: withEventTimeOrder
Cleanup

What is Stream-Stream Join?

Stream-stream join is a widely used operation in stream processing where two or more data streams are joined based on some common attributes or keys. It is essential in several use cases, such as real-time analytics, fraud detection, and IoT data processing.

Concept of Stream-Stream Join

Stream-stream join combines two or more streams based on a common attribute or key. The join operation is performed on an ongoing basis, with each new data item from the stream triggering a join operation. In stream-stream join, each data item in the stream is treated as an event, and it is matched with the corresponding event from the other stream based on matching criteria. This matching criterion could be a common attribute or key in both streams.

When it comes to joining data streams, there are a few key challenges that must be addressed to ensure successful results. One of the biggest hurdles is the fact that, at any given moment, neither stream has a complete view of the dataset. This can make it difficult to find matches between inputs and generate accurate join results.

To overcome this challenge, it’s important to buffer past input as a streaming state for both input streams. This allows for every future input to be matched with past input, which can help to generate more accurate join results. Additionally, this buffering process can help to automatically handle late or out-of-order data, which can be common in streaming environments.

To further optimize the join process, it’s also important to use watermarks to limit the state. This can help to ensure that only the most relevant data is being used to generate join results, which can help to improve accuracy and reduce processing times.

Types of Stream-Stream Join

Depending on the nature of the join and the matching criteria, there are several types of stream-stream join operations. Some of the popular types of stream-stream join are:

Inner Join
In inner join, only those events are returned where there is a match in both the input streams. This type of join is useful when combining the data from two streams with a common key or attribute.

Outer Join
In outer join, all events from both the input streams are included in the joined stream, whether or not there is a match between them. This type of join is useful when we need to combine data from two streams, and there may be missing or incomplete data in either stream.

Left Join
In left join, all events from the left input stream are included in the joined stream, and only the matching events from the right input stream are included. This type of join is useful when we need to combine data from two streams and keep all the data from the left stream, even if there is no matching data in the right stream.

1. The Setup: Create a fake dataset at scale

Most people do not have 2 streams just hanging around for one to experiment with Stream Steam Joins. Thus I used Faker to mock 2 different streams which we will use for this example.

The name of the library being used is Faker and faker_vehicle to create Datasets.

!pip install faker_vehicle
!pip install faker

Imports

from faker import Faker
from faker_vehicle import VehicleProvider
from pyspark.sql import functions as F
import uuid
from utils import logger

Parameters

# define schema name and where should the table be stored
schema_name = “test_streaming_joins”
schema_storage_location = “/tmp/CHOOSE_A_PERMANENT_LOCATION/”

Create the Target Schema/Database
Create a Schema and set location. This way, all tables would inherit the base location.

create_schema_sql = f”””
 CREATE SCHEMA IF NOT EXISTS {schema_name}
 COMMENT ‘This is {schema_name} schema’
 LOCATION ‘{schema_storage_location}’
 WITH DBPROPERTIES ( Owner=’Jitesh’);
 “””
print(f”create_schema_sql: {create_schema_sql}”)
spark.sql(create_schema_sql)

Use Faker to define functions to help generate fake column values

fake = Faker()
fake.add_provider(VehicleProvider)

event_id = F.udf(lambda: str(uuid.uuid4()))
vehicle_year_make_model = F.udf(fake.vehicle_year_make_model)
vehicle_year_make_model_cat = F.udf(fake.vehicle_year_make_model_cat)
vehicle_make_model = F.udf(fake.vehicle_make_model)
vehicle_make = F.udf(fake.vehicle_make)
vehicle_model = F.udf(fake.vehicle_model)
vehicle_year = F.udf(fake.vehicle_year)
vehicle_category = F.udf(fake.vehicle_category)
vehicle_object = F.udf(fake.vehicle_object)

latitude = F.udf(fake.latitude)
longitude = F.udf(fake.longitude)
location_on_land = F.udf(fake.location_on_land)
local_latlng = F.udf(fake.local_latlng)
zipcode = F.udf(fake.zipcode)

Generate Streaming source data at your desired rate

def generated_vehicle_and_geo_df (rowsPerSecond:int , numPartitions :int ):
    return (
        spark.readStream.format(”rate”)
        .option(”numPartitions”, numPartitions)
        .option(”rowsPerSecond”, rowsPerSecond)
        .load()
        .withColumn(”event_id”, event_id())
        .withColumn(”vehicle_year_make_model”, vehicle_year_make_model())
        .withColumn(”vehicle_year_make_model_cat”, vehicle_year_make_model_cat())
        .withColumn(”vehicle_make_model”, vehicle_make_model())
        .withColumn(”vehicle_make”, vehicle_make())
        .withColumn(”vehicle_year”, vehicle_year())
        .withColumn(”vehicle_category”, vehicle_category())
        .withColumn(”vehicle_object”, vehicle_object())
        .withColumn(”latitude”, latitude())
        .withColumn(”longitude”, longitude())
        .withColumn(”location_on_land”, location_on_land())
        .withColumn(”local_latlng”, local_latlng())
        .withColumn(”zipcode”, zipcode())
        )

# You can uncomment the below display command to check if the code in this cell works
#display(generated_vehicle_and_geo_df)

# You can uncomment the below display command to check if the code in this cell works
#display(generated_vehicle_and_geo_df)

Now let’s generate the base source table and let’s call it Vehicle_Geo

def stream_write_to_vehicle_geo_table(rowsPerSecond: int = 1000, numPartitions: int = 10):
    table_name_vehicle_geo= “vehicle_geo”
    (
        generated_vehicle_and_geo_df(rowsPerSecond, numPartitions)
            .writeStream
            .queryName(f”write_to_delta_table: {table_name_vehicle_geo}”)
            .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_vehicle_geo}/_checkpoint”)
            .format(”delta”)
            .toTable(f”{schema_name}.{table_name_vehicle_geo}”)
    )
stream_write_to_vehicle_geo_table(rowsPerSecond = 1000, numPartitions = 10)

Let the above code run for a few iterations, and you can play with rowsPerSecond and numPartitions to control how much data you would like to generate. Once you have generated enough data, kill the above stream and get a base line for row count.

spark.read.table(f”{schema_name}.{table_name_vehicle_geo}”).count()

display(
    spark.sql(f”“”
    SELECT * 
    FROM {schema_name}.{table_name_vehicle_geo}
“”“)
)

Let’s also get a min & max of the timestamp column as we would be leveraging it for watermarking.

display(
    spark.sql(f”“”
    SELECT 
         min(timestamp)
        ,max(timestamp)
        ,current_timestamp()
    FROM {schema_name}.{table_name_vehicle_geo}
“”“)
)

Next, we will break this Delta table into 2 different tables

Because for Stream-Stream Joins we need 2 different streams. We will use Delta To Delta Streaming here to create these tables.

a ) Table: Vehicle

vehicle_df = (
        spark.readStream.format(”delta”).option(”maxFilesPerTrigger”,”100”).table(f”{schema_name}.vehicle_geo”)
        .selectExpr(
            “event_id”
            ,”timestamp as vehicle_timestamp”
            ,”vehicle_year_make_model”
            ,”vehicle_year_make_model_cat”
            ,”vehicle_make_model”
            ,”vehicle_make”
            ,”vehicle_year”
            ,”vehicle_category”
            ,”vehicle_object”
            )
    )
#display(vehicle_df)
def stream_write_to_vehicle_table():
    table_name_vehicle = “vehicle”
    (   vehicle_df
        .writeStream
        #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_vehicle}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_vehicle}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_vehicle}”)
    )

stream_write_to_vehicle_table()

b) Table: Geo

We have added a filter when we write to this table. This would be useful when we emulate the left join scenario. Filter: where(”value like ‘1%’ “)

geo_df = (
    spark.readStream.format(”delta”).option(”maxFilesPerTrigger”,”100”).table(f”{schema_name}.vehicle_geo”)
        .selectExpr(
            “event_id”
            ,”value”
            ,”timestamp as geo_timestamp”
            ,”latitude”
            ,”longitude”
            ,”location_on_land”
            ,”local_latlng”
            ,”cast( zipcode as integer) as zipcode”
        ).where(”value like ‘1%’ “) 
    )
#geo_df.printSchema()
#display(geo_df)

def stream_write_to_geo_table():
    table_name_geo = “geo”
    (   geo_df
        .writeStream
        #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_geo}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_geo}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_geo}”)
    )
    
stream_write_to_geo_table()

2. Set a baseline using traditional SQL

Before we do the actual streaming joins. Let’s do a regular join and figure out the expected row count.

Get row count from Inner Join

sql_query_batch_inner_join = f’‘’
        SELECT count(vehicle.event_id) as row_count_for_inner_join
        FROM {schema_name}.{table_name_vehicle} vehicle
        JOIN {schema_name}.{table_name_geo} geo
        ON vehicle.event_id = geo.event_id
    AND vehicle_timestamp >= geo_timestamp  - INTERVAL 5 MINUTES        
        ‘’‘
print(f’‘’ Run SQL Query: 
          {sql_query_batch_inner_join}       
       ‘’‘)
display( spark.sql(sql_query_batch_inner_join) )

Get row count from Inner Join

sql_query_batch_left_join = f’‘’
        SELECT count(vehicle.event_id) as row_count_for_left_join
        FROM {schema_name}.{table_name_vehicle} vehicle
        LEFT JOIN {schema_name}.{table_name_geo} geo
        ON vehicle.event_id = geo.event_id
            -- Assume there is a business logic that timestamp cannot be more than 15 minutes off
    AND vehicle_timestamp >= geo_timestamp  - INTERVAL 5 MINUTES
        ‘’‘
print(f’‘’ Run SQL Query: 
          {sql_query_batch_left_join}       
       ‘’‘)
display( spark.sql(sql_query_batch_left_join) )

Summary so far:

We created a Source Delta Table: vehicle_geo
We took the previous table and divided its column into two tables: Vehicle and Geo
Vehicle row count matches with vehicle_geo, and it has a subset of those columns
The Geo row count is lesser than Vehicle because we added a filter when we wrote to the Geo table
We ran 2 SQL to identify what the row count should be after we do stream-stream join

3. Define Temporary Streaming Views

Some people prefer to write the logic in SQL. Thus, we are creating streaming views which could be manipulated with SQL. The below code block will help create a view and set a watermark on the stream.

def stream_from_delta_and_create_view (schema_name: str, table_name:str, column_to_watermark_on:str, how_late_can_the_data_be: str = “2 minutes” , maxFilesPerTrigger: int = 100):
    view_name = f”_streaming_vw_{schema_name}_{table_name}”
    print(f”Table {schema_name}.{table_name} is now streaming under a temporoary view called {view_name}”)
    (
        spark.readStream.format(”delta”)
        .option(”maxFilesPerTrigger”, f”{maxFilesPerTrigger}”)
        .option(”withEventTimeOrder”, “true”)
        .table(f”{schema_name}.{table_name}”)
        .withWatermark(f”{column_to_watermark_on}”,how_late_can_the_data_be)
        .createOrReplaceTempView(view_name)
    )

3. a Create Vehicle Stream

Get CanadianDataGuy.com’s stories in your inbox

Join Medium for free to get updates from this writer.

Let’s create a Vehicle Stream and set its watermark as 1mins

stream_from_delta_and_create_view(schema_name =schema_name, table_name = ‘vehicle’, column_to_watermark_on =”vehicle_timestamp”, how_late_can_the_data_be = “1 minutes” )

Let’s visualize the stream.

display(
    spark.sql(f’‘’
        SELECT *
        FROM _streaming_vw_test_streaming_joins_vehicle
    ‘’‘)
)

You can also do an aggregation on the stream. It’s out of the scope of this blog, but I wanted to show you how you can do it

display(
    spark.sql(f’‘’
        SELECT 
            vehicle_make
            ,count(1) as row_count
        FROM _streaming_vw_test_streaming_joins_vehicle
        GROUP BY vehicle_make
        ORDER BY vehicle_make
    ‘’‘)
)

3. b Create Geo Stream

Let’s create a Geo Stream and set its watermark as 2 mins

stream_from_delta_and_create_view(schema_name =schema_name, table_name = ‘geo’, column_to_watermark_on =”geo_timestamp”, how_late_can_the_data_be = “2 minutes” )

Have a look at what the data looks like

display(
    spark.sql(f’‘’
        SELECT *
        FROM _streaming_vw_test_streaming_joins_geo
    ‘’‘)
)

4. Inner Joins with optional Watermarking

While inner joins on any kind of columns and with any kind of conditions are possible in streaming environments, it’s important to be aware of the potential for unbounded state growth. As new input arrives, it can potentially match with any input from the past, leading to a rapidly increasing streaming state size.

To avoid this issue, it’s essential to define additional join conditions that prevent indefinitely old inputs from matching with future inputs. By doing so, it’s possible to clear old inputs from the state, which can help to prevent unbounded state growth and ensure more efficient processing.

There are a variety of techniques that can be used to define these additional join conditions. For example, you might limit the scope of the join by only matching on a subset of columns, or you might set a time-based constraint that prevents old inputs from being considered after a certain period of time has elapsed.

Ultimately, the key to managing streaming state size and ensuring efficient join processing is to consider the unique requirements of your specific use case carefully and to leverage the right techniques and tools to optimize your join conditions accordingly. Although watermarking could be optional, I would highly recommend you set a watermark on both streams.

sql_for_stream_stream_inner_join = f”“”
    SELECT 
        vehicle.*
        ,geo.latitude
        ,geo.longitude
        ,geo.zipcode
    FROM _streaming_vw_test_streaming_joins_vehicle vehicle
    JOIN _streaming_vw_test_streaming_joins_geo geo
    ON vehicle.event_id = geo.event_id
    -- Assume there is a business logic that timestamp cannot be more than X minutes off
    AND vehicle_timestamp BETWEEN geo_timestamp  - INTERVAL 5 MINUTES AND geo_timestamp
“”“
#display(spark.sql(sql_for_stream_stream_inner_join))

table_name_stream_stream_innner_join =’stream_stream_innner_join’

(   spark.sql(sql_for_inner_join)
    .writeStream
    #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_stream_stream_innner_join}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_stream_stream_innner_join}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_stream_stream_innner_join}”)
)

If the stream has finished then in the next step. You should find that the row count should match up with the regular batch SQL Job

spark.read.table(f”{schema_name}.{table_name_stream_stream_innner_join}”).count()

How was the watermark computed in this scenario?

When we defined streaming views for Vehicle and Geo, we set them as 1 min and 2 min, respectively.

If you look at the join condition we mentioned :

AND vehicle_timestamp >= geo_timestamp - INTERVAL 5 minutes

5 min + 2 min = 7 min.

Spark Streaming would automatically calculate this 7 min number and the state would be cleared after that.

5. Left Joins with Watermarking

While the watermark + event-time constraints is optional for inner joins, for outer joins they must be specified. This is because for generating the NULL results in outer join, the engine must know when an input row is not going to match with anything in future. Hence, the watermark + event-time constraints must be specified for generating correct results.

5.a How Left Joins works differently than an Inner Join

One important factor is that the outer NULL results will be generated with a delay that depends on the specified watermark delay and the time range condition. This delay is necessary to ensure that there were no matches, and that there will be no matches in the future.

In the current implementation of the micro-batch engine, watermarks are advanced at the end of each micro-batch, and the next micro-batch uses the updated watermark to clean up the state and output outer results. However, this means that the generation of outer results may be delayed if there is no new data being received in the stream. If either of the two input streams being joined does not receive data for a while, the outer output (in both left and right cases) may be delayed.

sql_for_stream_stream_left_join = f”“”
    SELECT 
        vehicle.*
        ,geo.latitude
        ,geo.longitude
        ,geo.zipcode
    FROM _streaming_vw_test_streaming_joins_vehicle vehicle
    LEFT JOIN _streaming_vw_test_streaming_joins_geo geo
    ON vehicle.event_id = geo.event_id
        AND vehicle_timestamp BETWEEN geo_timestamp  - INTERVAL 5 MINUTES AND geo_timestamp
“”“
#display(spark.sql(sql_for_stream_stream_left_join))

table_name_stream_stream_left_join =’stream_stream_left_join’

(   spark.sql(sql_for_stream_stream_left_join)
    .writeStream
    #.trigger(availableNow=True)
        .queryName(f”write_to_delta_table: {table_name_stream_stream_left_join}”)
        .option(”checkpointLocation”, f”{schema_storage_location}/{table_name_stream_stream_left_join}/_checkpoint”)
        .format(”delta”)
        .toTable(f”{schema_name}.{table_name_stream_stream_left_join}”)
)

If the stream has finished, then in the next step. You should find that the row count should match up with the regular batch SQL Job.

spark.read.table(f”{schema_name}.{table_name_stream_stream_left_join}”).count()

You will find that some records that could not match are not being released, which is expected. The outer NULL results will be generated with a delay that depends on the specified watermark delay and the time range condition. This is because the engine has to wait for that long to ensure there were no matches and there will be no more matches in future.
**Watermark will advance once new data is pushed to it**

Thus let’s generate some more fate data to the base table: vehicle_geo. This time we are sending a much lower volume of 10 records per second. Let the below command run for at least one batch and then kill it.

stream_write_to_vehicle_geo_table(rowsPerSecond = 10, numPartitions = 10)

5. b What to observe:

Soon you should see the watermark moves ahead and the number of records in ‘Aggregation State’ goes down.
If you click on the running stream and click the raw data tab and look for “watermark”. You will see it has advanced
Once 0 records per second are being processed, that means your stream has caught up, and now your row count should match up with the traditional SQL left join

spark.read.table(f”{schema_name}.{table_name_stream_stream_left_join}”).count()

6. The cold start edge case: withEventTimeOrder

“When using a Delta table as a stream source, the query first processes all of the data present in the table. The Delta table at this version is called the initial snapshot. By default, the Delta table’s data files are processed based on which file was last modified. However, the last modification time does not necessarily represent the record event time order.
In a stateful streaming query with a defined watermark, processing files by modification time can result in records being processed in the wrong order. This could lead to records dropping as late events by the watermark.
You can avoid the data drop issue by enabling the following option:
withEventTimeOrder: Whether the initial snapshot should be processed with event time order.

If you use startingVersion then withEventTimeOrder attribute is ignored.

In our scenario, I pushed this inside Step 3 when we created the temporary streaming views.

spark.readStream.format(”delta”)
        .option(”maxFilesPerTrigger”, f”{maxFilesPerTrigger}”)
        .option(”withEventTimeOrder”, “true”)
        .table(f”{schema_name}.{table_name}”)

7. Cleanup

Drop all tables in the database and delete all the checkpoints

spark.sql(
    f”“”
    drop schema if exists {schema_name} CASCADE
“”“
)


dbutils.fs.rm(schema_storage_location, True)

If you have reached so far, you now have a working pipeline and a solid example which you can use going forward.

Download the code

https://github.com/jiteshsoni/material_for_public_consumption/blob/main/notebooks/spark_stream_stream_join.py

References:

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins

Footnote:

Thank you for taking the time to read this article. If you found it helpful or enjoyable, please consider clapping to show appreciation and help others discover it. Don’t forget to follow me for more insightful content, and visit my website CanadianDataGuy.com for additional resources and information. Your support and feedback are essential to me, and I appreciate your engagement with my work.

Build an Ethereum ETL Pipeline for Free Using Databricks Free Edition

Yogita Nesargi — Tue, 23 Sep 2025 04:29:40 GMT

The Ethereum blockchain generates one of the richest transactional datasets in the world. Yet analyzing this data directly from a node proves impractical — blocks and transactions arrive as deeply nested structures, making efficient querying nearly impossible without transformation.

Best of all? You can build this entire pipeline without spending a dollar. Databricks Free Edition is a no-cost version of Databricks designed for students, educators, hobbyists, and anyone interested in learning or experimenting with data and AI Databricks. Simply sign up for Databricks Free Edition — no credit card required — and you'll get a workspace with serverless compute ready to go.

In this technical walkthrough, we'll build a streaming ETL pipeline in Databricks that:

Ingests raw Ethereum blocks from AWS's public dataset
Extracts and flattens transaction data
Stores everything in queryable Delta Lake tables for analytics

Why Databricks + AutoLoader for Blockchain Data?

Databricks Autoloader provides a Structured Streaming source called cloudFiles that automatically processes new files as they arrive, with the option of also processing existing files Microsoft Learn Databricks. This makes it ideal for blockchain data because:

Scalable storage: Blockchain data grows continuously — Delta Lake handles petabyte-scale datasets effortlessly
Schema enforcement: Flatten complex nested Ethereum data into clean, queryable tables with automatic schema evolution
Streaming ingestion: Process blocks and transactions in near real-time as they arrive
SQL + ML integration: Run ad-hoc queries or feed data directly into ML models for fraud detection, token analytics, or NFT tracking

Step 1: Load Historical Ethereum Data from AWS S3

AWS provides free access to blockchain datasets through the aws-public-blockchain S3 bucket, with data optimized for analytics by being transformed into compressed Parquet files, partitioned by date for efficient querying AWS AWS Open Data Registry.

Instead of setting up an Ethereum node, we'll directly pull historical block data that's already preprocessed into Parquet format — completely free.

Here's what we'll accomplish:

Connect anonymously to AWS S3
List available Ethereum block files (stored as Parquet)
Download selected files into a Unity Catalog Volume in Databricks
Verify successful data landing

This foundational step prepares our workspace for efficient block processing using Spark and Delta Lake.

# Databricks notebook source
# === SIMPLE PARAMETERIZATION (ONLY NECESSARY VARIABLES) ===
# Parameterize only what needs to be variable for reusability
dbutils.widgets.text("catalog_name", "blockchain", "Catalog Name")
dbutils.widgets.text("schema_name", "ethereum", "Schema Name")
dbutils.widgets.text("num_files", "20", "Number of Files to Download")

# === CONFIGURATION ===
# Get widget values
CATALOG = dbutils.widgets.get("catalog_name")
SCHEMA = dbutils.widgets.get("schema_name")
NUM_FILES = int(dbutils.widgets.get("num_files"))

# Hard-coded values (no need to parameterize constants)
AWS_BUCKET = "aws-public-blockchain"
S3_PREFIX = "v1.0/eth/blocks/"

# Unity Catalog volume paths for data organization
DATA_VOLUME = f"/Volumes/{CATALOG}/{SCHEMA}/ethereum"
CHECKPOINT_VOLUME = f"/Volumes/{CATALOG}/{SCHEMA}/ethereum_checkpoints"
SCHEMA_VOLUME = f"/Volumes/{CATALOG}/{SCHEMA}/ethereum_schemas"

print(f"🔧 Using Catalog: {CATALOG}, Schema: {SCHEMA}")
print(f"📦 Downloading {NUM_FILES} files from s3://{AWS_BUCKET}/{S3_PREFIX}")

# === UNITY CATALOG SETUP ===
stmts = [
    f"CREATE CATALOG IF NOT EXISTS {CATALOG}",
    f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}",
    f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.ethereum",
    f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.ethereum_checkpoints",
    f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.ethereum_schemas",
]

for i, s in enumerate(stmts, 1):
    print(f"[{i}/{len(stmts)}] {s}")
    try:
        spark.sql(s)
        print("  ✅ Success")
    except Exception as e:
        print(f"  ❌ Error: {e}")

print(f"\nCreated/verified UC objects. Paths available:")
print(f"  Data: {DATA_VOLUME}")
print(f"  Checkpoints: {CHECKPOINT_VOLUME}")
print(f"  Schemas: {SCHEMA_VOLUME}")

# === DATA DOWNLOAD ===
import os
import boto3
from botocore import UNSIGNED
from botocore.client import Config

print(f"\n📥 Downloading to: {DATA_VOLUME}")
os.makedirs(DATA_VOLUME, exist_ok=True)

# Configure anonymous S3 client (no AWS credentials needed!)
s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

# Collect parquet files from S3
keys = []
token = None

while len(keys) < NUM_FILES:
    params = {
        "Bucket": AWS_BUCKET, 
        "Prefix": S3_PREFIX, 
        "MaxKeys": min(1000, NUM_FILES - len(keys))
    }
    if token:
        params["ContinuationToken"] = token
    
    resp = s3.list_objects_v2(**params)
    
    for obj in resp.get("Contents", []) or []:
        if obj["Key"].endswith(".parquet"):
            keys.append(obj["Key"])
            if len(keys) >= NUM_FILES:
                break
    
    if not resp.get("IsTruncated"):
        break
    token = resp.get("NextContinuationToken")

if not keys:
    raise RuntimeError(f"No parquet files found under s3://{AWS_BUCKET}/{S3_PREFIX}")

# Download files with progress tracking
for i, key in enumerate(keys, 1):
    rel_path = key.replace("v1.0/eth/", "")
    dest_path = os.path.join(DATA_VOLUME, rel_path)
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    
    print(f"[{i}/{len(keys)}] {os.path.basename(key)} ...", end=" ", flush=True)
    s3.download_file(AWS_BUCKET, key, dest_path)
    print("✓")

print("✅ Download complete!")

Step 2: Stream Raw Blocks with Autoloader

With Ethereum blockchain data now in our Unity Catalog volume, we can continuously process new blocks as they arrive. In production, you'd use web3.py to poll the Ethereum network and save new blocks as Parquet files. For now, we'll stream the historical Parquet files we downloaded.

Autoloader's cloudFiles source automatically processes new files as they arrive Microsoft, making it perfect for blockchain data ingestion.

How Autoloader Works

Configuration options specific to the cloudFiles source are prefixed with cloudFiles so that they are in a separate namespace from other Structured Streaming source options Databricks. Key features include:

Automatic file discovery: Watches folders continuously and picks up new files
Schema evolution: Auto Loader can detect schema drifts, notify you when schema changes happen, and rescue data that would have been otherwise ignored or lost What is Auto Loader? - Azure Databricks | Microsoft Learn
Exactly-once processing: Maintains state in checkpoint location to ensure no data loss or duplication
Schema Hints: Provides control if you want to specially handle a few column without handling each and every column which is tedious

# === STREAMING READER ===
reader = (
    spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "parquet")  # Specify Parquet file format
        .option("cloudFiles.schemaLocation", SCHEMA_VOLUME)  # Schema tracking
        .option("cloudFiles.schemaEvolutionMode", "addNewColumns")  # Handle new fields
        .option("cloudFiles.schemaHints", "number BIGINT, baseFeePerGas BIGINT")  # Type hints
        .load(f"dbfs:{DATA_VOLUME}/blocks/")
)

# Display streaming data for monitoring
display(reader)

Step 3: Create Delta Tables for Blockchain Data

Next, we extract or add fields from our streaming data and write them into a Delta table. This provides a foundation for efficient queries and downstream transformations.

# Extract and transform block-level fields if needed
blocks_df = reader.select(
   "*",
   "_metadata"
)

# Write to Delta using Structured Streaming
blocks_query = (
    blocks_df.writeStream
        .format("delta")  # Delta Lake sink for ACID transactions
        .outputMode("append")  # Append new blocks as they arrive
        .option("checkpointLocation", f"{CHECKPOINT_VOLUME}/blocks/")  # State management
        .trigger(availableNow=True)  # Process all available data
        .table(f"{CATALOG}.{SCHEMA}.blocks")  # Save as managed Delta table
)

Trigger Types: Why `trigger(availableNow)` Matters

The available now trigger option consumes all available records as an incremental batch with the ability to configure batch size with options such as maxBytesPerTrigger Microsoft Learn Databricks. Understanding trigger options is crucial for optimizing your blockchain pipeline:

Available Trigger Types

Databricks Structured Streaming provides multiple triggers to control micro-batch execution:

`trigger(once=True)` (Deprecated)

Runs the query a single time, processing all currently available data, then stops
Ideal for one-time backfills or finite datasets
In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads Microsoft Learn Databricks

`trigger(availableNow=True)` (Recommended for Blockchain Pipelines)

Processes all currently available data immediately
Continues processing new data as it arrives
With Trigger.AvailableNow, file discovery happens asynchronously with data processing and data can be processed across multiple micro-batches with rate limiting Configure Auto Loader for production workloads - Azure Databricks | Microsoft Learn
Ensures historical and new blocks are captured seamlessly
Handles schema evolution safely without dropping fields
Prevents missing transactions that appear during ingestion

`trigger(processingTime='10 seconds')`

Processes data at fixed time intervals
Useful for controlling costs and reducing API calls
Best for scenarios without strict latency requirements

Why We Use `trigger(availableNow)`

For our Ethereum pipeline, availableNow provides the perfect balance:

Historical data ingestion: Process all existing blocks in one pass
Schema resilience: Handles Ethereum protocol upgrades that add new fields
Resource optimization: Auto Loader by default processes a maximum of 1000 files every micro-batch. You can configure cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger to configure how many files or how many bytes should be processed in a micro-batch

Configuration Example

By combining availableNow with Delta tables and checkpoints, we achieve a robust, scalable streaming solution for blockchain data — while keeping the code reusable for any streaming data source.

# Fine-tune batch processing for optimal performance
optimized_query = (
    blocks_df.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", f"{CHECKPOINT_VOLUME}/optimized_blocks/")
        .option("maxFilesPerTrigger", 100)  # Process 100 files per batch
        .option("maxBytesPerTrigger", "1GB")  # Soft limit on data per batch
        .trigger(availableNow=True)
        .table(f"{CATALOG}.{SCHEMA}.blocks")
)

Conclusion

In this blog, we explored how to ingest Ethereum blockchain data into Databricks and store it in Delta Lake, creating a solid foundation for analysis:

Raw Ethereum Parquet files are ingested into a Databricks Volume.
Streaming ingestion is handled with Databricks Autoloader for reliable file detection.
Schema evolution: Configured automatic handling of new fields as Ethereum evolves
Data is queryable with Databricks SQL, enabling analytics on both historical and new blockchain data.

This setup provides a foundation that can be extended with Medallion Architecture (Bronze → Silver → Gold) and enrichments such as:

Daily ETH transferred per day.
Active wallets and transaction counts per day.
Gas usage trends.
Token transfer analytics.

🚀 Blockchain data is massive and fast-moving — but with Databricks + Delta Lake, you now have a scalable and robust way to tame it.

Future Work

1. Add a Custom Streaming Reader
Instead of first dumping Parquet to storage, implement a direct Spark Structured Streaming source that connects to Ethereum nodes (via WebSocket / JSON-RPC).

This allows new blocks and transactions to flow directly into Spark DataFrames, reducing latency and storage overhead.
For example, a Python wrapper around web3.py could push blocks straight into Spark’s DataStreamReader.

2. Enrich On-Chain Data with Off-Chain Sources

Token metadata, DeFi protocol information, and NFT collections from APIs.
Join off-chain data with raw transactions for deeper analytics, like wallet behavior, token performance, and DeFi usage patterns.
In the next blog, we’ll dive into keeping this data up to date continuously using Spark Structured Streaming, so your Delta tables always reflect the latest blocks and transactions in real time.

How to ace and structure your Data Modelling Interview

Canadian Data Guy — Wed, 18 Jun 2025 16:56:40 GMT

1. Understand the Requirements (Functional and Non-Functional)

Ask for Use Cases: Start by understanding the primary use cases for the data model. Ask questions like:
- What kind of questions or analyses will the data model need to answer?
- Who will use the data (e.g., business users, analysts, data scientists)?
Clarify Non-Functional Requirements (NFRs): Determine the expectations around performance, latency, and data freshness.
- Is the data needed in real-time, near real-time, or in batches?
- What are the expected data volumes and retention periods?

2. Define Entities and Relationships

Identify the key entities (e.g., Customers, Transactions, Products) and their relationships.
Use Entity-Relationship (ER) diagrams or similar visual tools to illustrate the relationships.
Explain cardinality (e.g., one-to-many, many-to-one) between the entities.
Tip: Start representing the relationship as you progress and validate it with the interviewer to see if it makes sense. This will help establish a common understanding between you, and if you have made any bad assumptions, the interviewer can correct you.

3. Design Fact and Dimension Tables

Fact Tables:
- Explain what events or transactions the fact table will capture.
- Include details like granularity (e.g., transactions at a daily/hourly level).
- Mention the primary key and any measures (e.g., sales amount, quantities).
Dimension Tables:
- Identify dimensions that provide context (e.g., time, products, customers).
- Mention the primary key and the attributes of each dimension.
- Discuss if surrogate keys are used for maintaining consistency.

4. Normalization vs. Denormalization

Explain your approach to normalization or denormalization, depending on the use case:
- Normalized tables are suitable for transactional systems to avoid data redundancy.
- Denormalized tables are better for analytical systems to improve query performance.
Justify your choice based on the expected query patterns and performance needs.

5. Design Aggregation Tables (if needed)

For reporting purposes, you might need aggregate tables that summarize data.
Explain how you would create these tables and what metrics they will store.
Use naming conventions like agg_, dim_, and fact_ for clarity.

6. Discuss Partitioning Strategy

Choose partitioning columns based on query patterns and data distribution:
- For time-based queries, consider partitioning by date.
- Explain the expected data volumes and how partitioning will improve performance.
Include how you would handle archiving and data retention policies.

One Tap Could Make All the Difference

One surefire way to never see this content again? Scroll past without engaging. Search algorithms heavily rely on signals like likes and comments to decide whether a piece of content deserves to surface again, for you or anyone else. If you found this valuable, even in a small way, do consider hitting the like button or dropping a quick comment. It not only supports the content but helps others discover it too.

7. Demonstrate with Sample Queries

Show how the model would work by writing or describing example queries:
- These queries should answer the business questions you gathered in Step 1.
- Highlight how your model minimizes joins and ensures efficient querying.
Aim for zero or minimal joins, especially in denormalized models.

8. Discuss Data Quality and Governance

Explain how you would maintain data quality in your model:
- What kind of checks would you apply at different stages (e.g., data integrity, row count checks)?
- Would you use tools like DBT, Airflow, or others for orchestration?
Address data governance considerations like data lineage, privacy, and compliance (e.g., GDPR).

9. Explain How the Model Can Scale

Discuss how the model will handle increasing data volumes and user queries:
- Consider indexing strategies, sharding, or using cloud-based data warehouses like Snowflake or BigQuery.
Explain if and how you would optimize performance through techniques like partitioning, caching, or indexing.

10. Wrap Up with Recap and Iteration

Summarize your approach and how it addresses the initial requirements.
Discuss potential areas for iteration or improvement if the requirements change.
Be open to feedback from the interviewer and discuss how you would adapt the design based on their inputs.

Template for Each Table Design

For each table you design, mention:

Name: Use naming conventions (e.g., dim_, fact_, agg_).
Primary Key: Specify the key.
Columns: List the key attributes and explain their purpose.
Partitioning: If applicable, explain why you chose a specific column for partitioning.
Estimated Row Count: Provide an estimate based on expected data volumes.
Sample Query: Show how to query the table to answer a business question.

This structure helps convey both your technical skills and your ability to think critically and design solutions that meet business needs.

Leave something rememberable

Before you wrap up, give your interviewer something they can revisit long after the conversation ends. Pair your polished ER diagram with crisp, layered documentation—entity definitions, key attributes, and the rationale behind every relationship. Treat it as a living blueprint: clear enough for newcomers, detailed enough for architects, and structured to mirror the depth of your own experience.

Humans forget roughly 80 % of new information within a day, so the most reliable way to stay top-of-mind is to hand them a reference they can’t ignore. The sharper your vision, the richer your notes, the louder your expertise will echo when the hiring panel reviews candidates. Turn your model into a one-page memory hook that brings your name back to the top of their list.

A Deep Dive into Skewed Joins, GroupBy Bottlenecks, and Smart Strategies to Keep Your Spark Jobs Flying

Canadian Data Guy — Fri, 06 Jun 2025 03:11:24 GMT

Data skew in Apache Spark refers to an uneven distribution of data across partitions, often manifesting during shuffle-intensive operations like joins or group-by aggregations. In a skewed scenario, one or a few partitions end up holding far more records for a particular key than others, leading to hotspots and straggler tasks. This imbalance causes performance bottlenecks (tasks processing heavy partitions take much longer) and inefficient resource usage (some executors sit idle). In extreme cases, heavily skewed partitions can even exhaust executor memory and cause job failures. Below, we delve into why skew occurs in joins and aggregations, and provide comprehensive strategies—ranging from Spark configuration tweaks to code-level patterns and architectural designs—to alleviate data skew.

Why Data Skew Occurs in Joins and Aggregations

Join Operations: In Spark (excluding broadcast joins), joining two datasets on a key requires redistributing data so that records with the same key end up on the same partition (for a shuffle hash join or sort-merge join). If the key distribution is highly uneven (e.g. one key value appears in 90% of the records), the partition handling that key will be massive compared to others, causing skew. All records for that popular key funnel into one task, creating a severe load imbalance. For example, consider joining a large transactions table with a user table on user_id when a few “power users” have the vast majority of transactions. The join partition corresponding to those user_ids will handle hundreds of thousands of records, while other partitions process only a few – resulting in stragglers and possibly out-of-memory errors.

GroupBy and Aggregations: Similarly, grouping or aggregating by a key brings all data for each key onto one executor. If some keys occur far more frequently than others, those keys’ partitions become disproportionately large. For instance, a groupBy("customer_id") on an orders dataset where a handful of customers account for most orders will produce skew: the reducer for those popular customers must aggregate an extremely large list, while others handle trivial amountsl. Even though Spark performs map-side partial aggregation, a single reduce task will still have to combine all intermediate results for a heavy key, leading to one very slow task.

Understanding these root causes guides us to solutions. Next, we address join skew and groupBy/aggregation skew separately, discussing targeted techniques for each.

How do we know if we have a Skew Problem?

To identify if there is a skew problem in Spark, several indicators and methods can be employed:

Task Duration Discrepancy:
- If all tasks in a shuffle stage finish except for a few that hang for a long time, this may indicate data skew.
Spark UI Analysis:
- Check the tasks summary metrics in the Spark UI. A significant difference between the minimum and maximum shuffle read sizes can suggest skewness.
Data Spills:
- If, despite tuning the number of shuffle partitions, there are numerous data spills, this might point to data skew.
Row Count Disparity:
- Counting rows grouped by join or aggregation columns can reveal skew. A significant difference in row counts for different groups indicates potential skew issues.
Compression Ratios:
- Highly compressed tables can affect the estimation of shuffle partitions, leading to spills. Monitoring this can help identify such cases.

Additionally, Spark SQL's Adaptive Query Execution (AQE) can help detect and sometimes resolve data skew dynamically by adjusting execution strategies as needed.

Mitigating Skew in Join Operations

When joining two datasets on a key, Spark must shuffle records so that identical keys end up on the same partition. If one key is heavily overrepresented, its partition can become a bottleneck. Below are strategies ordered from most to least recommended

1. Adaptive Query Execution (AQE) – Automatic Skew Handling

Spark 3.0+ introduced Adaptive Query Execution (AQE), which can dynamically detect and correct skewed partitions during runtime. When AQE is enabled, Spark measures the size of each shuffle partition after the initial shuffle. If it finds any partition that is both exceptionally large in absolute terms and multiple times larger than the median partition size, it automatically splits that partition into smaller sub-tasks and replicates the corresponding rows from the other side of the join so each sub-task can run independently.

How It Works

Collect Partition Statistics:
- After the shuffle phase, Spark records the size (bytes) of every partition on both sides of the join.
Identify Skewed Partitions:
A partition is marked as “skewed” only if it meets both criteria:
- Absolute‐Size Threshold: spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes Default: 256MB
- Relative‐Size Factor: spark.sql.adaptive.skewJoin.skewedPartitionFactor
  (Default: 5.0)
If the median shuffle‐partition size is 50 MB, a factor of 5.0 means any partition > 250 MB qualifies—provided it also exceeds the 256 MB absolute threshold.
Split & Replicate:
- Suppose partition #17 is 1 GB and the coalesced‐partition target is 250 MB. Spark divides that 1 GB into four ~250 MB sub-partitions.
- For a join, each of those sub-partitions must still see all matching rows from the opposite dataset. Spark duplicates those matching rows N times (once per sub-partition) so each sub-task can run a local join.
Run Subtasks in Parallel & Merge Results:
- Instead of a single, massive task pulling 1 GB, Spark launches N tasks (e.g., four tasks pulling ~250 MB each plus replicated rows).
- When those sub-tasks finish, Spark concatenates their outputs to produce the final joined result.

Because this splitting and replication occur after the initial shuffle—when Spark has accurate sizes—no query rewriting or manual “hints” are required.

Configuration

# Enable AQE (on by default in Spark 3.2+)
spark.sql.adaptive.enabled=true

# Enable skew-join correction
spark.sql.adaptive.skewJoin.enabled=true

# Absolute-size threshold for skewed partitions
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256MB

# Relative-size factor: if a partition is > factor × median size, it's skewed
spark.sql.adaptive.skewJoin.skewedPartitionFactor=5.0

# (Spark 3.3+) Force AQE to apply skew-join splitting even if it adds shuffle overhead
spark.sql.adaptive.forceOptimizeSkewedJoin=true

Pros & Cons

Pros:
- Zero code changes: No query rewrites, no manual hints.
- Runtime intelligence: Works on any sort-merge or shuffle-hash join where skew is severe.
- Eliminates straggler tasks without requiring you to identify skewed keys in advance.
Cons:
- Applies only to shuffle joins (sort-merge and shuffle-hash). Broadcast joins never shuffle, so they aren’t “skewed.”
- Splitting and replicating can introduce extra shuffle I/O; mild skew might not trigger or be worth splitting.
- You may need to tune thresholds (skewedPartitionThresholdInBytes and skewedPartitionFactor) to avoid splitting on nearly-skewed partitions.

Keep This Post Discoverable: Your Engagement Counts!

Share CanadianDataGuy’s No Fluff Newsletter

2. Broadcast Hash Join (Small–Large Optimization)

If one side of a join is small enough to fit in memory on every executor, a broadcast hash join eliminates virtually all skew risk. By broadcasting the smaller dataset to every executor, Spark can join on the large side without shuffling it by key. Even a “hot” key on the large side is processed in parallel across many tasks, because each task already has the complete, in-memory copy of the smaller table.

How It Works

Spark Optimizer Picks It Automatically (if small side ≤ 10 MB by default):
- Controlled by:
  spark.sql.autoBroadcastJoinThreshold (Default: 10MB)
- Raise this value to allow larger small tables but not more than 1 GB practically
Explicitly Force Broadcast in DataFrame Code:

from pyspark.sql.functions import broadcast

result = largeDF.join(broadcast(smallDF), "joinKey")

Spark SQL Hint:

SELECT /*+ BROADCAST(s) */ *
FROM large l
JOIN small s
  ON l.joinKey = s.joinKey;

Since the large dataset is not shuffled by key, no single reducer processes all rows for a heavy key. Instead, each task hashes the broadcasted small side in-memory, and streams its assigned partitions of the large side through that hash.

Pros & Cons

Pros:
- No shuffle on large side—completely eliminates skew related to the small side.
- Simple to implement via broadcast() hints or by tuning spark.sql.autoBroadcastJoinThreshold.
- Dramatic speedups when one side is truly small and the other side has a hot key.
Cons:
- The “small” table must fit comfortably in each executor’s memory. If it’s too large (hundreds of MB), broadcasting can create memory pressure or OOM.
- Not applicable when both sides are large.
- Total cluster memory usage for the small table = (# executors) × (size of small table).

3. Handling Skewed Keys Separately (Divide & Conquer)

If you know exactly which key(s) are skewed, you can split your data into two subsets—the skewed-key subset and the “rest”—process them separately, then recombine

caption...

How It Works

Split Each Dataset into “Skewed” vs. “Rest”:

skewed_keys = ["USA"]

# Dataset A (large or small, doesn’t matter)
A_skew   = A.filter(F.col("country") == "USA")
A_rest   = A.filter(F.col("country") != "USA")

# Dataset B
B_skew   = B.filter(F.col("country") == "USA")
B_rest   = B.filter(F.col("country") != "USA")

Join the “Rest” Subsets Normally:

main_join = A_rest.join(B_rest, "country")

Since “USA” is removed, these partitions will be balanced—assuming no other keys are extremely skewed.

Join the “Skewed” Subsets Separately with an Optimized Strategy:

If B_skew is small enough, broadcast it:

skew_join = A_skew.join(broadcast(B_skew), "country")

Otherwise, you could salt only the “USA” key (as shown above) or use any other technique.

Union the Two Results:

final_result = main_join.unionByName(skew_join)

Pros & Cons

Pros:
- Simplicity: Process the skewed key in isolation; non-skewed data is untouched.
- You choose exactly how to handle the problematic key (e.g., broadcast, salt, or extra resources).
- No need to change logic for the majority of keys.
Cons:
- Requires an extra read/scan (filter) on each dataset—though filter is usually cheap.
- Increases job complexity: two join operations instead of one.
- If more than one key is skewed, you must repeat this process for each key or group of keys—still subject to skew within that sub‐subset.
- Must identify skewed key(s) beforehand.

4. Salting Every Key (Uniform Distribution Across N Buckets)

In real-world joins—especially at scale—any single key with extremely high cardinality (for example, a superstar YouTuber like “mr_beast”) can overwhelm one partition, leading to severe performance bottlenecks. While you might compensate by detecting and salting just that one “hot” key, a more robust approach is to uniformly salt every youtuber_id, ensuring that even unexpected popularity spikes are handled gracefully. By applying a deterministic salt to all keys, each youtuber_id is augmented with a bucket index, distributing its rows across up to N partitions. Matching rows from both tables still join correctly because the salt is derived deterministically from the join key (and potentially another column like video_id).

How It Works

Choose a Salt Count (N)
- Decide how many buckets to split every youtuber_id into (for example, N = 10).
- Aim for each salted partition to be on the order of 100–300 MB (or your target). Use the Spark UI’s “Shuffle Read Size by Task” to gauge ideal bucket size.
Compute a Deterministic Salt for Each Row
- For each row, compute:

salt = abs(hash(concat(youtuber_id, video_id))) % N
salted_youtuber = CONCAT(youtuber_id, "_", salt)

This ensures:

All rows belonging to the same (youtuber_id, video_id) produce the same (salted_youtuber, video_id) pair in both tables.
Every youtuber_id is split across up to N buckets—popular keys will spread widely, less-popular keys may cluster in fewer buckets if they have fewer distinct video_id values.

3. Salt Both Tables in PySpark

import pyspark.sql.functions as F

N = 10

def saltAllExpr(yid_col, vid_col):
    """
    Deterministic salt for every (youtuber_id, video_id):
    salted_youtuber = youtuber_id + "_" + (abs(hash(youtuber_id || video_id)) % N)
    """
    return F.concat(
        yid_col,
        F.lit("_"),
        (F.abs(F.hash(F.concat(yid_col, vid_col))) % N).cast("string")
    )

# Salt the IMPRESSIONS table
salted_impressions = impressions.withColumn(
    "salted_youtuber",
    saltAllExpr(F.col("youtuber_id"), F.col("video_id"))
)

# Salt the CLICKS table
salted_clicks = clicks.withColumn(
    "salted_youtuber",
    saltAllExpr(F.col("youtuber_id"), F.col("video_id"))
)

Every (youtuber_id, video_id) pair gets a consistent bucket index in [0..9].
For "mr_beast" with video_id = "abc123", salted_youtuber = "mr_beast_4" (for example).
A different video "xyz789" might map to "mr_beast_7".
A less-popular youtuber with only one or two videos may occupy only 1–2 buckets—but that’s fine.

Perform the Salted Join

joined = salted_impressions.alias("imp").join(
    salted_clicks.alias("clk"),
    on=[ "salted_youtuber", "video_id" ],
    how="inner"
)

Before salting: All "mr_beast" rows (across any video_id) would land in a single partition.
After salting: Each distinct (youtuber_id, video_id) combination goes to a bucket youtuber_id_<0..9>, so "mr_beast" content spreads across up to 10 partitions—one per bucket index.
This eliminates a single “hot” partition for "mr_beast".

Spark SQL Equivalent

WITH salted_impressions AS (
  SELECT
    *,
    CONCAT(
      youtuber_id,
      '_',
      CAST(ABS(hash(CONCAT(youtuber_id, video_id))) % 10 AS STRING)
    ) AS salted_youtuber
  FROM impressions
),
salted_clicks AS (
  SELECT
    *,
    CONCAT(
      youtuber_id,
      '_',
      CAST(ABS(hash(CONCAT(youtuber_id, video_id))) % 10 AS STRING)
    ) AS salted_youtuber
  FROM clicks
)
SELECT
  imp.*,
  clk.viewer_id,
  clk.timestamp AS click_timestamp
FROM salted_impressions imp
JOIN salted_clicks clk
  ON imp.salted_youtuber = clk.salted_youtuber
 AND imp.video_id       = clk.video_id;

Each (youtuber_id, video_id) deterministically maps to one of 10 buckets.
Even if "mr_beast" has 100 videos, those 100 distinct (youtuber_id, video_id) pairs spread across up to 10 buckets.

Pros & Cons

Pros:

Uniform Distribution for All Keys
Any youtuber with many videos—like "mr_beast"—will spread its rows across N buckets.
No Conditional Logic on “Hot” Keys
You don’t need to first identify which youtuber is skewed; every key is salted uniformly.
Deterministic
Matching (youtuber_id, video_id) always end up in the same bucket on both sides, so joins remain correct.
Works for Any Join
Applies whether one or both tables are large—no reliance on broadcast.

Cons:

Extra Shuffle Volume
Every row in both tables carries an extra salted key, and all rows must shuffle by (salted_youtuber, video_id).
- If a youtuber is lightly used, its rows may end up in only one or two buckets—but they still shuffle.
- If data was quite balanced originally, salting “everything” may introduce more shuffle than strictly necessary.
Choosing the Right N Is Crucial
- If N is too small, heavily skewed keys (like "mr_beast") still concentrate too much data in one bucket.
- If N is too large, you create many small partitions, which increases scheduler overhead.
Need to Drop salted_youtuber After the Join
If you only care about the original key (youtuber_id), drop salted_youtuber once the join is done.

When to Use “Salt Everything”

Use this approach when:

You don’t know in advance which keys will be skewed (e.g., an Uber driver of the week suddenly goes viral, or any youtuber’s popularity spikes).
Data volume is large and dynamic, and you want a one‐size‐fits‐all solution rather than conditionally checking for hot keys.
You want consistent distribution for all (youtuber_id, video_id) pairs without maintaining a list of skewed keys.

Additional Considerations (Ideally; try to avoid getting into these)

Tuning Shuffle Partitions

Adjust: spark.sql.shuffle.partitions to a value higher than the default (200), ideally a few times your cluster’s total cores, so that partitions remain small. Too many partitions cause scheduler overhead; too few cause each partition to be large.

Speculative Execution: Enabling speculation (spark.speculation=true) can alleviate the impact of skew by attempting to re-run straggling tasks on another executor. This doesn’t fix the skew itself, but if a task is slow (perhaps due to skew or maybe a slow node), Spark will launch a duplicate task elsewhere. Whichever finishes first wins. In a skew scenario, a speculated task is still doing the same heavy work, so it won’t magically complete faster unless the original executor was anomalously slow. However, speculation can sometimes help if, say, one executor was busy with garbage collection while another could do the work faster – it provides a safety net for stragglers. It’s generally good to enable in large clusters, but note it causes extra resource usage for those duplicate tasks.
Monitoring with the Spark UI
- In the Stages tab, expand a SQL stage and click Physical Plan.
- Under Shuffle Read Size by Task, look for a single bar that towers over the others—that’s your skewed partition.
- Use those insights to decide between AQE or manual salting.
Filtering Out Problematic Rows
- If certain values (e.g., NULL or outliers) cause extreme skew but are not essential, you can drop them before the join, Only do this if you can accept losing those rows from the result.

cleanedDF = originalDF.filter(F.col("country").isNotNull())

Use Skew Hints (Spark 3.4+)
- You can annotate specific keys as skewed in a Spark SQL query so that Spark generates a plan that avoids shuffling them into a single reducer
Memory and Shuffle Tuning: While not fixing skew, you might need to adjust memory configs to handle it. For instance, if one partition is huge, increasing executor memory or shuffle buffer sizes (spark.shuffle.spill.numElementsForceSpillThreshold, spark.shuffle.file.buffer, etc.) won’t solve the skew but might prevent OOM crashes by allowing Spark to spill gracefully. Similarly, ensure spark.memory.fraction or spark.sql.autoBroadcastJoinThreshold are set such that the heavy data can be handled (e.g., give more memory to shuffle if needed). These are more about coping with skew than removing it.
Adaptive Query Execution (AQE): As discussed, ensure spark.sql.adaptive.enabled=true (should be default on modern Spark) and spark.sql.adaptive.skewJoin.enabled=true. You can adjust spark.sql.adaptive.skewJoin.skewedPartitionFactor (default 5) and ...skewedPartitionThresholdInBytes (default 256MB) to tune how aggressively Spark flags partitions as skewed Lowering these values makes Spark split smaller skews, but setting them too low might cause unnecessary splitting. In Spark 3.3+, if you really want to force skew join handling, spark.sql.adaptive.forceOptimizeSkewedJoin=true will apply the optimization even if it might add extra shuffle overhead.

Tackling Skew in Spark Aggregations: From Simple Sums to Semi-Additive Metrics

Aggregation operations like groupBy().agg() in Spark can become major performance bottlenecks when data is skewed. A small number of high-cardinality keys can result in uneven workload distribution, where one reducer is overloaded while others remain idle. While Spark's map-side partial aggregation helps, it alone can’t prevent reducers from becoming overwhelmed when skewed keys funnel massive data into single tasks.

In this deep dive, we’ll explore practical patterns to mitigate skew during aggregations, especially focusing on semi-additive metrics like averages, distinct counts, and ratios—metrics that can't always be merged as trivially as sums or counts.

1. Two-Stage Aggregation with Salting

The most effective method for aggregation skew is a two-stage salted aggregation. In the first stage, you add a salt (random or deterministic) to the key, distributing rows across more groups. In the second stage, you aggregate these partials back to the original key.

How It Works:

Add a new column (e.g., salt = rand() % N) to the grouping key
Group by (key, salt) and compute partial aggregates
Re-group by key to merge the partials

PySpark Example:

from pyspark.sql.functions import col, rand, floor, sum as _sum, count as _count

N = 10
salted_df = df.withColumn("salt", floor(rand() * N))

# First stage: partial aggregation
partial = salted_df.groupBy("key", "salt").agg(
    _sum("value").alias("partial_sum"),
    _count("value").alias("partial_count")
)

# Second stage: final aggregation
final = partial.groupBy("key").agg(
    _sum("partial_sum").alias("total_sum"),
    _sum("partial_count").alias("total_count")
)

This works well for semi-additive metrics like average:

final.withColumn("avg", col("total_sum") / col("total_count"))

Pros:

Greatly reduces skew on hot keys
Flexible: works for sums, counts, averages, etc.

Cons:

Not directly applicable to non-associative metrics (like median, percentile)
Requires an extra stage of aggregation and data shuffle
You must choose N carefully

2. Favor Combiner-Friendly DataFrame Operations

In the DataFrame API, Spark automatically performs map-side combine for aggregation functions like sum, count, and avg. This significantly reduces data shuffled across the network.

Best Practices:

Avoid collecting all values per key using collect_list or collect_set unless needed
Prefer built-in aggregation functions that support partial aggregation

Example:

df.groupBy("user_id").agg(
    _sum("impressions").alias("total_impressions"),
    _count("clicks").alias("click_count")
)

This automatically benefits from map-side combine.

3. Hierarchical or Incremental Aggregation

Instead of grouping by the final key directly, first group on a compound key (e.g., key + day), then roll up to the main key. This acts like salting but uses a meaningful secondary attribute.

Example: Group by (customer_id, date), then group again by customer_id.

Pros:

Uses natural structure in data
More interpretable than random salt

Cons:

Only works if meaningful secondary keys exist
Adds complexity to query logic

4. Isolate Skewed Keys

When just a few keys are skewed (e.g., "mr_beast" on YouTube), isolate them:

Filter the skewed keys
Aggregate them separately
Aggregate the rest normally
Union results

Pros:

Simple logic for non-skewed keys
You can fine-tune treatment of skewed keys

Cons:

Manual, doesn’t scale to many skewed keys
Separate logic paths = more complexity

Special Note: Semi-Additive Metrics

For metrics like averages, ratios, or distinct counts, special care is needed:

Average: Use partial sums and counts, then divide
Ratios: Keep numerator/denominator separate, aggregate both, then divide
Count Distinct: Use approx_count_distinct() for scalable approximations

Some metrics cannot be split and recombined (e.g., exact percentiles). In those cases, use isolation or rethink the need for exact aggregation.

Final Thoughts for Aggregates

Aggregation skew is an invisible killer in Spark jobs. The best strategy is proactive design: salt heavy keys, use partial aggregation, and always choose APIs that favor combiners. With these patterns, even semi-additive or tricky metrics can be made scalable at massive volumes.

If you're dealing with skew, don't just throw resources at it. Design for it.

Summary of Recommendations for Joins

Recommended first Adaptive Query Execution (AQE): Zero code changes, runtime splitting for any sort-merge or shuffle-hash join.
Broadcast Hash Join
- When one side is small (≤ 10 MB by default to 1GB). Hint in DataFrame or SQL.
- Avoids all skew because no shuffle on the small side.
Salting the Key
- When neither side is small, but you know exactly which key(s) dominate.
- Manual, but guaranteed to split a hot key across N partitions.
Handle Skewed Keys Separately
- When you can isolate a small number of skewed keys.
- Split data into “skewed” vs. “rest”; optimize skewed subset, then union.

By applying these strategies in order—starting with AQE’s automatic handling, then broadcasting small tables, and, if necessary, resorting to manual salting or custom partitioning—you can eliminate or dramatically reduce skew-related stragglers in your Spark join operations. Choose the approach that best fits your cluster’s Spark version, data volume, and the complexity you’re willing to maintain.

Architectural Patterns and Data Design to Reduce Skew

Beyond individual Spark jobs, you can sometimes address skew at the data architecture level to prevent issues before they happen:

Skew-Aware Data Partitioning: As discussed, designing how data is partitioned or bucketed in storage can reduce skew. For example, if you frequently group or join by a key that’s skewed, consider storing the data partitioned by that key and a secondary split. A real-world practice: if one category of data is 90% of the dataset, you might partition that category’s data further by another field. Essentially, acknowledge the skewed key in your data model and subdivide it. This could mean separate tables or partitions for heavy categories. When you process the data, you then handle those partitions in parallel. The benefit is you're not repeatedly shuffling the entire dataset to discover the same skew; you’ve pre-divided it.
Pre-Aggregation / Summaries: If your use-case allows, maintain rolling aggregates for skewed keys. For instance, if one user has a million events per day and you always compute their daily total, consider updating a running total for that user in a database or a separate file, rather than recomputing from scratch in each Spark job. By reducing the raw data volume for that key through prior aggregation, you avoid the huge shuffle for that key at query time. This is applicable in pipelines where data is appended incrementally (common in streaming or daily ETL). You trade off storage (keeping summary data) for performance.
Alternate Algorithms: In some cases, you might choose a different approach entirely. For example, for a skewed distinct count, using an approximate algorithm (like HyperLogLog) per partition can avoid bringing all data together. Or using Bloom filters to reduce data before join (filter out records that won’t match). These are specific to certain problems but can mitigate skew by cutting down the data processed.
Scaling Up Hot Data Separately: This is more of an infrastructure pattern – if one key’s data is massive, you could route that to a specialized system. For instance, maybe that one key corresponds to a particular customer – you could give them their own dedicated processing or database, and exclude those records from the general Spark workflow. It’s an extreme solution, but sometimes separating concerns (multi-tenancy isolation) helps if one tenant’s data skews the whole system.
Monitoring and Iteration: A softer “pattern” is to continuously monitor your Spark job metrics (especially in Spark UI or via logs) to catch skew issues and then adjust. Over time, you may adapt your data ingestion or job logic to handle new skewed keys as data grows. For example, if a new user becomes a power user, you might add them to the “skewed key list” for salting. In practice, skew patterns can change, so an architecture that can adjust (or a code path that can automatically detect top N heavy keys and treat them differently) can be very useful.

In essence, architectural approaches are all about not putting all eggs in one basket – distribute data smartly from the ground up, and treat the outliers with special care. This reduces the burden on any single Spark job to handle an immense skew on the fly.

References

Decode the Join: A Spark Data Engineer’s Visual Handbook

Canadian Data Guy — Fri, 09 May 2025 23:55:27 GMT

Ever stared at a Spark job and wondered which join strategy it picked—and why your cluster suddenly feels like it’s running through molasses? This visual handbook is here to help. Whether you're optimizing joins in production or just trying to wrap your head around what happens under the hood, this guide breaks down Broadcast, Shuffle, and Sort-Merge Joins using clear diagrams, code snippets, and real-world scenarios. Decode the logic, spot the trade-offs, and make smarter join decisions in your next big data pipeline.

A big thank you to for the opportunity to contribute to this space. It’s always a pleasure to share insights with fellow data enthusiasts. If this visual guide helped demystify Spark joins for you, feel free to share your thoughts or questions in the comments—I’d love to hear from you!

Why Your PySpark UDF Is Slowing Everything Down

Canadian Data Guy — Thu, 24 Apr 2025 22:39:47 GMT

1. Introduction

PySpark’s User Defined Functions (UDFs) empower developers to inject custom Python logic into Spark DataFrames. They feel like a convenient escape hatch when built-in SQL functions don’t cut it. However, under the hood, each UDF invocation triggers a complex ballet of inter-process communication, serialization, and single-threaded Python loops. This blog peels back each layer of that architecture to reveal why PySpark UDFs can become a massive performance drain — and then walks through concrete alternatives and optimizations to keep your jobs blazing fast.

2. The Problem with PySpark UDFs

When you sprinkle UDF calls across your Spark SQL or DataFrame pipeline, you’re effectively handing off portions of your query plan to a “black box” Python function. That comes at a steep cost:

2.1 Catalyst Optimizer Becomes Blind

No predicate pushdown: Spark’s Catalyst optimizer can’t inspect or reorder the logic inside your UDF, so it abandons optimizations like pushing filters down to data sources.
No whole-stage code generation: The code-gen engine can’t fuse your UDF into JVM bytecode, so you lose out on compiler-level speed gains.

2.2 Serialization/Deserialization Overhead

Row-by-row data shuffling: Each row must be marshalled from the JVM heap into a Python object, sent over a local socket, then converted back. After your Python code runs, the result takes the reverse path back into the JVM.
Millions of crossings: With millions (or even billions) of rows, that boundary-crossing cost balloons.

2.3 Single-Threaded Python Execution

Global Interpreter Lock (GIL): Your UDF runs in a standard CPython process under a single core. All per-row work happens sequentially.
ide the UDF.

2.4 Memory and Stability Risks

Python OOMs: Unlike JVM operations, Spark doesn’t manage Python worker memory. Processing large batches can crash with out-of-memory errors.
Uncaught exceptions: A bug in your UDF can fail an entire Spark task. Null handling, pickling errors, and non-serializable closures often catch teams by surprise.

3. Under the Hood: PySpark’s Dual-Runtime Architecture

Py4J is a communication bridge/library that lets Python and Java interoperate by exchanging objects over sockets. In Spark, it powers two key workflows: setting up the Python SparkContext and converting data types in PySpark SQL. When you start a PySpark session, Py4J opens a socket connection between your Python driver and the underlying Java driver. Later, whenever Spark SQL operations run, Py4J translates Python types into their Java equivalents (and back) so the Python API can seamlessly drive the JVM-based SQL engine. Under the hood, every Python UDF invocation follows this path:

Python Driver → SparkContext → Py4J → JVM → JavaSparkContext

Because each UDF call must cross this socket boundary, it adds measurable latency to your job.

3.1 Py4J: Bridging Python and the JVM

At startup, PySpark uses Py4J to:

Connect the Python driver to the JVM driver.
Translate data types between Python and Java during SQL operations and UDF calls.

Every call into Spark SQL or a UDF crosses this bridge — think of it as a high-latency tunnel for each record.

3.2 Driver, Executors, and Python Workers

Driver (Python process): You call df.withColumn("foo", my_udf(col("bar"))).
JVM Driver: Receives the UDF registration, plans the query.
Executor JVMs: Spin up separate Python subprocesses per task.
Python Workers: Handle the actual UDF logic on deserialized batches.

4. Lifecycle of a PySpark UDF Call

4.1 Registration & Serialization of the Python Function

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def uppercase(val):
    return val.upper()

uppercase_udf = udf(uppercase, StringType())

_create_udf wraps your Python function into a serializable form and tags it with return types.
UDF object travels in the Spark plan to all executors.

4.2 Data Flow on Executors

Executor receives a task partition.
JVM serializes partition rows into Arrow or Pickle bytes.
Bytes stream over TCP to the Python worker.
Python worker deserializes, applies your function row-by-row.
Results are serialized back to JVM for further operators.

4.3 Detailed Serialization Cycle

JVM row object
  └─serialize─▶ Python bytes
      └─deserialize─▶ Python object
           └─apply UDF─▶ Python object
                └─serialize─▶ Python bytes
                     └─JVM bytes
                          └─deserialize─▶ JVM row

Multiply that by every row, every partition, every stage — and you see why simple operations feel so sluggish.

5. Performance Implications

5.1 Quantifying the Overhead

Catalyst loss: 10–30% longer query planning in UDF-heavy jobs.
Serialization tax: 0.5–5 ms per row crossing (tested on medium-sized clusters).
CPU utilization: < 25% CPU usage across nodes despite heavy transforms.

5.2 Real-World Benchmark Example

Scenario: Uppercasing a 100 million-row column.
Native Spark SQL:
df.selectExpr("upper(name) as name")
→ 12 seconds end-to-end
Python UDF:
df.withColumn("name", uppercase_udf("name"))
→ reorders, serialization, single-thread overhead → 85 seconds
7× slower for a trivial transform.

6. Strategies for Faster Custom Logic

6.1 Leverage Built-in Spark Functions

Whenever possible, reach for Spark’s SQL functions (upper, concat, regexp_replace, etc.) — they run entirely in the JVM, enjoy whole-stage codegen, and scale across all cores.

6.2 Pandas UDFs (Vectorized)

Introduced in Spark 2.3, Pandas UDFs batch rows into pandas.Series and use Apache Arrow for zero-copy transfer.

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType
import pandas as pd

@pandas_udf(StringType())
def upper_series(s: pd.Series) -> pd.Series:
    return s.str.upper()

df.withColumn("name", upper_series("name"))

Batch size: Typically 8 K–64 K rows per call
Vectorized ops: Internal loops in C, parallelized across cores in Python worker
Results: 5–10× speed-up over row-UDFs

6.3 Scala/Java UDFs

If you need custom logic beyond SQL but want JVM speed:

Write a Scala object implementing UserDefinedFunction.
Register it via spark.udf.registerJava(...).
Invoke from PySpark as if it were a native function.

No Python serialization needed.
Runs inside the executor JVM with full multi-core utilization.

6.4 Threading & Parallelism in Python UDFs

If you absolutely must call an external API or library row-by-row:

Use multithreading inside your Python UDF to hide network latency.
Batch HTTP calls where possible.
Be cautious: GIL still applies for CPU-bound work, and thread pools can exhaust memory.

7. Common Pitfalls & Debugging Tips

PicklingError: Ensure functions and closures reference only top-level functions and serializable objects.
Null handling: Always guard inputs with if v is None: return None.
Schema drift: Explicitly set return types; mismatches lead to confusing errors at shuffle boundaries.
Memory leaks: Monitor Python worker logs for MemoryError and tune spark.python.worker.memory.

8. Summary & Best Practices

Our newsletter is 100% free and always will be, but without your claps, comments, or shares, search engines may bury this post forever. A quick clap not only tells us this content resonates but also makes sure you (and everyone else) can find it again when it matters most.

Avoid plain Python UDFs whenever built-in Spark SQL functions suffice.
Prefer Pandas UDFs for vectorized, batch transforms—they dramatically reduce boundary crossings via Apache Arrow. In fact, the vectorized nature and rapid Arrow improvements often make Pandas UDFs faster than even Scala/Java UDFs.
Consider Scala/Java UDFs only when you need JVM-native logic that can’t be expressed in SQL or Pandas UDFs.
Design for serializability: keep UDFs self-contained, stateless, and null-safe.
Benchmark early: compare native vs. Pandas vs. Python vs. Scala/Java UDFs on representative data.
Moving forward, hands down use native functions first, then Pandas UDFs in almost all cases.
When you must call external APIs inside a UDF loop, embed threading or async parallelism to help latency—see this video on parallelization within a loop for an example.

By understanding the multi-stage journey of data through the PySpark UDF pipeline — from JVM serialization, through Python’s single-threaded interpreter, back to the JVM — you can make informed choices that balance flexibility with performance. Next time you need custom logic, pause to ask: “Can I batch or vectorize? ” Your cluster (and your users) will thank you.

To learn more about how to improve things, read our deep dive blog on Pandas UDF.

References

Ganesh, R. “Is really UDF hitting the performance in PySpark!” Medium, Jul 5, 2024. Medium
AWS Documentation. “Optimize user-defined functions,” Tuning AWS Glue for Apache Spark (AWS Prescriptive Guidance). AWS Documentation
Tang, T. “Spark functions vs UDF performance?” Stack Overflow, Mar 5, 2018. Stack Overflow
Databricks. “Arrow-optimized Python UDFs in Apache Spark™ 3.5,” Databricks Blog, Aug 26, 2024. Databricks
“Why You Should Avoid Using UDFs in PySpark,” Det.Life Blog, Jan 2024. Data Engineer Things
Illustrious_Ad4259. “Are there any major disadvantages in performance for Spark when using PySpark?” Reddit r/dataengineering, Nov 2021. Reddit
Sen, Soutir. “PySpark UDFs (User-Defined Functions) – Complete Guide,” LinkedIn Article, Dec 2024. linkedin.com
Two Sigma. “Introducing Pandas UDFs for PySpark,” Two Sigma Article.

What a Netflix Senior Data Engineer Taught Us About Winning in Tech—And It’s Not What You Think

Canadian Data Guy — Thu, 17 Apr 2025 00:26:11 GMT

We recently had an enriching conversation with Jarriett, a Senior Data Engineer at Netflix. His success at Netflix isn't a mere coincidence—it's a powerful story of continuous learning, resilience, and compounded experience accumulated over many years. Individuals with 18 years of hands-on engineering experience, retaining his extraordinary level of curiosity, are indeed rare. Here's a distilled summary of his key recommendations:

"Use Custom GPT to Enhance Productivity"

Jarriett shared practical tips on utilizing AI tools, such as custom GPTs, to efficiently prepare for tasks like interviews. Leveraging these technologies effectively can save time, maintain focus, and elevate productivity.

"Tell Your Project Story Like a Founder"

When approaching behavioral interviews or discussions, Jarriett advises framing your projects as a founder would. Clearly articulate the problem, your process, the impact, and lessons learned. This narrative style demonstrates ownership and deep business understanding.

"Tech is Easy—Business is Tough"

One of the most impactful points Jarriett emphasized was that technical problems, while complex, are relatively straightforward compared to the nuanced, intricate challenges presented by business contexts. AI is increasingly managing technical tasks, highlighting the necessity for professionals to develop competencies AI cannot easily replicate, such as strategic business understanding and effective communication.

"Document Clearly and Tailor to Your Audience"

Effective documentation was another crucial lesson from Jarriett. He emphasizes tailoring your documentation to your audience: provide deep technical details when communicating with engineers and offer higher-level insights when engaging with business stakeholders. Documenting clearly ensures your contributions are recognized and remain influential long after your direct involvement. Additionally, he highlighted that thorough documentation makes your name appear frequently in internal searches and documents, thereby building recognition and visibility across your organization.

"Step Out of the Tech Bubble and Be a Detective"

Professionals should actively engage with the business side of projects, understanding stakeholders' real-world problems and motivations. Jarriett suggested adopting a detective-like mindset: asking probing questions, being curious, and identifying underlying business problems when talking to teams. This approach enhances your value in projects and reflects positively during performance reviews.

"Stay Curious and Effective"

Curiosity was repeatedly highlighted as a key trait. Being genuinely interested in exploring and solving problems fosters continuous growth and innovation. Jarriett exemplified effectiveness by integrating continuous learning into everyday tasks, such as listening to audio transcriptions of books during routine activities.

"Write Understandable, Simple Code"

He cautioned against overly clever or compact coding practices. Clear, understandable, and well-commented code is far more valuable than code that saves a few lines but is opaque to your colleagues. Simplicity in coding enhances maintainability and collaboration.

"Resilience and Positive Mindset"

Jarriett's journey to Netflix was marked by resilience. Even after initially facing setbacks in the interview process, his positive attitude, continued self-improvement, and unwavering resilience eventually led him to success.

How Do I Think About Setting Spark Shuffle Partitions in 2025?

Canadian Data Guy — Tue, 15 Apr 2025 21:36:00 GMT

In 2025, overthinking about Spark shuffle partitions has become less critical thanks to modern innovations in the Spark ecosystem. In earlier years—say, 2015 to 2019—the default setting of 200 partitions often proved either too high or too low, prompting manual tuning and much deliberation. However, with advances like the Adaptive Query Engine, many of these decisions are now automatically managed, ensuring optimal performance without constant human intervention. This guide provides a streamlined decision tree to help you quickly determine if any manual adjustment is needed, so you can focus on higher-value aspects of your data processing work.

How to calculate in-memory data size

When assessing data size for partitioning in Spark, it's important to note that the on-disk size—such as data stored in S3—does not always reflect the in-memory size. This is because data formats like Parquet or Avro are highly compressed, and the actual memory footprint can be 2 to 8 times larger than the file size on disk. Understanding the in-memory size is essential for properly tuning your shuffle partition settings.

To accurately gauge this in-memory size, you can run the following Spark commands to trigger a computation and then inspect the Spark UI (specifically under the SQL/Dataframe tab) for the 'Shuffle read size':

# Read data (example: Parquet file) df = spark.read.load("examples/src/main/resources/users.parquet") # Save as no-op (does not write data, but triggers computation) df.write.format("noop").mode("overwrite").save()

This approach helps ensure that you're basing your partitioning decisions on the actual memory requirements rather than the compressed on-disk sizes.

References

https://www.databricks.com/notebooks/gallery/SparkAdaptiveQueryExecution.html

https://www.databricks.com/discover/pages/optimize-data-workloads-guide

Keep This Post Discoverable: Your Engagement Counts!

Spark Join Strategies Explained: Broadcast Hash Join

Canadian Data Guy — Mon, 14 Apr 2025 14:00:00 GMT

Apache Spark employs multiple join strategies to efficiently combine datasets in a distributed environment. This guide provides a zero-to-hero explanation of the three primary join strategies – Broadcast Hash Join (BHJ), Shuffle Hash Join (SHJ), and Sort-Merge Join (SMJ) – with a focus on Databricks. We will explore how each strategy works, their execution plans (DAG stages, partitioning, memory and shuffle behavior), and how to tune these joins on Databricks (including relevant configurations like AQE and join hints). A visual cheat sheet and further reading resources are provided at the end.

Introduction to Spark Join Strategies

In Spark SQL, a join combines two datasets by matching rows on a common key. The way Spark executes the join greatly impacts performance, especially with large data. Spark’s Catalyst optimizer will choose a join strategy based on data statistics (size of each side, join type, etc.), or you can influence it via hints and settings. The three main join strategies for equi-joins are:

Broadcast Hash Join (BHJ) – Broadcasts the entire smaller dataset to all executors, avoiding shuffles for that side, Very fast when one side is sufficiently small, analogous to a map-side join in Hadoop
Shuffle Hash Join (SHJ) – Shuffles both datasets on the join key, then builds a hash table on the smaller side of each partition and streams the larger side to find matches.
Avoids the sort step of SMJ but requires enough memory per partition.
Sort-Merge Join (SMJ) – Shuffles both datasets on the join key and sorts them, then merges sorted partitions to find matches. This is Spark’s default strategy for large data and supports all join types . It’s robust (can spill to disk if needed) but involves heavy network and CPU overhead for sorting.

Each strategy has optimal use cases and pitfalls. In Databricks (which uses Spark under the hood), adaptive query execution (AQE) can dynamically optimize joins (e.g. switching strategies or handling skew) to improve performance. We’ll now dive into each strategy in detail.

What is a Broadcast Hash Join (BHJ)?

A Broadcast Hash Join is an efficient strategy used to join two datasets in Spark when one of them is significantly smaller than the other. Instead of moving data across the network (shuffling) for both sides of the join, Spark copies—or "broadcasts"—the entire small dataset to every worker node (executor). Then, each executor performs a local hash join between its partition of the larger dataset and the entire, locally cached, small dataset. This approach helps to avoid expensive network shuffling and the need for sorting on either side of the join.

The Broadcast Process in Detail

The broadcast procedure involves:

Collecting the Data:
The driver first gathers the entire small dataset and converts it into an efficient in-memory data structure (typically a hash map).

Distributing the Data:
This hash map is then distributed (broadcast) to all executor nodes, usually via a network distribution algorithm akin to torrent distribution.
Utilizing the Broadcast Data:
Each executor then uses the broadcasted data to quickly look up matching join keys when processing its partition of the larger dataset.

Understanding these steps is crucial because if any stage fails—whether due to memory limits on the driver, executor constraints, or even network issues—the entire query may fail.

When Does Spark Use BHJ?

Spark will automatically choose to perform a Broadcast Hash Join under these conditions:

Dataset Size: One side of the join is smaller than a pre-configured threshold, which is by default 10 MB in open-source Spark. In Databricks environments, this threshold is commonly increased (e.g., ~30 MB with adaptive execution), meaning Databricks can handle moderately larger tables.
Join Type: The join condition is an equality condition (equi-join).

The setting spark.sql.autoBroadcastJoinThreshold controls this threshold and can be adjusted based on available memory and expected performance benefits.

BHJ works well with these join types:

Supported: Inner joins, and left, semi, or anti joins (as long as the correct side is broadcast).
Limitations: It is not supported for full outer joins. For right outer joins, only the left table can be broadcast; similarly, in left joins only the right table can be broadcast.

If the join type is not supported by a BHJ, Spark may revert to another join strategy, such as a sort-merge join or a broadcast nested loop join when dealing with non-equi conditions.

Databricks and Adaptive Query Execution (AQE)

In Databricks:

Adaptive Query Execution (AQE): AQE can dynamically convert a sort-merge join into a broadcast hash join if it determines at runtime that one side of the join is smaller than the broadcast threshold.
Higher Thresholds: Databricks’ default setting for auto-broadcast (often spark.databricks.adaptive.autoBroadcastJoinThreshold) may be set higher (e.g., 30 MB) to allow for broadcasting moderately larger tables.
Forcing Broadcasts: Although AQE works automatically, you might sometimes use explicit hints (such as /*+ BROADCAST(table) */ in SQL or wrapping a DataFrame with broadcast(df) in PySpark) to ensure the small dataset is broadcast immediately, thereby skipping unnecessary shuffles.

Common Misconception- Order of Joins

For optimal join order performance: Perform joins from smallest to largest tables first to minimize data shuffling⁠⁠ However, do broadcast joins last, even though this seems counterintuitive. This is because:⁠⁠

Broadcast joins don't require shuffles and can be executed efficiently even on large fact tables
If broadcast joins are done first, the joined data needs to be shuffled again for later joins
By doing broadcast joins last, we avoid having to shuffle that data again.
Group together joins that share the same ON clause to reduce shuffling, since the data is already arranged properly

Memory and Shuffle Considerations

Using BHJ provides tremendous speedups by eliminating the costly shuffle of the larger dataset. However, it comes with some significant memory considerations:

Driver Memory: The whole small dataset must be collected on the driver before it can be broadcast. The driver has a memory limit, defined by spark.driver.maxResultSize, and exceeding this limit will cause the job to fail.
Executor Memory: Each executor must have enough memory to store the broadcasted dataset along with its own processing workload. The available memory on the node with the smallest capacity is the practical limit.
Timeout and Overload Risks: If the dataset is even moderately large, broadcasting it might overwhelm the driver or network, leading to out-of-memory (OOM) errors or timeouts. For example, while Databricks has even seen broadcasts for datasets up to a few GB in size, one must exercise extreme caution when attempting such operations.
Compression Differences: Note that the on-disk size of data (like Parquet files in Delta tables) might be much smaller than the in-memory representation. Spark’s decisions are based on disk size, so actual in-memory data after decompression might far exceed the expected limits.

To address these issues, you can either disable auto-broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or lower the threshold to ensure no large table is inadvertently broadcasted. On Databricks with the Photon engine, executor-side broadcasts further alleviate pressure on the driver because the broadcast process does not rely solely on the driver's resources.

Performance Recommendations

When to Use BHJ:
Use Broadcast Hash Join when one dataset is much smaller than the other. This is commonly the case when joining large fact tables with much smaller dimension tables or when one table is the result of a selective filter.
Why Forcing Broadcasts:
While Spark’s optimizer may choose to broadcast small datasets automatically, in complex queries or skewed datasets the statistics might not be accurate. In those cases, manually forcing a broadcast using explicit hints ensures that the join operation skips the shuffle stage and executes as a broadcast join.
Caution in Production:
Forcing broadcasts in ad hoc queries or development is acceptable. However, in production workloads, it’s important to validate the dataset size at runtime. This can be done by checking record counts and partition sizes to avoid overloading any executor or the driver. Monitoring the Spark UI is critical to ensure broadcasts do not result in GC (garbage collection) pressure or other resource issues.

Example SQL with Broadcast Hint

To explicitly force a broadcast in SQL, you can include the following hint in your query:

SELECT /*+ BROADCASTJOIN(table1)*/ table1.id, table1.col, table2.id, table2.int_col FROM table1 JOIN table2 ON table1.id = table2.id;

In the physical plan, you will see a BroadcastExchange operator for the small table along with a BroadcastHashJoin operator, indicating that the join was executed without additional shuffling of the large table.

SQL Query : 
select /*+ BROADCASTJOIN(table1)*/ table1.id,table1.col,table2.id,table2.int_col from table1 join table2 on table1.id = table2.id

Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false\n
  +- BroadcastHashJoin [id#271L], [id#286L], Inner, BuildLeft, false
 :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint,   false]),false), [id=#955]
       :  +- Filter isnotnull(id#271L)
       :     +- Scan ExistingRDD[id#271L,col#272]
               +- Filter isnotnull(id#286L)
                 +- Scan ExistingRDD[id#286L,int_col#287L]

Number of records processed: 799541
Querytime : 15.35717314 seconds

Key Pitfalls and Best Practices

Avoid Broadcasting Too Much Data:
Never broadcast a table that is too large (generally over 1GB) as it can overwhelm the driver and executors. Spark has a hard limit (roughly 8GB) on what it can broadcast.
Watch for Non-Equi Joins:
BHJ only supports joins using equality conditions (equi-joins). When using non-equi join conditions (such as range conditions), BHJ cannot be applied.
Force with Caution:
When you force a broadcast using hints or functions like broadcast(df), you bypass Spark’s adaptive query execution optimizations. This is useful if you are sure the data size is small, but can cause performance issues if the dataset unexpectedly grows.
Plan for Memory Needs:
Increase the broadcast thresholds only if your driver and executors have ample memory. For instance, a driver with 32GB+ memory might safely use higher thresholds (like 200MB). Be sure to also configure spark.driver.maxResultSize appropriately to avoid driver-level memory errors.

Production Advice

When deploying BHJ in production workloads, careful planning and ongoing monitoring are essential to ensure stable performance:

Validate Data Sizes: Always verify that the dataset chosen for broadcasting is truly small both on disk and in-memory. Measure the record count and partition sizes before forcing a broadcast. This helps prevent unexpected OOM (out-of-memory) failures, which can occur when the dataset size exceeds available memory on the driver or executors.
Check Data Size and Record Count
- Count the Records: Before attempting a broadcast, run a simple df.count() on the small dataset. This confirms that the number of records is within an acceptable range.
- Estimate Data Size in Memory: Sometimes the dataset's on-disk size differs from its in-memory footprint. You can either use approximations from your data source’s statistics or compute a rough estimate using:
```
# Example in PySpark
data_size_in_bytes = df.rdd.map(lambda row: len(str(row))).sum()
print("Approximate in-memory size (bytes):", data_size_in_bytes)
```
- While this isn’t exact, it provides an estimate that can be compared against thresholds like spark.sql.autoBroadcastJoinThreshold

Threshold Validation before Forcing a Broadcast

Compare Against Broadcast Thresholds: Before performing an explicit broadcast, validate that the data size is below the configured threshold (e.g., 10MB, 30MB, or a custom value in your Spark configuration). This might involve:

broadcast_threshold = int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold").replace("b", ""))
# Assume approximate_size holds our computed or estimated size of the dataset in bytes.
if approximate_size < int(broadcast_threshold):
    print("Proceed with broadcast")
    # Then use broadcast
    from pyspark.sql.functions import broadcast
    df_broadcasted = broadcast(df)
else:
    print("Data too large; do not broadcast")

This validation helps avoid unintentionally broadcasting a dataset that is too big, potentially causing an OOM error.

Monitor Resource Usage: Leverage Spark’s UI and logging mechanisms to track metrics like GC (garbage collection) activity, memory usage, and broadcast sizes. The smallest available executor memory sets the limit, so ensure that the broadcast data comfortably fits on each node.
Use Adaptive Query Execution (AQE) Carefully: While Spark’s AQE can convert joins to BHJ at runtime, explicitly broadcasting small datasets using hints or functions like broadcast(df) can bypass the overhead of shuffling. However, avoid hardcoding broadcast hints unless you are confident of the dataset's size, as data volumes may fluctuate in production workloads.
Configure Thresholds Cautiously: Adjust configurations such as spark.sql.autoBroadcastJoinThreshold (and related thresholds in environments like Databricks) based on current cluster resources. For drivers with high memory (32GB+), thresholds can be increased, but setting these too high risks overwhelming your system if data volumes grow unexpectedly.
Plan for Scalability and Edge Cases: Implement safeguards within your production pipelines. For instance, include runtime validations or logic to disable broadcasting dynamically when data sizes approach critical limits. This is especially important for pipelines handling dynamic or streaming data where bursts of data could otherwise lead to system instability.
If you’re running a driver with a lot of memory (32GB+), you can safely raise the broadcast thresholds to something like 200MB

set spark.sql.autoBroadcastJoinThreshold = 209715200;
set spark.databricks.adaptive.autoBroadcastJoinThreshold = 209715200;

Why do we need to explicitly broadcast smaller tables if AQE can automatically broadcast smaller tables for us? The reason for this is that AQE optimizes queries while they are being executed.
- Spark needs to shuffle the data on both sides and then only AQE can alter the physical plan based on the statistics of the shuffle stage and convert to broadcast join
- Therefore, if you explicitly broadcast smaller tables using hints, it skips the shuffle altogether and your job will not need to wait for AQE’s intervention to optimize the plan
Never broadcast a table bigger than 1GB because broadcast happens via the driver and a 1GB+ table will either cause OOM on the driver or make the drive unresponsive due to large GC pauses
Please take note that the size of a table in disk and memory will never be the same. Delta tables are backed by Parquet files, which can have varying levels of compression depending on the data. And Spark might broadcast them based on their size in the disk — however, they might actually be really big (even more than 8GB) in memory after the decompression and conversion from column to row format. Spark has a hard limit of 8GB on the table size it can broadcast. As a result, your job may fail with an exception in this circumstance. In this case, the solution is to either disable broadcasting by setting spark.sql.autoBroadcastJoinThreshold to -1 and do the explicit broadcast using hints (or the PySpark broadcast function) of the tables that are really small in the disk as well as in memory, or set the spark.sql.autoBroadcastJoinThreshold to smaller values like 100MB or 50MB instead of setting the threshold to -1.
The driver can only collect up to 1GB of data in memory at any given time, and anything more than that will trigger an error in the driver, causing the job to fail. However, since we want to broadcast tables larger than 10MB, we risk running into this problem. This problem can be solved by increasing the value of the following driver configuration.
- Please keep in mind that because this is a driver setting; it cannot be altered once the cluster is launched. Therefore, it should be set under the cluster’s advanced options as a Spark config. Setting this parameter to 8GB for a driver with >32GB memory seems to work fine in most circumstances. In certain cases where the broadcast hash join is going to broadcast a very large table, setting this value to 16GB would also make sense.
- In Photon, we have the executor-side broadcast. So, you don’t have to change the following driver configuration if you use a Databricks Runtime (DBR) with Photon.

spark.driver.maxResultSize 16g

Final Thoughts

In summary, Broadcast Hash Join is a fast and efficient joining strategy in Spark for skewed or unbalanced joins where one dataset is significantly smaller. It avoids the expensive shuffling of the larger dataset by replicating the small data across all executors, enabling quick local hash lookups. However, its effectiveness depends heavily on the small dataset fitting in memory on the driver and executors. Forcing broadcasts should be done judiciously, with thorough validations in production to prevent resource exhaustion and associated failures.

By understanding the details of how BHJ operates and its configurations, you can better optimize your Spark jobs and manage performance, especially in environments like Databricks where adaptive query execution and executor-side optimizations further enhance its capabilities.

How the Process Works

BHJ operates in two main phases:

Broadcast Phase:
- Collection and Broadcast: The small table is first collected by the Spark driver. After collection, the data is broadcast to all the executors across the cluster.
- Local Caching: Once received on each node, the small dataset is cached in memory as a read-only broadcast variable. This ensures that the data is immediately available for the join process without any further data movement.
Hash Join Phase:
- Building a Hash Map: Each executor creates an in-memory hash map from the broadcasted dataset. The hash map is built using the join key.
- Local Join Operation: As the larger dataset is processed, every row in each partition is checked against the hash map for matching join keys. Because the small dataset is already available locally, this lookup is very fast and eliminates the need for shuffling data across the network.

Since no sort or extra merge steps are required, this one-pass in-memory lookup per partition makes the Broadcast Hash Join particularly quick, especially in common scenarios like joining large fact tables with much smaller dimension tables (a typical star schema pattern).

Spark Join Strategies Explained: Shuffle Hash

Canadian Data Guy — Thu, 10 Apr 2025 14:00:00 GMT

1. Introduction

Modern big data applications often require joining huge datasets efficiently. Choosing the right join strategy is critical to optimize performance and resource usage. Apache Spark offers several join methods, including broadcast joins, sort-merge joins, and shuffle hash joins. SHJ stands out as a middle-ground approach:

It shuffles both tables like sort-merge joins to align data with the same key.
Instead of sorting, it builds an in-memory hash table for the smaller dataset per partition and probes it with rows from the larger dataset.

This dual approach has the potential to improve execution time by reducing the sorting overhead but demands careful memory management.

2. Understanding Shuffle Hash Join

Shuffle Hash Join is best understood as a hybrid that borrows elements from two traditional join methods:

Sort Merge Join (SMJ)
- Mechanism: Both datasets are sorted by the join key and then merged.
- Pros: Reliable for large datasets.
- Cons: Sorting is CPU intensive.
Broadcast Hash Join (BHJ)
- Mechanism: The smaller table is broadcast to all nodes, and each executor performs a local hash join.
- Pros: Eliminates shuffling.
- Cons: Limited by broadcast size, not suitable when the smaller table exceeds available memory on executors.

How SHJ Differentiates Itself:

Key Step: It shuffles both datasets based on the join key so that every partition contains matching keys.
In-Partition Operation: Instead of sorting the data in each partition, Spark builds a hash table from the smaller dataset's partition and then probes that table with each row from the larger dataset.
Memory Sensitivity: The approach assumes that each partition of the smaller side can be held in memory, which is crucial for performance and avoiding runtime errors.

Key Concepts to Remember:
No Sorting: Eliminates the costly sort phase.
Memory Requirement: High dependency on the ability to fit the hashed partition in memory, risking OOM errors if miscalculated.

3. When to Use SHJ

Historical Perspective

Pre-Spark 3.0:
Spark defaulted to Sort Merge Join for equality-based joins due to the risk of OOM when building in-memory hash tables.
Spark 3.x and Beyond:
With enhancements like Adaptive Query Execution (AQE), Spark can dynamically decide to use SHJ when it detects that:
- The smaller dataset, after partitioning, is of manageable size.
- Avoiding the expensive sorting operation is beneficial for performance.

Practical Scenarios

Moderately Small Datasets:
When one dataset is small enough that its partitions are lightweight (e.g., 5 MB per partition out of 5 GB divided across 1000 partitions), yet not small enough for a broadcast join.
High Sorting Overhead:
When joining a massive fact table (e.g., 1 TB) with a dimension table that is too big to broadcast but small enough per partition, the cost of sorting the entire dataset (as in SMJ) may dominate and thus SHJ becomes more efficient.

Decision Factors

Estimated Partition Size:
Spark’s optimizer checks if the estimated per-partition size of the smaller table is below a threshold (set via spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold).
Configuration and Hints:
Users can guide Spark’s optimizer using hints like /*+ SHUFFLE_HASH(tab) */ or disable sort-merge joins by toggling spark.sql.join.preferSortMergeJoin.
spark.conf.set("spark.sql.join.preferSortMergeJoin","false")

4. How SHJ Works

The execution of a Shuffle Hash Join can be understood through two primary phases, with some literature breaking it into a three-phase model for clarity.

A. Shuffle Phase

Objective:
Bring together all rows associated with a given join key within the same partition.

Process:

Repartitioning:
Both datasets are re-distributed (shuffled) using the join key as the partitioning key. Note that both sides are shuffled – so network cost is still incurred for both datasets.
Data Co-location:
Post-shuffle, each partition will hold all the relevant rows for a specific range of join keys.
Network I/O:
While shuffling ensures correct join semantics, it incurs the cost of network communication for both datasets.

Example Scenario:

Imagine two datasets, Person and Address, initially spread across different partitions. In the shuffle phase, rows with the same key (e.g., A001) are sent to the same partition. This guarantees that later join operations will have all matching keys available on the same executor.

B. Hash Join Phase

After the shuffle phase, the join is executed within each partition through these steps:

Hash Table Creation:
- Selection:
  Spark selects the smaller dataset based on statistics or join hints.
- Building the Hash Table:
  For every partition, Spark creates an in-memory hash table that maps join keys to the associated rows.
Probing the Hash Table:
- Streaming Data:
  The larger dataset’s rows are processed sequentially within the partition.
- Lookup and Join:
  For each row in the larger dataset, the hash table is queried using the join key. If a match exists, Spark produces the joined row as output.

Because no sort is done, if the data per partition is large, the hash table may also be large. Spark assumes the build side will fit in memory. If it doesn’t, the task can spill partitions of the build side to disk (Spark has some support for spilling hash tables, but it is more complex than spilling a sort). In worst cases, an SHJ can run out of memory if the hash table grows too big, causing the executor to OOM. This is why Spark is conservative in using SHJ unless it’s confident the partitions are small enough

Conceptual Diagram:

Imagine a partition where:

The smaller dataset’s partition (say, 5 MB worth of data) is fully loaded into a hash table.
The larger dataset streams through, and for each key, Spark quickly checks the in-memory hash table for corresponding rows.

This operation is performed concurrently across all partitions on different worker nodes.

Alternative Three-Phase View

For some, a detailed three-phase breakdown clarifies the process:

Shuffle:
Repartition both datasets so that all rows sharing the same join key are co-located.
Hash Table Creation:
For each partition, build the in-memory hash table using the smaller dataset.
Hash Join:
Join the larger dataset’s partition by probing the hash table.

This view underlines the importance of parallel execution, where each worker node processes its partitions independently, which is key to Spark’s scalability.

5. Supported Join Types

Shuffle Hash Join is designed to work primarily with equi-joins. In Apache Spark, it supports:

Inner Joins:
Only matching rows are returned.
Left, Right, Semi, and Anti Joins:
These join types function well as long as the join condition is based on equality.

Additional Notes:

Full Outer Join:
Initially, SHJ did not support full outer joins in Spark 3.0 but was later introduced in Spark 3.1+.
Non-equi Joins and Cross Joins:
SHJ does not naturally handle cross joins or non-equi conditions. In such cases, Spark falls back on other, more suitable join strategies.

6. Performance Characteristics & Trade-Offs

Understanding the performance implications of SHJ is critical for designing robust, high-performance Spark jobs.

Advantages

No Sorting Required:
- By eliminating the sort step used in SMJ, SHJ significantly reduces CPU overhead.
Efficient CPU Usage:
- Hash functions and probing operations are generally less costly than sorting large datasets.
Parallel Execution:
- The join is processed in parallel across partitions, making it scalable across large clusters.

Considerations and Pitfalls

Memory Sensitivity:
- Build Side Dependency:
  Every partition on the smaller side must fit in memory. If a partition exceeds available memory, it may cause disk spills or even OOM errors.
- Configuration Challenges:
  Incorrect estimations or misconfigured thresholds can lead to failures. Monitoring and adjusting Spark’s parameters is essential.
Data Skew:
- Uneven Distribution:
  A heavily skewed join key might result in one partition holding a disproportionate amount of data, dramatically increasing memory requirements for that partition.
- Mitigation Strategies:
  Use techniques like increasing the number of shuffle partitions (via spark.sql.shuffle.partitions) or applying custom salting techniques.
Network I/O:
- While SHJ saves on CPU cycles, it does not reduce the network cost of shuffling. If your workload is network-bound, the benefits of SHJ may be limited.
Fallback and Spilling:
- If the hash table grows too large, Spark may attempt to spill data to disk. However, disk spilling is less efficient and can severely impact performance.

7. SHJ Compared to Other Join Strategies

A clear comparison can help decide when to use SHJ over other join methods:

AspectBroadcast Hash Join (BHJ)Sort Merge Join (SMJ)Shuffle Hash Join (SHJ)When to UseVery small tables (typically <10 MB by default)Large tables where sorting is tolerableModerately small build side that cannot be broadcast; avoid sorting overheadSorting RequirementNo sorting; smaller dataset is broadcastedSorting required across partitionsNo sorting within partitions; uses in-memory hash tableMemory ImpactMinimal memory impact on executorsUses more CPU for sortingRequires sufficient memory per partition for hash tablesNetwork CostMinimal network I/O (broadcast eliminates shuffle)High network I/O due to data shufflingSame network cost as SMJ

Key Takeaways:

BHJ is best when the smaller table is extremely small.
SMJ is a general-purpose join that is robust for large datasets.
SHJ strikes a balance by avoiding the heavy sorting cost when the per-partition memory size is manageable.

Shuffle hash join over sort-merge join

In most cases Spark chooses sort-merge join (SMJ) when it can’t broadcast tables. Sort-merge joins are the most expensive ones. Shuffle-hash join (SHJ) has been found to be faster in some circumstances (but not all) than sort-merge since it does not require an extra sorting step like SMJ. There is a setting that allows you to advise Spark that you would prefer SHJ over SMJ, and with that Spark will try to use SHJ instead of SMJ wherever possible. Please note that this does not mean that Spark will always choose SHJ over SMJ. We are simply defining your preference for this option.

set spark.sql.join.preferSortMergeJoin = false

Databricks Photon engine also replaces sort-merge join with shuffle hash join to boost the query performance.

Setting the preferSortMergeJoin config option to false for each job is not necessary. For the first execution of a concerned job, you can leave this value to default (which is true).

If the job in question performs a lot of joins, involving a lot of data shuffling and making it difficult to meet the desired SLA, then you can use this option and change the preferSortMergeJoin value to false

8. Configuration and Tuning Best Practices

Optimizing SHJ involves careful configuration and continuous monitoring. Below are some best practices.

A. Adaptive Query Execution (AQE)

What is AQE?
Adaptive Query Execution dynamically adapts the physical plan based on runtime statistics. With Spark 3.x, AQE can convert a sort-merge join to a shuffle hash join if it detects that partition sizes are favorable.

Configuration Example:

// Set AQE threshold such that if post-shuffle partition size is below 64MB, Spark uses SHJ. spark.conf.set("spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold", "64MB")

This dynamic adjustment helps balance between CPU use and memory load without manual intervention.

B. Join Hints and Configurations

Explicit Hints:
When you know the data characteristics, you can direct Spark to use SHJ via hints:

// Using a hint to explicitly request a Shuffle Hash Join val dfJoined = factTable.join(dimensionTable.hint("SHUFFLE_HASH"), "joinKey") dfJoined.explain() // The physical plan should show ShuffledHashJoin

Disabling SMJ Preference:
For cases where SHJ is preferred over SMJ, you can adjust the setting as follows:

// Tell Spark to favor hash-based join strategies over sort-merge join. spark.conf.set("spark.sql.join.preferSortMergeJoin", "false")

C. Monitoring and Debugging

Using the Spark UI:

Partition Metrics:
Monitor the size and distribution of shuffle partitions to ensure they meet expected thresholds.
Task Execution Details:
Observe tasks’ memory usage and CPU times. Unexpected OOM errors or high spill metrics may indicate misconfigured thresholds or skewed data.

Log Analysis:

AQE Logs:
When AQE is enabled, logs will show if the join strategy was dynamically switched.
Executor Logs:
Pay attention to memory allocation logs and warnings about data spills.

9. Practical Example

Let’s consider a real-world scenario to solidify our understanding. Suppose you are joining a large fact table with a moderately sized dimension table:

Fact Table: ~1 TB of transactional data.
Dimension Table: ~5 GB of reference data.

Rationale:
Broadcasting a 5 GB table is infeasible in this scenario, but if you partition the 5 GB table into 1000 slices, each partition is only about 5 MB. This makes it an ideal candidate for a shuffle hash join.

Implementation Example in Spark (Scala):

// Assuming factTable and dimensionTable are pre-defined DataFrames val dfJoined = factTable.join( dimensionTable.hint("SHUFFLE_HASH"), Seq("joinKey") // Using column(s) that define the join condition ) // Explain the plan to verify the join strategy dfJoined.explain(true) // Expected outcome: // The physical plan should display an operator "ShuffledHashJoin" // indicating that Spark is using SHJ for the join.

What to Look For:

Physical Plan Inspection:
Look for the ShuffledHashJoin operator in the explain plan output.
Resource Usage:
Monitor executor memory usage and check that each partition from the smaller dimension table fits within the allotted memory, avoiding spills or OOM errors.

ShuffledHashJoin [id1#3], [id2#8], Inner, BuildRight
:- Exchange hashpartitioning(id1#3, 200)
:  +- LocalTableScan [id1#3]
+- Exchange hashpartitioning(id2#8, 200)
   +- LocalTableScan [id2#8]

10. Databricks platform specific insights

Databricks generally relies on BHJ and SMJ under the hood, and uses SHJ in a more limited, adaptive way. Under AQE, Databricks might start a join as a sort-merge join but then convert it to a shuffled hash join at runtime if it finds that each partition’s size is below a threshold (and thus can fit in memory).

This is an optimization: Spark saves the cost of sorting when it realizes it wasn’t needed. By default, this conversion is off (threshold = 0) on vanilla Spark 3.2, but Databricks may enable it or allow setting it. If using hints, you can explicitly ask for a SHJ: e.g., .hint("SHUFFLE_HASH") in DataFrame API or SQL hints. This can be useful if you know one side is moderately small but Spark’s stats are missing. Always ensure that the hint-targeted side will be small per partition; otherwise, you might get memory errors.

Databricks’ strong skew mitigation helps SHJ as well – if one partition is skewed and would OOM an SHJ, AQE’s skew join handling could split that partition and even fall back to a sort-merge or a replicated join for that partition if necessary. Also, note that Photon (Databricks’ vectorized engine) has an improved hashed join implementation that can spill gracefully and use multiple threads per join, which makes SHJ more viable for large data in Photon. In standard Spark, SHJ is single-threaded per task for the join itself (just like SMJ merge is single-threaded per task).

11. Conclusion

Shuffle Hash Join (SHJ) provides a balanced approach by eliminating the high cost of sorting that is present in Sort Merge Joins, while sidestepping the broadcast size limitations of Broadcast Hash Joins. By shuffling data to co-locate matching join keys and then using an in-memory hash table to perform the join, SHJ offers:

Improved CPU efficiency due to reduced sorting overhead.
Scalability when the smaller dataset can be effectively partitioned.
A flexible mechanism that can adapt to runtime data sizes through AQE.

However, SHJ requires meticulous tuning and monitoring:

Memory Utilization:
Ensure that each partition’s hash table fits in memory.
Data Skew:
Address uneven data distributions to prevent performance bottlenecks.
Network Costs:
Understand that while CPU usage may decrease, shuffling still incurs network overhead.

By leveraging configuration settings, join hints, and adaptive query execution, data engineers can optimize their Spark workloads using SHJ. This detailed understanding equips you with the knowledge to carefully evaluate when SHJ is the right tool for your data joining needs, ensuring robust and efficient Spark application performance.

Spark Join Strategies Explained: Sort Merge Join

Canadian Data Guy — Thu, 10 Apr 2025 05:22:09 GMT

What is it:

Sort-Merge Join is the default join strategy in Spark for large datasets that don’t qualify for a broadcast. It involves shuffling and sorting both sides of the join on the join key, then streaming through the sorted data to merge matching keys. SMJ is robust and scalable: it can handle very large tables and all join types (inner, outer, etc.), at the cost of more network and CPU usage.

How it works

Spark will use a Sort-Merge Join when neither side is small enough to broadcast (or if the join type is not supported by BHJ). The execution has three main phases

Shuffle Phase: In the shuffle phase, both input datasets are repartitioned (shuffled) across the cluster nodes based on the join keys. This operation ensures that matching keys from both datasets reside within the same partitions on executors. The shuffle is an expensive network operation involving data redistribution across nodes. Each executor receives and transmits data based on the key distribution. By default, Spark employs 200 partitions (spark.sql.shuffle.partitions). In the physical plan, this shows up as Exchange hashpartitioning(...) on each side of the join
Sort Phase: Within each partition, Spark sorts the records by the join key. Each side is sorted independently. The plan will have local Sort operators after the exchange on each side. The output is that in partition i, both datasets are sorted by key. Sorting is an expensive step (O(n log n) per partition). If the data is already partitioned and sorted (e.g. bucketing and sorting on the join key), Spark may skip the shuffle and/or sort – but this requires specific conditions (like both sides being bucketed by the join key with the same number of partitions).
Merge Phase: Once each partition has sorted data from both sides, Spark performs a merge join: it iterates through the two sorted lists and finds matching keys, similar to how one would merge two sorted files. Because the data is sorted, Spark can do this efficiently by advancing pointers in each list, without nested loops. This merge join operation is efficient—linear time complexity per partition—enabling rapid matching without the need for nested loops. The output of each task is the joined records for that partition’s key range.

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
  +- SortMergeJoin [id#320L], [id#335L], Inner
       :- Sort [id#320L ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(id#320L, 36), ENSURE_REQUIREMENTS, [id=#1018]
       :    +- Filter isnotnull(id#320L)
       :     +- Scan ExistingRDD[id#320L,col#321]
               +- Sort [id#335L ASC NULLS FIRST], false, 0
                +- Exchange hashpartitioning(id#335L, 36), ENSURE_REQUIREMENTS, [id=#1019]
                  +- Filter isnotnull(id#335L)
                   +- Scan ExistingRDD[id#335L,int_col#336L]

Execution details

Sort-Merge join will span multiple stages in the Spark DAG. Typically, you’ll have one stage (or stages) to produce the shuffle partitions for side A, another for side B, and then a final stage where the actual join (merge) happens. In Spark UI’s DAG visualization, you might see something like: both tables read in earlier stages, then a stage where “Exchange -> Sort -> WholeStageCodegen -> SortMergeJoin” occurs

A Spark DAG visualization of a Sort-Merge Join. Both tables are read and then shuffled (Exchange) so that matching keys co-locate. Each partition then sorts its chunk of data on the join key and merges the two sorted streams to output joined rows. (Some upstream stages show as “skipped” because their output was cached for reuse in this example.)

Supported join types: All join types are supported by SMJ for equality conditions – inner, left, right, full outer, semi, anti. It’s the fallback for any join that can’t use a more specialized strategy. Even non-equi joins (like inequalities) can be executed with a sort-merge-like approach if one side is small (Spark might use a Broadcast NLJ for those), but typically equi-joins are where SMJ is used. If you have a full outer join or if both sides are huge, SMJ is usually the plan Spark will choose. (Full outer join cannot be executed as a pure hash join in Spark 2.x, so SMJ was the only choice; Spark 3.1 introduced a shuffle hash algorithm for full outer, but SMJ is still often used.)

Why is it the most stable join?

Sort-Merge Join is network and CPU intensive. It performs a full shuffle of both datasets – which means network I/O proportional to the data size – and a sort of each partition. The memory usage during the sort phase can be high; Spark uses external sort which will spill to disk if a partition’s data doesn’t fit in memory. Unlike SHJ, SMJ is not all-or-nothing in memory: if a task has more data than RAM, it will write sorted runs to disk and merge them (graceful degradation).

This is why SMJ is considered stable for large data – it won’t crash for memory reasons, at worst it will spill and slow down. Still, you want to avoid excessive spilling by tuning partition sizes (Databricks often sets the default shuffle partitions to a high number or uses AQE to auto-tune partition counts).

Because both sides are shuffled, SMJ is symmetric – both large and small tables incur shuffle cost. The algorithm doesn’t build big hash tables, so it can handle very large inputs (even beyond memory) as long as you accept the sorting cost. One positive aspect is that SMJ streaming merge has low overhead per record once sorted, and if data is somewhat presorted or partitioned, the cost might be less than worst-case.

Databricks-specific insights

Databricks Runtime by default enables Adaptive Query Execution (AQE), which can optimize sort-merge joins in two major ways:

Dynamic partition coalescing – after shuffle, if many partitions are small, Databricks can coalesce them to reduce task overhead
Skew handling – if some partitions are extremely large (skewed), Databricks can split those into multiple tasks to avoid stragglers

We will discuss skew handling separately, but it’s important that with AQE, SMJ is not as rigid as it once was. Databricks also collects detailed statistics to decide join strategies: if the optimizer has reliable size estimates (via cost-based optimization), it might avoid SMJ in favor of BHJ when appropriate. However, when dealing with truly large tables where neither side is small, SMJ will be chosen because it’s the most general and robust approach.

Advanced Performance Tuning Strategies

While Spark handles the heavy lifting, you can tune SMJ performance by managing the shuffle and sort behavior:

Partition sizing: Adjust spark.sql.shuffle.partitions so that each partition after shuffle is a reasonable size (Databricks often aims for ~128 MB per partition as a balance between parallelism and overhead). Too few partitions (huge partitions) mean slow sorts and potential disk spills; too many (tiny partitions) mean excessive task scheduling overhead. AQE can auto-coalesce partitions that are smaller than spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB)
spark.apache.org
Take advantage of sorting where possible: If your data is bucketed and sorted on the join keys (and both sides have the same number of buckets and join key bucketing), Spark can use a join without shuffle (it still sorts each bucket if not sorted, but avoids data movement). On Databricks, Delta Lake can maintain clustering (Z-order or sorting) on keys; while Spark does not automatically detect sort order for skipping the sort stage, having data clustered can improve CPU cache efficiency during the merge.
Push down filters and projections: Reduce data size before the join. SMJ’s cost is super linear in data volume (due to sorting). If you can filter out unnecessary rows or columns (thus less data to shuffle), do it first. The Catalyst optimizer should push filters, but be mindful when writing queries (e.g., filter as early as possible in the query plan). Also, dropping unused columns means less data is carried through the shuffle.
Monitor for skew: SMJ is particularly vulnerable to skewed keys: if one key accounts for a huge fraction of data, one shuffle partition will be enormous and the merge task for that partition will be a straggler. We’ll discuss skew mitigation soon (Databricks can automatically split skewed partitions. If you suspect skew, the Spark UI’s stage detail can show if one task processed far more data than others.

When to use SMJ

Typically, you don’t force a sort-merge join; Spark will use it by default for large data. But you might choose to use an SMJ (or let Spark use it) in cases where both datasets are large and similar in size, or when you’re doing a full outer join (which BHJ can’t handle). If one side can be broadcast but you choose not to (perhaps due to risk of OOM or because it’s just borderline size), SMJ will handle it gracefully. SMJ is also the strategy that can cope with lack of statistics: if Spark isn’t sure of sizes, it errs on the side of SMJ because it won’t blow up memory. On Databricks, if you disable adaptive execution or broadcasting, you are essentially forcing SMJ for all joins.

Common Pitfalls

Inadequate shuffle partition tuning, leading to excessive disk spills or overhead from numerous tiny partitions.
Failure to minimize shuffle volume by removing unnecessary columns.
Ignoring or inadequately handling data skew.
Misjudging broadcast opportunities by incorrectly assessing dataset size (rely on in-memory exchange size, not disk size).

Pitfalls: The major downside of SMJ is performance degradation if not tuned. Mistakes include not accounting for data skew (leading to very slow tasks) and leaving the default shuffle partitions at 200 regardless of data scale. For instance, joining two 1 TB tables with 200 partitions would create ~5 GB partitions, likely causing massive spills; increasing partitions (or using AQE) would be necessary. Another common pitfall is forgetting that all columns of both sides are shuffled by default. Projecting out unneeded columns can make a huge difference in shuffle volume. Also, if you have multiple joins in a single query (like joining 3-4 tables), Spark might form a multi-way join plan – consider breaking a very large join into steps or using broadcasts for some legs to avoid an overly expensive single SMJ of many inputs.

Conclusion

Sort-Merge Join remains a foundational element in Spark's join strategies. Understanding its detailed mechanics—shuffle, sort, and merge phases. With careful tuning and vigilant analysis, SMJ can transform demanding Spark workloads into highly optimized, reliable operations. On Databricks, always keep AQE enabled for SMJ – it will automatically optimize partition counts and handle skew, making SMJ perform much better in practice than the static execution plans of the past.

Further Resources

Apache Spark Official Documentation – SQL Performance Tuning: Covers join strategy hints, adaptive execution, etc. (See “Join Strategy Hints” and “Adaptive Query Execution” in the Spark docs)
downloads.apache.org
Tuning Spark SQL queries for AWS Glue and Amazon EMR Spark jobs
Apache Spark Official Documentation – Adaptive Query Execution (AQE): Detailed explanation of AQE features like converting SMJ to BHJ/SHJ and skew join handling
downloads.apache.org
Databricks Documentation – Join Hints & Optimizations: Databricks-specific docs on join strategies, including the SKEW and RANGE hints, and how AQE is used on Databricks
docs.databricks.com
“How Databricks Optimizes Spark SQL Joins” – Medium (dezimaldata): A blog post (Aug 2023) summarizing Databricks’ techniques like CBO, AQE, range join and skew join optimizations
dezimaldata.medium.com
Spark Summit Talks on Joins and AQE: Videos like “Optimizing Shuffle Heavy Workloads” or “AQE in Spark 3.0” (by Databricks engineers) for a deeper understanding of the internals of join execution and tuning.
https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution
How Databricks Optimizes the Spark SQL Joins
Top 5 Mistakes That Make Your Databricks Queries Slow (and How to Fix Them)

Your Degree Isn't Enough: How to Actually Break Into Data

Canadian Data Guy — Thu, 20 Mar 2025 14:03:33 GMT

Landing your first job in data can seem overwhelming, especially when everyone around you knows Python, SQL, and visualization tools like Power BI. Technical skills alone are no longer enough to differentiate yourself in today's competitive job market. Here's a strategic roadmap to landing your first role in data, going beyond the basics and showcasing how you can truly stand out.

Accept the Market Reality

In any room filled with aspiring data professionals, you'll find that nearly everyone knows Python, SQL, Power BI, or Tableau. These skills, while essential, are now commodities—they’re expected rather than exceptional. Even personal or college projects have become commonplace.

To land your first job, you need to set yourself apart.

Differentiate Yourself: Gain Real-world Experience

How can you differentiate yourself, especially without prior work experience? One highly effective strategy is engaging directly with startup founders.

Step-by-Step Approach:

Identify Founders and Startups: Visit platforms like Wellfound.com to connect with early-stage startups—seed-funded or even pre-funded ideas.
Offer Value First: Proactively offer your data skills for free initially. While the idea of unpaid work may seem counterintuitive, the true compensation comes in the form of valuable experience, mentorship, and exposure to product-driven thinking.
Build Real Projects: Working with startups provides real-world projects, solving genuine problems. When interviewing, you’ll have concrete examples of your work to share, far superior to hypothetical or generic projects.
Gain Product Thinking: Startups will help you learn product development, user needs, and business strategy—skills rarely developed through traditional academic projects.

Networking with the Right Mindset

A common mistake people make at networking events is immediately asking for jobs or favors. Trust that the person you're speaking with already knows you're looking for opportunities. Instead, approach networking events with genuine curiosity. Seek to understand what products and challenges people are working on. If the conversation naturally leads to them asking about your interests, then introduce yourself, clearly express your intent, and genuinely offer to help on a project for free. Your goal is simply to create opportunities. Most good people won't let your contributions go unpaid indefinitely, but let them initiate compensation conversations.

Get Certified: Beyond Degrees

Degrees are common; certifications stand out. Specializing in high-demand platforms makes you more attractive to employers.

Recommended Certifications:

Databricks Certifications
Snowflake Certifications
dbt (Data Build Tool) Certification

Highlight certifications prominently on your resume (top-left corner) to immediately capture recruiters' attention.

Target Your Applications Strategically

Certifications aren't just decorations; they open doors to companies actively seeking specialized talent.

Leveraging Certifications:

Consulting firms (Deloitte, Accenture, Slalom) partner with platforms like Databricks and Snowflake. Companies needing certified specialists approach these consultancies.
Research prominent Databricks or Snowflake consulting partners, identify open roles, and directly apply.
Go beyond applying—reach out personally to recruiters via LinkedIn, clearly showcasing your certifications and enthusiasm.

Prepare Effectively for Interviews: STAR Methodology and Hooks

Due to the short attention span of interviewers, begin your answers by clearly stating the impact or result upfront within the first 30 seconds. Follow this hook with the Situation, Task, and Action details.

Example Hook:

"I saved a million dollars in licensing costs and significantly reduced CO2 emissions by building a real-time data system using Databricks that optimized drill bit movements during oil exploration."

Notice how this statement immediately highlights both financial and environmental impacts, making it compelling and memorable.

STAR Methodology Recap:

Result First (Hook): State your biggest impact immediately.
Situation: Describe the scenario briefly.
Task: Clarify your specific responsibility.
Action: Detail the steps you took.

Practicing this method ensures your answers are concise, impactful, and easy for interviewers to remember.

Leverage Consulting Roles as a Stepping Stone

Your initial goal is breaking into the industry. Consulting roles offer valuable exposure and practical experience. After building experience over 2-3 years, aim higher:

Target product companies like Databricks or Snowflake directly.
These roles typically offer significantly better compensation, career growth, and specialized experience.

Key Takeaways

Real-world experience: Prioritize working on actual startup projects over hypothetical college projects.
Networking: Approach networking events genuinely, seeking understanding and creating opportunities rather than directly asking for jobs.
Certifications: Invest in high-impact certifications that directly align with industry demands.
Strategic networking: Proactively connect with recruiters, leveraging your specialized skills.
Interview Preparation: Master the STAR methodology with impactful hooks to clearly demonstrate your value.
Career trajectory: View consulting as an entry point, not the end goal.

Landing your first data job requires strategic moves beyond just technical skills. By adopting this proactive, differentiated approach, you'll set yourself apart and significantly increase your chances of success.

How to Generate 1TB of Synthetic Data Faster Than a Coffee Break

Canadian Data Guy — Wed, 01 Jan 2025 15:01:26 GMT

Imagine creating a massive 1 Terabyte dataset of IoT data in less time than it takes to enjoy your coffee break. With synthetic data generation techniques and a bit more computing power, this becomes a reality. By leveraging an 4-core machine, we can process an astounding 1 million rows per second, with each row containing 1 KB of data. Let's break down what this means:

1 billion rows of 1 KB each equates to 1000 GB or 1 TB of data.
At a rate of 1 million rows per second, it takes approximately 1000 seconds (about 16.66 minutes) to generate 1 billion rows.

This means you can create a full terabyte of synthetic IoT data in under 10 minutes on 8 core machine; I have used 4 in my example. Such rapid data generation opens up exciting possibilities for developers, data scientists, and researchers working on big data projects, IoT applications, or machine learning models that require extensive datasets for training and testing.

ScreenShot Of Spark Streaming UI

Why create Synthetic datasets?

Privacy and Compliance: Synthetic data allows developers to work with realistic data without risking exposure of sensitive information, helping to meet data protection regulations.
Scalability and Control: You can generate virtually unlimited amounts of data with precise control over its characteristics, enabling thorough testing of systems at scale and creation of edge cases that might be rare or impossible to capture in real-world data.
Development Acceleration: By removing dependency on upstream teams for data, developers can build end-to-end pipelines, set up DevOps processes, and address architectural concerns before actual data becomes available, significantly speeding up the development process.
Cost-Effectiveness and Efficiency: Generating synthetic data is often faster and more economical than collecting and processing real-world data, especially for large-scale testing and development.

Hardware

We used a single machine with 4 cores and 32 GB of memory

Let’s get into the code

Install Databricks Data Generator

The dbldatagen is a Python library for generating synthetic data within the Databricks environment using Spark. The generated data may be used for testing, benchmarking, demos, and many other uses.

It operates by defining a data generation specification in code that controls how the synthetic data is generated. The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion. You can use it from Scala, R or other languages by defining a view over the generated data.

%pip install dbldatagen

Setup and Imports

import dbldatagen as dg
import uuid

from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType, IntegerType
from pyspark.sql.functions import expr

Parameters

# Parameterize partitions and rows per second
PARTITIONS = 4 # Match with number of cores on your cluster
ROWS_PER_SECOND = 1 * 1000 * 1000 # 1 Million rows per second

Schema Definition


iot_data_schema = StructType([
    StructField("device_id", StringType(), False),
    StructField("event_timestamp", TimestampType(), False),
    StructField("temperature", DoubleType(), False),
    StructField("humidity", DoubleType(), False),
    StructField("pressure", DoubleType(), False),
    StructField("battery_level", IntegerType(), False),
    StructField("device_type", StringType(), False),
    StructField("error_code", IntegerType(), True),
    StructField("signal_strength", IntegerType(), False)
])

Here, we define the schema for our IoT data. Each StructField represents a column in our dataset, specifying the name, data type, and whether it can contain null values. This schema mimics real-world IoT device data, including device identifiers, sensor readings, and status information.

Why use Databricks Data Generator- `dbldatagen`?

Using dbldatagen for synthetic data generation offers several significant benefits that align closely with the characteristics of your actual data. The ability to specify parameters like minValue, maxValue, random, and percentNulls allows you to create datasets that closely mimic real-world scenarios. This means you can generate realistic variations in your data, such as different temperature ranges or device IDs, while also controlling for missing values. By tailoring these specifications, you ensure that the synthetic data is not only large in volume but also rich in diversity, making it a valuable resource for testing and training machine learning models effectively.

dataspec = (
    dg.DataGenerator(spark, name="iot_data", partitions=PARTITIONS)
    .withSchema(iot_data_schema)
    .withColumnSpec("device_id", percentNulls=0.1, minValue=1000, maxValue=9999, prefix="DEV_", random=True)
    .withColumnSpec("event_timestamp", begin="2023-01-01 00:00:00", end="2023-12-31 23:59:59", random=True)
    .withColumnSpec("temperature", minValue=-10.0, maxValue=40.0, random=True)
    .withColumnSpec("humidity", minValue=0.0, maxValue=100.0, random=True)
    .withColumnSpec("pressure", minValue=900.0, maxValue=1100.0, random=True)
    .withColumnSpec("battery_level", minValue=0, maxValue=100, random=True)
    .withColumnSpec("device_type", values=["Sensor", "Actuator", "Gateway", "Controller"], random=True)
    .withColumnSpec("error_code", minValue=0, maxValue=999, random=True, percentNulls=0.2)
    .withColumnSpec("signal_strength", minValue=-100, maxValue=0, random=True)
)

This section creates a data generator specification using dbldatagen. For each column, we define the data generation rules, including value ranges, randomness, and special formatting (like the "DEV_" prefix for device IDs). This ensures our synthetic data closely resembles real IoT data patterns.

Streaming DataFrame Creation

streaming_df = (
    dataspec.build(
        withStreaming=True,
        options={
            'rowsPerSecond': ROWS_PER_SECOND,
        }
    )
    .withColumn(
        "firmware_version",
        expr(
            "concat('v', cast(floor(rand() * 10) as string), '.', "
            "cast(floor(rand() * 10) as string), '.', "
            "cast(floor(rand() * 10) as string))"
        )
    )
    .withColumn(
        "location",
        expr(
            "concat(cast(rand() * 180 - 90 as decimal(8,6)), ',', "
            "cast(rand() * 360 - 180 as decimal(9,6)))"
        )
    )
    .withColumn(
        "data_payload",
        expr("repeat(uuid(), 22)")  # Add approx. 800 Bytes to construct 1 KB row
    )
)
streaming_df = ( dataspec.build( withStreaming=True, options={ 'rowsPerSecond': ROWS_PER_SECOND, } ) .withColumn( "firmware_version", expr( "concat('v', cast(floor(rand() * 10) as string), '.', " "cast(floor(rand() * 10) as string), '.', " "cast(floor(rand() * 10) as string))" ) ) .withColumn( "location", expr( "concat(cast(rand() * 180 - 90 as decimal(8,6)), ',', " "cast(rand() * 360 - 180 as decimal(9,6)))" ) ) .withColumn( "data_payload", expr("repeat(uuid(), 22)") # Add approx. 800 Bytes to construct 1 KB row ) )

Here, we build the streaming DataFrame using our data specification. We enable streaming with `withStreaming=True`and set the rows per second. We also add additional columns:

firmware_version: A randomly generated version number.
location: Random latitude and longitude coordinates.
data_payload: A large string to reach our 1 KB per row target.

Data Writing

streaming_df.writeStream
    .queryName("iot_data_stream")
    .outputMode("append")
    .option("checkpointLocation", f"/tmp/dbldatagen/streamingDemo/checkpoint-{uuid.uuid4()}")
    .toTable("soni.default.iot_data_1kb_rows")

Finally, we initiate the streaming process. The data is written to a Delta table named "iot_data_1kb_rows" in append mode. A checkpoint location is specified to allow for fault-tolerant execution of the streaming query.

Is it really cheaper than Starbucks coffee?

On a 4-core setup, generating a billion rows of synthetic IoT data would take approximately 17 minutes. Adding an extra 5 minutes as a buffer for instance setup brings the total time to 22 minutes.

Cost Breakdown for Generating 1 Billion Rows of Synthetic IoT Data:

Total time: 17 minutes (data generation) + 5 minutes (instance setup) = 22 minutes
Time in hours: 22÷60=0.3667 hours
Cost EC2 + Databricks: $0.228

This means generating a terabyte of synthetic IoT data costs just $0.228—less than the price of all coffee options. Such efficiency showcases the cost-effectiveness of synthetic data generation, enabling developers and data scientists to create large-scale datasets for testing and development at a fraction of the cost of traditional methods.

Furthermore, as illustrated in the graph below, the CPU utilization consistently exceeds 80%, highlighting the system's optimized performance and contributing to the remarkably low cost.

Stay Connected & Keep Learning: Join Our Community

If you found this post helpful, please drop a like to keep me motivated! And feel free to leave a comment below if you have any questions or thoughts—I'd love to hear from you!

If this is your first time reading my content, Welcome! I write in-depth technical blogs on Spark, Databricks, and Spark Streaming. Beyond writing, I specialize in helping data professionals unlock their full potential and ace their next data interviews.

Here are a few ways you can continue to learn, connect, and grow:

Join Us on WhatsApp: Stay updated and engage with the community through our WhatsApp group. Join here.
Join Our Discord Community: Connect with past clients and other data enthusiasts on our Discord server. It’s a great place to network, pair up with peers, and accelerate each other’s journeys. Join here.
Visit My Website: My website is your go-to resource for content, including blogs, tutorials, and more. Check it out here.

Canadian Data Guy Unfiltered

Inside Delta Lake’s Idempotency Magic: The Secret to Exactly-Once Spark

Introduction

Checkpoint Directory Structure

The Normal Happy Path

The Critical Failure Scenario

Delta Lake’s Idempotency Magic

The Solution

Complete Recovery Flow

Offset Semantics / ( inclusive, exclusive]

TLDR: Recovery Rules

Where to Find the IDs

When Things Go Wrong: A Real-World Accident

Root Cause

Key Lessons:

Summary: The Complete Picture

Related Deep Dives:

How to Choose Between Liquid Clustering and Partitioning with Z-Order in Databricks

Understanding the Basics: Liquid Clustering vs. Partitioned Z-Order Tables

Liquid Clustering

Partitioned Z-Order Tables

Decision Tree

Factors to Consider When Choosing

Table Size

Note: Liquid is being actively improved so the guidance could change

Data Ingestion Pattern

Query Patterns

Data Distribution

Partition Column Selection

Real-World Example: Amazon Clickstream Data

Optimizing the Partitioned Table

Optimistic Concurrency Control

Potential Pitfalls and Best Practices

Conclusion

Keep This Post Discoverable: Your Engagement Counts!

References

Unlocking Sub-Second Latency with Databricks

Why am I doing this?

What we’re building: Operational Guardrail Stream

The dataset and assumptions

What makes an event “bad” in this post

Rule 1: Payload hygiene check

Rule 2: Bad data check

Real-Time Mode: the tiny bit you need to know

The code: Real-time guardrail (Kafka → Spark RTM → Kafka)

Imports & Configuration

Connect to Kafka

Parse JSON Payload

Compute decision, reasons/ Your Custom Logic / Rules

Write Back To Kafka

The Best Part? It’s Just the Flip of a Switch

References

I Knew the Answer. I Just Couldn’t Remember It.

My Old Stack: Why It Failed

Discovering Obsidian

What I loved immediately:

The Simple / No Code Solution

How I Feed the Agent

Recommended Obsidian Plugins

Creative Things I Use the Cursor For

Chaos → Structure

Multi-Note Synthesis

Example Prompts

What It’s Good At (And Not)

⚡ Key Insight

Why This Matters

Going Deeper: Production Agents

Start here. Build your own local recall engine.

4 Surprising Truths That Will Change How You Think About Spark Streaming

TL;DR

Four Counter‑Intuitive Truths

Conclusion

Why I Materialize Delta History for Debugging

One-Time Dump of Delta Table History

Stop Waiting for Connectors: Stream ANYTHING into Spark (It's 4 Functions)

💡 What You’ll Learn

The Problem: You Have Data, Spark Wants It

The Secret: It’s Just a Conversation

The 5 Questions Spark Will Ask You

Let’s See Real Code: Streaming Ethereum Blocks

Compute `decision`, `reasons/ Your Custom Logic / Rules`

Trigger Types: Why `trigger(availableNow)` Matters

`trigger(once=True)` (Deprecated)

`trigger(availableNow=True)` (Recommended for Blockchain Pipelines)

`trigger(processingTime='10 seconds')`

Why We Use `trigger(availableNow)`