Delta Live Tables Advanced Q & A
This is primarily written for those trying to handle edge cases. Q1.) How can a single/unified table be built with historical backfill and ongoing streaming Kafka data? The streaming table built using DLT allows writes to the table outside of the DLT. Thus, you can build and run your DLT pipeline with Kafka as a source, generating the physical table with a name. Then, you can do a streaming write to this table outside DLT. What is the gotcha here? The data has lost its natural ordering which is fine in most cases, meaning it did not go into the Delta table in the same order it was generated. This is in contrast to an ideal world in which Kafka had infinite retention, and a single DLT pipeline would have ingested the data. If and only if you are using the table as a Streaming source with Watermarking downstream then while reading this data, we will have to instruct Spark Streaming to sort the data while reading it. We can do this by using the following parameter ‘withEventTimeOrder’. spark.readStream.format(“delta”) .option(“maxFilesPerTrigger”, f”{maxFilesPerTrigger}”) .option(“withEventTimeOrder”, “true”) .table(f”{schema_name}.{table_name}”) You can further read about this solution here https://canadiandataguy.medium.com/how-to-write-your-first-spark-application-with-stream-stream-joins-with-working-code-dd9b0b39f814#d828 To reiterate, the gotcha only applies if you use this table as a Streaming Source along with Watermarking. Q2.) How do I handle deletes in a Streaming Table? Let’s take GDPR as an example of where we need to enforce retention on the Delta table. One can run a regular DELETE command on the table and then in the DLT pipeline make changes to downstream consumers. “By default, streaming tables require append-only sources. When a streaming table uses another streaming table as a source, and the source streaming table requires updates or deletes, for example, GDPR “right to be forgotten” processing, the skipChangeCommits flag can be set on the target streaming table to ignore those changes. For more information about this flag, see Ignore updates and deletes.” @tabledef b(): return spark.readStream.option(“skipChangeCommits”, “true”).table(“LIVE.A”) Q3.) How to enable mergeSchema on DLT table? This is already handled in DLT. If you want to control otherwise explicitly, you can pass the following spark conf property at the DLT pipeline or table level. spark.databricks.delta.schema.autoMerge.enabled True If you are using Autoloader, consider playing with different schema evolution modes while reading data. .option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”) Q4.) How to change the location where the table is stored? @dlt.table( name=”<name>”, comment=”<comment>”, spark_conf={“<key>” : “<value”, “<key” : “<value>”}, table_properties={“<key>” : “<value>”, “<key>” : “<value>”}, path=”<storage-location-path>”, partition_cols=[“<partition-column>”, “<partition-column>”], schema=”schema-definition”, temporary=False) 3. In your DLT pipeline configuration, set this property pipelines.tableManagedByMultiplePipelinesCheck.enabledto false 4. Now, we need to make sure that we do not read any duplicate data because we cannot reuse our old checkpoint. We will solve this by using filters or providing a starting configuration for the streaming source. E.g., if your streaming source is: 4. a) Kafka: Then we will provide offset information. More information can be found here. 4. b) Delta: For example, suppose you have a table user_events. If you want to read changes since version 5, use: spark.readStream.format(“delta”) .option(“startingVersion”, “5”) .load(“/tmp/delta/user_events”) If you want to read changes since 2023–03–03, use: spark.readStream.format(“delta”) .option(“startingTimestamp”, “2018-10-18”) .load(“/tmp/delta/user_events”) More details can be found here. 5. To do step 4, you should parameterize your DLT pipeline, which can be done by following these instructions. Q5.) Does DLT support Identity Columns? Yes, more details here. However, Identity columns are not supported with APPLY CHANGES tables. Q6.) How to stream out of a table which was loaded using apply_changes? This is generally not recommended. The target of the APPLY CHANGES INTO query or apply_changes the function cannot be used as a source for a streaming live table. A table that reads from the target of a APPLY CHANGES INTO query or apply_changes function must be a live table. You can rely on enabling SCD and then use audit columns (__START_AT &__END_AT)to identify the changes. However, the downstream would still have to do a batch read and filter on these audit columns to limit the information being read. If you are adventurous and still want to do a read stream of this source. You need to enableChangeDataFeed on the delta table ‘fact_sales’. @dlt.table(name=”fact_sales”, comment=”This is a fact tables for sales”, partition_cols = [“order_date”], table_properties={ “pipelines.autoOptimize.zOrderCols”: “StoreId,ItemId”, “delta.enableChangeDataFeed”: “true”, }) Then you can decide to stream changes out of the __apply_changes_{table_name} . Make sure to handle tombstones/deletes as part of your downstream pipeline. Q7.) How to delete Data using DLT? Use the Change Data Capture functionality of DLT. The particular expression which will help you achieve this is called apply_as_deletes. You can change the parameter to match your custom criteria. For example, if you had bad records originating in a specific time interval or file name, you can change the expression to meet your custom criteria. import dltfrom pyspark.sql.functions import col, expr@dlt.viewdef users(): return spark.readStream.format(“delta”).table(“cdc_data.users”)dlt.create_streaming_live_table(“target”)dlt.apply_changes( target = “target”, source = “users”, keys = [“userId”], sequence_by = col(“sequenceNum”), apply_as_deletes = expr(“operation = ‘DELETE’ or {any other custom logic} “), except_column_list = [“operation”, “sequenceNum”], stored_as_scd_type = “2”) Q8.) How to avoid accidental overwrites in DLT? Set this property so that tables cannot be overwritten. pipelines.reset.allowed false Q9.) DLT Pipeline was deleted, but the Delta table exists. What to do now? What if the owner has left the org and I need a new DLT pipeline to take care of the table Step 1.) Verify via CLI if the pipeline has been deleted databricks –profile <your_env> pipelines listdatabricks –profile <your_env> pipelines get –pipeline-id <deleted_pipeline_id> Step 2.) Change the owner of the table ALTER TABLE <db>.<table> SET TBLPROPERTIES(pipelines.pipelineId = ‘<NEW_PIPELINE_ID>’); Note: In case you do not have a pipeline ID yet, you can use the below parameter once; run your pipeline to get the pipeline ID and then remove the below parameter. pipelines.tableManagedByMultiplePipelinesCheck.enabledto false Q10.) How does sequence_by work in apply_changes() ? There are two types of data management strategies with apply_changes: Type 1 involves keeping only the latest state of a record. This means that if an older record arrives out-of-order and we already have a newer record in the target, the older record will not update the target because it is not the latest state. Type 2 involves keeping a history of all records. This means