Tuesday, May 9, 2023

Databricks - DLT some notes



When to use views, materialized views, and streaming tables

To ensure your pipelines are efficient and maintainable, choose the best dataset type when you implement your pipeline queries.

Consider using a view when:

  • You have a large or complex query that you want to break into easier-to-manage queries.

  • You want to validate intermediate results using expectations.

  • You want to reduce storage and compute costs and do not require the materialization of query results. Because tables are materialized, they require additional computation and storage resources.

Consider using a materialized view when:

  • Multiple downstream queries consume the table. Because views are computed on demand, the view is re-computed every time the view is queried.

  • The table is consumed by other pipelines, jobs, or queries. Because views are not materialized, you can only use them in the same pipeline.

  • You want to view the results of a query during development. Because tables are materialized and can be viewed and queried outside of the pipeline, using tables during development can help validate the correctness of computations. After validating, convert queries that do not require materialization into views.

Consider using a streaming table when:

  • A query is defined against a data source that is continuously or incrementally growing.

  • Query results should be computed incrementally.

  • High throughput and low latency is desired for the pipeline.

2

 By default, Delta Live Tables recomputes table results based on input data each time a pipeline is updated, so you need to make sure the deleted record isn’t reloaded from the source data. Setting the pipelines.reset.allowed table property to false prevents refreshes to a table, but does not prevent incremental writes to the tables or prevent new data from flowing into the table.

3

By contrast, the final tables in a pipeline, commonly referred to as gold tables, often require 
complicated aggregations or read from sources that are the targets of an 
APPLY CHANGES INTO operation. Because these operations inherently create updates rather than appends, they are not supported as inputs to streaming tables. These transformations are better suited for materialized views. By mixing streaming tables and materialized views into a single pipeline, you can simplify your pipeline and avoid costly re-ingestion or re-processing of raw data and have the full power of SQL to compute complex aggregations over an efficiently encoded and filtered dataset.



1

By default, streaming tables require append-only sources. When a streaming table uses another streaming table as a source, and the source streaming table requires updates or deletes, for example, GDPR “right to be forgotten” processing, the skipChangeCommits flag can be set on the target streaming table to ignore those changes.




1

Limitations

  • Metrics for the target table, such as number of output rows, are not available.

  • SCD type 2 updates will add a history row for every input row, even if no columns have changed.

  • The target of the APPLY CHANGES INTO query or apply_changes function cannot be used as a source for a streaming table. A table that reads from the target of an APPLY CHANGES INTO query or apply_changes function must be a live table.

  • Expectations are not supported in an APPLY CHANGES INTO query or apply_changes() function. To use expectations for the source or target dataset:

    • Add expectations on source data by defining an intermediate table with the required expectations and use this dataset as the source for the target table.

    • Add expectations on target data with a downstream table that reads input data from the target table.


1

In Python, Delta Live Tables determines whether to update a dataset as a materialized view or streaming table based on the defining query. The @table decorator is used to define both materialized views and streaming tables.

To define a materialized view in Python, apply @table to a query that performs a static read against a data source. To define a streaming table, apply @table to a query that performs a streaming read against a data source. Both dataset types have the same syntax.

2

For tables less than 1 TB in size, Databricks recommends letting Delta Live Tables control data organization. Unless you expect your table to grow beyond a terabyte, you should generally not specify partition columns.


3

Configure a streaming table to ignore changes in a source streaming table

Note

  • To use the skipChangeCommits flag, you must select the Preview channel in your pipeline settings.

  • The skipChangeCommits flag works only with spark.readStream using the option() function. You cannot use this flag in a dlt.read_stream() function.


By default, streaming tables require append-only sources. When a streaming table uses another streaming table as a source, and the source streaming table requires updates or deletes, for example, GDPR “right to be forgotten” processing, the skipChangeCommits flag can be set on the target streaming table to ignore those changes.



1


Note

In Databricks Runtime 12.1 and above, skipChangeCommits deprecates the previous setting ignoreChanges. In Databricks Runtime 12.0 and lower, ignoreChanges is the only supported option.

The semantics for ignoreChanges differ greatly from skipChangeCommits. With ignoreChanges enabled, rewritten data files in the source table are re-emitted after a data changing operation such as UPDATEMERGE INTODELETE (within partitions), or OVERWRITE. Unchanged rows are often emitted alongside new rows, so downstream consumers must be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes.

skipChangeCommits disregards file changing operations entirely. Data files that are rewritten in the source table due to data changing operation such as UPDATEMERGE INTODELETE, and OVERWRITE are ignored entirely. In order to reflect changes in upstream source tables, you must implement separate logic to propagate these changes.



No comments:

Post a Comment