Incremental ingestion: stop reloading everything every night
Watermarking, change data capture and the patterns that cut cost and processing windows in ETL pipelines.
Reloading the whole table every night works — until the table has 500 million rows and the nightly window isn't enough anymore. Incremental ingestion is what keeps pipelines fast and cheap as volume grows.
Watermarking
The most common technique is the watermark: you store the highest value of a monotonic column (usually updated_at or a sequential ID) processed in the last run, and on the next run you only fetch what came after.
SELECT * FROM source
WHERE updated_at > '${last_watermark}';
Simple, but there are traps:
- Time zones and clocks. Use UTC end to end.
- Late-arriving records. If the source can insert data with
updated_atin the past, consider an overlap window (reprocess the last X hours). - Deletes. A watermark on
updated_atwon't capture physical deletes. For that, you need CDC or soft deletes.
Change Data Capture (CDC)
When the source supports it, CDC reads the database transaction log and delivers inserts, updates and deletes in order. It's more robust than watermarking, but requires source support and more infrastructure.
On Azure, you can combine SQL Server CDC with Azure Data Factory; in the Lakehouse, Delta's MERGE applies those changes idempotently.
The practical rule
Start simple: a watermark with an overlap window solves most cases. Move to CDC when you need to capture deletes or when latency demands near-real-time. Resisting the temptation to "just reload everything, it's easier" is what separates a pipeline that scales from one that will wake you up at night a year from now.
Related articles
Slowly Changing Dimensions Type 2, without the headache
The essential pattern for tracking history in dimensions — explained with a concrete example and the most common mistakes.
Read articleMedallion Architecture: the pattern that organizes your Lakehouse
How the Bronze, Silver and Gold layers turn a chaotic data lake into a reliable, auditable platform.
Read articleEnjoyed this? Check out the e-books for in-depth content.
E-books