Brains Up AnalyticsBRAINSUPAnalytics
ETLAzurePerformance

Incremental ingestion: stop reloading everything every night

Watermarking, change data capture and the patterns that cut cost and processing windows in ETL pipelines.

Reloading the whole table every night works — until the table has 500 million rows and the nightly window isn't enough anymore. Incremental ingestion is what keeps pipelines fast and cheap as volume grows.

Watermarking

The most common technique is the watermark: you store the highest value of a monotonic column (usually updated_at or a sequential ID) processed in the last run, and on the next run you only fetch what came after.

SELECT * FROM source
WHERE updated_at > '${last_watermark}';

Simple, but there are traps:

  • Time zones and clocks. Use UTC end to end.
  • Late-arriving records. If the source can insert data with updated_at in the past, consider an overlap window (reprocess the last X hours).
  • Deletes. A watermark on updated_at won't capture physical deletes. For that, you need CDC or soft deletes.

Change Data Capture (CDC)

When the source supports it, CDC reads the database transaction log and delivers inserts, updates and deletes in order. It's more robust than watermarking, but requires source support and more infrastructure.

On Azure, you can combine SQL Server CDC with Azure Data Factory; in the Lakehouse, Delta's MERGE applies those changes idempotently.

The practical rule

Start simple: a watermark with an overlap window solves most cases. Move to CDC when you need to capture deletes or when latency demands near-real-time. Resisting the temptation to "just reload everything, it's easier" is what separates a pipeline that scales from one that will wake you up at night a year from now.

Related articles

Enjoyed this? Check out the e-books for in-depth content.

E-books