CDC Isn’t Streaming. It’s Correctness
I stopped asking “streaming or batch?” and started asking “can we prove correctness?”
A few years ago, extracting data with JDBC/ODBC directly from OLTP databases was the default move. It worked, until it didn’t. Every new table, every new consumer, every larger dataset meant heavier queries hitting the source system. As data grew, the database became the bottleneck, and “just run another extract” started to feel like a risky habit.
Then I learned about Change Data Capture (CDC), and it genuinely felt like a silver bullet. Instead of re-reading full tables, you could capture only what changed from the database logs, with lower impact on the source, and with a path to scale ingestion without constantly upgrading infrastructure.
But CDC also changed the question I was asking.
I stopped obsessing over data freshness in the lakehouse: “is this streaming or batch?”; and started thinking in terms of replayability: if the raw change events are safely stored in bronze, I can rebuild state whenever I need. The real challenge isn’t how fast changes arrive. It’s whether I can apply them deterministically, handle edge cases, and trust the final table.
That’s where CDC stops being a tool choice and becomes a design choice, with consequences.
What is CDC?
Change Data Capture (CDC) is a way to replicate only what changed in a database, by reading its transaction log (redo/WAL/binlog) instead of re-querying full tables.
Instead of “a table snapshot”, CDC produces a sequence of change events: inserts, updates, and deletes; usually with metadata like the primary key, operation type, and an ordering marker (LSN/SCN/timestamp).
The magic isn’t the speed. It’s the position: CDC systems track where they are in the log, so they can resume after failures and, in theory, avoid re-reading everything. In practice, this is where correctness starts: you still need a checkpoint strategy, idempotent writes, and rules for duplicates and out-of-order events.
CDC doesn’t eliminate scaling. It just moves the bottleneck away from the source database. And it doesn’t guarantee correctness in your lakehouse. The merge/upsert logic does.
What CDC looks like
CDC usually arrives as a stream of events (Kafka/Kinesis) or as micro-batch files in object storage (AWS S3, GCP Cloud Storage). Either way, it’s the same idea: a chronological list of changes.
Once you see CDC as events, you stop thinking in “table refresh” and start thinking in “how do I apply events deterministically?”
Insert Data
| ID | Name | Age | Op | Seq | EventTime |
|---|---|---|---|---|---|
| 1 | João | 25 | I | 0158975 | 2026-02-11 10:15:30 |
| 2 | Maria | 34 | I | 0589624 | 2026-02-11 10:18:00 |
| 3 | Carlos | 29 | I | 0635894 | 2026-02-11 10:19:10 |
Update Data
| ID | Name | Age | Op | Seq | EventTime |
|---|---|---|---|---|---|
| 1 | João | 26 | U | 1523789 | 2026-02-12 08:19:54 |
| 2 | Maria | 35 | U | 1658942 | 2026-02-12 10:48:10 |
Delete Data
| ID | Name | Age | Op | Seq | EventTime |
|---|---|---|---|---|---|
| 3 | Carlos | 29 | D | 2895635 | 2026-02-12 15:00:10 |
- Some CDC systems emit only the primary key on deletes; others include a ‘before image’ (like the example above).
Apply CDC events (what your Silver table becomes)
After ingesting CDC, the bronze layer keeps the events as-is (append-only). The silver layer is where you apply those events to build the latest state.
Given the events above:
- ID 1 was inserted (age 25) and later updated (age 26) → keep the latest version
- ID 2 was inserted (age 34) and later updated (age 35) → keep the latest version
- ID 3 was inserted and later deleted → remove it from the current state
| ID | Name | Age | UpdatedAt |
|---|---|---|---|
| 1 | João | 26 | 2026-02-12 08:19:54 |
| 2 | Maria | 35 | 2026-02-12 10:48:10 |
This is the part where “prove correctness” lives:
- bronze: immutable truth (events)
- silver: deterministic reconstruction (state).
Basically CDC gives you history. Your merge logic decides truth.
How the merge/upsert works in a Lakehouse
In a lakehouse, CDC is just events. The hard part is turning those events into a stable table.
Generally, my framework is:
- Keep CDC events files in bronze
- Build the silver state in silver layer by applying changes with three rules:
- merge by primary keys
- apply only the latest event per key (timestamp/LSN)
- idempotent apply
In summary, a batch looks like:
- read new CDC events since the last checkpoint
- if you’re on Databricks, I recommend the Auto Loader
- deduplicate them per key
- upsert inserts/updates
- delete when the operation is
D
This is how CDC becomes reliable: not because it’s “streaming”, but because the state can always be rebuilt from bronze in a deterministic way.
Code Example
from pyspark.sql import functions as F, Window
from delta.tables import DeltaTable
# 1) Read raw CDC (already landed as files in Bronze)
cdc = spark.read.format("parquet").load("s3://bucket/bronze/customer_cdc/") # json/csv/parquet...
PK = "id"
ORD = "lsn" # or "event_time" if you don't have lsn/scn
OP = "op" # 'I','U','D'
# 2) Keep only the latest event per key (dedup + ordering)
w = Window.partitionBy(PK).orderBy(F.col(ORD).desc())
latest_cdc = (
cdc.withColumn("rn", F.row_number().over(w))
.where("rn = 1")
.drop("rn")
)
# 3) Apply into Silver with Delta MERGE
silver_table = DeltaTable.forName(spark, "silver.customer")
(
silver_table.alias("t")
.merge(
latest_cdc.alias("s"), f"t.{PK} = s.{PK}"
)
.whenMatchedDelete(
condition="s.op = 'D'",
)
.whenMatchedUpdate(
condition=f"s.{ORD} >= t.{ORD}"
set={c: f"s.{c}" for c in upserts.columns if c != OP}
)
.whenNotMatchedInsert(
values={c: f"s.{c}" for c in upserts.columns if c != OP}
)
.execute()
)
The real-world traps of CDC
CDC looks clean on paper: changes go out, you apply them, life is good.
In production, CDC is messy for one simple reason: it’s not a “latest snapshot.” It’s a sequence of events. And events come with quirks: retries, ordering issues, deletes, and backfills; that can silently corrupt your “current state” if you don’t design for them.
Here are the traps I learned to treat as first-class requirements.
1. Duplicates (CDC is often at-least-once)
Many CDC tools prioritize reliability over elegance. If there’s a reconnect, a retry, or a consumer restart, you might see the same change more than once.
If your silver layer is not idempotent, duplicates become:
- inflated counts
- incorrect latest values
- “ghost updates” that are hard to debug later
My rule: assume duplicates will happen.
If reprocessing the same events changes the result, the pipeline is fragile
2. Out-of-order events (time is not a guarantee)
A timestamp column feels like the obvious ordering key, but it’s not always trustworthy. Events can be split across files, delivered in micro-batches, or arrive late due to network and system behavior.
That’s how you end up with the worst kind of bug:
- an older update overwriting a newer state.
My rule: never let “older” win.
3. Deletes (where CDC stops being a feature and becomes engineering)
Deletes are the moment CDC becomes real. In many systems, a delete event is not a full row, it may only carry the primary key and an operation flag.
Then you need to make a decision that is more product than technical:
- Hard delete: remove the row from silver
- Soft delete: keep it but mark is_deleted = true
Both are valid. The danger is being inconsistent.
My rule: define delete semantics early and document them.
4. Backfills and reprocessing (the true test of a CDC design)
Eventually you will need to:
- re-run a day
- fix a bug
- reprocess a range
- rebuild a table
If your lakehouse design can’t survive replay, you don’t have a pipeline, you have a “one-time success”.
My rule: bronze is immutable truth; silver is derived state.
That’s also why I stopped obsessing over streaming vs batch: if I can replay and get the same result, freshness becomes a product choice, not a fear.
So my definition of “CDC done right” is simple:
- Bronze keeps the raw change events (immutable)
- Silver is rebuilt by rules: dedup by key, order by LSN/SCN (or event time), and merge idempotently
- Deletes are explicit. Backfills are expected. Reprocessing is safe
CDC is powerful, but the value isn’t “near real-time.” The value is traceability + replayability.
Because in the end, correctness is not something you hope for.
It’s something you design for.