Change Data Capture (CDC): Streaming Database Changes in Real Time
Change Data Capture (CDC) is a set of database and integration techniques that identify, record, and deliver every data modification — inserts, updates, and deletes — as a continuous event stream rather than a batch snapshot. CDC operates at the intersection of database replication, event-driven architecture, and real-time analytics, making it a foundational component in modern data infrastructure. Its significance lies in enabling downstream systems to remain synchronized with source databases without the latency, resource cost, or data fidelity loss of periodic full-table extracts.
Definition and scope
CDC is formally classified within the data integration discipline as a mechanism for tracking row-level changes in a source database and propagating those changes to one or more consumers in near-real time. The IEEE and the Object Management Group (OMG) have addressed streaming data exchange patterns in standards covering data flow and event processing, while the Apache Software Foundation has published specifications governing open-source CDC infrastructure components such as Apache Kafka Connect and Apache Debezium.
The scope of CDC spans three distinct propagation models:
- Log-based CDC — reads the database engine's internal transaction log (the write-ahead log in PostgreSQL, the binlog in MySQL, or the redo log in Oracle) to extract change events without modifying the source schema or adding query overhead.
- Trigger-based CDC — uses database stored procedures and triggers that fire on DML operations and write change records to a staging table.
- Timestamp/query-based CDC — polls source tables for rows whose modification timestamp exceeds the last extraction watermark.
Log-based CDC is architecturally dominant in production environments because it imposes the lowest overhead on the source system and captures all changes including hard deletes — an operation invisible to timestamp polling. The broader database change data capture landscape now defaults to log-based methods as the reference implementation for latency-sensitive pipelines.
How it works
Log-based CDC follows a discrete sequence of operations that transforms raw database write activity into structured, consumable events:
- Transaction log interception — a CDC connector attaches to the database replication slot (PostgreSQL) or binlog stream (MySQL/MariaDB) and reads committed transaction records as they are written.
- Event parsing and schema resolution — the connector decodes binary log entries into structured change events, resolving column names, data types, and before/after row images against the current table schema.
- Event serialization — decoded events are serialized into a standard format (Apache Avro, JSON, or Protobuf) and published to a message broker, most commonly Apache Kafka.
- Schema registry coordination — a schema registry (such as the Confluent Schema Registry, governed by the Confluent Community License) maintains versioned schemas to ensure consumers can parse events produced under prior or current schema states.
- Consumer delivery — downstream services, data warehouses, search indices, or cache layers consume the event stream and apply changes in the order they were committed, preserving database transactions and ACID properties such as ordering guarantees.
Trigger-based CDC substitutes steps 1 and 2 with database-native DML triggers that write to a shadow audit table. This approach is compatible with databases that do not expose replication logs externally, but adds write amplification — every source row write produces a second write to the audit table — and cannot capture schema changes automatically.
Timestamp-based CDC eliminates the replication log requirement entirely but introduces a polling interval that creates latency of at least the poll frequency, typically 30 seconds to 5 minutes in standard configurations. It also fails to capture hard deletes unless source tables use soft-delete patterns.
Common scenarios
CDC appears consistently across four categories of operational requirements:
Database-to-database synchronization — maintaining a read replica, a geographically distributed standby, or a heterogeneous target (for example, migrating from a relational source to a NoSQL database system target) with sub-second lag rather than overnight batch windows.
Real-time analytics and data warehousing — streaming operational row changes into an OLAP system or data warehousing layer so that analytical queries reflect transactions completed within seconds rather than the previous day's extract. This use case is the primary driver behind the adoption of streaming-compatible warehouse engines.
Cache invalidation — propagating row-level changes from a source relational database system into an in-memory cache layer, such as Redis or Memcached, to eliminate stale reads without requiring application-level cache invalidation logic.
Audit and compliance pipelines — capturing a tamper-evident, ordered record of all row-level changes for database auditing and compliance obligations under frameworks such as HIPAA, PCI-DSS, and SOX. Log-based CDC produces an immutable event sequence tied directly to committed transactions, which satisfies audit trail requirements without relying on application-layer logging.
The databasesystemsauthority.com reference covers the full spectrum of database infrastructure decisions within which CDC sits as one integration pattern among the broader set of distributed database systems design choices.
Decision boundaries
Selecting a CDC method requires evaluating four structural constraints:
Log access availability — Log-based CDC requires the database engine to expose replication log access to external connectors. PostgreSQL exposes this via logical replication slots (introduced in PostgreSQL 9.4); MySQL exposes the binlog natively. Proprietary engines such as Oracle require licensed supplemental logging features. Where log access is unavailable or cost-prohibitive, trigger-based or timestamp-based methods are the fallback.
Latency requirements — Log-based CDC achieves sub-second event delivery from commit to consumer. Timestamp polling introduces latency equal to the poll interval, making it unsuitable for use cases requiring event delivery within 5 seconds. Trigger-based CDC latency depends on the downstream consumer's polling interval against the audit table.
Schema change handling — Log-based connectors such as Apache Debezium support automatic schema evolution tracking, alerting consumers when a column is added or dropped. Trigger-based CDC requires manual trigger redeployment after schema changes. Timestamp-based CDC requires query updates to include new columns.
Operational overhead — Log-based CDC requires management of replication slots, which accumulate unconsumed WAL segments if a consumer falls behind — a condition that risks disk exhaustion on the source server. Trigger-based CDC distributes overhead to the source write path. Timestamp-based CDC is the operationally simplest but the least capable. Database monitoring and observability practices should include replication slot lag as a first-class metric in any log-based CDC deployment.
References
- Apache Debezium Documentation — open-source log-based CDC framework under the Apache Software Foundation umbrella
- PostgreSQL Logical Replication Documentation — official PostgreSQL project documentation covering logical replication slots used by log-based CDC
- Apache Kafka Connect Documentation — Apache Software Foundation specification for the connector framework used to pipeline CDC events
- MySQL Binary Log Documentation — Oracle MySQL reference documentation for binlog configuration supporting CDC
- NIST SP 800-92: Guide to Computer Security Log Management — NIST guidance on log management practices applicable to audit-driven CDC pipelines
- OMG Data Distribution Service (DDS) Specification — Object Management Group standard covering real-time data distribution patterns