Change Data Capture Architecture: Zero-Impact Real-Time Replication

Technical Brief #1 Reading time: 10 minutes

Executive Technical Summary

Production teams increasingly need operational data to move in real time: from core databases into warehouses, lakes, feature stores, vector databases, and operational applications. The challenge is that the source databases are often the most sensitive systems in the enterprise. They may be running high transaction volumes, supporting customer-facing services, and operating under strict availability requirements.

For these environments, traditional extraction methods create unacceptable trade-offs. Polling queries add load and miss important change types. Triggers capture more detail but add work to every transaction. Direct application changes increase coupling and deployment risk.

Log-based Change Data Capture, or CDC, provides a cleaner architecture. Instead of querying tables or modifying application code, it reads committed changes from database transaction logs and turns those changes into a reliable downstream event stream.

For engineering teams, the design goal is not simply “move data faster.” The goal is to replicate committed changes with minimal source impact, predictable latency, recoverable state, schema awareness, and operational controls suitable for production workloads.

Key Takeaways

Polling-based CDC is simple, but not sufficient for high-value production workloads. It can miss deletes, create source load, and introduce latency between polling cycles.
Trigger-based CDC captures more change detail, but adds transaction-path overhead. For high-throughput systems, that overhead is often unacceptable.
Log-based CDC is the preferred architecture for production replication. It captures committed changes from database logs without table scans or application changes.
Exactly-once behavior requires architecture, not just a connector. Checkpoints, idempotent writes, ordering guarantees, and recovery logic all matter.
CDC must be operated as infrastructure. Monitoring, schema handling, backpressure, replay, and access controls are required for reliable production use.

The CDC Challenge

The technical requirement is straightforward to state:

Replicate every committed change from a production database to downstream systems in near real time, without slowing the source system or changing application code.

In practice, that requirement creates several constraints:

The production database may be under heavy transactional load.
The application team may not allow triggers, schema modifications, or custom write-path logic.
Expensive extraction queries against production tables are not acceptable.
Downstream systems need inserts, updates, deletes, and sometimes schema changes.
The pipeline must recover from failures without data loss or duplication.
Operators need visibility into lag, throughput, errors, and recovery state.

This is why CDC is not just a connector pattern. It is a distributed systems problem involving database internals, state management, delivery semantics, schema evolution, and operational reliability.

CDC Approaches: Architectural Comparison

1. Query-Based CDC, or Polling

A polling approach repeatedly queries source tables for records that changed after the last sync time.

SELECT *
FROM orders
WHERE updated_at > :last_sync_time;

This approach is easy to understand and can work for simple analytical extracts. It does not require deep database-specific log access, and it can be implemented against almost any relational database.

However, polling has major limitations in production replication:

Deletes are often missed because deleted rows no longer exist to be queried.
The source table must have a reliable timestamp or incrementing column.
Query load increases as the table grows or the polling interval shortens.
Latency is bounded by the polling interval.
It may be difficult to distinguish inserts from updates without additional metadata.
Clock drift, late commits, and batch updates can create correctness issues.

Best fit: Low-risk analytical synchronization where latency and delete capture are not critical. Not recommended for: High-throughput, regulated, or operational real-time replication.

2. Trigger-Based CDC

A trigger-based approach writes change records into an audit or staging table whenever a source table changes.

CREATE TRIGGER orders_cdc_trigger
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW
EXECUTE FUNCTION capture_change();

Triggers can capture inserts, updates, and deletes with low latency. They are more complete than simple polling.

But the trade-off is significant: trigger logic executes inside the transaction path. Every insert, update, or delete must also execute trigger code. Under load, this can increase transaction latency, create lock contention, and complicate operational debugging.

Common risks include:

Additional overhead on every source transaction.
Required schema or database object changes.
Higher risk of deadlocks or transaction failures under pressure.
Difficult upgrades when application schemas evolve.
Complex rollback and recovery behavior.

Best fit: Controlled environments with modest throughput and clear ownership of source schema changes. Not recommended for: Core production systems where transaction-path performance and stability are critical.

3. Log-Based CDC

Log-based CDC reads committed changes from database transaction logs, such as Oracle redo logs, MySQL binlog, PostgreSQL WAL, or SQL Server transaction logs.

Instead of asking the database to re-read tables, the CDC engine observes the change stream already maintained by the database for recovery and replication.

Key benefits:

Minimal query load on production tables.
No triggers or application code changes.
Captures inserts, updates, deletes, and selected DDL events.
Lower latency because changes are captured as commits appear in the log.
Better recoverability through log positions and checkpoints.
Cleaner separation between operational applications and downstream consumers.

Operational considerations still matter. Log access must be configured carefully. Log retention must be sufficient. Schema changes must be handled safely. Downstream writes must be idempotent and observable.

Best fit: Production-grade real-time replication where source impact, correctness, and recoverability matter.

Reference Architecture

A production CDC pipeline typically has six layers:

Production Database
  -> Transaction Log
  -> CDC Log Reader
  -> Change Event Stream
  -> Transformation and Routing Layer
  -> Delivery Writer
  -> Target Systems

Core Components

Component	Responsibility	Production Considerations
Log Reader	Reads committed changes from database logs	Must maintain a durable log position and handle rotation, archival, and failover.
Checkpoint Store	Persists progress through the source log	Must be durable and coordinated with downstream delivery state.
Change Stream	Buffers and orders change events	Must support backpressure, retries, and large transactions.
Schema Manager	Tracks source schemas and change events	Must detect, classify, and safely propagate compatible schema changes.
Transformation Layer	Applies mappings, filters, enrichment, and routing	Must be versioned, observable, and testable.
Writer	Applies changes to downstream systems	Must support idempotency, batching, conflict handling, and recovery.
Monitoring Layer	Exposes pipeline health	Must report lag, throughput, errors, retries, schema changes, and delivery state.

How Log-Based CDC Works

Step 1: Establish a Consistent Starting Point

Before continuous capture begins, the CDC system needs a consistent baseline. This usually involves an initial snapshot or backfill of selected tables, combined with a precise source log position.

The key is coordination: the system must know where the snapshot ends and where log-based streaming begins. Without that boundary, records can be missed or duplicated.

Step 2: Read Committed Changes from Logs

The CDC engine reads the transaction log and parses committed changes into structured events. A typical event includes:

Source database and schema
Table name
Operation type: insert, update, delete, or DDL
Primary key or unique identifier
Before and after values, where available
Commit timestamp
Log position, such as SCN, LSN, GTID, or file offset
Transaction identifier

Step 3: Preserve Ordering and Transaction Boundaries

Ordering is critical. Some downstream systems need events in commit order. Others require table-level or key-level ordering. A production CDC system should define ordering guarantees explicitly rather than assume one global order fits every target.

Transaction boundaries also matter. If a source transaction updates multiple tables, downstream systems may need to consume those changes atomically or at least understand that they were part of the same transaction.

Step 4: Transform and Route Events

Raw database changes are often not the final consumption format. CDC pipelines may need to:

Convert source data types to target-compatible types.
Map schemas and field names.
Filter sensitive columns.
Mask or tokenize PII.
Route tables to different targets.
Enrich events with metadata for audit and lineage.

Transformations should be explicit and versioned. Hidden transformations create debugging and governance problems later.

Step 5: Deliver Reliably to Targets

Targets may include data warehouses, data lakes, search indexes, event streams, feature stores, vector databases, or operational databases.

Reliable delivery depends on four patterns:

Durable checkpoints so the pipeline can resume after failure.
Idempotent writes so retries do not create duplicate final state.
Backpressure handling so slow targets do not overwhelm the pipeline.
Replay controls so operators can recover or rebuild downstream state.

Delivery Semantics: What “Exactly Once” Really Means

In distributed data systems, “exactly once” should be interpreted carefully. A connector cannot magically guarantee that every downstream system will behave exactly once under every failure mode.

A more practical definition is:

Each committed source change is applied to the target final state once, even if the pipeline retries internally.

Achieving that outcome requires:

Checkpointing source log positions only after durable downstream acknowledgement.
Using deterministic event identifiers or primary keys.
Applying idempotent upserts, merges, or replace-by-key operations.
Handling deletes explicitly.
Coordinating retries with target write semantics.
Detecting and resolving partial batch failures.

For append-only targets, the strategy may involve deduplication keys. For mutable targets, it may involve merge semantics. For event streams, it may involve producer transactions, partitioning, and consumer-side idempotency.

Database-Specific Implementation Notes

The architecture is consistent across databases, but implementation details differ significantly.

Oracle CDC

Primary log source: Online and archived redo logs.

Common requirements:

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
ALTER TABLE orders ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

Operational notes:

Supplemental logging may be required so changes contain enough information for downstream reconstruction.
The CDC engine must track the System Change Number, or SCN.
Large transactions may span multiple redo logs.
Archive log retention must be sized for expected downtime and recovery windows.
Permissions should follow least-privilege principles and be reviewed with the DBA team.

MySQL CDC

Primary log source: Binary log, or binlog.

Common requirements:

SET GLOBAL binlog_format = 'ROW';
SET GLOBAL binlog_row_image = 'FULL';

Operational notes:

Row-based binlog is typically required for reliable change reconstruction.
The pipeline must handle GTID-based or file-and-position-based checkpointing.
Binlog retention must exceed the maximum expected pipeline outage.
Large BLOB or JSON fields may require memory and batch-size tuning.
Failover handling must account for replica topology and binlog continuity.

PostgreSQL CDC

Primary log source: Write-Ahead Log, or WAL, through logical decoding.

Common requirements:

ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 10;
SELECT pg_create_logical_replication_slot('deltaplex_slot', 'pgoutput');

Operational notes:

Logical replication slots retain WAL until consumed, so slot lag must be monitored closely.
Long-running outages can increase disk usage.
WAL formats and plugins may differ across PostgreSQL versions.
LSN checkpoints define the recovery position.
Publication scope should be controlled to avoid unnecessary capture.

SQL Server CDC

Primary log source: SQL Server transaction log through CDC or related replication mechanisms.

Common requirements:

EXEC sys.sp_cdc_enable_db;

EXEC sys.sp_cdc_enable_table
  @source_schema = 'dbo',
  @source_name = 'orders',
  @role_name = NULL;

Operational notes:

CDC is enabled at the database and table level.
Cleanup jobs and retention windows must align with pipeline recovery requirements.
LSN checkpoints are used to track progress.
Permissions should limit access to the required CDC artifacts.
Version and edition support should be confirmed during design.

Handling Production Edge Cases

Large Transactions

A single transaction may update millions of rows. Loading the full transaction into memory can cause out-of-memory failures or long stalls.

Recommended handling:

Stream large transactions in bounded chunks.
Preserve transaction metadata for downstream consistency.
Apply backpressure when targets cannot keep up.
Use durable intermediate state when needed.
Resume from the last safe checkpoint after failure.

Schema Changes

Schema evolution is one of the most common causes of pipeline breakage.

A production CDC system should detect and classify schema events, including:

Added columns
Dropped columns
Renamed columns
Data type changes
Primary key changes
Table creation or deletion

Not all schema changes should be auto-applied. Additive changes may be safe. Destructive or incompatible changes may require approval, a migration plan, or a compatibility layer.

Deletes and Tombstones

Deletes must be treated as first-class events. In analytics targets, this may require tombstone records, soft-delete flags, or merge logic. In event streams, consumers need a clear contract for delete semantics.

Ignoring deletes creates downstream data drift and can become a governance issue when deletion requests or retention policies apply.

Log Retention and Pipeline Downtime

The CDC system can only resume if the required logs are still available. If logs expire before the pipeline catches up, a full resnapshot may be required.

Operators should define:

Maximum acceptable pipeline downtime.
Source log retention settings.
Alert thresholds for lag and remaining retention window.
Recovery procedures when log continuity is lost.

Source Failover

Production databases often run in high-availability configurations. CDC pipelines must understand how failover affects log positions, replication slots, and source endpoints.

A robust design should include:

Failover-aware source discovery.
Durable checkpoints independent of a single node.
Validation after failover before streaming resumes.
Alerts when log continuity cannot be guaranteed.

Slow Downstream Targets

Targets may throttle, fail, or become unavailable. Without backpressure, the CDC engine can accumulate unbounded memory or disk usage.

Production pipelines should support:

Bounded buffers.
Retry policies.
Dead-letter handling for malformed records.
Pause, resume, and replay controls.
Alerts for delivery lag and target error rates.

Security and Governance Requirements

CDC pipelines often carry sensitive operational data. They should be governed with the same discipline as the systems they connect.

Core controls include:

Least-privilege source access.
Encryption in transit and, where applicable, at rest.
Role-based access control for pipeline management.
Audit logs for configuration changes and data access.
Column-level masking or filtering for sensitive fields.
Lineage metadata from source to destination.
Environment separation for development, testing, and production.

For regulated industries, lineage and auditability are not secondary features. They are part of the production readiness criteria.

Implementation Checklist

Before deploying CDC for a production workload, engineering and data teams should validate the following:

Area	Questions to Answer
Source readiness	Are logs enabled, retained, and accessible with least-privilege permissions?
Snapshot strategy	How will the initial baseline be captured without affecting production?
Checkpointing	Where is progress stored, and when is it advanced?
Delivery semantics	How are retries, idempotency, and partial failures handled?
Schema evolution	Which schema changes are auto-applied, paused, or escalated?
Performance	What are the latency, throughput, and resource targets?
Monitoring	Are lag, throughput, errors, retries, and retention risk visible?
Security	Are encryption, access control, masking, and audit logs configured?
Recovery	Can operators pause, resume, replay, and resnapshot safely?
Ownership	Who responds to incidents, schema changes, and target failures?

How Deltaplex Supports Production CDC

Deltaplex is designed to operationalize log-based CDC for enterprise environments where source impact, correctness, governance, and reliability matter.

Key capabilities include:

Log-based CDC for major operational databases.
Low-impact capture from transaction logs rather than repeated table queries.
Continuous delivery to warehouses, lakes, event streams, and downstream data platforms.
Durable checkpointing and resumable replication.
Schema change detection and configurable handling policies.
Monitoring for pipeline health, lag, throughput, and delivery errors.
Operational controls such as pause, resume, replay, and recovery.
Deployment options across on-premises, VPC, and hybrid environments.
Governance support through metadata, audit logs, and lineage visibility.

The result is not just faster data movement. It is a more reliable operational data layer that can support analytics, AI, compliance, and real-time applications without turning production databases into extraction engines.

Conclusion

Real-time replication is easy to underestimate. Reading rows from one database and writing them to another looks simple until production realities appear: deletes, schema changes, large transactions, failover, slow targets, partial writes, audit requirements, and recovery after downtime.

Log-based CDC is the architecture best suited for these realities because it separates data capture from the application transaction path. But log-based CDC still requires careful engineering around checkpoints, ordering, schema evolution, delivery semantics, and operations.

For enterprises building real-time data infrastructure, the objective should be clear: capture committed changes with minimal source impact, deliver them reliably, and govern the full data flow from source to destination.

That is the foundation required for production-grade real-time analytics, operational reporting, AI systems, and event-driven applications.