Technical Brief #1 Reading time: 10 minutes

Executive Technical Summary

Production teams increasingly need operational data to move in real time: from core databases into warehouses, lakes, feature stores, vector databases, and operational applications. The challenge is that the source databases are often the most sensitive systems in the enterprise. They may be running high transaction volumes, supporting customer-facing services, and operating under strict availability requirements.

For these environments, traditional extraction methods create unacceptable trade-offs. Polling queries add load and miss important change types. Triggers capture more detail but add work to every transaction. Direct application changes increase coupling and deployment risk.

Log-based Change Data Capture, or CDC, provides a cleaner architecture. Instead of querying tables or modifying application code, it reads committed changes from database transaction logs and turns those changes into a reliable downstream event stream.

For engineering teams, the design goal is not simply “move data faster.” The goal is to replicate committed changes with minimal source impact, predictable latency, recoverable state, schema awareness, and operational controls suitable for production workloads.

Key Takeaways

The CDC Challenge

The technical requirement is straightforward to state:

Replicate every committed change from a production database to downstream systems in near real time, without slowing the source system or changing application code.

In practice, that requirement creates several constraints:

This is why CDC is not just a connector pattern. It is a distributed systems problem involving database internals, state management, delivery semantics, schema evolution, and operational reliability.

CDC Approaches: Architectural Comparison

1. Query-Based CDC, or Polling

A polling approach repeatedly queries source tables for records that changed after the last sync time.

SELECT *
FROM orders
WHERE updated_at > :last_sync_time;

This approach is easy to understand and can work for simple analytical extracts. It does not require deep database-specific log access, and it can be implemented against almost any relational database.

However, polling has major limitations in production replication:

Best fit: Low-risk analytical synchronization where latency and delete capture are not critical. Not recommended for: High-throughput, regulated, or operational real-time replication.

2. Trigger-Based CDC

A trigger-based approach writes change records into an audit or staging table whenever a source table changes.

CREATE TRIGGER orders_cdc_trigger
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW
EXECUTE FUNCTION capture_change();

Triggers can capture inserts, updates, and deletes with low latency. They are more complete than simple polling.

But the trade-off is significant: trigger logic executes inside the transaction path. Every insert, update, or delete must also execute trigger code. Under load, this can increase transaction latency, create lock contention, and complicate operational debugging.

Common risks include:

Best fit: Controlled environments with modest throughput and clear ownership of source schema changes. Not recommended for: Core production systems where transaction-path performance and stability are critical.

3. Log-Based CDC

Log-based CDC reads committed changes from database transaction logs, such as Oracle redo logs, MySQL binlog, PostgreSQL WAL, or SQL Server transaction logs.

Instead of asking the database to re-read tables, the CDC engine observes the change stream already maintained by the database for recovery and replication.

Key benefits:

Operational considerations still matter. Log access must be configured carefully. Log retention must be sufficient. Schema changes must be handled safely. Downstream writes must be idempotent and observable.

Best fit: Production-grade real-time replication where source impact, correctness, and recoverability matter.

Reference Architecture

A production CDC pipeline typically has six layers:

Production Database
  -> Transaction Log
  -> CDC Log Reader
  -> Change Event Stream
  -> Transformation and Routing Layer
  -> Delivery Writer
  -> Target Systems

Core Components

ComponentResponsibilityProduction Considerations
Log ReaderReads committed changes from database logsMust maintain a durable log position and handle rotation, archival, and failover.
Checkpoint StorePersists progress through the source logMust be durable and coordinated with downstream delivery state.
Change StreamBuffers and orders change eventsMust support backpressure, retries, and large transactions.
Schema ManagerTracks source schemas and change eventsMust detect, classify, and safely propagate compatible schema changes.
Transformation LayerApplies mappings, filters, enrichment, and routingMust be versioned, observable, and testable.
WriterApplies changes to downstream systemsMust support idempotency, batching, conflict handling, and recovery.
Monitoring LayerExposes pipeline healthMust report lag, throughput, errors, retries, schema changes, and delivery state.

How Log-Based CDC Works

Step 1: Establish a Consistent Starting Point

Before continuous capture begins, the CDC system needs a consistent baseline. This usually involves an initial snapshot or backfill of selected tables, combined with a precise source log position.

The key is coordination: the system must know where the snapshot ends and where log-based streaming begins. Without that boundary, records can be missed or duplicated.

Step 2: Read Committed Changes from Logs

The CDC engine reads the transaction log and parses committed changes into structured events. A typical event includes:

Step 3: Preserve Ordering and Transaction Boundaries

Ordering is critical. Some downstream systems need events in commit order. Others require table-level or key-level ordering. A production CDC system should define ordering guarantees explicitly rather than assume one global order fits every target.

Transaction boundaries also matter. If a source transaction updates multiple tables, downstream systems may need to consume those changes atomically or at least understand that they were part of the same transaction.

Step 4: Transform and Route Events

Raw database changes are often not the final consumption format. CDC pipelines may need to:

Transformations should be explicit and versioned. Hidden transformations create debugging and governance problems later.

Step 5: Deliver Reliably to Targets

Targets may include data warehouses, data lakes, search indexes, event streams, feature stores, vector databases, or operational databases.

Reliable delivery depends on four patterns:

  1. Durable checkpoints so the pipeline can resume after failure.
  2. Idempotent writes so retries do not create duplicate final state.
  3. Backpressure handling so slow targets do not overwhelm the pipeline.
  4. Replay controls so operators can recover or rebuild downstream state.

Delivery Semantics: What “Exactly Once” Really Means

In distributed data systems, “exactly once” should be interpreted carefully. A connector cannot magically guarantee that every downstream system will behave exactly once under every failure mode.

A more practical definition is:

Each committed source change is applied to the target final state once, even if the pipeline retries internally.

Achieving that outcome requires:

For append-only targets, the strategy may involve deduplication keys. For mutable targets, it may involve merge semantics. For event streams, it may involve producer transactions, partitioning, and consumer-side idempotency.

Database-Specific Implementation Notes

The architecture is consistent across databases, but implementation details differ significantly.

Oracle CDC

Primary log source: Online and archived redo logs.

Common requirements:

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
ALTER TABLE orders ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

Operational notes:

MySQL CDC

Primary log source: Binary log, or binlog.

Common requirements:

SET GLOBAL binlog_format = 'ROW';
SET GLOBAL binlog_row_image = 'FULL';

Operational notes:

PostgreSQL CDC

Primary log source: Write-Ahead Log, or WAL, through logical decoding.

Common requirements:

ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 10;
SELECT pg_create_logical_replication_slot('deltaplex_slot', 'pgoutput');

Operational notes:

SQL Server CDC

Primary log source: SQL Server transaction log through CDC or related replication mechanisms.

Common requirements:

EXEC sys.sp_cdc_enable_db;

EXEC sys.sp_cdc_enable_table
  @source_schema = 'dbo',
  @source_name = 'orders',
  @role_name = NULL;

Operational notes:

Handling Production Edge Cases

Large Transactions

A single transaction may update millions of rows. Loading the full transaction into memory can cause out-of-memory failures or long stalls.

Recommended handling:

Schema Changes

Schema evolution is one of the most common causes of pipeline breakage.

A production CDC system should detect and classify schema events, including:

Not all schema changes should be auto-applied. Additive changes may be safe. Destructive or incompatible changes may require approval, a migration plan, or a compatibility layer.

Deletes and Tombstones

Deletes must be treated as first-class events. In analytics targets, this may require tombstone records, soft-delete flags, or merge logic. In event streams, consumers need a clear contract for delete semantics.

Ignoring deletes creates downstream data drift and can become a governance issue when deletion requests or retention policies apply.

Log Retention and Pipeline Downtime

The CDC system can only resume if the required logs are still available. If logs expire before the pipeline catches up, a full resnapshot may be required.

Operators should define:

Source Failover

Production databases often run in high-availability configurations. CDC pipelines must understand how failover affects log positions, replication slots, and source endpoints.

A robust design should include:

Slow Downstream Targets

Targets may throttle, fail, or become unavailable. Without backpressure, the CDC engine can accumulate unbounded memory or disk usage.

Production pipelines should support:

Security and Governance Requirements

CDC pipelines often carry sensitive operational data. They should be governed with the same discipline as the systems they connect.

Core controls include:

For regulated industries, lineage and auditability are not secondary features. They are part of the production readiness criteria.

Implementation Checklist

Before deploying CDC for a production workload, engineering and data teams should validate the following:

AreaQuestions to Answer
Source readinessAre logs enabled, retained, and accessible with least-privilege permissions?
Snapshot strategyHow will the initial baseline be captured without affecting production?
CheckpointingWhere is progress stored, and when is it advanced?
Delivery semanticsHow are retries, idempotency, and partial failures handled?
Schema evolutionWhich schema changes are auto-applied, paused, or escalated?
PerformanceWhat are the latency, throughput, and resource targets?
MonitoringAre lag, throughput, errors, retries, and retention risk visible?
SecurityAre encryption, access control, masking, and audit logs configured?
RecoveryCan operators pause, resume, replay, and resnapshot safely?
OwnershipWho responds to incidents, schema changes, and target failures?

How Deltaplex Supports Production CDC

Deltaplex is designed to operationalize log-based CDC for enterprise environments where source impact, correctness, governance, and reliability matter.

Key capabilities include:

The result is not just faster data movement. It is a more reliable operational data layer that can support analytics, AI, compliance, and real-time applications without turning production databases into extraction engines.

Conclusion

Real-time replication is easy to underestimate. Reading rows from one database and writing them to another looks simple until production realities appear: deletes, schema changes, large transactions, failover, slow targets, partial writes, audit requirements, and recovery after downtime.

Log-based CDC is the architecture best suited for these realities because it separates data capture from the application transaction path. But log-based CDC still requires careful engineering around checkpoints, ordering, schema evolution, delivery semantics, and operations.

For enterprises building real-time data infrastructure, the objective should be clear: capture committed changes with minimal source impact, deliver them reliably, and govern the full data flow from source to destination.

That is the foundation required for production-grade real-time analytics, operational reporting, AI systems, and event-driven applications.