Skip to content

Sync Modes

Bizon supports three sync modes that determine how data is extracted from sources. Choose the right mode based on your data freshness requirements and source capabilities.

ModeCreates New JobResumes FromBest For
full_refreshYes, every runBeginningSmall datasets, complete snapshots
incrementalYes, every runLast sync timestampLarge datasets, append-only data
streamNo, reuses jobLast committed offsetReal-time data, Kafka/queues

Full refresh mode syncs all data from the source on every run.

source:
name: hubspot
stream: contacts
sync_mode: full_refresh

How it works:

  1. Creates a new StreamJob each run
  2. Extracts all records from the source
  3. Continues until pagination is exhausted
  4. Marks job as SUCCEEDED or FAILED

Use cases:

  • Small to medium datasets where full extraction is fast
  • Data that changes unpredictably (no reliable timestamp)
  • Initial loads before switching to incremental
  • Sources that don’t support incremental queries

Considerations:

  • Higher data transfer costs for large datasets
  • Longer sync times as data grows
  • Destination must handle duplicate/updated records

Incremental mode only syncs new or updated records since the last successful sync.

source:
name: hubspot
stream: contacts
sync_mode: incremental

How it works:

  1. Gets the start_date from the last successful StreamJob
  2. Creates a new StreamJob
  3. Queries source for records updated after start_date
  4. Syncs until pagination is exhausted
  5. Stores the new sync timestamp for next run

Use cases:

  • Large datasets where full refresh is too slow
  • Append-only data (logs, events, transactions)
  • APIs with efficient date-filtered queries
  • Cost optimization for metered APIs

Requirements:

  • Source must implement get_records_after() method
  • Source API must support filtering by modification date
  • Check stream support with bizon stream list <source>

Configuration:

source:
name: hubspot
stream: contacts
sync_mode: incremental
# Optional: force restart from scratch
force_ignore_checkpoint: false

Considerations:

  • Relies on source’s timestamp accuracy
  • Deleted records may not be captured
  • First run behaves like full refresh

Stream mode enables continuous, real-time data ingestion with offset tracking.

source:
name: kafka
stream: topic
sync_mode: stream

How it works:

  1. Uses a single, persistent RUNNING StreamJob
  2. Tracks offsets/positions in the source system
  3. Commits progress periodically to backend
  4. Runs continuously until stopped
  5. Resumes from last committed offset on restart

Use cases:

  • Real-time data pipelines
  • Kafka/Redpanda topic consumption
  • Event streaming architectures
  • Low-latency data delivery

Configuration:

source:
name: kafka
stream: topic
sync_mode: stream
bootstrap_servers: kafka:9092
group_id: my-consumer-group
topics:
- name: events
destination_id: project.dataset.events
engine:
runner:
type: stream # Use stream runner for continuous operation

Considerations:

  • Requires stream-capable source (Kafka, RabbitMQ)
  • Use --runner stream for continuous operation
  • Backend must be persistent (postgres, bigquery) for production
┌─────────────────────────────────────────────────────────────┐
│ Is your data real-time? │
│ (Kafka, message queue) │
└─────────────────────────────┬───────────────────────────────┘
┌───────────────┴───────────────┐
│ │
Yes No
│ │
▼ ▼
┌────────┐ ┌───────────────────────────────┐
│ STREAM │ │ Does source support │
└────────┘ │ incremental queries? │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│ │
Yes No
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Is dataset │ │ FULL_REFRESH │
│ large (>1M)? │ └──────────────────┘
└────────┬────────┘
┌───────────────┴───────────────┐
│ │
Yes No
│ │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ INCREMENTAL │ │ FULL_REFRESH │
└─────────────┘ └──────────────────┘

To ignore existing checkpoints and start fresh:

source:
name: hubspot
stream: contacts
sync_mode: incremental
force_ignore_checkpoint: true # Start from scratch

This is useful when:

  • Schema changes require re-syncing all data
  • Source data was corrected/backfilled
  • Testing or debugging pipelines