Sync Modes

Bizon supports three sync modes that determine how data is extracted from sources. Choose the right mode based on your data freshness requirements and source capabilities.

Overview

Mode	Creates New Job	Resumes From	Best For
`full_refresh`	Yes, every run	Beginning	Small datasets, complete snapshots
`incremental`	Yes, every run	Last sync timestamp	Large datasets, append-only data
`stream`	No, reuses job	Last committed offset	Real-time data, Kafka/queues

Full Refresh

Full refresh mode syncs all data from the source on every run.

source:
  name: hubspot
  stream: contacts
  sync_mode: full_refresh

How it works:

Creates a new StreamJob each run
Extracts all records from the source
Continues until pagination is exhausted
Marks job as SUCCEEDED or FAILED

Use cases:

Small to medium datasets where full extraction is fast
Data that changes unpredictably (no reliable timestamp)
Initial loads before switching to incremental
Sources that don’t support incremental queries

Considerations:

Higher data transfer costs for large datasets
Longer sync times as data grows
Destination must handle duplicate/updated records

Incremental

Incremental mode only syncs new or updated records since the last successful sync.

source:
  name: hubspot
  stream: contacts
  sync_mode: incremental

How it works:

Gets the start_date from the last successful StreamJob
Creates a new StreamJob
Queries source for records updated after start_date
Syncs until pagination is exhausted
Stores the new sync timestamp for next run

Use cases:

Large datasets where full refresh is too slow
Append-only data (logs, events, transactions)
APIs with efficient date-filtered queries
Cost optimization for metered APIs

Requirements:

Source must implement get_records_after() method
Source API must support filtering by modification date
Check stream support with bizon stream list <source>

Configuration:

source:
  name: hubspot
  stream: contacts
  sync_mode: incremental
  # Optional: force restart from scratch
  force_ignore_checkpoint: false

Considerations:

Relies on source’s timestamp accuracy
Deleted records may not be captured
First run behaves like full refresh

Stream

Stream mode enables continuous, real-time data ingestion with offset tracking.

source:
  name: kafka
  stream: topic
  sync_mode: stream

How it works:

Uses a single, persistent RUNNING StreamJob
Tracks offsets/positions in the source system
Commits progress periodically to backend
Runs continuously until stopped
Resumes from last committed offset on restart

Use cases:

Real-time data pipelines
Kafka/Redpanda topic consumption
Event streaming architectures
Low-latency data delivery

Configuration:

source:
  name: kafka
  stream: topic
  sync_mode: stream
  bootstrap_servers: kafka:9092
  group_id: my-consumer-group
  topics:
    - name: events
      destination_id: project.dataset.events

engine:
  runner:
    type: stream  # Use stream runner for continuous operation

Considerations:

Requires stream-capable source (Kafka, RabbitMQ)
Use --runner stream for continuous operation
Backend must be persistent (postgres, bigquery) for production

Choosing the Right Mode

┌─────────────────────────────────────────────────────────────┐
│                    Is your data real-time?                   │
│                    (Kafka, message queue)                    │
└─────────────────────────────┬───────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              │                               │
             Yes                              No
              │                               │
              ▼                               ▼
         ┌────────┐           ┌───────────────────────────────┐
         │ STREAM │           │   Does source support         │
         └────────┘           │   incremental queries?        │
                              └───────────────┬───────────────┘
                                              │
                              ┌───────────────┴───────────────┐
                              │                               │
                             Yes                              No
                              │                               │
                              ▼                               ▼
                    ┌─────────────────┐           ┌──────────────────┐
                    │   Is dataset    │           │   FULL_REFRESH   │
                    │   large (>1M)?  │           └──────────────────┘
                    └────────┬────────┘
                             │
             ┌───────────────┴───────────────┐
             │                               │
            Yes                              No
             │                               │
             ▼                               ▼
      ┌─────────────┐              ┌──────────────────┐
      │ INCREMENTAL │              │   FULL_REFRESH   │
      └─────────────┘              └──────────────────┘

Force Checkpoint Reset

To ignore existing checkpoints and start fresh:

source:
  name: hubspot
  stream: contacts
  sync_mode: incremental
  force_ignore_checkpoint: true  # Start from scratch

This is useful when:

Schema changes require re-syncing all data
Source data was corrected/backfilled
Testing or debugging pipelines

Next Steps

Sources Overview - Learn about source connectors
Checkpointing - Understand fault tolerance
Engine Configuration - Configure backends for checkpoint storage