BigQuery Destinations

Bizon offers three BigQuery destination variants optimized for different workloads.

Installation

pip install bizon[bigquery]

Destination Variants

Variant	Method	Best For	Latency
`bigquery`	GCS Load Jobs	Large batch loads	Minutes
`bigquery_streaming`	Legacy Streaming	Low-latency, small volume	Seconds
`bigquery_streaming_v2`	Storage Write API	High-throughput streaming	Seconds

BigQuery (Batch Load)

Uses GCS as a staging area, then loads via BigQuery load jobs. Best for large batch workloads.

destination:
  name: bigquery
  config:
    project_id: my-project
    dataset_id: my_dataset
    dataset_location: US
    gcs_buffer_bucket: my-staging-bucket
    gcs_buffer_format: parquet
    buffer_size: 400
    time_partitioning: DAY

Configuration

Field	Type	Required	Default	Description
`project_id`	string	Yes	-	GCP project ID
`dataset_id`	string	Yes	-	BigQuery dataset ID
`dataset_location`	string	No	`US`	Dataset location
`gcs_buffer_bucket`	string	Yes	-	GCS bucket for staging
`gcs_buffer_format`	enum	No	`parquet`	Staging format: `parquet` or `csv`
`time_partitioning`	enum	No	`DAY`	Partition granularity: `DAY`, `HOUR`, `MONTH`, `YEAR`
`buffer_size`	int	No	`400`	Buffer size in MB

BigQuery Streaming

Uses the legacy streaming API. Good for low-latency with moderate volume.

destination:
  name: bigquery_streaming
  config:
    project_id: my-project
    dataset_id: my_dataset
    dataset_location: US
    buffer_size: 50
    buffer_flush_timeout: 300

Configuration

Field	Type	Required	Default	Description
`project_id`	string	Yes	-	GCP project ID
`dataset_id`	string	Yes	-	BigQuery dataset ID
`dataset_location`	string	No	`US`	Dataset location
`buffer_size`	int	No	`50`	Buffer size in MB
`buffer_flush_timeout`	int	No	`600`	Max buffer age in seconds

BigQuery Streaming V2 (Recommended)

Uses the Storage Write API for high-throughput streaming. Best for production real-time pipelines.

destination:
  name: bigquery_streaming_v2
  config:
    project_id: my-project
    dataset_id: my_dataset
    dataset_location: US
    buffer_size: 100
    buffer_flush_timeout: 300
    bq_max_rows_per_request: 5000

Configuration

Field	Type	Required	Default	Description
`project_id`	string	Yes	-	GCP project ID
`dataset_id`	string	Yes	-	BigQuery dataset ID
`dataset_location`	string	No	`US`	Dataset location
`buffer_size`	int	No	`50`	Buffer size in MB
`buffer_flush_timeout`	int	No	`600`	Max buffer age in seconds
`bq_max_rows_per_request`	int	No	`5000`	Max rows per API call (max 10000)
`time_partitioning.type`	enum	No	`DAY`	Partition type
`time_partitioning.field`	string	No	`_bizon_loaded_at`	Partition field

Authentication

Service Account Key

Provide a service account JSON key:

destination:
  name: bigquery_streaming_v2
  config:
    project_id: my-project
    dataset_id: analytics
    authentication:
      service_account_key: |
        {
          "type": "service_account",
          "project_id": "my-project",
          ...
        }

Or reference from environment:

authentication:
  service_account_key: BIZON_ENV_GCP_SERVICE_ACCOUNT

Application Default Credentials

If running on GCP (GKE, Cloud Run, etc.) or with gcloud auth, omit authentication:

destination:
  name: bigquery_streaming_v2
  config:
    project_id: my-project
    dataset_id: analytics
    # No authentication section - uses ADC

Schema Configuration

Define schemas for structured loading with unnest: true:

destination:
  name: bigquery_streaming_v2
  config:
    project_id: my-project
    dataset_id: analytics
    unnest: true
    record_schemas:
      - destination_id: my-project.analytics.events
        record_schema:
          - name: event_id
            type: STRING
            mode: REQUIRED
          - name: event_type
            type: STRING
            mode: REQUIRED
          - name: user_id
            type: STRING
            mode: NULLABLE
          - name: payload
            type: JSON
            mode: NULLABLE
          - name: created_at
            type: TIMESTAMP
            mode: REQUIRED
        clustering_keys:
          - event_type
          - user_id

Column Types

Type	Description
`STRING`	Text
`INTEGER` / `INT64`	64-bit integer
`FLOAT` / `FLOAT64`	64-bit float
`BOOLEAN`	True/false
`TIMESTAMP`	Date and time
`DATE`	Date only
`DATETIME`	Date and time (no timezone)
`TIME`	Time only
`JSON`	JSON object
`BYTES`	Binary data
`NUMERIC` / `BIGNUMERIC`	Exact numeric
`GEOGRAPHY`	Geographic data

Column Modes

Mode	Description
`NULLABLE`	Can be null (default)
`REQUIRED`	Cannot be null
`REPEATED`	Array of values

Multi-Table Routing

Route records to different tables based on destination_id:

source:
  name: kafka
  stream: topic
  topics:
    - name: user-events
      destination_id: my-project.analytics.user_events
    - name: order-events
      destination_id: my-project.analytics.order_events

destination:
  name: bigquery_streaming_v2
  config:
    project_id: my-project
    dataset_id: analytics
    unnest: true
    record_schemas:
      - destination_id: my-project.analytics.user_events
        record_schema:
          - name: user_id
            type: STRING
            mode: REQUIRED
          - name: action
            type: STRING
            mode: REQUIRED

      - destination_id: my-project.analytics.order_events
        record_schema:
          - name: order_id
            type: STRING
            mode: REQUIRED
          - name: amount
            type: FLOAT64
            mode: REQUIRED

Time Partitioning

Partition tables by time for better query performance:

Batch (bigquery)
Streaming V2

destination:
  name: bigquery
  config:
    project_id: my-project
    dataset_id: analytics
    gcs_buffer_bucket: my-bucket
    time_partitioning: DAY  # DAY, HOUR, MONTH, YEAR

destination:
  name: bigquery_streaming_v2
  config:
    project_id: my-project
    dataset_id: analytics
    time_partitioning:
      type: DAY
      field: _bizon_loaded_at  # Or your custom timestamp field

Clustering

Add clustering keys for better query performance:

record_schemas:
  - destination_id: my-project.analytics.events
    record_schema:
      - name: event_type
        type: STRING
        mode: REQUIRED
      - name: user_id
        type: STRING
        mode: NULLABLE
    clustering_keys:
      - event_type
      - user_id

Choosing the Right Variant

┌─────────────────────────────────────────────────────────────┐
│                   Is latency critical?                       │
│                   (< 1 minute required)                      │
└─────────────────────────────┬───────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              │                               │
             Yes                              No
              │                               │
              ▼                               ▼
┌─────────────────────────────┐   ┌──────────────────────────┐
│    Is throughput high?       │   │        bigquery          │
│    (> 1000 records/sec)      │   │    (GCS Load Jobs)       │
└──────────────┬──────────────┘   └──────────────────────────┘
               │
   ┌───────────┴───────────┐
   │                       │
  Yes                      No
   │                       │
   ▼                       ▼
┌─────────────────┐   ┌─────────────────────┐
│ bigquery_       │   │ bigquery_streaming  │
│ streaming_v2    │   │ (Legacy API)        │
└─────────────────┘   └─────────────────────┘

Production Configuration

Complete production example:

name: kafka-to-bigquery

source:
  name: kafka
  stream: topic
  sync_mode: stream
  bootstrap_servers: kafka:9092
  topics:
    - name: events
      destination_id: my-project.analytics.events

destination:
  name: bigquery_streaming_v2
  config:
    project_id: my-project
    dataset_id: analytics
    dataset_location: US
    buffer_size: 100
    buffer_flush_timeout: 300
    bq_max_rows_per_request: 5000
    unnest: true
    time_partitioning:
      type: DAY
      field: _bizon_loaded_at
    record_schemas:
      - destination_id: my-project.analytics.events
        record_schema:
          - name: event_id
            type: STRING
            mode: REQUIRED
          - name: event_type
            type: STRING
            mode: REQUIRED
          - name: payload
            type: JSON
            mode: NULLABLE
          - name: timestamp
            type: TIMESTAMP
            mode: REQUIRED
        clustering_keys:
          - event_type
    authentication:
      service_account_key: BIZON_ENV_GCP_SERVICE_ACCOUNT

engine:
  backend:
    type: postgres
    config:
      host: db.example.com
      database: bizon
      schema: public

  runner:
    type: stream
    log_level: INFO

Next Steps

Kafka to BigQuery Tutorial - Complete streaming guide
Kafka Source - Configure Kafka consumption
Engine Configuration - Production backend setup