Airbyte

Overview

why airbyte - do not have to reinvent the wheel to move data between multiple “sources” and “destinations”
it gives a unified view of our integrations to help us debug issues, for monitoring, etc
airbyte is open source, so there are a lot of resources, connectors, etc
two scenarios for using airbyte -
- “extract” and “load” part of elt. we extract and load data from google sheets into for e.g. snowflake
- “publish” - we publish transformed data from snowflake into downstream saas applications for reporting
find difference between airbyte tiers here
- “self managed community” - open source
- “self managed enterprise” - run airbyte yourself but with support and enterprise features
- “airbyte cloud” - fully managed

Concepts

“source” - we ingest data from sources
“destination” - we load data into destinations
“connector” - we configure connectors with authentication information etc to create sources and destinations
“connection” - the pipeline to replicate data from source to destination
“stream” - group of related records. e.g. table in a relational database, a directory with files in a file system, etc
“record” - single unit of data in a stream. like “rows” in rdbms
“field” - attribute of a record. like “columns” in rdbms

Connection Settings

“sync schedule” - can be manual, scheduled (choose between pre-configured intervals like 1 hour, 2 hours, 24 hours, etc) or cron
“destination namespace” - where to write the data in the destination
- “source defined” - some sources like jdbc provide namespace information. when syncing streams, use the exact same structure
- “destination specified” - sync all streams into the default schema selected when creating the destination
- “custom format” - sync all streams to a unique new schema. i believe this is useful when we are syncing from multiple sources into the same destination
“destination prefix” - add a prefix to each stream
“schema propagation” - how to handle schema changes in the source
- checks are run every at most 24 hours, to see how the source schema has changed
- we can set it up to automatically propagate additional / removal of fields and streams
- or we can set it up to stop syncs in case of changes to review them manually, or manually approve all changes ourselves

Stream Settings

“stream selection” - which streams get synced
“field selection” - which fields of a stream get synced
“sync mode” - how airbyte reads from a source and writes it to a destination
“cursor selection” - how to incrementally read from a source
“primary key selection” - how to upsert when writing to a destination
“mappings” - (encrypt / hash / rename) fields, drop rows, etc

Sync Mode

if we see, its divided into two parts - the first part describes how data is read from the source, the second part describes how data gets written to the destination
“full refresh, overwrite” - “full refresh” - read everything, “overwrite” - delete everything and then rewrite everything in the destination
“full refresh, append” - “full refresh” - read everything, “append” - append everything to the data in the destination table, no overwrites. assume a sync workflow is run, a record is deleted, and a sync is run again. if we have x records in the first sync, we will have 2x-1 records after the second sync
“incremental, append” - “incremental” - the “cursor” value is used to track whether a record should be replicated, so we need to define a “cursor field” as well when selecting incremental. airbyte only checks for records whose cursor field is greater than the largest value it saw for it during the last sync, and loads only them. understand - this way, we are basically telling airbyte which subset of rows it needs to look at, and it can ignore rest of the rows. e.g. a timestamp field like updated at can be a good fit for this
“incremental, append + deduped” - issue with above - recall how append will simply add all the records. by adding “deduped”, we need to specify the primary key field, and instead of simply adding the rows whose cursor field has changed (aka rows which have been updated), it will modify the rows in the destination where needed, else add rows. this way, we do not end up with duplicate data
“cdc” or “change data capture” - capture inserts, updates, deletes etc from one data source into another
databases like postgres, mysql, etc have “transaction logs” which contain the inserts, updates and deletes, which airbyte can read. also called “write ahead log” etc
for this, we need to grant the permissions to the user - ALTER USER <user_name> REPLICATION;. we might have to make other changes as well for this to work, depending on the type of database we use
when using cdc with incremental mode, the “cursor field” is the one managed by airbyte / cdc, and we do not need to specify a column like updated at that we manage ourselves
another difference (apart from the automatic cursor field above) between incremental + cdc vs incremental by itself - deletes cannot be processed

Considerations Around Scaling Airbyte

setting the right requests and limits for memory and cpu in kubernetes (docs)
increasing the replica count for static deployments for high availability (docs)
schedule static vs dynamic sync related pods on two dedicated, separate node groups using node selector. this helps limit the blast radius as well (docs)

Architecture

just like airflow, various components like webapp, api server (for managing config around sources, destination, connections, etc), temporal (for scheduling), workload api server (queues tasks), which then get picked up by the workload launcher. see a detailed description here
new pods are spun up for every sync -
- “replication pods” - call read on source / write on destination
- “connector pods” - call check / discover / spec
now, both these types of pods also have sidecars alongside to help with cross cutting concerns like logging, token refresh, etc

Airbyte

Overview

Concepts

Connection Settings

Stream Settings

Sync Mode

Considerations Around Scaling Airbyte

Architecture

Further Reading

LLM

Iceberg

Observability