English Vocabulary for Airbyte Data Integration

Learn the professional English vocabulary for Airbyte data integration — connectors, sync modes, schema discovery, normalization, and dbt in team discussions.

Airbyte is an open-source data integration platform used by data engineering teams to move data between sources and destinations. If you work in a data team that uses Airbyte — or you’re joining one — you need to communicate about connectors, sync strategies, and data pipeline health in precise English. This post covers the essential Airbyte vocabulary and how it sounds in real team conversations.

Key Vocabulary

Source connector A plugin that connects Airbyte to a data source — a database, a SaaS API, a file store, or any other system that holds data you want to extract. Airbyte maintains a catalog of hundreds of source connectors. Example: “We’re using the Postgres source connector to extract from our transactional database and load it into the data warehouse.”

Destination connector A plugin that connects Airbyte to the target system where data will be written — typically a data warehouse like BigQuery, Snowflake, or Redshift. Example: “Configure the destination connector to point at our Snowflake warehouse — use the raw_data schema for the initial load.”

Sync modes The strategy used to replicate data from source to destination. Airbyte supports several sync modes, each suited to different data characteristics and performance requirements. Example: “Which sync mode are you using for the orders table? Full Refresh will be too slow once we have millions of rows.”

Full Refresh A sync mode that replaces the entire destination dataset with a fresh copy from the source on every sync. Simple but expensive for large tables — every sync reads and writes all data. Example: “We use Full Refresh for the product catalog — it’s only 5,000 rows and it changes frequently in ways that are hard to track incrementally.”

Incremental sync A sync mode that only reads and writes data that has changed since the last sync, using a cursor field (typically a timestamp or auto-incrementing ID) to track progress. Much more efficient than Full Refresh for large tables. Example: “Switch the orders table to Incremental sync using updated_at as the cursor — Full Refresh is taking 45 minutes per run.”

Dedupe (Incremental + Deduped + History) An advanced sync mode that maintains a deduplicated snapshot of the source data in the destination — each row appears exactly once, with the latest values. Requires a primary key. Example: “We’re using Dedupe for the user profile table so that each user has exactly one row in the warehouse, even if their profile updates multiple times per day.”

Connection In Airbyte, a “connection” is the configured pipeline between a specific source and a specific destination, including the sync mode, schedule, and selected streams. Example: “We have three connections — one for the transactional database, one for Salesforce, and one for our event tracking platform.”

Stream An individual data stream within a source — typically a database table, an API endpoint’s resource, or a topic. When you configure a connection, you select which streams to sync. Example: “I enabled three streams from the Stripe connector: customers, invoices, and payment_intents. We don’t need the subscriptions stream right now.”

Schema discovery The process by which Airbyte automatically inspects a source and identifies the available streams and their field types. Airbyte runs schema discovery when you set up or refresh a source connector. Example: “Run schema discovery again — the source team added three new columns to the transactions table last week and we need Airbyte to detect them.”

Connector builder Airbyte’s low-code tool for creating custom source connectors for APIs that don’t have an existing connector. Uses a YAML-based declarative format. Example: “There’s no existing connector for this vendor’s API, so I’m building a custom connector using the Connector Builder.”

Normalization Airbyte’s optional data transformation step that converts raw JSON data from the destination into structured relational tables. When enabled, it runs dbt models under the hood. Example: “Enable normalization if you want Airbyte to automatically create clean, queryable tables from the raw JSON — otherwise you’ll need to write your own dbt models.”

Common Phrases and Collocations

“Configure the source connector” The standard action phrase for setting up a new data source in Airbyte. Example: “Can you configure the source connector for the new PostgreSQL read replica? The credentials are in the shared vault.”

“Set up an incremental sync” Choosing the incremental strategy for a specific connection or stream. Example: “Set up an incremental sync for the events table using event_timestamp as the cursor field — we’re syncing millions of rows and Full Refresh isn’t sustainable.”

“Discover the source schema” Triggering schema detection on a connector. Example: “Discover the source schema on the Shopify connector — I think they added new fields to the orders object in the last API update.”

“The sync failed on stream…” Standard incident language for identifying which part of a connection errored. Example: “The sync failed on the line_items stream — looks like a null value in a non-nullable field. I’ll add a transformation to handle it.”

“Promote to production” Moving a connection from a development or staging configuration to a production Airbyte instance. Example: “The connection is validated in dev. Ready to promote to production once you’ve confirmed the destination credentials are correct.”

Practical Sentences to Practice

  1. “We use the Salesforce source connector with incremental sync on the opportunity object — it runs every 15 minutes and uses the SystemModstamp field as the cursor.”
  2. “Schema discovery found four new columns in the source. I’ll update the connection to include them and verify the types are mapped correctly.”
  3. “The Dedupe sync mode requires a primary key — can you confirm what the primary key for the accounts table is before I configure it?”
  4. “We’re building a custom connector with the Connector Builder for the vendor’s REST API — the authentication is OAuth 2.0 with a client credentials flow.”
  5. “Normalization is turned off for the events table because we have custom dbt models that handle the transformation logic.”

Common Mistakes to Avoid

Confusing “Full Refresh” with “initial load” Full Refresh runs on every sync — it replaces all data every time. It is not just the first run. Using it on large tables will cause long sync times and high compute costs indefinitely. Instead of: “I’ll use Full Refresh for now and switch later.” Say: “I’ll set up Incremental now — Full Refresh is not appropriate for tables over 100,000 rows in our environment.”

Forgetting that Incremental sync requires a reliable cursor field If a cursor field (like updated_at) is not indexed, not populated for all rows, or not monotonically increasing, incremental syncs will miss records or be slow. Always verify: “Does updated_at get populated for soft deletes and updates, not just inserts? Incremental sync will miss changes that don’t update this field.”

Treating normalization as optional without a data transformation plan If you disable normalization, raw data lands as JSON blobs. You need dbt or another transformation layer to make it queryable. Teams sometimes disable normalization and then wonder why their tables are unusable. Clarify: “If we disable normalization, the data engineer team needs to own the dbt models that transform the raw landing zone into analytics-ready tables.”

Summary

Airbyte’s vocabulary — source and destination connectors, sync modes (Full Refresh, Incremental, Dedupe), streams, schema discovery, normalization — maps directly to the core decisions data engineers make when building integration pipelines. Understanding these terms in English lets you participate fully in pipeline design discussions, debug sync failures clearly, and document your data integration architecture in a way that international teammates can follow. Precise vocabulary is the foundation of reliable data engineering communication.