English for OpenAI Realtime Voice Agents

Learn the English vocabulary for building voice agents with the OpenAI Realtime API: turn detection, barge-in, audio streaming, and function calling, explained for developers.

Voice agents built on realtime speech APIs introduce a vocabulary that’s genuinely different from text-based chatbots — concepts like “barge-in” and “turn detection” don’t come up when you’re building a text interface. If you’re working on a voice product, precise terminology helps you describe latency issues, interruption bugs, and audio quality problems clearly to your team. This guide covers the essentials.

Key Vocabulary

Turn detection — the mechanism that determines when a user has finished speaking and it’s the agent’s turn to respond, typically based on silence duration or voice activity signals. “We tuned the turn detection threshold down because the agent was interrupting users who paused briefly mid-sentence.”

Barge-in — when a user starts speaking while the agent is still talking, requiring the agent to stop its own audio output and listen instead. “Barge-in handling was broken — the agent kept talking over the user instead of stopping when they interrupted.”

Voice activity detection (VAD) — the technology that distinguishes speech from silence or background noise in an audio stream, used as an input to turn detection. “VAD sensitivity is too high in noisy environments — background chatter is triggering false turn-starts.”

Audio streaming — sending and receiving audio data continuously in small chunks over a persistent connection, rather than as a single complete file, enabling low-latency, real-time conversation. “Because the audio is streamed rather than sent as a full file, the agent can start responding before the user has even finished their sentence.”

Function calling (in a voice context) — the agent’s ability to invoke a defined tool or API mid-conversation, such as looking up an order status, and then continue the spoken conversation with the result. “Add a function call for order lookups so the agent doesn’t have to guess or hallucinate a tracking number.”

Latency budget — the total acceptable delay between a user finishing speaking and the agent starting its response, typically a few hundred milliseconds for a natural-feeling conversation. “We’re over our latency budget by nearly a second, which is exactly why the conversation feels sluggish to testers.”

Session state — the ongoing context and configuration maintained for a single voice conversation, including conversation history, active tools, and voice settings. “The bug only happens on reconnect, because we’re not restoring session state properly after the WebSocket drops.”

Common Phrases

  • “Is this a turn detection issue, or is the model just slow to generate the response?”
  • “The barge-in didn’t register — check whether VAD picked up the interruption at all.”
  • “We need to trim the latency budget before this feels natural in a live call.”
  • “That’s a function call, not a hallucinated answer — the agent actually queried the order system.”
  • “Session state isn’t surviving the reconnect, which is why context resets mid-call.”

Example Sentences

Reporting a bug to the team: “Users are reporting that the agent talks over them when they try to interrupt. It looks like barge-in isn’t stopping the audio output in time — there’s a noticeable delay between the user starting to speak and the agent’s audio actually cutting off.”

Explaining a design trade-off: “We could lower the turn detection threshold to make the agent feel more responsive, but that risks cutting users off mid-thought if they pause. We landed on a middle ground based on testing with real conversations.”

Describing the architecture to a stakeholder: “The agent streams audio in both directions over a persistent connection, so it can respond as soon as it detects the user has finished speaking, and it can call out to our internal APIs mid-conversation when it needs real data, like an order status.”

Professional Tips

  • Distinguish turn detection (deciding when to respond) from barge-in handling (reacting to interruption) — they’re related but separately tunable, and conflating them slows down debugging.
  • Always quantify latency budget in milliseconds when discussing voice responsiveness; “it feels slow” is much less actionable than “we’re 400ms over budget.”
  • Use “function calling” specifically for agent-initiated tool use, not for scripted, hardcoded responses — the distinction matters when explaining why an answer was accurate.
  • When session bugs appear only after reconnects, mention session state explicitly — it immediately narrows the search space for reviewers.

Practice Exercise

  1. Explain, in two sentences, the difference between turn detection and barge-in.
  2. Write a one-sentence bug report describing a voice agent that doesn’t stop talking when interrupted.
  3. Describe, to a non-technical stakeholder, why voice agents have a latency budget and why it matters.