OpenAI Realtime API: English for Voice AI Engineers
Master the English vocabulary for the OpenAI Realtime API — sessions, turns, VAD, audio delta events, WebSocket connections, and voice AI pipelines.
The OpenAI Realtime API makes it possible to build sub-second voice AI applications by streaming audio directly over a persistent WebSocket connection, and it introduces a specific vocabulary that is distinct from standard REST-based LLM usage. Voice AI engineers who work on call-centre automation, voice assistants, or live transcription pipelines will encounter terms like VAD, audio delta events, and conversation turns in every design discussion and code review. This guide covers the English you need to communicate clearly about the Realtime API.
Key Vocabulary
Session — a persistent, stateful connection between a client and the Realtime API that maintains conversation history, model configuration, and audio context for its entire lifetime. “Each phone call maps to a single Realtime session — when the call ends, we close the WebSocket and the session is torn down along with it.”
Turn — a discrete unit of conversation, either a user speaking or the assistant responding, that advances the conversation state; turns alternate between user and assistant roles. “The assistant’s turn began as soon as VAD detected end-of-speech, and the audio response started streaming back before the transcription was even finalised.”
VAD (Voice Activity Detection) — the component that analyses the incoming audio stream to determine when a user starts speaking and, crucially, when they have finished so the model can begin generating a response. “We increased the VAD silence threshold from 200 ms to 500 ms because users were being interrupted mid-sentence when they paused briefly to think.”
Audio delta event — a streaming event sent by the server that contains a small chunk of base64-encoded audio representing part of the assistant’s spoken response; the client accumulates and plays these chunks in sequence. “The playback buffer is consuming audio delta events as they arrive, so the user starts hearing the assistant’s voice within about 300 ms of the turn starting.”
Input audio buffer — the server-side buffer that accumulates raw audio bytes sent by the client before VAD commits them to a conversation item. “If the client sends audio faster than real time — for example, replaying a pre-recorded file — the input audio buffer fills quickly, so you need to pace the sends to match actual playback speed.”
Conversation item — a structured record in the session’s conversation history representing a completed user message, assistant message, or function call result. “After each user turn, we inspect the conversation item to extract the transcript and log it to our analytics pipeline alongside the session ID.”
Function calling (Realtime) — a mechanism that allows the model to pause its audio response, invoke a tool, and resume speaking once the tool result has been submitted back to the session. “We wired up a function call to a live flight-status API so the assistant can say ‘let me check that for you’ and then read out the real-time departure information seamlessly.”
Interruption handling — the pattern of detecting and managing the case where a user speaks while the assistant is still producing audio, requiring the client to stop playback and cancel the in-progress server turn. “Interruption handling was the trickiest part to get right — we had to truncate the conversation item on the server to match exactly how many audio bytes the client had actually played before the user spoke.”
Useful Phrases
- “We open the WebSocket, send a
session.updateevent to set the voice and turn detection mode, and then start streaming audio — the whole setup takes under 100 ms.” - “VAD is running server-side, so you don’t need to do any speech detection on the client; just pipe the raw PCM audio and let the API decide when a turn ends.”
- “We’re batching the audio delta events into 50 ms chunks on the client side before handing them to the Web Audio API to avoid underruns during network jitter.”
- “The function call pauses the audio stream — the model waits for you to submit the
conversation.item.createevent with the tool result before it continues speaking.” - “If the user interrupts, you cancel the current response with a
response.cancelevent and clear the playback buffer immediately to avoid the assistant talking over the user.”
Common Mistakes
Saying “close the session” when you mean “end the turn”. A session persists for an entire conversation and is closed by disconnecting the WebSocket; a turn ends when the user or assistant finishes a single utterance. Saying “close the session after each question” suggests tearing down and rebuilding the WebSocket connection every time, which is both expensive and incorrect. The right phrase for finishing a speaking exchange is “the turn ends” or “the turn completes.”
Describing audio delta events as “packets”. Non-native speakers with a networking background sometimes call audio delta events “audio packets,” which in English implies UDP datagrams or a specific network-layer concept. The correct term in the Realtime API context is event or audio delta event. In a code review or architecture discussion, using “event” keeps the language consistent with the official documentation and avoids confusion with lower-level networking concepts.
Confusing “latency” and “delay” in voice AI conversations. Both words refer to elapsed time, but in voice AI engineering, latency is the precise, measurable time between an event (end of speech) and a response (first audio byte received). Delay is a more general or subjective term. When discussing performance in a technical meeting, use “end-to-end latency,” “time-to-first-byte,” or “response latency” rather than the vague “there’s a delay.”
Building voice AI applications well requires both engineering skill and clear communication — teams that share a precise vocabulary for sessions, turns, and events ship faster and debug problems more efficiently.