English Vocabulary for the OpenAI Realtime API

Learn the professional English vocabulary for OpenAI's Realtime API — WebSocket connections, audio buffers, turn detection, function call events, and how engineers talk about them in real projects.

The OpenAI Realtime API enables low-latency, bidirectional audio and text conversations with GPT-4o over a persistent WebSocket connection. It is used to build voice assistants, real-time transcription tools, and interactive AI applications where response time is measured in milliseconds. If you are building or integrating with this API, you need to communicate precisely about events, audio buffers, and session configuration with your team. This post covers the vocabulary that appears most in engineering discussions, design documents, and pull request reviews for Realtime API projects.

Key Vocabulary

WebSocket connection The persistent, bidirectional network connection used to communicate with the Realtime API. Unlike HTTP requests — which are stateless and one-directional — a WebSocket connection stays open for the duration of a session, allowing the server to push events to the client at any time. Example: “Open the WebSocket connection before you start the audio stream — if you send audio before the session is established you will lose the first few frames.”

input_audio_buffer.append A client-side event that sends a chunk of raw audio data to the server’s input buffer. Audio is streamed in small chunks rather than sent all at once, which enables real-time processing. The payload is base64-encoded PCM audio. Example: “We’re calling input_audio_buffer.append every 100 milliseconds with the microphone data — the buffer accumulates until the model detects end of speech.”

response.create A client event that explicitly instructs the model to generate a response, used in manual turn detection mode. When you send this event, the model processes the accumulated audio or text and begins streaming back a reply. Example: “After the user clicks the ‘send’ button, emit a response.create event — in manual mode the model waits for this signal rather than detecting silence automatically.”

turn_detection The mechanism by which the Realtime API determines when a speaker has finished talking and the model should begin responding. It can be set to server-side VAD (Voice Activity Detection) — where the API detects silence automatically — or manual, where the client controls when turns end. Example: “Switch turn detection to server-side VAD for the consumer app — users expect the assistant to respond naturally when they stop speaking, not when they press a button.”

function_call event A server-sent event indicating that the model wants to call a tool (function) that you have registered with the session. Your client must handle this event, execute the function locally, and send back a conversation.item.create event with the result. Example: “The function_call event fires when the assistant decides to look up the user’s account — handle it in the WebSocket message listener and send back the result within two seconds or the response will time out.”

conversation.item.create A client event used to inject a message — such as a tool result, a system message, or a user utterance — directly into the conversation context. It is the primary way to feed structured data back into an ongoing session. Example: “After your function executes, send conversation.item.create with the function output — the model will incorporate it into the next response automatically.”

session.update A client event that modifies the configuration of an active session — changing the system prompt, registered tools, voice, or turn detection settings — without disconnecting and reconnecting. Example: “Use session.update to swap the system prompt when the user changes context — reconnecting would reset the conversation history.”

response.audio.delta A server-sent event that carries a chunk of the model’s audio output as it is generated. Clients receive a stream of these deltas and play them back progressively, which is what produces the low-latency, streaming audio experience. Example: “Buffer the response.audio.delta chunks and start playback immediately — don’t wait for the full response or you lose the real-time feel.”

How to Use This Vocabulary

Realtime API discussions often center on latency and reliability. When your team talks about the audio pipeline, they describe the flow from microphone capture, through input_audio_buffer.append, to turn detection, to response.create, to response.audio.delta playback. Understanding this vocabulary lets you pinpoint exactly where a delay or error is occurring — for example, “the latency is between turn detection and the first audio delta, not in our playback code.”

Function calling with the Realtime API introduces a synchronous step into an otherwise streaming flow. Teams discuss how to handle function_call events without blocking audio playback, how to time out stale function results, and how to manage conversation state across multiple tool calls in a single turn.

Example Conversation

Jordan: The voice assistant is cutting off responses mid-sentence. Where’s the break? Riley: Check if response.cancel is being triggered prematurely — sometimes the VAD misreads background noise as silence. Jordan: Good point. Should we switch turn detection to manual for now? Riley: Try it in staging. Use session.update to change the mode without dropping the connection.

Practice

  1. Draw a sequence diagram of a single voice turn: microphone → input_audio_buffer.append → VAD → response.createresponse.audio.delta → playback. Describe each step in English using the vocabulary from this post.
  2. Write a short paragraph explaining to a non-technical product manager what “turn detection” means and why it matters for user experience. Avoid technical jargon — translate it into plain English.
  3. Explain in a code review comment why conversation.item.create must be sent before response.create when handling a function call result, and what will happen if the order is reversed.