The OpenAI Realtime API enables low-latency voice conversations over WebSockets. Master the key concepts — audio buffers, turn detection, and realtime function calls — with these scenario-based exercises.
0 / 5 completed
1 / 5
At standup, a colleague asks what input_audio_buffer is used for in the OpenAI Realtime API. What is the correct answer?
The input_audio_buffer is a server-side buffer where your client appends raw audio chunks using input_audio_buffer.append events. Once enough audio has accumulated (or you send input_audio_buffer.commit), the server processes it through speech-to-text and triggers a response.create. It is not a client queue or a rate-limit mechanism — the client pushes bytes in and the server owns the buffer state.
2 / 5
During a PR review, a teammate asks what turn detection controls in a Realtime API session. Which answer is correct?
Turn detection in the Realtime API refers to server-side Voice Activity Detection (VAD). When enabled, the server automatically detects end-of-speech and commits the audio buffer, triggering a response without any explicit client action. You configure it via the turn_detection field in the session object, setting type: "server_vad" and thresholds like silence_duration_ms.
3 / 5
In a design review, the team discusses how function calls work in a Realtime API session. What is accurate?
The Realtime API supports function calls using the same tool-use paradigm as Chat Completions. The server streams argument chunks via response.function_call_arguments.delta and signals completion with response.function_call_arguments.done. Your client then executes the function and submits the result by creating a conversation.item with type: "function_call_output", followed by response.create to resume generation.
4 / 5
An incident report shows audio responses cutting off mid-sentence. A senior engineer asks what response.cancel is and whether it was triggered accidentally. What does it do?
response.cancel is a client event that immediately stops the current in-progress response, halting both text generation and TTS audio. It is useful for barge-in scenarios where the user starts speaking before the assistant has finished. If triggered accidentally — for example by a noisy VAD threshold — it will cut the response short, which matches the incident description.
5 / 5
During a code review, a senior engineer asks what audio format parameters must be set when connecting to the OpenAI Realtime API via WebSocket. What is correct?
The Realtime API supports three audio codecs: pcm16 (raw 16-bit PCM at 24 kHz), g711_ulaw, and g711_alaw. You declare both input_audio_format (what you send) and output_audio_format (what you receive) in the session.update configuration event. MP3 and Opus are not supported — raw PCM or G.711 are the only options.