Voice agents have lagged behind text models in reasoning capability for years. OpenAI changed that on Thursday. The company launched GPT-Realtime-2, its first voice model with what it calls "GPT-5-class reasoning," alongside two specialized speech models: GPT-Realtime-Translate for live translation and GPT-Realtime-Whisper for streaming transcription. The release closes a long-standing gap. Previous voice models could handle turn-taking and natural-sounding speech, but they lacked the reasoning depth of text-only counterparts.
OpenAI's earlier GPT-Realtime model debuted in summer 2025, with version 1.5 arriving in February. GPT-Realtime-2 delivers an 11% performance improvement over that release. The technical upgrades are significant. The context window quadrupled from 32,000 to 128,000 tokens, letting the model sustain longer, more complex conversations.
OpenAI benchmarked the model at "high" reasoning setting hitting 96.6% accuracy on Big Bench Audio, up from 81.4% for GPT-Realtime-1.5. On Audio MultiChallenge, a multi-turn dialogue test, the "xhigh" variant scored 48.5% versus 34.7%.
Developers get five reasoning intensity levels: minimal, low, medium, high, and xhigh. The default is low to keep latency down for simple requests.
Pricing holds steady from the previous generation. GPT-Realtime-2 costs $32 per million audio input tokens and $64 per million output tokens.
GPT-Realtime-Translate and GPT-Realtime-Whisper The translation model handles more than 70 input languages and 13 output languages, priced at $0.034 per minute. OpenAI previously handled translation through general speech models.
This is its first dedicated offering for the use case.
GPT-Realtime-Whisper targets live transcription for meetings, classrooms, and broadcasts at $0.017 per minute. Whisper has been one of the most popular open-weight speech-to-text models since its 2022 debut, though the open version has not seen a major update in years.
OpenAI outlined three interaction patterns for developers: Voice-to-Action (users describe a task out loud, the system reasons and executes), Systems-to-Voice (software converts context into spoken guidance), and Voice-to-Voice (live conversational AI across tasks and languages). Deutsche Telekom is already testing the voice-to-voice pattern for customer support. The company said these features are coming soon to ChatGPT's audio mode. "Voice can truly become the primary interface now," OpenAI stated.
All three models are available through the Realtime API and OpenAI's Playground. The API supports EU data residency and falls under OpenAI's enterprise privacy commitments.













