OpenAI has launched a new set of audio models aimed at a part of the AI market that is starting to matter far beyond demos: voice interfaces that can actually be deployed in customer support, transcription and multilingual workflows. Reuters described the release on May 8 as a push into real-time voice tasks, while OpenAI said the models are now available to developers worldwide through its API (Reuters, May 8; OpenAI, May 8).
The launch includes three core models: gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts. The first two are speech-to-text systems, while the third is a new text-to-speech model that OpenAI says can be steered not just on what to say, but how to say it, including prompts such as speaking like a sympathetic customer service agent or a narrator with a specific tone (OpenAI, May 8).
What OpenAI actually shipped
The practical significance is less about novelty than about control. Voice AI has existed for years, but enterprise buyers have been waiting for systems that handle accents, noise and variable speaking speeds without collapsing in the real world. OpenAI says its new transcription models outperform Whisper v2 and Whisper v3 on word error rate across benchmark tests, including FLEURS, a multilingual speech benchmark spanning more than 100 languages (OpenAI, May 8).
That global language coverage matters. A voice assistant that works in a quiet U.S. demo is one thing; a system that can reliably handle mixed accents in India, Europe or Latin America is the commercially relevant version. OpenAI said the improvements came from reinforcement learning-heavy training and extensive midtraining on audio-focused datasets, a sign that competition in voice AI is shifting from headline demos to operational reliability (OpenAI, May 8).
The numbers behind the launch
The pricing is also aggressive enough to make deployment easier to model. OpenAI lists gpt-4o-transcribe at an estimated cost of about $0.006 per minute and gpt-4o-mini-transcribe at about $0.003 per minute, according to its API pricing page. For teams building higher-speed spoken systems, the company lists gpt-realtime-translate at $0.034 per minute and gpt-realtime-2 audio pricing at $32 per 1 million input audio tokens and $64 per 1 million output audio tokens (OpenAI pricing docs, accessed May 8).
On the speech output side, OpenAI says the text-to-speech stack now offers 13 built-in voices. The company also documents stricter limits for custom voice creation: up to 20 voices per organization, with source recordings capped at 30 seconds each and explicit speaker consent required (OpenAI text-to-speech docs, accessed May 8). Those constraints show that the commercial rollout is happening alongside tighter safeguards rather than as an unrestricted cloning tool.
Why the voice market is becoming more important
This is also a timing play. Since launching Whisper in 2022, OpenAI has spent most of the market’s attention on text and multimodal assistants. But companies now want agents that can answer calls, summarize meetings, route support conversations and operate across languages without making users type into a chat box. That makes audio less of a side feature and more of a front door.
The harder question is whether better models are enough. Businesses still have to integrate telephony, latency, compliance and disclosure rules into real products. OpenAI’s own policy requires developers to clearly tell end users when a voice is AI-generated. Still, this launch suggests the next phase of the voice AI race will be decided less by whether a model can speak at all and more by whether it can do so accurately, cheaply and with enough control to be trusted in production.
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.