The model does not wait for a speaker to finish. It streams audio input and incrementally generates translated output at the same time. Google describes this as staying "mere seconds behind each speaker," which eliminates the awkward pauses that can derail a natural conversation .
Users do not need to manually select a source language. The model automatically detects which language is being spoken on the fly. This works even in environments where multiple languages are mixed, making it suitable for dynamic, real-world conversations .
A crucial element for user experience is that the translated output does not sound robotic. The model is designed to retain the original speaker's intonation, pacing, and pitch, producing a translated voice that sounds more like the original person and less like a text-to-speech engine .
With support for 70+ languages, the model covers thousands of bidirectional pairs. It is designed for two-way conversations, where each speaker can hear the other's words translated into their own language fluidly .
For developers, the model is accessed via the Gemini Live API. It requires audio input in a specific format: raw, little-endian, 16-bit PCM audio at a 16kHz sample rate. The translated audio output is also raw 16-bit PCM, but at a higher 24kHz sample rate . The model's context window allows for up to 128,000 input tokens and 64,000 output tokens
.
Google's journey to this public launch was a phased one, with the Gemini 3.5 model family first being announced at the Google I/O developer conference in May 2026 .
gemini-3.1-flash-live-preview on March 26, 2026, as part of this iterative development gemini-3.5-live-translate-preview model was officially released to developers via the Gemini Live API and Google AI Studio, and to consumers globally through updates to the Google Translate app on both Android and iOS The model is being made available across a wide range of Google's consumer, developer, and enterprise platforms, with varying levels of access.
For consumers, this is the simplest point of access. The feature is rolling out globally within the Google Translate app. Users can tap the "Live translate" button in the bottom-left corner of the app screen while wearing headphones. On Android, Google is also rolling out a hands-free "listening mode" that plays translations through the phone's earpiece, allowing you to hold the phone to your ear like a regular call .
For developers, the model is available in a public preview. This allows for integration into third-party applications and services using the Gemini Live API with a specific translation configuration. Google AI Studio also provides a sandbox environment for developers to prototype and test the model's capabilities .
Access for businesses is more restricted. Gemini 3.5 Live Translate for Google Meet is launching in a private preview for select enterprise customers starting in June 2026. When available, it will automatically detect a speaker's language and translate it to each participant's preferred language, supporting over 70 languages and 2,000+ language pairs during meetings. A broader rollout is planned for later in 2026 . This feature will be available to Google Workspace Business Standard and Plus, Enterprise Standard and Plus, Google AI Pro, and Google AI Ultra subscribers
.
Real-time communication platforms like Agora, Fishjam, LiveKit, Pipecat, and Vision Agents are already working on integrating the Gemini Live API to bring the translation model into their own media pipelines .
One of the most compelling real-world tests is with Grab, the Southeast Asian rideshare and delivery platform. Grab is piloting the technology to provide real-time voice translation between drivers and riders. The company handles over 10 million voice calls per month, and this pilot tackles the challenge of a linguistically fragmented market head-on .
The move from turn-by-turn to streaming translation is a fundamental UX shift. By deeply integrating the model into ubiquitous products like Google Translate and Meet, and opening it up to a developer ecosystem, Google is pushing real-time speech translation from a niche feature into a standard infrastructure layer for global communication . The pilot with Grab clearly illustrates this shift, positioning instant, natural-sounding translation as a utility rather than a novelty
.
All AI-generated audio from the model is watermarked with Google's SynthID technology to ensure its origin is detectable and to mitigate potential misuse, a critical step as synthetic voice technology becomes more convincing and widespread .
Comments
0 comments