Discover the Power of Gemini 3.1 Flash TTS Today

 

Gemini 3.1

Google on Wednesday released Gemini 3.1 Flash TTS, a text-to-speech model that the company describes as its most expressive and controllable to date. The model is available in preview through the Gemini API, Google AI Studio, Vertex AI, and Google Vids for Workspace users.

Controllable Speech With 200+ Audio Tags

The new model introduces more than 200 audio tags that developers can embed directly into text input to steer vocal style, pacing, accent, and emotional expression at a granular level. Tags range from emotions like "determination" and "curiosity" to delivery cues such as "whispers" and "laughs," enabling what Google calls an "authorial" approach to audio generation.

Gemini 3.1 Flash TTS supports over 70 languages, including Hindi, Japanese, and German, with 30 prebuilt voices available as starting points. The model also handles multi-speaker dialogue natively, maintaining natural conversational flow without requiring separate API calls for different voices — a feature aimed at podcast creators, dramatic scripts, and assistant interfaces.

On the Artificial Analysis TTS leaderboard, Google AI Studio reported that the model achieved an Elo score of 1,211. Artificial Analysis noted that Gemini 3.1 Flash TTS ranked second on its Speech Arena Leaderboard, ahead of ElevenLabs' Eleven v3.

SynthID Watermarking and Developer Access

All audio generated by the model is watermarked with SynthID, Google's imperceptible watermarking technology designed to identify AI-generated content and help prevent misinformation. The watermark is embedded without degrading audio quality, according to Google.

The model is accessible through the gemini-3.1-flash-tts-preview model ID in the Gemini API, with an input token limit of 8,192 and an output token limit of 16,384. The launch follows the March 25 release of Gemini 3.1 Flash Live, Google's real-time dialogue model built for voice-first AI applications.
Next Post Previous Post