Late May 2026 brought a flurry of benchmark results that, taken together, tell a single story: Alibaba's Qwen model family no longer trails the Western frontier—it sits inside the top tier. On voice, Fun-Realtime-TTS-Preview broke into the global top five and swept the three core speech tracks in China. On code, Qwen3.7-Max debuted as the highest-ranked model from any non-US lab on Code Arena. The broader context, captured by the Stanford 2026 AI Index, is that the performance gap between the best American and Chinese AI models has collapsed to roughly 2.7 percent—down from over 30 points two years earlier .
On May 28, Alibaba's Tongyi Lab placed Fun-Realtime-TTS-Preview at No. 5 globally on the Artificial Analysis Speech Arena leaderboard with an Elo score of 1,190. It was the only Chinese-engineered voice system in the top five and ranked first among all Chinese models across all three core tracks—ASR (speech recognition), Chat (end-to-end conversational voice), and TTS (text-to-speech)—a result described as a "grand slam" in voice interaction .
These results follow a broader push from the Qwen voice lab. Earlier Fun-Realtime-ASR and Fun-Realtime-AudioChat models had already claimed top spots on the same platform, and the Qwen2.5-Omni-7B leads the VoiceBench Avg leaderboard with a score of 0.741 .
Alibaba's voice models have also beaten Western rivals including OpenAI and xAI on regional-accent and dialect benchmarks, with a particular edge in complex Chinese dialects .
Separately, Qwen3.5-Omni-Plus—released in March 2026—reported 215 state-of-the-art results across audio and audio-visual understanding tasks. On independent audio benchmarks, it outperformed Google's Gemini 3.1 Pro on general audio understanding, reasoning, and translation, though it only matched Gemini on comprehensive audio-visual comprehension . A measured technical review notes that the audio wins are genuine—with a 6.55% word error rate on the Fleurs ASR benchmark versus Gemini's 7.32%—but that the model trails Gemini by about 12 points on the OmniGAIA agentic benchmark
.
Alibaba shipped Qwen3.7-Max on May 19, 2026, and within a week it appeared at No. 4 on Code Arena's WebDev leaderboard with an Elo of 1,541, one point behind Claude Opus 4.6 Thinking and ahead of every model from OpenAI and Google . On the React coding track, it rose to No. 3 with 1,536 Elo, trailing only two Claude Opus variants
. Some sources report it briefly climbed to No. 2 on certain Code Arena sub-leaderboards
.
Anthropic's Claude Opus 4.7/4.6 line occupied spots one through three on WebDev, meaning Alibaba was the only developer outside Anthropic—and the only non-US lab—to break into the coding top five . The model sits ahead of GPT-5.5, Gemini 3.5 Flash, and GLM-5.1 on agentic web development tasks that score real-world human preference on multi-step coding workflows
.
Beyond Code Arena, Qwen models have been clocking competitive results on other coding and reasoning benchmarks:
The Stanford 2026 AI Index's Arena Elo snapshot as of March 2026 shows the top labs packed tight :
| Lab | Arena Elo |
|---|---|
| Anthropic | 1,503 |
| xAI | 1,495 |
| 1,494 | |
| OpenAI | 1,481 |
| Alibaba | 1,449 |
| DeepSeek | 1,424 |
Alibaba sits 5th overall, roughly 50–55 points behind the leader. That is close enough that the report's authors describe competitive pressure as having shifted toward cost, reliability, and domain-specific performance rather than raw capability .
The benchmark results land in a year when the performance gap between the best US and Chinese AI models has nearly vanished. Stanford's 2026 AI Index finds the gap collapsed from 17.5–31.6 percentage points in May 2023 to just 2.7% as of March 2026. The two countries are now "constantly trading places at the top of benchmarks"—a sharp departure from the US-dominated era through 2024 .
This happened despite the US outspending China roughly 23 to 1 on private AI investment—$285.9 billion versus $12.4 billion in the most recent period tracked .
Analysts point to several forces behind the catch-up:
It is worth noting that other assessments see a wider gap. A 2026 Brookings analysis argues that American frontier models still lead Chinese ones by "several months or more" and that US labs retain an edge on compute scale and longer-horizon agentic tasks . Congressional testimony from the same period makes a similar point
.
Even so, the practical upshot for enterprises and developers is clear: more competition, faster iteration, lower prices, and more viable options from both American and Chinese providers .
Studio Global AI
Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.
Alibaba's Qwen models landed a pair of top five finishes in late May 2026: Fun Realtime TTS Preview ranked 5th globally in the Artificial Analysis Speech Arena, and Qwen3.7 Max debuted at No.
Alibaba's Qwen models landed a pair of top five finishes in late May 2026: Fun Realtime TTS Preview ranked 5th globally in the Artificial Analysis Speech Arena, and Qwen3.7 Max debuted at No. The broader US–China AI performance gap has shrunk from as much as 31.6 percentage points in early 2023 to roughly 2.7% as of March 2026, with American and Chinese labs now regularly trading the top spot on key benchm...
The results signal a shift in competitive pressure from raw benchmark scores toward cost, reliability, regional language capability, and domain specialization—areas where Chinese labs, with lower inference costs and s...
Loading comments...
Comments
0 comments