These results follow a broader push from the Qwen voice lab. Earlier Fun-Realtime-ASR and Fun-Realtime-AudioChat models had already claimed top spots on the same platform, and the Qwen2.5-Omni-7B leads the VoiceBench Avg leaderboard with a score of 0.741 .
Alibaba's voice models have also beaten Western rivals including OpenAI and xAI on regional-accent and dialect benchmarks, with a particular edge in complex Chinese dialects .
Separately, Qwen3.5-Omni-Plus—released in March 2026—reported 215 state-of-the-art results across audio and audio-visual understanding tasks. On independent audio benchmarks, it outperformed Google's Gemini 3.1 Pro on general audio understanding, reasoning, and translation, though it only matched Gemini on comprehensive audio-visual comprehension . A measured technical review notes that the audio wins are genuine—with a 6.55% word error rate on the Fleurs ASR benchmark versus Gemini's 7.32%—but that the model trails Gemini by about 12 points on the OmniGAIA agentic benchmark
.
Alibaba shipped Qwen3.7-Max on May 19, 2026, and within a week it appeared at No. 4 on Code Arena's WebDev leaderboard with an Elo of 1,541, one point behind Claude Opus 4.6 Thinking and ahead of every model from OpenAI and Google . On the React coding track, it rose to No. 3 with 1,536 Elo, trailing only two Claude Opus variants
. Some sources report it briefly climbed to No. 2 on certain Code Arena sub-leaderboards
.
Anthropic's Claude Opus 4.7/4.6 line occupied spots one through three on WebDev, meaning Alibaba was the only developer outside Anthropic—and the only non-US lab—to break into the coding top five . The model sits ahead of GPT-5.5, Gemini 3.5 Flash, and GLM-5.1 on agentic web development tasks that score real-world human preference on multi-step coding workflows
.
Beyond Code Arena, Qwen models have been clocking competitive results on other coding and reasoning benchmarks:
| Lab | Arena Elo |
|---|---|
| Anthropic | 1,503 |
| xAI | 1,495 |
| 1,494 | |
| OpenAI | 1,481 |
| Alibaba | 1,449 |
| DeepSeek | 1,424 |
Alibaba sits 5th overall, roughly 50–55 points behind the leader. That is close enough that the report's authors describe competitive pressure as having shifted toward cost, reliability, and domain-specific performance rather than raw capability .
The benchmark results land in a year when the performance gap between the best US and Chinese AI models has nearly vanished. Stanford's 2026 AI Index finds the gap collapsed from 17.5–31.6 percentage points in May 2023 to just 2.7% as of March 2026. The two countries are now "constantly trading places at the top of benchmarks"—a sharp departure from the US-dominated era through 2024 .
This happened despite the US outspending China roughly 23 to 1 on private AI investment—$285.9 billion versus $12.4 billion in the most recent period tracked .
Analysts point to several forces behind the catch-up:
It is worth noting that other assessments see a wider gap. A 2026 Brookings analysis argues that American frontier models still lead Chinese ones by "several months or more" and that US labs retain an edge on compute scale and longer-horizon agentic tasks . Congressional testimony from the same period makes a similar point
.
Comments
0 comments