In a traditional VLA pipeline, the system follows a sequential process: the car sees the road, translates that visual perception into language-like tokens, and then reasons about those language tokens to generate a driving action. Dr. Liu described this intermediate step as a critical weakness, stating bluntly that “language is poison” for real-time driving . His argument is that language tokens introduce inherent latency and inject irrelevant semantic noise into a process that demands millisecond-level reactions.
The VLA 2.0 model eliminates this bottleneck entirely. It adopts what the company calls a “Vision-Implicit Token-Action” path, enabling an end-to-end generation of driving commands directly from raw visual inputs without any intermediate language representation . While the system can still accept language as an input—such as a driver’s navigation command or a spoken instruction—it never creates its own language tokens as an internal output during the act of driving
. XPeng showcased the system at its CVPR booth alongside a physical AI world model, with a related research paper, DrivePTS, accepted for publication at the conference
.
XPeng’s leadership has not been shy about drawing direct comparisons to Tesla. Their claims over the spring and summer of 2026 represent a sharp escalation in confidence. Dr. Liu stated in his June interview that XPeng has already achieved parity with Tesla’s FSD v13 in China and that matching the performance of the newer FSD v14 is “within reach before the end of summer” .
These technical claims are backed by an unusually personal commitment from the top. In December 2025, CEO He Xiaopeng set a public “performance wager,” declaring that XPeng’s VLA system must match the on-road experience of Tesla’s FSD v14.2 in Silicon Valley by August 30, 2026 . The stakes of this bet were made explicit: if the team failed, the person in charge would “run naked”
.
To support its narrative, XPeng released a head-to-head video in May 2026 that brought two US-based Tesla enthusiasts to China. The staged comparison pitted a XPeng P7 running VLA 2.0 against a Tesla Model 3 with FSD on identical Beijing routes. According to XPeng’s own cut of the video, its vehicle required only 2 driver takeovers, compared to 7 for the Tesla . While He Xiaopeng has reiterated at multiple events, including Auto China 2026, that the goal is to fully surpass Tesla’s FSD in the Chinese market by August, independent reviews urge a measure of caution. An Electrek editor who tested VLA 2.0 in Beijing described its performance as “comparable” to FSD v14, but noted that both systems still require constant driver attention and are far from fully autonomous
.
For now, the race remains a high-speed chase defined by bold architectural bets and even bolder claims. XPeng’s decision to design language out of its driving brain is a calculated gamble that the fastest path from vision to action is a straight line—even if that means throwing the dictionary out the window.
Comments
0 comments