答案已发布22小时前Last edited 22小时前16 来源

AI Agent跑分首秀：英伟达Blackwell Ultra凭何实现20倍能效碾压？

AA AgentPerf是由Artificial Analysis于2026年6月12日发布的首个开源、多厂商AI智能体推理硬件基准测试，它测量的不是单轮对话，而是系统能同时支持多少个满足速度与服务等级协议（SLO）的AI编程智能体。[4] 该基准基于来自12种以上编程语言的真实开源代码智能体轨迹，包含多轮LLM调用、工具调用（模拟CPU延迟）和不断增长的上下文窗口，结果按每加速器和每兆瓦进行标准化呈现。[4] 在DeepSeek V4 Pro（大型混合专家模型）工作负载上，英伟达GB300 NVL72（Blackwell Ultra）平台取得了最高性能，在最宽松的SLO（20 tokens/s, 10s TTFT）下，机架级...

使用 Studio Global AI 搜索并核查事实浏览更多热门页面

13K0

Nvidia Blackwell Ultra GPU architecture powering agentic AI infrastructure benchmarks — What did Nvidia achieve in the first published results of Artificial Analysis's AgentPerf benchmark, what does this new benchmark measure, aNvidia's Blackwell Ultra architecture is purpose-built for the demanding multi-step reasoning of agentic AI workloads. Image: AI-generated.
AI 提示
Create a landscape editorial hero image for this Studio Global article: What did Nvidia achieve in the first published results of Artificial Analysis's AgentPerf benchmark, what does this new benchmark measure, a. Article summary: Here are the key findings from the first published results of Artificial Analysis's **AA-AgentPerf** benchmark, announced on June 12, 2026.. Topic tags: general, documentation, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "We measure real-world performance of AI accelerator systems during language model inference. ## AA-AgentPerf: The Hardware Benchmark for the Agent Era. AA-AgentPerf has been shaped" source context "AI Hardware Benchmarking & Performance Analysis" Reference image 2: visual subject "For years, co-founder and chief executive officer Jensen Huang and other higher-ups at Nvidia have
openai.com

如果你还在用每秒生成多少“token”来衡量AI芯片的优劣，那可能已经跟不上时代的脚步了。2026年6月12日，知名AI性能分析机构Artificial Analysis发布了业界首个专门针对AI智能体（Agent）工作负载的硬件基准测试——AA-AgentPerf的处女跑成绩。结果显示，英伟达最新一代的Blackwell Ultra GB300 NVL72平台，以一种近乎“降维打击”的姿态，在所有实测平台上夺魁，其在特定能效指标上相对上一代Hopper架构的优势达到了惊人的20倍 。

这是否意味着，在即将到来的“智能体AI”时代，算力军备竞赛的规则已被悄然改写？

新基准，测什么？——从“聊天”到“干活”的跨越

传统的AI推理基准测试，如MLPerf，主要衡量模型在收到单一指令后生成回复的速度。但AA-AgentPerf模拟的场景要复杂得多，它拷问的是硬件在真实世界智能体应用中的表现。

具体来说，AgentPerf测量的是在满足特定的服务等级协议（SLO），即输出token速度和首token延迟（TTFT）要求的前提下，一套推理系统能够同时支持多少个AI编程智能体（Coding Agent） 。这些智能体的工作轨迹并非凭空想象，而是从公开代码库中提取的真实流程，跨越了12种以上的编程语言。它们的工作流涉及多轮大语言模型（LLM）调用、带有模拟CPU延迟的工具调用，以及不断膨胀的上下文窗口。最终结果会按单个加速器和每兆瓦功耗进行标准化，以公平反映性能和能效。

简单理解，之前我们比的是“谁答得快”，现在AgentPerf比的是“谁能同时招呼更多能写代码、会干活的AI员工，并且保证每个员工的工作效率不掉队”。

英伟达的“成绩单”：20倍能效奇迹

在AgentPerf的首轮测试中，所有平台都运行了DeepSeek V4 Pro，一个足以代表当前最前沿智能体AI能力的大型混合专家（MoE）模型。英伟达的 GB300 NVL72 平台交出了以下答卷：

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜索并核查事实

人们还问