| 32GB HBM |
| ~1.6 TB/s |
| Not specified |
| One TensorCore with two matrix‑multiply units; designed for large TPU pods of up to 256 chips. |
| Huawei Ascend 910 (original) | China / Huawei | Da Vinci architecture, ~7 nm | ~256 TFLOPS FP16 | HBM memory | ~1.2 TB/s bandwidth | ~350W | Introduced in 2019 as Huawei’s flagship AI accelerator. |
| Biren BR100 | China / Biren Technology | Dual‑die GPU, TSMC 7 nm CoWoS | 256 TFLOPS FP32 / ~2,048 TOPS INT8 | 64GB HBM2E | up to ~2.3 TB/s | ~550W | Chiplet‑style GPU with ~77B transistors targeting data‑center AI. |
| Biren BR104 | China / Biren Technology | Single‑die GPU variant | ~128 TFLOPS FP32 | 32GB HBM2E | ~819 GB/s | ~300W | Lower‑tier variant designed for PCIe accelerator cards. |
| Cambricon MLU370‑X8 | China / Cambricon | MLUarch03 architecture, 7 nm | 24 TFLOPS FP32 / 96 TFLOPS FP16 / 256 TOPS INT8 | 48GB LPDDR5 | ~614 GB/s | ~250W | Supports multi‑card clusters via MLU‑Link interconnect. |
US accelerators currently lead in documented raw compute throughput for large‑scale AI training. For example, AMD’s MI325X reaches roughly 1.3 PFLOPS FP16 performance, while Google’s TPU v6e delivers 918 TFLOPs bf16 per chip.
Chinese chips such as Huawei’s Ascend 910C aim to narrow this gap, with estimates of around 800 TFLOPS FP16 using a dual‑chiplet configuration derived from earlier Ascend chips.
Biren’s BR100 accelerator represents another Chinese attempt to compete at the high end, delivering up to 256 TFLOPS FP32 compute and around 2,048 TOPS INT8 performance in a multi‑die GPU design.
Cambricon’s MLU370‑X8 targets AI inference and training workloads with 256 TOPS INT8 and 96 TFLOPS FP16 compute capability.
Modern AI models rely heavily on memory capacity and bandwidth, making HBM integration critical.
Higher memory bandwidth helps accelerate matrix operations and large‑model training, which often move massive tensors across memory and compute units.
Large AI models are rarely trained on a single chip. Instead, hundreds or thousands of accelerators are connected into clusters.
Cluster architecture is increasingly as important as the individual chip’s raw compute performance.
Manufacturing technology strongly influences performance and energy efficiency.
Many Chinese AI chips rely on external fabrication technologies. For example, Biren’s BR100 was built using TSMC’s 7‑nm process and advanced CoWoS packaging.
Huawei’s newer Ascend chips combine designs produced on SMIC’s 7‑nm‑class process with earlier wafers produced before US export restrictions.
In contrast, US‑designed chips typically rely on advanced foundries and supply chains capable of producing cutting‑edge nodes and packaging technologies.
Hardware performance alone does not determine success in AI computing.
Developer tooling, frameworks, and cloud integration often determine which hardware is adopted for real‑world AI workloads.
Several patterns emerge from the current generation of AI accelerators:
The AI chip race is therefore not just a contest of transistor counts or TFLOPs—it is also a competition between ecosystems, manufacturing capability, and the ability to scale thousands of accelerators into efficient AI infrastructure.
As generative AI models continue to grow, these differences in architecture, memory systems, and software ecosystems will likely determine which platforms dominate future AI computing.
Comments
0 comments