उत्तरप्रकाशित30 अप्रैल 2026Last edited 6 मई 20269 स्रोत

DeepSeek V4 और 98% मेमोरी दावा: असली बात KV cache की है

DeepSeek के आधिकारिक API समाचार, मॉडल कार्ड और तकनीकी विवरणों में V4 के लिए कुल VRAM 98% कम होने को औपचारिक स्पेसिफिकेशन के रूप में नहीं दिखाया गया है; प्रमाण long context inference में Hybrid Attention, CSA/HCA और KV... ज्यादा जाँचा जा सकने वाला तीसरे पक्ष का दावा यह है कि DeepSeek V4, V3.2 की तुलना में 27% single...

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

15K0

DeepSeek V4 與 KV cache 記憶體壓縮爭議的抽象示意圖 — DeepSeek V4 少用 98% 記憶體？先看 KV Cache 證據DeepSeek V4 的可靠證據指向長上下文 KV cache 壓縮；「整體記憶體少用 98%」仍未見官方確認。
AI संकेत
Create a landscape editorial hero image for this Studio Global article: DeepSeek V4 少用 98% 記憶體？先看 KV Cache 證據. Article summary: 未見 DeepSeek 官方資料證實 V4 整體 VRAM 少用 98%；可核對的是 V4 Preview 於 2026/04/24 發布，架構重點是 CSA/HCA 等 Hybrid Attention 壓縮長上下文 KV cache，而不是同幅降低所有記憶體成本 [5][13][14]。. Topic tags: deepseek, ai, llm, ai infrastructure, gpu. Reference image context from search candidates: Reference image 1: visual subject "# 新浪看点. # DeepSeek V4报告太详尽了！484天换代之路全公开. > ## henry 发自凹非寺量子位 | 公众号 QbitAI. DeepSeek V4“迟到”半年，但发布后的好评如潮还在如潮。. V4-Pro和V4-Flash，**1.6万亿参数/2840亿参数**，**上下文都是1M**。1M场景下，V4-Pro的单token FL" source context "DeepSeek V4报告太详尽了！484天换代之路全公开|人工智能深度|技术迭代复盘|Token|DeepSeek-V4|大模型技术报告_新浪新闻" Reference image 2: visual subject "1M token 上下文设置下，DeepSeek-V4-Pro 的单 token 推理 FLOPs 仅为 DeepSeek-V3.2 的 27%，KV Cache 仅为 V3.2 的 10%；V4-Flash 更激进——FLOPs 10%、KV Cache 7%。百万上下文从演示用 demo，变成了可以日常跑的工作负载。. DeepSeek-V4 想解
openai.com

DeepSeek V4 को लेकर ‘98% कम मेमोरी’ वाला दावा सुनने में बहुत बड़ा लगता है, लेकिन यहीं सबसे ज्यादा सावधानी चाहिए। उपलब्ध सार्वजनिक जानकारी से जो बात मजबूत लगती है, वह यह है कि DeepSeek V4 ने लंबे context वाले inference में KV cache और attention cost घटाने के लिए architecture बदला है। लेकिन DeepSeek के आधिकारिक API समाचार, मॉडल कार्ड या NVIDIA की तकनीकी व्याख्या में पूरे model deployment की VRAM जरूरत 98% कम होने को आधिकारिक specification की तरह नहीं लिखा गया है ^[5]^[13]^[14]।

सरल भाषा में: KV cache, GPU VRAM का एक अहम हिस्सा हो सकता है—खासकर लंबी बातचीत, लंबे दस्तावेज़ या agent workflows में। लेकिन KV cache ही पूरा VRAM नहीं होता।

सबसे सुरक्षित निष्कर्ष

DeepSeek V4 के बारे में अभी सबसे संतुलित बात यह होगी:

DeepSeek V4, Hybrid Attention, Compressed Sparse Attention यानी CSA और Heavily Compressed Attention यानी HCA जैसे design के जरिए long-context inference में KV cache का दबाव घटाता है; लेकिन उपलब्ध स्रोतों से यह साबित नहीं होता कि कुल VRAM जरूरत 98% कम हो जाती है ^[13]^[14]।

यह फर्क व्यावहारिक है। अगर कोई टीम GPU खरीद, cloud budget या serving capacity की planning कर रही है, तो ‘KV cache कम’ और ‘पूरी GPU memory कम’ को अलग-अलग देखना जरूरी है।

आधिकारिक सामग्री क्या कहती है

DeepSeek की API news page के अनुसार DeepSeek-V4 Preview को 24 अप्रैल 2026 को जारी किया गया ^[5]। DeepSeek V4 model card बताता है कि इस series में DeepSeek-V4-Pro और DeepSeek-V4-Flash शामिल हैं। इसे Mixture-of-Experts यानी MoE language model series बताया गया है, जो DeepSeekMoE framework और Multi-Token Prediction यानी MTP strategy को बनाए रखती है, साथ ही Hybrid Attention Architecture जैसी नई architectural changes जोड़ती है ^[14]।

मेमोरी बचत से सबसे सीधे जुड़ी बात attention mechanism में दिखती है। NVIDIA की technical post के मुताबिक, V4 की Compressed Sparse Attention (CSA) dynamic sequence compression के जरिए KV entries को compress करती है, ताकि KV cache memory footprint कम हो; इसके बाद DeepSeek Sparse Attention यानी DSA attention matrices को sparse बनाकर computational overhead घटाता है। Heavily Compressed Attention (HCA) कई token sets की KV entries को एक single compressed entry में मिलाकर KV cache size और घटाता है ^[13]।

इससे जो बात सिद्ध होती है, वह है: DeepSeek V4 में KV cache size और attention computation के लिए efficiency engineering की गई है। इससे यह अपने-आप सिद्ध नहीं होता कि model weights, activations, framework overhead और बाकी serving stack सहित पूरा VRAM भी उसी अनुपात में घटेगा।

98%, 90% और 9.5x: इन तीन दावों को न मिलाएँ

अभी उपलब्ध स्रोतों में 98% वाला दावा सबसे साफ तौर पर एक LinkedIn user-generated लेख के शीर्षक में दिखता है, जिसमें कहा गया है कि DeepSeek Sparse Attention real-world serving में KV memory को 98% shrink करता है ^[21]। ऐसी सामग्री चर्चा या clue के तौर पर काम आ सकती है, लेकिन इसे DeepSeek की official specification नहीं माना जाना चाहिए।

ज्यादा सावधानी से पढ़ा जा सकने वाला third-party number 10% KV cache है। Wccftech ने लिखा कि DeepSeek V4, DeepSeek V3.2 की तुलना में सिर्फ 27% single-token inference FLOPs और 10% key-value यानी KV cache इस्तेमाल करता है ^[20]। अगर इसे सीधे पढ़ें, तो KV cache में लगभग 90% कमी निकलती है। लेकिन यह comparison DeepSeek V3.2 के विरुद्ध है; यह हर context length, batch size, hardware setup या total VRAM पर लागू सार्वभौमिक दावा नहीं है ^[20]।

एक और headline में DeepSeek V4 को 9.5x lower memory requirements के रूप में बताया गया ^[3]। गणित के हिसाब से 1/9.5 करीब 10.5% बची हुई जरूरत के बराबर है, यानी लगभग 89.5% reduction। यह भी 98% नहीं है, और फिर भी यह स्पष्ट करना जरूरी है कि बात KV cache की है, किसी खास long-context scenario की है या पूरे deployment memory की ^[3]।

दावा	प्रमाण की स्थिति	ज्यादा सही पढ़ाई
कुल VRAM 98% कम	आधिकारिक स्रोतों में पुष्टि नहीं दिखती	इसे procurement या marketing specification की तरह लिखना जोखिमपूर्ण है ^[5]^[14]^[21]
KV cache में बड़ी कमी	तकनीकी सामग्री से समर्थित	CSA/HCA long-context KV entries को compress करते हैं ^[13]
10% KV cache	third-party report में दावा	V3.2 की तुलना में लगभग 90% KV cache reduction समझा जा सकता है, कुल VRAM reduction नहीं ^[20]
9.5x lower memory	third-party headline	लगभग 89.5% reduction के बराबर, लेकिन scope स्पष्ट करना जरूरी है ^[3]

KV cache पूरे VRAM के बराबर क्यों नहीं है

Long-context inference में KV cache बहुत अहम हो जाता है। Hugging Face की DeepSeek V4 write-up बताती है कि agentic workloads में tool results बार-बार context में जुड़ते जाते हैं; इसके बाद हर नए token को पहले से आए लंबे context के मुकाबले attention cost चुकानी पड़ती है। इसी वजह से single-token inference FLOPs और KV cache size दोनों sequence length के साथ बढ़ते हैं ^[17]। Hugging Face के GitHub version में long-running agent tasks की आम समस्या इसी तरह बताई गई है: trace context budget से बाहर चला जाता है, KV cache GPU भर देता है या tool-call rounds से task धीमा हो जाता है ^[22]।

लेकिन model serving में VRAM सिर्फ KV cache के लिए नहीं लगता। वही LinkedIn लेख, जहाँ 98% वाला दावा दिखता है, VRAM demand को shared weights, expert weights, activations, KV cache और framework overhead जैसे अलग-अलग हिस्सों में बाँटकर दिखाता है ^[21]। यानी capacity planning में हर component अलग गिनना पड़ेगा। किसी एक long-context workload में KV cache बहुत घटे, तब भी यह निष्कर्ष अपने-आप नहीं निकलेगा कि पूरी serving memory भी 90% या 98% कम हो गई।

CSA/HCA असरदार engineering है, जादुई प्रतिशत नहीं

DeepSeek V4 की दिशा महत्वपूर्ण है, क्योंकि यह million-token context जैसे लंबे inference scenarios में attention और KV cache की महँगी समस्या को target करती है। NVIDIA के विवरण के अनुसार CSA KV entries को compress करता है, attention matrices को sparse बनाता है, और HCA कई token sets की KV entries को single compressed entry में मिलाकर KV cache size घटाता है ^[13]।

DeepSeek V4 technical report में inference और training infrastructure optimization की भी बात है—जैसे MoE modules के लिए single fused kernel, जो computation, communication और memory access को overlap करने में मदद करता है ^[2]। ये सभी efficiency improvements महत्वपूर्ण हैं। फिर भी ये ‘कुल VRAM 98% कम’ दावे का सीधा प्रमाण नहीं हैं।

DeepSeek V4 को evaluate करते समय क्या देखें

अगर आपका use case लंबे documents, लंबी chat history या agent workflows से जुड़ा है, तो DeepSeek V4 का KV cache compression सचमुच उपयोगी हो सकता है। लेकिन headline वाले 98% number के बजाय अपने workload की वास्तविक bottleneck पहचानना बेहतर है।

सबसे व्यावहारिक तरीका है: अपनी context length, batch size, concurrency, serving engine और hardware configuration पर benchmark चलाएँ। अगर bottleneck सच में KV cache है, तो V4 की CSA/HCA design से फायदा मिल सकता है। लेकिन अगर memory pressure model weights, activations, framework overhead या concurrency strategy से आ रहा है, तो KV cache reduction कुल VRAM saving में उसी अनुपात से नहीं बदलेगा ^[13]^[21]^[22]।

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

DeepSeek के आधिकारिक API समाचार, मॉडल कार्ड और तकनीकी विवरणों में V4 के लिए कुल VRAM 98% कम होने को औपचारिक स्पेसिफिकेशन के रूप में नहीं दिखाया गया है; प्रमाण long context inference में Hybrid Attention, CSA/HCA और KV...
ज्यादा जाँचा जा सकने वाला तीसरे पक्ष का दावा यह है कि DeepSeek V4, V3.2 की तुलना में 27% single token inference FLOPs और 10% KV cache इस्तेमाल करता है—यानी KV cache में लगभग 90% कमी; 98% वाला दावा मुख्यतः LinkedIn के...

लोग पूछते भी हैं

"DeepSeek V4 और 98% मेमोरी दावा: असली बात KV cache की है" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

अन्य कोण और अतिरिक्त उद्धरणों के लिए "Claude Security पब्लिक बीटा: Anthropic का AI कोड-सुरक्षा स्कैनर क्या करता है" के साथ जारी रखें।

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

इस उत्तर को "Grok 4.3 API: 1M context, सस्ती token pricing और xAI की वॉइस रणनीति" के सामने क्रॉस-चेक करें।

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

IT Consulting Orange County CA, IT Services, IT Outsourcing - Technijian. # Anthropic Claude Security for Enterprise: What You Need to Know. Anthropic has officially opened Claude

Claude Security पब्लिक बीटा: Anthropic का AI कोड-सुरक्षा स्कैनर क्या करता है

Claude Security पब्लिक बीटा: AI से कंपनी के कोड में कमजोरियां कैसे खोजेगा Anthropic

抽象 AI 介面顯示 Grok 4.3 API、1M context、token 價格與語音波形

Grok 4.3 API: 1M context, सस्ती token pricing और xAI की वॉइस रणनीति

Xiaomi Q5 शायद MIX 5 नहीं: नया संकेत Xiaomi 18 सीरीज़ की ओर

Metas KI-Altersprüfung: Neue Regeln für Jugendliche in EU, USA und Brasilien

Meta की AI उम्र जांच: Instagram और Facebook पर किशोरों के लिए क्या बदलेगा

Meta की नई AI उम्र जांच: Instagram और Facebook पर किशोरों के नियम बदलेंगे

अनुसंधान वार्तालाप

आप30 अप्रैल 2026

शोध प्रश्न

搜尋更多關於 DeepSeek V4 嘅秘密：記憶體少用 98% 嘅資料。

Studio Global AI30 अप्रैल 202630 स्रोत

उद्धृत उत्तर

我搵到嘅重點係：DeepSeek V4「記憶體少用 98%」呢個講法未見到由 DeepSeek 官方文件直接確認；較可靠資料只支持「大幅減少 KV cache / 長上下文記憶體」呢個方向，而 98% 似乎主要來自第三方或社交平台分析，證據不足。

DeepSeek 官方 API 文件顯示，DeepSeek-V4 Preview 於 2026-04-24 發布，型號包括 V4-Pro 同 V4-Flash ^[10]。
DeepSeek V4 技術文件／模型卡顯示，V4 包括 DeepSeek-V4-Pro、DeepSeek-V4-Flash，發布日期為 2026-04-24，模型卡發布／更新日期亦喺 2026 年 4 月底 ^[3]。
技術核心唔係「神秘記憶體」，而係長上下文 attention 壓縮：資料顯示 V4 使用 hybrid attention，包含 Compressed Sparse Attention／CSA 同 Heavily Compressed Attention／HCA，用嚟壓縮 KV entries、減少 KV cache memory footprint 同 attention 計算開銷 ^[2]。
NVIDIA 技術文章亦提到，CSA 會動態壓縮序列入面嘅 KV entries，再用 sparse attention 減少 attention matrices 嘅計算；HCA 就係更進取嘅壓縮版本 ^[2]。
DeepSeek V4 論文摘要指，V4 為長上下文效率設計 hybrid attention，並保留 DeepSeekMoE 同 Multi-Token Prediction 等架構元素 ^[4]。
關於「98% 少用記憶體」：搜尋結果入面有一篇 LinkedIn 文章標題聲稱「DeepSeek Sparse Attention Shrinks KV Memory by 98 Percent」，但呢個係用戶生成內容，可信度低，唔應視為官方結論 ^[25]。
較保守嘅第三方解讀講法係「約 90% less KV cache at 1M tokens」，而唔係 98%；呢類資料仍屬第三方分析，可靠性低過官方技術文件 ^[13]。
另一篇新聞標題講「9.5x lower memory requirements」，換算大約係減少約 89.5%，亦唔等於 98%；而且仍需睇佢比較基準係邊個模型、邊種 context 長度、邊種 batch / serving 設定 ^[7]。
所以，最準確講法應該係：DeepSeek V4 透過 CSA/HCA、低精度儲存／計算、MoE 同 fused kernel 等方法，大幅降低長上下文推理時嘅 KV cache 壓力；但「少用 98% 記憶體」暫時證據不足，唔應當成官方事實 ^[2]^[4]^[25]。

如果你要寫成短片／帖文標題，可以用較安全版本：

「DeepSeek V4 點樣大幅壓縮 KV Cache？」
「98% 少用記憶體？DeepSeek V4 真相係咩」
「DeepSeek V4 嘅秘密唔係魔法，而係 CSA + HCA 壓縮 attention」

सूत्र

[2] [PDF] DeepSeek-V4: Towards Highly Efficient Million-Token Context ...huggingface.co
To enable efficient training and inference for DeepSeek-V4 series as well as productive de-velopment, we introduce several infrastructure optimizations. First, we design and implement a single fused kernel for MoE modules that fully overlaps computation, co...
[3] DeepSeek Releases V4 Models With 9.5x Lower Memory Requirements and Huawei Ascend Support - gHacks Tech Newsghacks.net
Recently Updated Google Plans to Invest Up to $40 Billion in Anthropic in Two-Phase Deal Tied to Performance Targets Meta Laying Off 8,000 Employees on May 20 as AI Spending Reaches Up to $135 Billion in 2026 OpenAI Releases GPT-5.5 With Stronger Agentic Co...
[5] DeepSeek V4 Preview Releaseapi-docs.deepseek.com
API Reference News DeepSeek-V4 Preview Release 2026/04/24 DeepSeek-V3.2 Release 2025/12/01 DeepSeek-V3.2-Exp Release 2025/09/29 DeepSeek V3.1 Update 2025/09/22 DeepSeek V3.1 Release 2025/08/21 DeepSeek-R1-0528 Release 2025/05/28 DeepSeek-V3-0324 Release 202...
[13] Build with DeepSeek V4 Using NVIDIA Blackwell and GPU ...developer.nvidia.com
Compressed Sparse Attention (CSA): Leverages dynamic sequence compression to compress KV entries to reduce the KV cache memory footprint and then applies DeepSeek Sparse Attention (DSA) to sparsify the attention matrices and reduce computational overhead. H...
[14] [PDF] DeepSeek V4 Technical Documentation - Model Cardfe-static.deepseek.com
1 DeepSeek V4 Technical Documentation Publication date: April 27, 2026 Updated date: April 24, 2026 2 DeepSeek V4 - Model Card General Information Model Provider DeepSeek AI Model name DeepSeek V4, including: • DeepSeek-V4-Pro • DeepSeek-V4-Flash Release da...
[17] DeepSeek-V4: a million-token context that agents can actually usehuggingface.co
DeepSeek-V4: a million-token context that agents can actually use Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In Sign Up Back to Articles , every tool result is appended to the context, and every s...
[20] DeepSeek Aims At Memory Shortage With Latest AI Model But Might ...wccftech.com
AI Hardware DeepSeek Aims At Memory Shortage With Latest AI Model But Might Sacrifice Performance Ramish Zafar • Add Wccftech on Google Comments Chinese artificial intelligence lab DeepSeek claims to significantly reduce computing resources required for tok...
[21] DeepSeek Sparse Attention Shrinks KV Memory by 98 Percent in Real World Servinglinkedin.com
Shared weights: 16.00 GB Expert weights (MoE): 500.00 GB Activations: 447.74 GB KV cache: 895.48 GB Framework overhead: 6.16 GB Total compounded VRAM demand for parallel 8 users: 1,849.39 GB Now pair this with real hardware. Suppose you deploy 30× NVIDIA H1...
[22] blog/deepseekv4.md at main · huggingface/blog · GitHubgithub.com
Focusing on long running agentic workloads. Running a frontier open model as an agent today breaks in predictable ways. The model stops. You reprompt. The trace blows past the context budget, or the KV cache fills the GPU, or tool-call round trips degrade h...

ट्रेंडिंग डिस्कवर

उत्तरप्रकाशित30 अप्रैल 2026Last edited 6 मई 20269 स्रोत

DeepSeek V4 और 98% मेमोरी दावा: असली बात KV cache की है

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

15K0

सबसे सुरक्षित निष्कर्ष

DeepSeek V4 के बारे में अभी सबसे संतुलित बात यह होगी:

DeepSeek V4, Hybrid Attention, Compressed Sparse Attention यानी CSA और Heavily Compressed Attention यानी HCA जैसे design के जरिए long-context inference में KV cache का दबाव घटाता है; लेकिन उपलब्ध स्रोतों से यह साबित नहीं होता कि कुल VRAM जरूरत 98% कम हो जाती है ^[13]^[14]।

आधिकारिक सामग्री क्या कहती है

98%, 90% और 9.5x: इन तीन दावों को न मिलाएँ

दावा	प्रमाण की स्थिति	ज्यादा सही पढ़ाई
कुल VRAM 98% कम	आधिकारिक स्रोतों में पुष्टि नहीं दिखती	इसे procurement या marketing specification की तरह लिखना जोखिमपूर्ण है ^[5]^[14]^[21]
KV cache में बड़ी कमी	तकनीकी सामग्री से समर्थित	CSA/HCA long-context KV entries को compress करते हैं ^[13]
10% KV cache	third-party report में दावा	V3.2 की तुलना में लगभग 90% KV cache reduction समझा जा सकता है, कुल VRAM reduction नहीं ^[20]
9.5x lower memory	third-party headline	लगभग 89.5% reduction के बराबर, लेकिन scope स्पष्ट करना जरूरी है ^[3]

KV cache पूरे VRAM के बराबर क्यों नहीं है

CSA/HCA असरदार engineering है, जादुई प्रतिशत नहीं

DeepSeek V4 को evaluate करते समय क्या देखें

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

DeepSeek के आधिकारिक API समाचार, मॉडल कार्ड और तकनीकी विवरणों में V4 के लिए कुल VRAM 98% कम होने को औपचारिक स्पेसिफिकेशन के रूप में नहीं दिखाया गया है; प्रमाण long context inference में Hybrid Attention, CSA/HCA और KV...
ज्यादा जाँचा जा सकने वाला तीसरे पक्ष का दावा यह है कि DeepSeek V4, V3.2 की तुलना में 27% single token inference FLOPs और 10% KV cache इस्तेमाल करता है—यानी KV cache में लगभग 90% कमी; 98% वाला दावा मुख्यतः LinkedIn के...

लोग पूछते भी हैं

"DeepSeek V4 और 98% मेमोरी दावा: असली बात KV cache की है" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

अनुसंधान वार्तालाप

आप30 अप्रैल 2026

शोध प्रश्न

搜尋更多關於 DeepSeek V4 嘅秘密：記憶體少用 98% 嘅資料。

Studio Global AI30 अप्रैल 202630 स्रोत

उद्धृत उत्तर

DeepSeek 官方 API 文件顯示，DeepSeek-V4 Preview 於 2026-04-24 發布，型號包括 V4-Pro 同 V4-Flash ^[10]。
DeepSeek V4 技術文件／模型卡顯示，V4 包括 DeepSeek-V4-Pro、DeepSeek-V4-Flash，發布日期為 2026-04-24，模型卡發布／更新日期亦喺 2026 年 4 月底 ^[3]。
技術核心唔係「神秘記憶體」，而係長上下文 attention 壓縮：資料顯示 V4 使用 hybrid attention，包含 Compressed Sparse Attention／CSA 同 Heavily Compressed Attention／HCA，用嚟壓縮 KV entries、減少 KV cache memory footprint 同 attention 計算開銷 ^[2]。
NVIDIA 技術文章亦提到，CSA 會動態壓縮序列入面嘅 KV entries，再用 sparse attention 減少 attention matrices 嘅計算；HCA 就係更進取嘅壓縮版本 ^[2]。
DeepSeek V4 論文摘要指，V4 為長上下文效率設計 hybrid attention，並保留 DeepSeekMoE 同 Multi-Token Prediction 等架構元素 ^[4]。
關於「98% 少用記憶體」：搜尋結果入面有一篇 LinkedIn 文章標題聲稱「DeepSeek Sparse Attention Shrinks KV Memory by 98 Percent」，但呢個係用戶生成內容，可信度低，唔應視為官方結論 ^[25]。
較保守嘅第三方解讀講法係「約 90% less KV cache at 1M tokens」，而唔係 98%；呢類資料仍屬第三方分析，可靠性低過官方技術文件 ^[13]。
另一篇新聞標題講「9.5x lower memory requirements」，換算大約係減少約 89.5%，亦唔等於 98%；而且仍需睇佢比較基準係邊個模型、邊種 context 長度、邊種 batch / serving 設定 ^[7]。
所以，最準確講法應該係：DeepSeek V4 透過 CSA/HCA、低精度儲存／計算、MoE 同 fused kernel 等方法，大幅降低長上下文推理時嘅 KV cache 壓力；但「少用 98% 記憶體」暫時證據不足，唔應當成官方事實 ^[2]^[4]^[25]。

如果你要寫成短片／帖文標題，可以用較安全版本：

「DeepSeek V4 點樣大幅壓縮 KV Cache？」
「98% 少用記憶體？DeepSeek V4 真相係咩」
「DeepSeek V4 嘅秘密唔係魔法，而係 CSA + HCA 壓縮 attention」

सूत्र

[2] [PDF] DeepSeek-V4: Towards Highly Efficient Million-Token Context ...huggingface.co
To enable efficient training and inference for DeepSeek-V4 series as well as productive de-velopment, we introduce several infrastructure optimizations. First, we design and implement a single fused kernel for MoE modules that fully overlaps computation, co...
[3] DeepSeek Releases V4 Models With 9.5x Lower Memory Requirements and Huawei Ascend Support - gHacks Tech Newsghacks.net
Recently Updated Google Plans to Invest Up to $40 Billion in Anthropic in Two-Phase Deal Tied to Performance Targets Meta Laying Off 8,000 Employees on May 20 as AI Spending Reaches Up to $135 Billion in 2026 OpenAI Releases GPT-5.5 With Stronger Agentic Co...
[5] DeepSeek V4 Preview Releaseapi-docs.deepseek.com
API Reference News DeepSeek-V4 Preview Release 2026/04/24 DeepSeek-V3.2 Release 2025/12/01 DeepSeek-V3.2-Exp Release 2025/09/29 DeepSeek V3.1 Update 2025/09/22 DeepSeek V3.1 Release 2025/08/21 DeepSeek-R1-0528 Release 2025/05/28 DeepSeek-V3-0324 Release 202...
[13] Build with DeepSeek V4 Using NVIDIA Blackwell and GPU ...developer.nvidia.com
Compressed Sparse Attention (CSA): Leverages dynamic sequence compression to compress KV entries to reduce the KV cache memory footprint and then applies DeepSeek Sparse Attention (DSA) to sparsify the attention matrices and reduce computational overhead. H...
[14] [PDF] DeepSeek V4 Technical Documentation - Model Cardfe-static.deepseek.com
1 DeepSeek V4 Technical Documentation Publication date: April 27, 2026 Updated date: April 24, 2026 2 DeepSeek V4 - Model Card General Information Model Provider DeepSeek AI Model name DeepSeek V4, including: • DeepSeek-V4-Pro • DeepSeek-V4-Flash Release da...
[17] DeepSeek-V4: a million-token context that agents can actually usehuggingface.co
DeepSeek-V4: a million-token context that agents can actually use Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In Sign Up Back to Articles , every tool result is appended to the context, and every s...
[20] DeepSeek Aims At Memory Shortage With Latest AI Model But Might ...wccftech.com
AI Hardware DeepSeek Aims At Memory Shortage With Latest AI Model But Might Sacrifice Performance Ramish Zafar • Add Wccftech on Google Comments Chinese artificial intelligence lab DeepSeek claims to significantly reduce computing resources required for tok...
[21] DeepSeek Sparse Attention Shrinks KV Memory by 98 Percent in Real World Servinglinkedin.com
Shared weights: 16.00 GB Expert weights (MoE): 500.00 GB Activations: 447.74 GB KV cache: 895.48 GB Framework overhead: 6.16 GB Total compounded VRAM demand for parallel 8 users: 1,849.39 GB Now pair this with real hardware. Suppose you deploy 30× NVIDIA H1...
[22] blog/deepseekv4.md at main · huggingface/blog · GitHubgithub.com
Focusing on long running agentic workloads. Running a frontier open model as an agent today breaks in predictable ways. The model stops. You reprompt. The trace blows past the context budget, or the KV cache fills the GPU, or tool-call round trips degrade h...

ट्रेंडिंग डिस्कवर

उत्तरप्रकाशित30 अप्रैल 2026Last edited 6 मई 20269 स्रोत

DeepSeek V4 और 98% मेमोरी दावा: असली बात KV cache की है

Studio Global AI के साथ खोजें और तथ्यों की जांच करें डिस्कवर से और अधिक ब्राउज़ करें

15K0

सबसे सुरक्षित निष्कर्ष

DeepSeek V4 के बारे में अभी सबसे संतुलित बात यह होगी:

DeepSeek V4, Hybrid Attention, Compressed Sparse Attention यानी CSA और Heavily Compressed Attention यानी HCA जैसे design के जरिए long-context inference में KV cache का दबाव घटाता है; लेकिन उपलब्ध स्रोतों से यह साबित नहीं होता कि कुल VRAM जरूरत 98% कम हो जाती है ^[13]^[14]।

आधिकारिक सामग्री क्या कहती है

98%, 90% और 9.5x: इन तीन दावों को न मिलाएँ

दावा	प्रमाण की स्थिति	ज्यादा सही पढ़ाई
कुल VRAM 98% कम	आधिकारिक स्रोतों में पुष्टि नहीं दिखती	इसे procurement या marketing specification की तरह लिखना जोखिमपूर्ण है ^[5]^[14]^[21]
KV cache में बड़ी कमी	तकनीकी सामग्री से समर्थित	CSA/HCA long-context KV entries को compress करते हैं ^[13]
10% KV cache	third-party report में दावा	V3.2 की तुलना में लगभग 90% KV cache reduction समझा जा सकता है, कुल VRAM reduction नहीं ^[20]
9.5x lower memory	third-party headline	लगभग 89.5% reduction के बराबर, लेकिन scope स्पष्ट करना जरूरी है ^[3]

KV cache पूरे VRAM के बराबर क्यों नहीं है

CSA/HCA असरदार engineering है, जादुई प्रतिशत नहीं

DeepSeek V4 को evaluate करते समय क्या देखें

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI के साथ खोजें और तथ्यों की जांच करें

मुख्य निष्कर्ष

DeepSeek के आधिकारिक API समाचार, मॉडल कार्ड और तकनीकी विवरणों में V4 के लिए कुल VRAM 98% कम होने को औपचारिक स्पेसिफिकेशन के रूप में नहीं दिखाया गया है; प्रमाण long context inference में Hybrid Attention, CSA/HCA और KV...
ज्यादा जाँचा जा सकने वाला तीसरे पक्ष का दावा यह है कि DeepSeek V4, V3.2 की तुलना में 27% single token inference FLOPs और 10% KV cache इस्तेमाल करता है—यानी KV cache में लगभग 90% कमी; 98% वाला दावा मुख्यतः LinkedIn के...

लोग पूछते भी हैं

"DeepSeek V4 और 98% मेमोरी दावा: असली बात KV cache की है" का संक्षिप्त उत्तर क्या है?

सबसे पहले सत्यापित करने योग्य मुख्य बिंदु क्या हैं?

मुझे आगे किस संबंधित विषय का पता लगाना चाहिए?

संबंधित पृष्ठ खोलें

मुझे इसकी तुलना किससे करनी चाहिए?

संबंधित पृष्ठ खोलें

अपना शोध जारी रखें

अनुसंधान वार्तालाप

आप30 अप्रैल 2026

शोध प्रश्न

搜尋更多關於 DeepSeek V4 嘅秘密：記憶體少用 98% 嘅資料。

Studio Global AI30 अप्रैल 202630 स्रोत

उद्धृत उत्तर

DeepSeek 官方 API 文件顯示，DeepSeek-V4 Preview 於 2026-04-24 發布，型號包括 V4-Pro 同 V4-Flash ^[10]。
DeepSeek V4 技術文件／模型卡顯示，V4 包括 DeepSeek-V4-Pro、DeepSeek-V4-Flash，發布日期為 2026-04-24，模型卡發布／更新日期亦喺 2026 年 4 月底 ^[3]。
技術核心唔係「神秘記憶體」，而係長上下文 attention 壓縮：資料顯示 V4 使用 hybrid attention，包含 Compressed Sparse Attention／CSA 同 Heavily Compressed Attention／HCA，用嚟壓縮 KV entries、減少 KV cache memory footprint 同 attention 計算開銷 ^[2]。
NVIDIA 技術文章亦提到，CSA 會動態壓縮序列入面嘅 KV entries，再用 sparse attention 減少 attention matrices 嘅計算；HCA 就係更進取嘅壓縮版本 ^[2]。
DeepSeek V4 論文摘要指，V4 為長上下文效率設計 hybrid attention，並保留 DeepSeekMoE 同 Multi-Token Prediction 等架構元素 ^[4]。
關於「98% 少用記憶體」：搜尋結果入面有一篇 LinkedIn 文章標題聲稱「DeepSeek Sparse Attention Shrinks KV Memory by 98 Percent」，但呢個係用戶生成內容，可信度低，唔應視為官方結論 ^[25]。
較保守嘅第三方解讀講法係「約 90% less KV cache at 1M tokens」，而唔係 98%；呢類資料仍屬第三方分析，可靠性低過官方技術文件 ^[13]。
另一篇新聞標題講「9.5x lower memory requirements」，換算大約係減少約 89.5%，亦唔等於 98%；而且仍需睇佢比較基準係邊個模型、邊種 context 長度、邊種 batch / serving 設定 ^[7]。
所以，最準確講法應該係：DeepSeek V4 透過 CSA/HCA、低精度儲存／計算、MoE 同 fused kernel 等方法，大幅降低長上下文推理時嘅 KV cache 壓力；但「少用 98% 記憶體」暫時證據不足，唔應當成官方事實 ^[2]^[4]^[25]。

如果你要寫成短片／帖文標題，可以用較安全版本：

「DeepSeek V4 點樣大幅壓縮 KV Cache？」
「98% 少用記憶體？DeepSeek V4 真相係咩」
「DeepSeek V4 嘅秘密唔係魔法，而係 CSA + HCA 壓縮 attention」

सूत्र

[2] [PDF] DeepSeek-V4: Towards Highly Efficient Million-Token Context ...huggingface.co
To enable efficient training and inference for DeepSeek-V4 series as well as productive de-velopment, we introduce several infrastructure optimizations. First, we design and implement a single fused kernel for MoE modules that fully overlaps computation, co...
[3] DeepSeek Releases V4 Models With 9.5x Lower Memory Requirements and Huawei Ascend Support - gHacks Tech Newsghacks.net
Recently Updated Google Plans to Invest Up to $40 Billion in Anthropic in Two-Phase Deal Tied to Performance Targets Meta Laying Off 8,000 Employees on May 20 as AI Spending Reaches Up to $135 Billion in 2026 OpenAI Releases GPT-5.5 With Stronger Agentic Co...
[5] DeepSeek V4 Preview Releaseapi-docs.deepseek.com
API Reference News DeepSeek-V4 Preview Release 2026/04/24 DeepSeek-V3.2 Release 2025/12/01 DeepSeek-V3.2-Exp Release 2025/09/29 DeepSeek V3.1 Update 2025/09/22 DeepSeek V3.1 Release 2025/08/21 DeepSeek-R1-0528 Release 2025/05/28 DeepSeek-V3-0324 Release 202...
[13] Build with DeepSeek V4 Using NVIDIA Blackwell and GPU ...developer.nvidia.com
Compressed Sparse Attention (CSA): Leverages dynamic sequence compression to compress KV entries to reduce the KV cache memory footprint and then applies DeepSeek Sparse Attention (DSA) to sparsify the attention matrices and reduce computational overhead. H...
[14] [PDF] DeepSeek V4 Technical Documentation - Model Cardfe-static.deepseek.com
1 DeepSeek V4 Technical Documentation Publication date: April 27, 2026 Updated date: April 24, 2026 2 DeepSeek V4 - Model Card General Information Model Provider DeepSeek AI Model name DeepSeek V4, including: • DeepSeek-V4-Pro • DeepSeek-V4-Flash Release da...
[17] DeepSeek-V4: a million-token context that agents can actually usehuggingface.co
DeepSeek-V4: a million-token context that agents can actually use Image 1: Hugging Face's logoHugging Face Models Datasets Spaces Buckets new Docs Enterprise Pricing Log In Sign Up Back to Articles , every tool result is appended to the context, and every s...
[20] DeepSeek Aims At Memory Shortage With Latest AI Model But Might ...wccftech.com
AI Hardware DeepSeek Aims At Memory Shortage With Latest AI Model But Might Sacrifice Performance Ramish Zafar • Add Wccftech on Google Comments Chinese artificial intelligence lab DeepSeek claims to significantly reduce computing resources required for tok...
[21] DeepSeek Sparse Attention Shrinks KV Memory by 98 Percent in Real World Servinglinkedin.com
Shared weights: 16.00 GB Expert weights (MoE): 500.00 GB Activations: 447.74 GB KV cache: 895.48 GB Framework overhead: 6.16 GB Total compounded VRAM demand for parallel 8 users: 1,849.39 GB Now pair this with real hardware. Suppose you deploy 30× NVIDIA H1...
[22] blog/deepseekv4.md at main · huggingface/blog · GitHubgithub.com
Focusing on long running agentic workloads. Running a frontier open model as an agent today breaks in predictable ways. The model stops. You reprompt. The trace blows past the context budget, or the KV cache fills the GPU, or tool-call round trips degrade h...