답변게시됨2개월 전Last edited 지난달14 소스

Zyphra ZAYA1-8B-Diffusion-Preview: 16개 토큰을 동시에 생성하는 확산 LLM

Zyphra의 ZAYA1 8B Diffusion Preview는 기존 ZAYA1‑8B MoE 모델을 확산 방식으로 변환해 한 번에 16개의 토큰을 생성하며 최대 4.6 7.7배 디코딩 속도 향상을 보고했다. 블록 단위 병렬 생성은 토큰을 순차적으로 생성하는 기존 자기회귀 방식의 메모리 대역폭 병목을 줄이고 GPU 병렬 연산 활용도를 높인다.

Studio Global AI로 검색 및 팩트체크 인기 페이지 더 보기

Conceptual illustration of a diffusion language model generating multiple tokens in parallel — What is Zyphra’s new ZAYA1-8B-Diffusion-Preview model, how does converting its autoregressive ZAYA1-8B into a Mixture-of-Experts diffusion lDiffusion-style language models can draft multiple tokens simultaneously instead of generating them sequentially.
AI 프롬프트
Create a landscape editorial hero image for this Studio Global article: What is Zyphra’s new ZAYA1-8B-Diffusion-Preview model, how does converting its autoregressive ZAYA1-8B into a Mixture-of-Experts diffusion l. Article summary: Zyphra’s ZAYA1-8B-Diffusion-Preview is an experimental diffusion-language version of its ZAYA1-8B MoE model, designed to decode blocks of text in parallel rather than strictly one token at a time. Zyphra claims it can ge. Topic tags: general, academic, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "# Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class. Zyphra AI has released ZAYA1-8B, a small Mixture of Experts (MoE) langu" source context "Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class
openai.com

확산 방식으로 바뀐 LLM 디코딩

AI 스타트업 Zyphra는 기존 언어 모델 구조를 크게 바꾸지 않으면서도 추론 속도를 높이는 실험적 모델 ZAYA1‑8B‑Diffusion‑Preview를 공개했다. 이 모델의 핵심은 전통적인 자기회귀(autoregressive) 텍스트 생성 방식 대신 확산(diffusion) 기반 생성 방식을 적용했다는 점이다.

일반적인 LLM은 텍스트를 한 토큰씩 순차적으로 생성한다. 반면 ZAYA1‑8B‑Diffusion‑Preview는 한 번의 단계에서 16개의 토큰 블록을 동시에 생성하도록 설계됐다. 이 방식 덕분에 Zyphra는 디코딩 속도가 **약 4.6배(손실 없는 샘플러)**에서 **최대 7.7배(logit‑mixing 샘플러)**까지 빨라질 수 있다고 주장한다.

특히 이 모델은 완전히 새로 학습된 확산 모델이 아니라, 기존 자기회귀 LLM 체크포인트를 확산 모델로 변환한 사례라는 점에서도 주목받는다.

기반 모델: ZAYA1‑8B

Diffusion‑Preview는 Zyphra가 공개한 ZAYA1‑8B 모델을 기반으로 한다.

이 모델은 Mixture‑of‑Experts(MoE) 구조를 사용한다. 전체 파라미터는 약 80억 개 이상이지만, 실제 추론 과정에서는 약 7억 6천만 개 정도만 활성화된다.

MoE 구조에서는 여러 전문 서브네트워크(‘expert’) 중 일부만 선택적으로 활성화된다. 그 결과 비슷한 규모의 밀집 모델보다 연산 비용을 크게 줄이면서 성능을 유지할 수 있다.

왜 기존 LLM 디코딩은 느린가

대부분의 언어 모델은 자기회귀 생성 방식을 사용한다.

동작 과정은 단순하다.

다음 토큰을 하나 생성
KV 캐시(Key–Value cache)를 업데이트
다시 다음 토큰을 생성

문제는 각 토큰이 이전 토큰에 의존하기 때문에 계산을 병렬화하기 어렵다는 점이다. 생성 과정은 결국 순차적 작업이 되고, KV 캐시에 반복적으로 접근하면서 메모리 대역폭이 병목이 되는 경우가 많다.

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI로 검색 및 팩트체크

사람들은 또한 묻습니다.