Sophia XT Q3 Architecture

1.75 trillion total parameters, 298 billion active per token. 128 layers, 96 attention heads, 256 MoE experts with top-4 routing. Discrete diffusion (not autoregressive). KV-Cache-Free inference with FlashAttention-3. Benchmarks: MMLU 92.3%, HumanEval 89.7%, GSM8K 94.2%, SWE-bench 73.5%. 24x NVIDIA H200 GPUs.

LLM Information File | Sitemap | Robots.txt