Cassandra T1 Model Family — Diffusion Language Model Architecture

Something from nothing.

Cassandra T1 is a 1.3 billion parameter masked diffusion language model created by SOPHIA XT. Unlike autoregressive models such as Gemma-style architectures, which generate one token at a time left-to-right, Cassandra T1 generates all token positions simultaneously through parallel denoising in 8-16 steps.

Demo and Benchmark Board

The Cassandra model-family page includes animated architecture diagrams, a masked-denoising demo mockup, code integration examples, and internal benchmark targets against a Gemma 4 autoregressive reference. The headline goal is roughly a 98% quality envelope while using far fewer forward passes for longer generations. Final public benchmark numbers should be replaced with measured release evaluations when weights are published.

Architecture

1.33B parameters. 28 transformer layers. Grouped Query Attention (16 heads, 4 KV heads). SwiGLU FFN. RMSNorm. RoPE (theta=500K, 128K context). Sliding window attention (4096) + global tokens (256). BPE vocabulary of 32,768 tokens with spatial coordinate token support.

Novel Contributions

Generation Speed

For 512 tokens: autoregressive models (GPT, LLaMA) need 512 forward passes. Cassandra T1 needs 8-16 forward passes — all tokens generated in parallel per step. This enables sub-second inference on consumer GPUs and ~2-3 seconds on phone hardware.

Training

Trained on 1M+ curated instruction samples including conversations, mathematical reasoning, code, spatial annotations, science, medical knowledge, and knowledge distillation from a larger pre-trained model. Multi-objective loss with spatial token upweighting. Beta(2,2) mask ratio sampling for PDE-matched importance distribution.

Applications

Part of the SOPHIA XT Model Family. Research by SOPHIA XT. View all research.


LLM Information File | Sitemap | Robots.txt