Qwen has unveiled Qwen3-Max, its largest and most capable model to date—and the headline numbers are eye-catching: ~1 trillion parameters trained on 36 trillion tokens, delivered in a Mixture-of-Experts (MoE) architecture that emphasizes both training stability and throughput. The team says the preview of Qwen3-Max-Instruct hit the top three on the Text Arena leaderboard, and the official release improves coding and agent performance further. You can try Qwen3-Max-Instruct via Alibaba Cloud API or in Qwen Chat, with a Thinking variant under active training.
Key takeaways
- Scale & data: ~1T parameters; 36T tokens of pretraining data.
- Stable training: The MoE design yielded a smooth, spike-free loss curve—no rollbacks or data distribution tweaks required.
- Throughput gains: With PAI-FlashMoE multi-level pipeline parallelism, Qwen3-Max-Base achieved ~30% higher MFU (Model FLOPs Utilization) vs Qwen2.5-Max-Base.
- Long-context training: The ChunkFlow strategy delivered ~3× throughput vs context parallelism and enabled training with a 1M-token context length. (Note: this statement is about training setup.)
- Resilience at scale: Tooling like SanityCheck and EasyCheckpoint plus pipeline scheduling reduced hardware-failure time loss to ~1/5 of that observed during Qwen2.5-Max training.
Qwen3-Max-Base: architecture & training
Qwen3-Max follows the Qwen3 design paradigm with an MoE backbone. The training report highlights consistent stability across the run—no loss spikes—and emphasizes efficiency improvements from PAI-FlashMoE. For long-context training, ChunkFlow substantially boosted throughput and supported 1M-token training context. Combined with fault-tolerance tooling and scheduling tweaks, these changes reduced cluster-level downtime during ultra-large-scale training.
Qwen3-Max-Instruct: coding & agents step up
The Instruct variant is positioned as a top-tier general model with specific strengths in coding and tool use:
- On SWE-Bench Verified (real-world coding fixes), Qwen3-Max-Instruct reports a score of 69.6.
- On Tau2-Bench (agent tool-calling proficiency), it reports 74.8, which the paper notes surpasses Claude Opus 4 and DeepSeek V3.1 in that benchmark.
- The preview ranked top-3 on the Text Arena leaderboard; the official release further boosts coding and agent capabilities.
You can access Qwen3-Max-Instruct via Alibaba Cloud API or try it directly in Qwen Chat.
Qwen3-Max-Thinking: pushing reasoning with test-time compute
A separate Thinking variant is still in training but already demonstrates standout reasoning when paired with tools:
- With a code interpreter and parallel test-time compute, the model reports 100% on challenging math-reasoning sets AIME 25 and HMMT.
The team says they plan to release the Thinking model publicly after continued training.
Source: