Qwen3-Max: A 1-Trillion-Parameter MoE That Pushes Coding, Agents, and Reasoning to the Edge

Qwen has unveiled Qwen3-Max, its largest and most capable model to date—and the headline numbers are eye-catching: ~1 trillion parameters trained on 36 trillion tokens, delivered in a Mixture-of-Experts (MoE) architecture that emphasizes both training stability and throughput. The team says the preview of Qwen3-Max-Instruct hit the top three on the Text Arena leaderboard, and the official release improves coding and agent performance further. You can try Qwen3-Max-Instruct via Alibaba Cloud API or in Qwen Chat, with a Thinking variant under active training.

Key takeaways

Scale & data: ~1T parameters; 36T tokens of pretraining data.
Stable training: The MoE design yielded a smooth, spike-free loss curve—no rollbacks or data distribution tweaks required.
Throughput gains: With PAI-FlashMoE multi-level pipeline parallelism, Qwen3-Max-Base achieved ~30% higher MFU (Model FLOPs Utilization) vs Qwen2.5-Max-Base.
Long-context training: The ChunkFlow strategy delivered ~3× throughput vs context parallelism and enabled training with a 1M-token context length. (Note: this statement is about training setup.)
Resilience at scale: Tooling like SanityCheck and EasyCheckpoint plus pipeline scheduling reduced hardware-failure time loss to ~1/5 of that observed during Qwen2.5-Max training.

Qwen3-Max-Base: architecture & training

Qwen3-Max follows the Qwen3 design paradigm with an MoE backbone. The training report highlights consistent stability across the run—no loss spikes—and emphasizes efficiency improvements from PAI-FlashMoE. For long-context training, ChunkFlow substantially boosted throughput and supported 1M-token training context. Combined with fault-tolerance tooling and scheduling tweaks, these changes reduced cluster-level downtime during ultra-large-scale training.

Qwen3-Max-Instruct: coding & agents step up

The Instruct variant is positioned as a top-tier general model with specific strengths in coding and tool use:

On SWE-Bench Verified (real-world coding fixes), Qwen3-Max-Instruct reports a score of 69.6.
On Tau2-Bench (agent tool-calling proficiency), it reports 74.8, which the paper notes surpasses Claude Opus 4 and DeepSeek V3.1 in that benchmark.
The preview ranked top-3 on the Text Arena leaderboard; the official release further boosts coding and agent capabilities.

You can access Qwen3-Max-Instruct via Alibaba Cloud API or try it directly in Qwen Chat.

Qwen3-Max-Thinking: pushing reasoning with test-time compute

A separate Thinking variant is still in training but already demonstrates standout reasoning when paired with tools:

With a code interpreter and parallel test-time compute, the model reports 100% on challenging math-reasoning sets AIME 25 and HMMT.

The team says they plan to release the Thinking model publicly after continued training.

Source:

Official Blog

Key takeaways#

Qwen3-Max-Base: architecture & training#

Qwen3-Max-Instruct: coding & agents step up#

Qwen3-Max-Thinking: pushing reasoning with test-time compute#

About the Author

Key takeaways

Qwen3-Max-Base: architecture & training

Qwen3-Max-Instruct: coding & agents step up

Qwen3-Max-Thinking: pushing reasoning with test-time compute