Native MTP speculative decoding for Apple Silicon. Over 2× the decode speed at default model temperatures, using the model's built-in MTP heads. No external drafter.
Wizard handles model, mode, and surface (browser or terminal) on first run. After that, one keypress.
# Install via Homebrew brew install youssofal/mtplx/mtplx # Start chatting — wizard picks model, mode, and surface mtplx start
Most fast-decode tools cheat at temperature by matching greedy argmaxes — that silently breaks the target distribution. MTPLX accepts via the Leviathan–Chen probability ratio with residual (p − q)+ correction. Verified bit-exact against single-token AR.
The drafter is the target's own MTP heads. No second model in memory. No distillation. No external drafter to maintain.
Probability-ratio acceptance with residual (p − q)+ correction. Verified max_diff = 0.0 against reference single-token AR.
MLX source fork plus custom Metal kernels tuned for the verify hot path. Real OpenAI- and Anthropic-compatible serving stack on top.
Per cycle, the MTP head drafts K tokens, the target verifies all K in one batched forward, and the math decides — per position, exactly. A bonus token falls out for free when all K accept.
K tokens drafted from the target's own built-in MTP heads, with proposal probabilities q.
Target evaluates all K positions in one forward via GraphBank-compiled verify shapes.
Per-position acceptance via Leviathan–Chen rejection sampling. fp32 ratio path because BF16 underflows.
On rejection, sample a clean replacement from (p − q)+. Rejected drafts never enter committed history.
Committed-history KV writeback. Bonus token at K+1 falls out for free when every position accepted.
MTPLX D4 acceptance on Qwen3.6-27B is higher per position than vLLM's CUDA MTP-5, on the same prompts.