MTPLX accelerates decoding with the model's own multi-token-prediction heads. No second model, no distillation, no quality tricks. This page is the deep dive; the app is the easy part.
The drafter is the target's own MTP heads. No second model in memory. No distillation. No external drafter to maintain.
Probability-ratio acceptance with residual (p − q)+ correction. Verified max_diff = 0.0 against reference single-token AR.
MLX source fork plus custom Metal kernels tuned for the verify hot path. Real OpenAI- and Anthropic-compatible serving stack on top.
Per cycle, the MTP head drafts K tokens, the target verifies all K in one batched forward, and the math decides per position, exactly. A bonus token falls out for free when all K accept.
K tokens drafted from the target's own built-in MTP heads, with proposal probabilities q.
Target evaluates all K positions in one forward via GraphBank-compiled verify shapes.
Per-position acceptance via Leviathan–Chen rejection sampling. fp32 ratio path because BF16 underflows.
On rejection, sample a clean replacement from (p − q)+. Rejected drafts never enter committed history.
Committed-history KV writeback. Bonus token at K+1 falls out for free when every position accepted.
MTPLX D4 acceptance on Qwen3.6-27B is higher per position than vLLM's CUDA MTP-5, on the same prompts.
Speculative decoding is only worth it if the output distribution is untouched. MTPLX accepts drafted tokens via the Leviathan–Chen probability ratio min(1, p/q) and, on rejection, resamples from the residual (p − q)+, the textbook construction that provably preserves the target distribution at any temperature. We verify it empirically too: logits_max_abs_diff = 0.0 against single-token autoregressive decoding, across turns, with the session cache on.