The technical story

A native-MTP runtime,
not a wrapper.

MTPLX accelerates decoding with the model's own multi-token-prediction heads. No second model, no distillation, no quality tricks. This page is the deep dive; the app is the easy part.

01 · Native MTP

Single checkpoint.

The drafter is the target's own MTP heads. No second model in memory. No distillation. No external drafter to maintain.

02 · Exact at T

Leviathan–Chen, not argmax.

Probability-ratio acceptance with residual (p − q)+ correction. Verified max_diff = 0.0 against reference single-token AR.

03 · MLX-native

Built for Apple Silicon.

MLX source fork plus custom Metal kernels tuned for the verify hot path. Real OpenAI- and Anthropic-compatible serving stack on top.

How the cycle runs

One forward.
K verified tokens.

Per cycle, the MTP head drafts K tokens, the target verifies all K in one batched forward, and the math decides per position, exactly. A bonus token falls out for free when all K accept.

01 · Draft

MTP head proposes

K tokens drafted from the target's own built-in MTP heads, with proposal probabilities q.

02 · Verify

Batched target forward

Target evaluates all K positions in one forward via GraphBank-compiled verify shapes.

03 · Accept

Probability ratio

Per-position acceptance via Leviathan–Chen rejection sampling. fp32 ratio path because BF16 underflows.

04 · Repair

Residual correction

On rejection, sample a clean replacement from (p − q)+. Rejected drafts never enter committed history.

05 · Commit

+ bonus token

Committed-history KV writeback. Bonus token at K+1 falls out for free when every position accepted.

What sits on top of MLX

Custom Metal kernels we own.

Clients
Browser chat · Open WebUI · Claude Code · Cline · Continue · openai-python · anthropic-python
Serving API
/v1/chat/completions · /v1/messages · /v1/models · /health · /metrics · OpenAI and Anthropic compatible, streaming SSE
Engine
Engine sessions · SessionBank warm-prefix exact-state reuse · logits_max_abs_diff = 0.0 across turns
MTPLX runtime
Native-MTP speculative cycle · committed-history KV contract verified vs vLLM CUDA reference at cosine > 0.9998 through D5
Custom Metal
linear-gdn-from-conv-tape fused GDN verify kernel · verify_qmv small-M qmv · GraphBank compiled verify shapes · draft-only 4/3-bit LM head
MLX source fork
mlx-mtplx-0.31.2-qmm · small-M qmv retuned BN16 · 4-simdgroup · unroll_count(4) for verify shapes M=3..6
vs vLLM CUDA

Higher acceptance
at every depth.

MTPLX D4 acceptance on Qwen3.6-27B is higher per position than vLLM's CUDA MTP-5, on the same prompts.

MTPLX · D4 · Apple Silicon

D197.62%
D295.24%
D388.10%
D475.61%

vLLM · MTP-5 · CUDA

P192.70%
P277.00%
P363.00%
P450.90%
P543.00%
What it adds up to

From 28 to 63 tokens per second.

Speedup0.00×
Without MTPbaseline TPS
With MTPLXnative MTP
Qwen3.6-27B · MacBook Pro M5 Max · MLX target temp 0.6 / top_p 0.95 / top_k 20

Exactness is the contract.

Speculative decoding is only worth it if the output distribution is untouched. MTPLX accepts drafted tokens via the Leviathan–Chen probability ratio min(1, p/q) and, on rejection, resamples from the residual (p − q)+, the textbook construction that provably preserves the target distribution at any temperature. We verify it empirically too: logits_max_abs_diff = 0.0 against single-token autoregressive decoding, across turns, with the session cache on.