How MTPLX works · MTP speculative decoding on Apple Silicon

01 · Native MTP

Single checkpoint.

The drafter is the target's own MTP heads. No second model in memory. No distillation. No external drafter to maintain.

02 · Exact at T

Leviathan–Chen, not argmax.

Probability-ratio acceptance with residual (p − q)+ correction. Verified max_diff = 0.0 against reference single-token AR.

03 · MLX-native

Built for Apple Silicon.

MLX source fork plus custom Metal kernels tuned for the verify hot path. Real OpenAI- and Anthropic-compatible serving stack on top.

How the cycle runs

One forward.
K verified tokens.

Per cycle, the MTP head drafts K tokens, the target verifies all K in one batched forward, and the math decides per position, exactly. A bonus token falls out for free when all K accept.

01 · Draft

MTP head proposes

K tokens drafted from the target's own built-in MTP heads, with proposal probabilities q.

02 · Verify

Batched target forward

Target evaluates all K positions in one forward via GraphBank-compiled verify shapes.

03 · Accept

Probability ratio

Per-position acceptance via Leviathan–Chen rejection sampling. fp32 ratio path because BF16 underflows.

04 · Repair

Residual correction

On rejection, sample a clean replacement from (p − q)+. Rejected drafts never enter committed history.

05 · Commit

+ bonus token

Committed-history KV writeback. Bonus token at K+1 falls out for free when every position accepted.

What sits on top of MLX

Custom Metal kernels we own.

Clients

Browser chat · Open WebUI · Claude Code · Cline · Continue · openai-python · anthropic-python

Serving API

/v1/chat/completions · /v1/messages · /v1/models · /health · /metrics · OpenAI and Anthropic compatible, streaming SSE

Engine

Engine sessions · SessionBank warm-prefix exact-state reuse · logits_max_abs_diff = 0.0 across turns

MTPLX runtime

Native-MTP speculative cycle · committed-history KV contract verified vs vLLM CUDA reference at cosine > 0.9998 through D5

Custom Metal

linear-gdn-from-conv-tape fused GDN verify kernel · verify_qmv small-M qmv · GraphBank compiled verify shapes · draft-only 4/3-bit LM head

MLX source fork

mlx-mtplx-0.31.2-qmm · small-M qmv retuned BN16 · 4-simdgroup · unroll_count(4) for verify shapes M=3..6

vs vLLM CUDA

Higher acceptance
at every depth.

MTPLX D4 acceptance on Qwen3.6-27B is higher per position than vLLM's CUDA MTP-5, on the same prompts.

MTPLX · D4 · Apple Silicon

D197.62%

D295.24%

D388.10%

D475.61%

vLLM · MTP-5 · CUDA

P192.70%

P277.00%

P363.00%

P450.90%

P543.00%

What it adds up to

From 28 to 63 tokens per second.

Speedup0.00×

0TPS

Without MTPbaseline TPS

With MTPLXnative MTP

Qwen3.6-27B · MacBook Pro M5 Max · MLX target temp 0.6 / top_p 0.95 / top_k 20

Exactness is the contract.

Speculative decoding is only worth it if the output distribution is untouched. MTPLX accepts drafted tokens via the Leviathan–Chen probability ratio min(1, p/q) and, on rejection, resamples from the residual (p − q)+, the textbook construction that provably preserves the target distribution at any temperature. We verify it empirically too: logits_max_abs_diff = 0.0 against single-token autoregressive decoding, across turns, with the session cache on.