Native MTP · MLX · v0.1.0-preview.1
MTPLX MTPLX MTPLX MTPLX MTPLX

Run local LLMs
twice as fast.

Native MTP speculative decoding for Apple Silicon. Over 2× the decode speed at default model temperatures, using the model's built-in MTP heads. No external drafter.

Install

One install.
One wizard.

Wizard handles model, mode, and surface (browser or terminal) on first run. After that, one keypress.

~/$ — terminal
# Install via Homebrew
brew install youssofal/mtplx/mtplx

# Start chatting — wizard picks model, mode, and surface
mtplx start
Same prompt. Same temperature.

Twice as fast.
Still exact.

Most fast-decode tools cheat at temperature by matching greedy argmaxes — that silently breaks the target distribution. MTPLX accepts via the Leviathan–Chen probability ratio with residual (p − q)+ correction. Verified bit-exact against single-token AR.

Speedup2.24×
Without MTPbaseline TPS
With MTPLXnative MTP
Qwen3.6-27B · MacBook Pro M5 Max · MLX target temp 0.6 / top_p 0.95 / top_k 20
What MTPLX is

A native-MTP runtime,
not a wrapper.

01 — Native MTP

Single checkpoint.

The drafter is the target's own MTP heads. No second model in memory. No distillation. No external drafter to maintain.

02 — Exact at T

Leviathan–Chen, not argmax.

Probability-ratio acceptance with residual (p − q)+ correction. Verified max_diff = 0.0 against reference single-token AR.

03 — MLX-native

Built for Apple Silicon.

MLX source fork plus custom Metal kernels tuned for the verify hot path. Real OpenAI- and Anthropic-compatible serving stack on top.

How the cycle runs

One forward.
K verified tokens.

Per cycle, the MTP head drafts K tokens, the target verifies all K in one batched forward, and the math decides — per position, exactly. A bonus token falls out for free when all K accept.

01 — Draft

MTP head proposes

K tokens drafted from the target's own built-in MTP heads, with proposal probabilities q.

02 — Verify

Batched target forward

Target evaluates all K positions in one forward via GraphBank-compiled verify shapes.

03 — Accept

Probability ratio

Per-position acceptance via Leviathan–Chen rejection sampling. fp32 ratio path because BF16 underflows.

04 — Repair

Residual correction

On rejection, sample a clean replacement from (p − q)+. Rejected drafts never enter committed history.

05 — Commit

+ bonus token

Committed-history KV writeback. Bonus token at K+1 falls out for free when every position accepted.

What sits on top of MLX

Custom Metal kernels we own.

Clients
Browser chat · Open WebUI · Claude Code · Cline · Continue · openai-python · anthropic-python
Serving API
/v1/chat/completions · /v1/messages · /v1/models · /health · /metrics — OpenAI- and Anthropic-compatible, streaming SSE
Engine
Engine sessions · SessionBank warm-prefix exact-state reuse · logits_max_abs_diff = 0.0 across turns
MTPLX runtime
Native-MTP speculative cycle · committed-history KV contract verified vs vLLM CUDA reference at cosine > 0.9998 through D5
Custom Metal
linear-gdn-from-conv-tape fused GDN verify kernel · verify_qmv small-M qmv · GraphBank compiled verify shapes · draft-only 4/3-bit LM head
MLX source fork
mlx-mtplx-0.31.2-qmm · small-M qmv retuned BN16 · 4-simdgroup · unroll_count(4) for verify shapes M=3..6
vs vLLM CUDA

Higher acceptance
at every depth.

MTPLX D4 acceptance on Qwen3.6-27B is higher per position than vLLM's CUDA MTP-5, on the same prompts.

MTPLX · D4 · Apple Silicon

D197.62%
D295.24%
D388.10%
D475.61%

vLLM · MTP-5 · CUDA

P192.70%
P277.00%
P363.00%
P450.90%
P543.00%