MTPLX 1.0.0

MTPLX 1.0.0 is the first full release: a native macOS app and the mtplx command line working as one product, built for Apple Silicon.

If you are new here: MTPLX runs local language models using their own built-in multi-token prediction heads as a speculative drafter, with exact rejection sampling. Same output distribution as normal decoding, measured 1.6x faster on a 16 GB M4 Mac mini and up to 2.24x on an M5 Max.

The Mac app

New in 1.0.0, and the reason this release exists. Download the DMG, drag it to Applications, and the app does the rest:

Guided onboarding: checks your hardware, recommends a model that fits your memory, downloads it, installs its own Python engine and fan control, puts mtplx on your PATH, and tunes decoding depth on your machine.
A live dashboard: decode speed gauge, acceptance rate by draft depth, the verify waterfall, context usage, cache state, and an activity feed that tells the truth about what the server is doing.
Native chat with streaming, thinking cards, file attachments, web search, and LaTeX rendering.
One-click launches for OpenCode, Pi, Hermes, and Open WebUI against your local server.
A built-in AIME benchmark runner, so you can score a model yourself instead of trusting a chart.
Automatic updates through Sparkle, with the engine refreshed after every update. No Homebrew required at any point: release builds bundle a pinned Python interpreter.

New models

The 0.3.7 engine ran one verified model. The 1.0.0 catalog covers a range of machines, in speed, balance, and quality builds:

Gemma 4. Runs as an assistant pair, where the tuned control is the draft block size rather than depth. Long-context behavior was verified for no performance cliff.
Qwen 3.6 MoE (35B-A3B). Mixture-of-experts support including prequantized expert sidecars, normalized expert layouts, and hard blocks on layouts that cannot run correctly.
Qwen 3.5 (4B, 9B). Smaller machines get first-class models instead of a cut-down experience.
Qwen 3.6 27B remains the flagship, now in speed, quality, and FP16 builds.

The catalog is shared by the app and the CLI, and the default is chosen for your machine: chip generation picks the precision, and machines under 32 GB route to the 9B model because the 27B default cannot load safely there.

KV cache reuse, in memory and on disk

Two layers, one goal: never pay for the same tokens twice.

In RAM: warm-prefix reuse across turns and requests. Multi-turn chats and agent workloads like OpenCode hit the cache instead of re-processing the conversation, which is the difference between an agent that flows and one that stalls before every reply.
On SSD: session state persists to disk with enforced size caps. Quit the server, restart your Mac, come back tomorrow: the session restores near-instantly instead of re-processing thousands of tokens.

Concurrency

1.0.0 adds continuous batching: the server can interleave multiple requests instead of serializing them. Batching presets, a scheduler mode, and explicit caps (--max-active-requests, --decode-batch-max, --batch-wait-ms) control the behavior. Agent workloads, which fire many short requests, benefit the most.

Smart fan mode

Fan control is no longer all-or-nothing. Smart mode ramps the fans when the model is working and restores them when it goes idle, works across the app, the CLI, and the server API, and survives handing a session from the app to a terminal client. The crash-safe watchdog from earlier releases still stands behind all of it: if MTPLX dies for any reason, your fans return to automatic.

A server built for agents

Most of the serving work this cycle came from running real coding agents against MTPLX and fixing everything that broke:

OpenCode, Pi, and Hermes each have a hardened lane: correct tool contracts, trimmed read-only toolsets, and long-context depth policy that keeps speculation effective deep into a session.
OpenAI stop sequences are honored across chat, completions, and the Anthropic endpoint.
/v1/completions streams tokens as they are generated, with real finish reasons and usage.
Cancellation is honest: cancelling a request, streaming or not, actually stops decode on the server.
A live metrics stream (server-sent events) powers the app's dashboard and is available to your own tools, alongside snapshot, thermal, and prefill-history endpoints.

Forge

Forge turns the engine into a model factory. Point it at a Hugging Face repo and it converts the model to MLX (AWQ, compressed-tensors, NVFP4, and BF16 sources), calibrates and trains the MTP adapter, verifies the result on your hardware with quality gates that reject speed wins that degrade output, and publishes back to the Hub with full provenance if you choose. Vision towers are preserved through conversion. Available in the app as a full visual workflow and as mtplx forge.

The AIME benchmark

The app and mtplx bench aime run a live 30-problem AIME benchmark with fully disclosed, coaching-free prompts: the prompt carries only the answer-format contract, and every run records its exact prompts and rescue policy so results are reproducible.

One product, two surfaces

mtplx start now detects the app's running server and attaches to it instead of loading a second copy of the model. The app and CLI share the same catalog, recommendations, and settings, and mtplx stop knows the app's port.

New commands

mtplx stop stops the running server cleanly.
mtplx settings get/set reads or changes live server settings.
mtplx bench aime [--quick] runs the benchmark from the terminal.
mtplx forge builds, verifies, and publishes MTP models.

Reliability and distribution

Release builds bundle a pinned CPython, so a pristine Mac needs no Homebrew and no Python.
The engine installs into an app-owned environment that ignores whatever pip configuration is on the machine, and packages load on macOS 14 and 15, not just the newest macOS.
An old mtplx on your PATH gets updated automatically instead of shadowing the new one. A newer one is left alone.
Busy ports resolve themselves: the app moves to the next free port with a banner, and the CLI tells you who owns a port and how to stop it.

Downloads

Mac app: mtplx.com/download
All releases and checksums: mtplx.com/releases
CLI: brew install youssofal/mtplx/mtplx