MTPLX 1.0.0

MTPLX 1.0.0 is the first full release: a native macOS app and the mtplx command line working as one product, built for Apple Silicon.

If you are new here: MTPLX runs local language models using their own built-in multi-token prediction heads as a speculative drafter, with exact rejection sampling. Same output distribution as normal decoding, measured 1.6x faster on a 16 GB M4 Mac mini and up to 2.24x on an M5 Max.

The Mac app

New in 1.0.0, and the reason this release exists. Download the DMG, drag it to Applications, and the app does the rest:

New models

The 0.3.7 engine ran one verified model. The 1.0.0 catalog covers a range of machines, in speed, balance, and quality builds:

The catalog is shared by the app and the CLI, and the default is chosen for your machine: chip generation picks the precision, and machines under 32 GB route to the 9B model because the 27B default cannot load safely there.

KV cache reuse, in memory and on disk

Two layers, one goal: never pay for the same tokens twice.

Concurrency

1.0.0 adds continuous batching: the server can interleave multiple requests instead of serializing them. Batching presets, a scheduler mode, and explicit caps (--max-active-requests, --decode-batch-max, --batch-wait-ms) control the behavior. Agent workloads, which fire many short requests, benefit the most.

Smart fan mode

Fan control is no longer all-or-nothing. Smart mode ramps the fans when the model is working and restores them when it goes idle, works across the app, the CLI, and the server API, and survives handing a session from the app to a terminal client. The crash-safe watchdog from earlier releases still stands behind all of it: if MTPLX dies for any reason, your fans return to automatic.

A server built for agents

Most of the serving work this cycle came from running real coding agents against MTPLX and fixing everything that broke:

Forge

Forge turns the engine into a model factory. Point it at a Hugging Face repo and it converts the model to MLX (AWQ, compressed-tensors, NVFP4, and BF16 sources), calibrates and trains the MTP adapter, verifies the result on your hardware with quality gates that reject speed wins that degrade output, and publishes back to the Hub with full provenance if you choose. Vision towers are preserved through conversion. Available in the app as a full visual workflow and as mtplx forge.

The AIME benchmark

The app and mtplx bench aime run a live 30-problem AIME benchmark with fully disclosed, coaching-free prompts: the prompt carries only the answer-format contract, and every run records its exact prompts and rescue policy so results are reproducible.

One product, two surfaces

mtplx start now detects the app's running server and attaches to it instead of loading a second copy of the model. The app and CLI share the same catalog, recommendations, and settings, and mtplx stop knows the app's port.

New commands

Reliability and distribution

Downloads