DeepSeek-V4-Flash-MLX-Q4Q8
A mixed-precision MLX quantization of deepseek-ai/DeepSeek-V4-Flash
intended for Apple-Silicon inference via vMLX (or any
MLX-aware runtime that loads mlx_lm.utils.load).
- Architecture: DeepSeek-V4 — 289.9 B total parameters, 256 routed
experts (top-6 per token), 1 shared expert, 43 layers, MLA attention
with
head_dim=512and grouped output projection, mHC (Manifold-Constrained Hyper-Connections,hc_mult=4), sqrtsoftplus + hash routing for the first 3 layers. - Quantization: standard MLX
affinemode (output ofmx.quantize, not TurboQuant). Tensor naming<module>.{weight, scales, biases}. Group size 32. Layout in safetensors:- routed experts (
layers.N.ffn.experts.E.{w1,w2,w3}): 4-bit - attention (
layers.N.attn.{wq_a, wkv, wo_a, wo_b, ...}): 8-bit - shared expert, embed_tokens, lm_head: 8-bit
- norms, router gate, mHC params: fp16 (passthrough)
- routed experts (
- On-disk size: 173 GB across 159 safetensors shards.
- Context: 1,048,576 tokens (sliding-window=128 short-prompt-safe).
Usage with vMLX
The bundle is a drop-in replacement for the upstream FP4/FP8 release in vMLX 1.3.97+. Two non-obvious considerations:
1. Runtime patch required (jang_tools.load_jangtq)
vMLX's bundled jang_tools.load_jangtq._patch_quant_config_inplace
(/Applications/vMLX.app/.../jang_tools/load_jangtq.py) infers
quantization overrides from raw safetensors keys
(model.layers.N.ffn.experts.E.w1) — these never match the
post-sanitize() module paths the MLX Model exposes
(model.layers.N.mlp.switch_mlp.gate_proj), so it overwrites this
bundle's correct config with unmatchable disk-keyed entries. After
overwrite, mlx_lm's class_predicate falls through to top-level
bits=8 and the routed experts get wrapped as 8-bit modules. The
4-bit-packed weights then silently fail to load (with strict=False)
and the model produces BOS-token loops at inference.
The fix is a 4-line guard at the top of _patch_quant_config_inplace
that returns early when the user's config already has post-sanitize
overrides:
if existing_overrides and any(k.startswith("model.") for k in existing_overrides):
return {"action": "user_provided", "existing_overrides": len(existing_overrides)}
The accompanying build_mlx_q4q8.sh script's
patch_loader step applies this idempotently. See
requantization-plan.md for the full diagnosis.
2. SimpleEngine only
vMLX auto-disables --continuous-batching for DSV4 because the
batched generator is incompatible with the model's 4-D mHC residual
stream. All requests go through SimpleEngine. Throughput on
Mac Studio M3 Ultra (256 GB unified memory): ~22 tok/s decode,
~75 tok/s prefill.
Serving
/Applications/vMLX.app/Contents/Resources/bundled-python/python/bin/python3 \
-m vmlx_engine.cli serve \
/path/to/DeepSeek-V4-Flash-MLX-Q4Q8 \
--served-model-name deepseek-v4-flash-mlx-q4q8 \
--host 127.0.0.1 --port 8010 \
--max-tokens 4096 \
--tool-call-parser deepseek \
--enable-auto-tool-choice
Then hit it with the OpenAI-compatible chat-completion API:
curl -s http://127.0.0.1:8010/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-v4-flash-mlx-q4q8",
"messages": [{"role": "user", "content": "What is 17+28?"}],
"max_tokens": 120
}'
The model is reasoning-capable (<think>...</think> blocks land in
reasoning_content; the final answer in content).
Hardware requirements
- Apple Silicon (M1 Max / M2 Ultra / M3 Ultra recommended).
- Unified memory: ≥ 192 GB strongly recommended; the bundle's
173 GB working set plus KV cache plus a 70 % wired-limit headroom
(configured automatically by
jang_tools.load_jangtq._apply_wired_limit_safe_default) needs comfortable spillover. Will technically load on 128 GB with reduced max-tokens, but expect SSD pressure. - macOS 14+ for the Metal kernels used by the routed-expert SwitchGLU.
Tool calling & reasoning
The bundle ships with the DSML tool-call grammar
(|DSML| / <|tool_calls|> / <|invoke|>); pair it with vMLX's
--tool-call-parser deepseek --enable-auto-tool-choice. Reasoning
modes:
- chat (default): direct response, no
<think>block. - thinking: emits
<think>...</think>wrapped reasoning, parsed out intoreasoning_contentbyDeepSeekR1ReasoningParser.
Both modes set the <|latest_reminder|> anchor automatically — vMLX
adds a default system prompt (DSV4: injected default system prompt
in the load log) to keep multi-turn chat from running away on
reasoning loops.
Quantization details
This release is the output of:
- Convert from upstream FP4 (routed experts) + FP8 (others) using
jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang. - Re-quantize the routed expert tensors from the FP4 source
through
mx.quantize(..., group_size=32, bits=4, mode="affine"). The upstream converter direct-copies FP4 onto disk in MXFP4 form (uint8 E8M0 scales, no biases) regardless of--format; vMLX's MXFP4 dispatch is broken at 4-bit and produces gibberish. The re-quantization step rewrites.weight + .scales + .biasesfor each of the 33,024 routed expert tensors using MLX's actual affine formula:
(matchesscale = max((w_max - w_min) / 15, eps) side = abs(w_min) > abs(w_max) scale = side ? scale : -scale edge = side ? w_min : w_max q0 = round(edge / scale) scale = (q0 != 0) ? edge / q0 : scale bias = (q0 != 0) ? edge : 0mlx/include/mlx/backend/metal/kernels/quantized.h:2387). - Rebuild
model.safetensors.index.jsonto include the newly-introduced.biaseskeys.
Size vs. quality tradeoff
This bundle is 173 GB on disk vs. ~149 GB for the upstream FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead. The extra space comes from MLX's affine quantization scheme:
- group_size = 32 (vs. upstream's 128×128 blocks): finer-grained scales mean less quantization error per group, but more scale/bias metadata per tensor.
- non-experts at Q8 affine (vs. upstream FP8 block): keeps attention, router, shared expert, embed/lm_head at 8-bit affine, which is quality-sensitive and small in total — cheap to spend bits on.
- experts at Q4 affine (vs. upstream MXFP4): same nominal width,
but affine adds per-group
biastensors that MXFP4 doesn't carry.
The choice is deliberate and quality-leaning rather than size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from published llama.cpp / MLX quantization studies — not measured on V4-Flash specifically):
| Knob | Size saved | Quality cost |
|---|---|---|
| group_size 32 → 64 | ~6–8 GB | +0.1–0.3 % PPL |
| group_size 32 → 128 | ~10–12 GB | +0.3–0.8 % PPL |
| Non-experts Q8 → Q6 | ~3–5 GB | +0.1–0.3 % PPL |
| Non-experts Q8 → Q4 | ~8–10 GB | +0.5–2 % PPL, noticeable on long-context / reasoning |
| Experts Q4 → Q3 | ~30–40 GB | +2–6 % PPL, real degradation |
The current config is essentially lossless (<1 % PPL increase).
A more space-balanced alternative for 192 GB Macs: keep Q8
non-experts + Q4 experts but bump to group_size=64 — saves ~6–8 GB,
quality loss is in the noise. Going below Q4 on the experts is where
MoE models fall off a cliff (each token only sees 6 of 256 experts,
so quantization noise does not average out across the population),
and gs=128 starts to bite on 1M-token contexts where small per-token
errors compound.
Net: the 24 GB overhead is the price of (a) MLX compatibility — there is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and (b) a config that errs on the side of preserving quality over shaving space.
The community mxfp4_to_affine.py script that ships in some upstream
DSV4 conversion guides uses scale = (max-min)/15, bias = min, which
does not match MLX's affine convention. Bundles produced that way
load but compound quantization error across the 43 transformer layers
(activations explode by layer ~20, NaN by layer ~29) and emit BOS-loop
gibberish. Do not use that script.
Files in this bundle
.
├── config.json # 132 quantization entries (129 routed-expert per-module + globals)
├── jang_config.json # vMLX chat / reasoning / tool-call schema
├── generation_config.json # eos_token_id = [1, 128803, 128804]
├── tokenizer.json
├── tokenizer_config.json # embedded chat_template + special tokens
├── encoding/ # DSV4 encoding adapter
├── model-00001-of-00159.safetensors # 159 shards, total ~173 GB
│ ...
├── model.safetensors.index.json
├── LICENSE
├── README.md # this file
├── README.upstream.md # upstream DeepSeek-V4 model card
└── DeepSeek_V4.pdf # upstream tech report
Building from source
The full pipeline (download → convert → re-quantize → finalize → patch
→ verify) is automated in
build_mlx_q4q8.sh (companion script in the
project repo). Quick reference of the steps:
./build_mlx_q4q8.sh check # sanity-check disks + tools
./build_mlx_q4q8.sh patch_loader # apply the load_jangtq.py guard
./build_mlx_q4q8.sh download # hf download deepseek-ai/DeepSeek-V4-Flash
./build_mlx_q4q8.sh convert # ~40 min: jang_tools convert_dsv4_jangtq
./build_mlx_q4q8.sh requantize # ~30 min: mx.quantize routed experts
./build_mlx_q4q8.sh finalize # tokenizer / encoding asset copy
./build_mlx_q4q8.sh patch # EOS / chat_template fixes
./build_mlx_q4q8.sh verify # check the bundle
./build_mlx_q4q8.sh serve # launch vMLX
./build_mlx_q4q8.sh all runs everything in order. Total runtime on
M3 Ultra: 75 minutes plus the initial download (160 GB at ~150 MB/s =
~18 minutes on a fast link).
See requantization-plan.md for the
diagnostic write-up of why the requantize step is needed.
License & attribution
This bundle is licensed under MIT, matching the upstream DeepSeek-V4-Flash license.
The original model and tech report are credited to the DeepSeek-AI team. Please cite their work when using this model:
@misc{deepseekv4,
title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
author = {DeepSeek-AI},
year = {2025},
url = {https://github.com/deepseek-ai/DeepSeek-V4}
}
The MLX-Q4Q8 quantization recipe is provided as-is and adds nothing substantive to the science — it is purely a packaging artifact for running the model on Apple-Silicon hardware.
Acknowledgments
- DeepSeek-AI for the base model and the open-source release.
- The MLX team at Apple for the framework and the
mlx.core.quantizereference implementation. - The vMLX team for the
jang_toolstooling and theload_jangtqloader (modulo the patch noted above).
- Downloads last month
- 3,006
4-bit
Model tree for Deviad/DeepSeek-V4-Flash-MLX-Q4Q8
Base model
deepseek-ai/DeepSeek-V4-Flash