DeepSeek-V4-Flash-MLX-Q4Q8

A mixed-precision MLX quantization of deepseek-ai/DeepSeek-V4-Flash intended for Apple-Silicon inference via vMLX (or any MLX-aware runtime that loads mlx_lm.utils.load).

  • Architecture: DeepSeek-V4 — 289.9 B total parameters, 256 routed experts (top-6 per token), 1 shared expert, 43 layers, MLA attention with head_dim=512 and grouped output projection, mHC (Manifold-Constrained Hyper-Connections, hc_mult=4), sqrtsoftplus + hash routing for the first 3 layers.
  • Quantization: standard MLX affine mode (output of mx.quantize, not TurboQuant). Tensor naming <module>.{weight, scales, biases}. Group size 32. Layout in safetensors:
    • routed experts (layers.N.ffn.experts.E.{w1,w2,w3}): 4-bit
    • attention (layers.N.attn.{wq_a, wkv, wo_a, wo_b, ...}): 8-bit
    • shared expert, embed_tokens, lm_head: 8-bit
    • norms, router gate, mHC params: fp16 (passthrough)
  • On-disk size: 173 GB across 159 safetensors shards.
  • Context: 1,048,576 tokens (sliding-window=128 short-prompt-safe).

Usage with vMLX

The bundle is a drop-in replacement for the upstream FP4/FP8 release in vMLX 1.3.97+. Two non-obvious considerations:

1. Runtime patch required (jang_tools.load_jangtq)

vMLX's bundled jang_tools.load_jangtq._patch_quant_config_inplace (/Applications/vMLX.app/.../jang_tools/load_jangtq.py) infers quantization overrides from raw safetensors keys (model.layers.N.ffn.experts.E.w1) — these never match the post-sanitize() module paths the MLX Model exposes (model.layers.N.mlp.switch_mlp.gate_proj), so it overwrites this bundle's correct config with unmatchable disk-keyed entries. After overwrite, mlx_lm's class_predicate falls through to top-level bits=8 and the routed experts get wrapped as 8-bit modules. The 4-bit-packed weights then silently fail to load (with strict=False) and the model produces BOS-token loops at inference.

The fix is a 4-line guard at the top of _patch_quant_config_inplace that returns early when the user's config already has post-sanitize overrides:

if existing_overrides and any(k.startswith("model.") for k in existing_overrides):
    return {"action": "user_provided", "existing_overrides": len(existing_overrides)}

The accompanying build_mlx_q4q8.sh script's patch_loader step applies this idempotently. See requantization-plan.md for the full diagnosis.

2. SimpleEngine only

vMLX auto-disables --continuous-batching for DSV4 because the batched generator is incompatible with the model's 4-D mHC residual stream. All requests go through SimpleEngine. Throughput on Mac Studio M3 Ultra (256 GB unified memory): ~22 tok/s decode, ~75 tok/s prefill.

Serving

/Applications/vMLX.app/Contents/Resources/bundled-python/python/bin/python3 \
  -m vmlx_engine.cli serve \
  /path/to/DeepSeek-V4-Flash-MLX-Q4Q8 \
  --served-model-name deepseek-v4-flash-mlx-q4q8 \
  --host 127.0.0.1 --port 8010 \
  --max-tokens 4096 \
  --tool-call-parser deepseek \
  --enable-auto-tool-choice

Then hit it with the OpenAI-compatible chat-completion API:

curl -s http://127.0.0.1:8010/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-v4-flash-mlx-q4q8",
    "messages": [{"role": "user", "content": "What is 17+28?"}],
    "max_tokens": 120
  }'

The model is reasoning-capable (<think>...</think> blocks land in reasoning_content; the final answer in content).

Hardware requirements

  • Apple Silicon (M1 Max / M2 Ultra / M3 Ultra recommended).
  • Unified memory: ≥ 192 GB strongly recommended; the bundle's 173 GB working set plus KV cache plus a 70 % wired-limit headroom (configured automatically by jang_tools.load_jangtq._apply_wired_limit_safe_default) needs comfortable spillover. Will technically load on 128 GB with reduced max-tokens, but expect SSD pressure.
  • macOS 14+ for the Metal kernels used by the routed-expert SwitchGLU.

Tool calling & reasoning

The bundle ships with the DSML tool-call grammar (|DSML| / <|tool_calls|> / <|invoke|>); pair it with vMLX's --tool-call-parser deepseek --enable-auto-tool-choice. Reasoning modes:

  • chat (default): direct response, no <think> block.
  • thinking: emits <think>...</think> wrapped reasoning, parsed out into reasoning_content by DeepSeekR1ReasoningParser.

Both modes set the <|latest_reminder|> anchor automatically — vMLX adds a default system prompt (DSV4: injected default system prompt in the load log) to keep multi-turn chat from running away on reasoning loops.

Quantization details

This release is the output of:

  1. Convert from upstream FP4 (routed experts) + FP8 (others) using jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang.
  2. Re-quantize the routed expert tensors from the FP4 source through mx.quantize(..., group_size=32, bits=4, mode="affine"). The upstream converter direct-copies FP4 onto disk in MXFP4 form (uint8 E8M0 scales, no biases) regardless of --format; vMLX's MXFP4 dispatch is broken at 4-bit and produces gibberish. The re-quantization step rewrites .weight + .scales + .biases for each of the 33,024 routed expert tensors using MLX's actual affine formula:
    scale = max((w_max - w_min) / 15, eps)
    side  = abs(w_min) > abs(w_max)
    scale = side ? scale : -scale
    edge  = side ? w_min : w_max
    q0    = round(edge / scale)
    scale = (q0 != 0) ? edge / q0 : scale
    bias  = (q0 != 0) ? edge      : 0
    
    (matches mlx/include/mlx/backend/metal/kernels/quantized.h:2387).
  3. Rebuild model.safetensors.index.json to include the newly-introduced .biases keys.

Size vs. quality tradeoff

This bundle is 173 GB on disk vs. ~149 GB for the upstream FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead. The extra space comes from MLX's affine quantization scheme:

  • group_size = 32 (vs. upstream's 128×128 blocks): finer-grained scales mean less quantization error per group, but more scale/bias metadata per tensor.
  • non-experts at Q8 affine (vs. upstream FP8 block): keeps attention, router, shared expert, embed/lm_head at 8-bit affine, which is quality-sensitive and small in total — cheap to spend bits on.
  • experts at Q4 affine (vs. upstream MXFP4): same nominal width, but affine adds per-group bias tensors that MXFP4 doesn't carry.

The choice is deliberate and quality-leaning rather than size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from published llama.cpp / MLX quantization studies — not measured on V4-Flash specifically):

Knob Size saved Quality cost
group_size 32 → 64 ~6–8 GB +0.1–0.3 % PPL
group_size 32 → 128 ~10–12 GB +0.3–0.8 % PPL
Non-experts Q8 → Q6 ~3–5 GB +0.1–0.3 % PPL
Non-experts Q8 → Q4 ~8–10 GB +0.5–2 % PPL, noticeable on long-context / reasoning
Experts Q4 → Q3 ~30–40 GB +2–6 % PPL, real degradation

The current config is essentially lossless (<1 % PPL increase). A more space-balanced alternative for 192 GB Macs: keep Q8 non-experts + Q4 experts but bump to group_size=64 — saves ~6–8 GB, quality loss is in the noise. Going below Q4 on the experts is where MoE models fall off a cliff (each token only sees 6 of 256 experts, so quantization noise does not average out across the population), and gs=128 starts to bite on 1M-token contexts where small per-token errors compound.

Net: the 24 GB overhead is the price of (a) MLX compatibility — there is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and (b) a config that errs on the side of preserving quality over shaving space.

The community mxfp4_to_affine.py script that ships in some upstream DSV4 conversion guides uses scale = (max-min)/15, bias = min, which does not match MLX's affine convention. Bundles produced that way load but compound quantization error across the 43 transformer layers (activations explode by layer ~20, NaN by layer ~29) and emit BOS-loop gibberish. Do not use that script.

Files in this bundle

.
├── config.json                        # 132 quantization entries (129 routed-expert per-module + globals)
├── jang_config.json                   # vMLX chat / reasoning / tool-call schema
├── generation_config.json             # eos_token_id = [1, 128803, 128804]
├── tokenizer.json
├── tokenizer_config.json              # embedded chat_template + special tokens
├── encoding/                          # DSV4 encoding adapter
├── model-00001-of-00159.safetensors  # 159 shards, total ~173 GB
│   ...
├── model.safetensors.index.json
├── LICENSE
├── README.md                          # this file
├── README.upstream.md                 # upstream DeepSeek-V4 model card
└── DeepSeek_V4.pdf                    # upstream tech report

Building from source

The full pipeline (download → convert → re-quantize → finalize → patch → verify) is automated in build_mlx_q4q8.sh (companion script in the project repo). Quick reference of the steps:

./build_mlx_q4q8.sh check         # sanity-check disks + tools
./build_mlx_q4q8.sh patch_loader  # apply the load_jangtq.py guard
./build_mlx_q4q8.sh download      # hf download deepseek-ai/DeepSeek-V4-Flash
./build_mlx_q4q8.sh convert       # ~40 min: jang_tools convert_dsv4_jangtq
./build_mlx_q4q8.sh requantize    # ~30 min: mx.quantize routed experts
./build_mlx_q4q8.sh finalize      # tokenizer / encoding asset copy
./build_mlx_q4q8.sh patch         # EOS / chat_template fixes
./build_mlx_q4q8.sh verify        # check the bundle
./build_mlx_q4q8.sh serve         # launch vMLX

./build_mlx_q4q8.sh all runs everything in order. Total runtime on M3 Ultra: 75 minutes plus the initial download (160 GB at ~150 MB/s = ~18 minutes on a fast link).

See requantization-plan.md for the diagnostic write-up of why the requantize step is needed.

License & attribution

This bundle is licensed under MIT, matching the upstream DeepSeek-V4-Flash license.

The original model and tech report are credited to the DeepSeek-AI team. Please cite their work when using this model:

@misc{deepseekv4,
  title  = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  author = {DeepSeek-AI},
  year   = {2025},
  url    = {https://github.com/deepseek-ai/DeepSeek-V4}
}

The MLX-Q4Q8 quantization recipe is provided as-is and adds nothing substantive to the science — it is purely a packaging artifact for running the model on Apple-Silicon hardware.

Acknowledgments

  • DeepSeek-AI for the base model and the open-source release.
  • The MLX team at Apple for the framework and the mlx.core.quantize reference implementation.
  • The vMLX team for the jang_tools tooling and the load_jangtq loader (modulo the patch noted above).
Downloads last month
3,006
Safetensors
Model size
56B params
Tensor type
U32
·
F16
·
I64
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Deviad/DeepSeek-V4-Flash-MLX-Q4Q8

Quantized
(28)
this model