MiniMax-M2.7 — PrismaQuant 3.20 bpp (vLLM)

Mixed-precision quantization with joint expert pruning. Fits a 228 B parameter MoE in 90 GB on a single DGX Spark, served natively by vLLM with no patches.

Source: MiniMaxAI/MiniMax-M2.7 Quantizer: prismaquant · commit pinned in this artifact's mixed_native_manifest.json

TL;DR

Metric Value
Disk size 90 GB (-58 % vs FP8 source 215 GB; -73 % vs BF16 ~456 GB)
Achieved bpp 3.20
Format mix 30,780 NVFP4 + 2,204 FP8_SOURCE Linears
Experts kept 10,912 of 15,872 (69 %) — 4,960 dropped via REAP saliency
Per-MoE-layer kept uniform 176 of 256 (top-k=8)
Decode throughput on Spark ~14 tok/s (single-stream, T0, 32k context)
vLLM patches required 0

How it was produced

prismaquant solves a multi-choice knapsack over per-Linear cost / memory choices to land at a target bit budget. Two distinct contributions go into this artifact:

1. Closed-form Δloss proxy

For each (Linear, format) pair, the cost is

Δloss12HtraceMSEW\Delta\mathrm{loss} \approx \tfrac{1}{2} \cdot H_{\text{trace}} \cdot \mathrm{MSE}_W

  • H_trace is the empirical Fisher diagonal trace, captured in one streaming forward+backward pass over the calibration set. It measures how curved the cross-entropy loss is at this Linear: high H_trace means a small weight perturbation moves the loss a lot.
  • MSE_W is the measured per-format weight round-trip error (NVFP4, FP8, BF16). Not an analytical formula — we run RTN on the actual weights and compute the error directly.

Multiplying gives the second-order Taylor estimate of how much the model's loss will rise if you replace the BF16 weight with the format's quantized version. The allocator picks per-Linear formats that minimize total Δloss subject to a total-bit budget.

2. Joint expert-prune + format choice

For MoE layers, prismaquant treats each MoE choice as a pair:

(quantization_format, dropped_expert_ids)

Both the format and the prune set are priced in the same knapsack via REAP-style saliency:

Sj=1Tcaltgj(t)fj(t)22S_j = \frac{1}{T_{\text{cal}}} \sum_t g_j(t) \cdot \lVert f_j(t) \rVert_2^2

This is the dropout-loss estimate from the REAP family of MoE expert-importance scores: how much the layer's output norm drops in expectation when expert j is removed, weighted by the gradient signal flowing through that expert. Sum across experts and you get a per-(router, expert) score in Δloss units, directly comparable to the quantization Δloss.

Per-layer prune candidates emit floor(R · num_experts) lowest-S experts at each ratio R; the DP picks (R, format) jointly. After the pareto sweep, prismaquant produces a uniform-kept prune manifest so vLLM's MoE kernel sees a single num_local_experts per layer (this artifact: 176 of 256 kept everywhere).

3. Pareto sweep + kneedle pick

Before committing to a target bit budget, prismaquant computes the full pareto curve. Below is the actual sweep that produced this artifact:

target bpp achieved size on disk predicted Δloss NVFP4 super-Linears FP8_SOURCE super-Linears experts dropped
3.10 3.10 88.4 GB 5,518 227 83 17,856
3.16 3.16 90.1 GB 3,775 ← kneedle 279 31 14,880
3.20 3.20 91.2 GB 3,734 272 38 14,880 ← shipped
3.25 3.25 92.6 GB 3,733 271 39 14,880
3.30 3.30 94.1 GB 3,733 236 74 14,880
3.40 3.40 96.9 GB 3,732 199 111 14,880
3.50 3.50 99.7 GB 2,496 268 42 11,904
3.60 3.60 102.6 GB 2,495 217 93 11,904

The Δloss plateau between 3.16 and 3.40 (~3,732) shows the allocator is already squeezing most of the available signal in that band. The dramatic drop at 3.50 (-33 %) is from relaxing the prune ratio (15,872 → 11,904 experts dropped). The user-specified target was the 90-95 GB band; 3.20 was picked as the smallest practical size that captures essentially all the available quality in the band.

Format mix on disk

NVFP4       : 30,780 Linears (94.5 %)  — experts + most attention/MLP projections
FP8_SOURCE  :  2,204 Linears  (6.7 %)  — passthrough of natively-FP8 source weights
BF16        :     62 routers          — output dim shrunk to kept-expert count
PRUNED      : 14,880 Linear slots     — 4,960 experts × 3 weights, dropped per REAP

Sample of per-layer assignments:

Layer Format mix
L00 (dense pre-MoE) 532 FP8_SOURCE
L01 (dense pre-MoE) 532 FP8_SOURCE
L02 (first MoE) 532 NVFP4
L30 (mid MoE) 529 NVFP4 + 3 FP8_SOURCE
L61 (last layer) 532 NVFP4

The allocator put the early dense layers (which dominate semantic embedding pathways) at FP8 for safety, then dropped the bulk of MoE expert weights to NVFP4 once their 0.5 · H · MSE_W showed it was safe. A few attention projections in mid-layers got FP8_SOURCE-pinned where the per-Linear sensitivity flagged NVFP4 as too aggressive.

Calibration

Calibration data: cal-mix-v1, a multi-domain mix balancing agentic, math, and coding sequences:

  • Agentic: tool-call traces, multi-step reasoning chains, planning + execution dialogues
  • Math: word problems, step-by-step solutions, symbolic manipulation
  • Coding: Python / Rust / SQL / shell, both authoring and reading patterns

Volume: 32 chunks × 4 samples × 2048 seq-len ≈ 262 k tokens through the streaming probe. Each chunk runs phase-1 forward (saliency capture) + phase-3 reverse-sweep (Fisher per-Linear). The chunks share the same multi-domain composition — so all calibration matters for all three downstream regimes.

prismaquant's per-domain-saliency feature exists (allocator can use union / intersection / mean across domains), but for this release the calibration was domain-merged. Per-domain runs are a follow-up.

Quality

Spot-checked at temperature 0 across agentic / math / coding:

Test Result
Multi-segment train problem (math) Step-by-step reasoning, exact answer 240 mi / 68.571 mph
Python is_palindrome Clean, correct
Python quicksort Clean, correct
Python binary_search Clean, correct
Python longest_substring_without_repeat Sliding-window, correct
Python merge_two_lists (linked list) Clean, correct
Python fibonacci Iterative, with worked example
Rust Point::distance Uses .hypot() (numerically stable)
SQL top-5 customers by 2024 volume Clean, proper date-range filter
Tool calling Clean function-call JSON emission
Reasoning content via <think> Captured by --reasoning-parser minimax_m2

Formal benchmarks (MMLU, GSM8K, HumanEval) deferred. The artifact is positioned as fits-on-Spark + serves-coherently across the three calibration domains; rigorous benchmark numbers in a follow-up release.

Serving

vllm serve <this-repo> \
  --quantization compressed-tensors \
  --trust-remote-code \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2

Recommended on UMA hardware: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to keep the cuda allocator from hoarding freed blocks.

Limitations / caveats

  • Calibration scale: 262 k tokens is moderate. Heavy reasoning-chain or long-context workloads may benefit from a re-export with more diverse calibration.
  • Domain-merged saliency: per-domain prune policies (union/intersection) supported by prismaquant but not exercised here. A re-export with domain-tagged calibration is a candidate next iteration.
  • No formal benchmarks yet: MMLU / GSM8K / HumanEval pending. Headline result is "fits + coherent across cal-mix".
  • No MTP heads: MiniMax-M2 has no MTP head (unlike Qwen3.5/3.6). No speculative-decoding accelerator.
  • Pruned experts are gone: 4,960 of 15,872 (31 %) dropped per REAP. Tasks heavily dependent on those specific experts could see degradation. Empirical probes show none on agentic/math/coding prompts.

Reproduction

This artifact was produced by:

# 1. Probe + cost (multi-chunk, adaptive sampling, deferred Fisher sync)
python -m prismaquant.multi_chunk_probe \
  --chunks-dir /work/chunks \
  --model <minimax-m2.7-snapshot> \
  --output /work/artifacts/probe.pkl \
  --activation-cache-dir /work/act \
  --work-dir /work/work \
  --layers-per-shard 4 --unified-sweep \
  --no-include-mtp --no-include-visual --no-include-lm-head \
  --prefetch-lookahead 4 --prefetch-workers 2 \
  --activation-rows-limit 256 \
  --calibration-modality text-only \
  --retain-cross-chunk-cache \
  --adaptive-sampling \
  --run-cost --cost-output /work/artifacts/cost.pkl \
  --cost-formats NVFP4,MXFP8_E4M3,FP8_SOURCE,BF16

# 2. Allocator (target_bits=3.20 picks the kneedle within the 90-95 GB band)
python -m prismaquant.allocator \
  --probe /work/artifacts/probe.pkl \
  --costs /work/artifacts/cost.pkl \
  --formats NVFP4,MXFP8_E4M3,FP8_SOURCE,BF16 \
  --target-bits 3.20 \
  --pareto-targets 3.10,3.16,3.20,3.25,3.30,3.40,3.50,3.60 \
  --enable-expert-prune \
  --prune-ratios 0.0,0.125,0.1875,0.25,0.3125,0.375 \
  --prune-alpha 0.15 \
  --layer-config /work/artifacts/layer_config_prune.json

# 3. Export (native compressed-tensors, GPTQ + scale-sweep activation-aware)
python -m prismaquant.export_native_compressed \
  --model <minimax-m2.7-snapshot> \
  --layer-config /work/artifacts/layer_config_prune.json \
  --prune-manifest /work/artifacts/layer_config_prune.json.prune.json \
  --output /work/exported \
  --activation-cache-dir /work/act \
  --device cuda

Full source + reproduction notes: https://github.com/RobTand/prismaquant

Acknowledgements

  • MiniMaxAI — source model.
  • vLLM — compressed-tensors serving stack with native NVFP4 + FP8 MoE kernels.
  • REAP-style per-expert dropout-loss saliency.
  • HAQ / HAWQ-V1/V2/V3 (Wang, Dong, Yao, et al.) — mixed-precision allocation foundations.
  • GPTQ (Frantar et al. 2022), AutoRound — per-Linear quantizer building blocks.

License

Inherits the MiniMax-M2.7 license from the source model. See base model card for terms.

Downloads last month
1,383
Safetensors
Model size
94B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rdtand/MiniMax-M2.7-PrismaQuant-3.20bit-vllm

Quantized
(103)
this model