MiniMax-M2.7 — PrismaQuant 3.20 bpp (vLLM)

Mixed-precision quantization with joint expert pruning. Fits a 228 B parameter MoE in 90 GB on a single DGX Spark, served natively by vLLM with no patches.

Source: MiniMaxAI/MiniMax-M2.7 Quantizer: prismaquant · commit pinned in this artifact's mixed_native_manifest.json

TL;DR

Metric	Value
Disk size	90 GB (-58 % vs FP8 source 215 GB; -73 % vs BF16 ~456 GB)
Achieved bpp	3.20
Format mix	30,780 NVFP4 + 2,204 FP8_SOURCE Linears
Experts kept	10,912 of 15,872 (69 %) — 4,960 dropped via REAP saliency
Per-MoE-layer kept	uniform 176 of 256 (top-k=8)
Decode throughput on Spark	~14 tok/s (single-stream, T0, 32k context)
vLLM patches required	0

How it was produced

prismaquant solves a multi-choice knapsack over per-Linear cost / memory choices to land at a target bit budget. Two distinct contributions go into this artifact:

1. Closed-form Δloss proxy

For each (Linear, format) pair, the cost is

$\Delta\mathrm{loss} \approx \tfrac{1}{2} \cdot H_{\text{trace}} \cdot \mathrm{MSE}_W$

H_trace is the empirical Fisher diagonal trace, captured in one streaming forward+backward pass over the calibration set. It measures how curved the cross-entropy loss is at this Linear: high H_trace means a small weight perturbation moves the loss a lot.
MSE_W is the measured per-format weight round-trip error (NVFP4, FP8, BF16). Not an analytical formula — we run RTN on the actual weights and compute the error directly.

Multiplying gives the second-order Taylor estimate of how much the model's loss will rise if you replace the BF16 weight with the format's quantized version. The allocator picks per-Linear formats that minimize total Δloss subject to a total-bit budget.

2. Joint expert-prune + format choice

For MoE layers, prismaquant treats each MoE choice as a pair:

(quantization_format, dropped_expert_ids)

Both the format and the prune set are priced in the same knapsack via REAP-style saliency:

$S_j = \frac{1}{T_{\text{cal}}} \sum_t g_j(t) \cdot \lVert f_j(t) \rVert_2^2$

This is the dropout-loss estimate from the REAP family of MoE expert-importance scores: how much the layer's output norm drops in expectation when expert j is removed, weighted by the gradient signal flowing through that expert. Sum across experts and you get a per-(router, expert) score in Δloss units, directly comparable to the quantization Δloss.

Per-layer prune candidates emit floor(R · num_experts) lowest-S experts at each ratio R; the DP picks (R, format) jointly. After the pareto sweep, prismaquant produces a uniform-kept prune manifest so vLLM's MoE kernel sees a single num_local_experts per layer (this artifact: 176 of 256 kept everywhere).

3. Pareto sweep + kneedle pick

Before committing to a target bit budget, prismaquant computes the full pareto curve. Below is the actual sweep that produced this artifact:

target bpp	achieved	size on disk	predicted Δloss	NVFP4 super-Linears	FP8_SOURCE super-Linears	experts dropped
3.10	3.10	88.4 GB	5,518	227	83	17,856
3.16	3.16	90.1 GB	3,775 ← kneedle	279	31	14,880
3.20	3.20	91.2 GB	3,734	272	38	14,880 ← shipped
3.25	3.25	92.6 GB	3,733	271	39	14,880
3.30	3.30	94.1 GB	3,733	236	74	14,880
3.40	3.40	96.9 GB	3,732	199	111	14,880
3.50	3.50	99.7 GB	2,496	268	42	11,904
3.60	3.60	102.6 GB	2,495	217	93	11,904

The Δloss plateau between 3.16 and 3.40 (~3,732) shows the allocator is already squeezing most of the available signal in that band. The dramatic drop at 3.50 (-33 %) is from relaxing the prune ratio (15,872 → 11,904 experts dropped). The user-specified target was the 90-95 GB band; 3.20 was picked as the smallest practical size that captures essentially all the available quality in the band.

Format mix on disk

NVFP4       : 30,780 Linears (94.5 %)  — experts + most attention/MLP projections
FP8_SOURCE  :  2,204 Linears  (6.7 %)  — passthrough of natively-FP8 source weights
BF16        :     62 routers          — output dim shrunk to kept-expert count
PRUNED      : 14,880 Linear slots     — 4,960 experts × 3 weights, dropped per REAP

Sample of per-layer assignments:

Layer	Format mix
L00 (dense pre-MoE)	532 FP8_SOURCE
L01 (dense pre-MoE)	532 FP8_SOURCE
L02 (first MoE)	532 NVFP4
L30 (mid MoE)	529 NVFP4 + 3 FP8_SOURCE
L61 (last layer)	532 NVFP4

The allocator put the early dense layers (which dominate semantic embedding pathways) at FP8 for safety, then dropped the bulk of MoE expert weights to NVFP4 once their 0.5 · H · MSE_W showed it was safe. A few attention projections in mid-layers got FP8_SOURCE-pinned where the per-Linear sensitivity flagged NVFP4 as too aggressive.

Calibration

Calibration data: cal-mix-v1, a multi-domain mix balancing agentic, math, and coding sequences:

Agentic: tool-call traces, multi-step reasoning chains, planning + execution dialogues
Math: word problems, step-by-step solutions, symbolic manipulation
Coding: Python / Rust / SQL / shell, both authoring and reading patterns

Volume: 32 chunks × 4 samples × 2048 seq-len ≈ 262 k tokens through the streaming probe. Each chunk runs phase-1 forward (saliency capture) + phase-3 reverse-sweep (Fisher per-Linear). The chunks share the same multi-domain composition — so all calibration matters for all three downstream regimes.

prismaquant's per-domain-saliency feature exists (allocator can use union / intersection / mean across domains), but for this release the calibration was domain-merged. Per-domain runs are a follow-up.

Quality

Spot-checked at temperature 0 across agentic / math / coding:

Test	Result
Multi-segment train problem (math)	Step-by-step reasoning, exact answer 240 mi / 68.571 mph
Python `is_palindrome`	Clean, correct
Python `quicksort`	Clean, correct
Python `binary_search`	Clean, correct
Python `longest_substring_without_repeat`	Sliding-window, correct
Python `merge_two_lists` (linked list)	Clean, correct
Python `fibonacci`	Iterative, with worked example
Rust `Point::distance`	Uses `.hypot()` (numerically stable)
SQL top-5 customers by 2024 volume	Clean, proper date-range filter
Tool calling	Clean function-call JSON emission
Reasoning content via `<think>`	Captured by `--reasoning-parser minimax_m2`

Formal benchmarks (MMLU, GSM8K, HumanEval) deferred. The artifact is positioned as fits-on-Spark + serves-coherently across the three calibration domains; rigorous benchmark numbers in a follow-up release.

Serving

vllm serve <this-repo> \
  --quantization compressed-tensors \
  --trust-remote-code \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2

Recommended on UMA hardware: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to keep the cuda allocator from hoarding freed blocks.

Limitations / caveats

Calibration scale: 262 k tokens is moderate. Heavy reasoning-chain or long-context workloads may benefit from a re-export with more diverse calibration.
Domain-merged saliency: per-domain prune policies (union/intersection) supported by prismaquant but not exercised here. A re-export with domain-tagged calibration is a candidate next iteration.
No formal benchmarks yet: MMLU / GSM8K / HumanEval pending. Headline result is "fits + coherent across cal-mix".
No MTP heads: MiniMax-M2 has no MTP head (unlike Qwen3.5/3.6). No speculative-decoding accelerator.
Pruned experts are gone: 4,960 of 15,872 (31 %) dropped per REAP. Tasks heavily dependent on those specific experts could see degradation. Empirical probes show none on agentic/math/coding prompts.

Reproduction

This artifact was produced by:

# 1. Probe + cost (multi-chunk, adaptive sampling, deferred Fisher sync)
python -m prismaquant.multi_chunk_probe \
  --chunks-dir /work/chunks \
  --model <minimax-m2.7-snapshot> \
  --output /work/artifacts/probe.pkl \
  --activation-cache-dir /work/act \
  --work-dir /work/work \
  --layers-per-shard 4 --unified-sweep \
  --no-include-mtp --no-include-visual --no-include-lm-head \
  --prefetch-lookahead 4 --prefetch-workers 2 \
  --activation-rows-limit 256 \
  --calibration-modality text-only \
  --retain-cross-chunk-cache \
  --adaptive-sampling \
  --run-cost --cost-output /work/artifacts/cost.pkl \
  --cost-formats NVFP4,MXFP8_E4M3,FP8_SOURCE,BF16

# 2. Allocator (target_bits=3.20 picks the kneedle within the 90-95 GB band)
python -m prismaquant.allocator \
  --probe /work/artifacts/probe.pkl \
  --costs /work/artifacts/cost.pkl \
  --formats NVFP4,MXFP8_E4M3,FP8_SOURCE,BF16 \
  --target-bits 3.20 \
  --pareto-targets 3.10,3.16,3.20,3.25,3.30,3.40,3.50,3.60 \
  --enable-expert-prune \
  --prune-ratios 0.0,0.125,0.1875,0.25,0.3125,0.375 \
  --prune-alpha 0.15 \
  --layer-config /work/artifacts/layer_config_prune.json

# 3. Export (native compressed-tensors, GPTQ + scale-sweep activation-aware)
python -m prismaquant.export_native_compressed \
  --model <minimax-m2.7-snapshot> \
  --layer-config /work/artifacts/layer_config_prune.json \
  --prune-manifest /work/artifacts/layer_config_prune.json.prune.json \
  --output /work/exported \
  --activation-cache-dir /work/act \
  --device cuda

Full source + reproduction notes: https://github.com/RobTand/prismaquant

Acknowledgements

MiniMaxAI — source model.
vLLM — compressed-tensors serving stack with native NVFP4 + FP8 MoE kernels.
REAP-style per-expert dropout-loss saliency.
HAQ / HAWQ-V1/V2/V3 (Wang, Dong, Yao, et al.) — mixed-precision allocation foundations.
GPTQ (Frantar et al. 2022), AutoRound — per-Linear quantizer building blocks.

License

Inherits the MiniMax-M2.7 license from the source model. See base model card for terms.

Downloads last month: 1,383

Safetensors

Model size

94B params

Tensor type

F32

BF16

F8_E4M3

Model tree for rdtand/MiniMax-M2.7-PrismaQuant-3.20bit-vllm

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(103)

this model