Qwen3.6-27B-MTP - Q4_K_M GGUF

You currently need to run this with the ikawrakow/ik_llama.cpp fork (main branch, which includes PR ik_llama.cpp for Qwen MTP support).

This was made from Q8_0 and not directly from fp16, some accuracy might been lost due to that.

This is a Q4_K_M GGUF quantization of Qwen3.6-27B that preserves the MTP (Multi-Token Prediction) layers, allowing for significantly faster text generation via speculative decoding.

Standard GGUF conversions often strip out MTP tensors to save a tiny bit of space. This model was carefully requantized fromRadamanthys11/Qwen3.6-27B-MTP-Q8_0-GGUF using ik_llama.cpp to retain the MTP head while shrinking the VRAM requirements down to a highly efficient Q4 footprint.

Performance & Benchmarks

Retaining the MTP layers provides a nearly "free" ~20% generation speedup with zero quality degradation.

Speed Benchmarks:

  • MTP Off: 26.32 tokens / second
  • MTP On (--draft-max 1): 31.69 tokens / second (~20% speedup) (Note: Higher draft max than 1 on this specific model decreases t/s for me.

Perplexity:

  • Dataset: wiki.test.raw
  • Context: 512
  • Result: 7.0291 +/- 0.04648 (over 580 chunks)

How to Use

When running your server or CLI, you must pass the MTP flags: -mtp --draft-max 1 --draft-p-min 0.0.

Example Server Command:

./llama-server \
  -m ./Qwen3.6-27B-MTP-Q4_K_M.gguf \
  -c 32768 \
  -mtp --draft-max 1 --draft-p-min 0.0

Creation Details

This file was created using llama-quantize built from ik_llama.cpp using the following command:

./llama-quantize --allow-requantize /path/to/Qwen3.6-27B-MTP-Q8_0.gguf /path/to/Qwen3.6-27B-MTP-Q4_K_M.gguf Q4_K_M
Downloads last month
2,346
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(274)
this model