Qwen3.6-27B-MTP - Q4_K_M GGUF
You currently need to run this with the ikawrakow/ik_llama.cpp fork (main branch, which includes PR ik_llama.cpp for Qwen MTP support).
This was made from Q8_0 and not directly from fp16, some accuracy might been lost due to that.
This is a Q4_K_M GGUF quantization of Qwen3.6-27B that preserves the MTP (Multi-Token Prediction) layers, allowing for significantly faster text generation via speculative decoding.
Standard GGUF conversions often strip out MTP tensors to save a tiny bit of space. This model was carefully requantized fromRadamanthys11/Qwen3.6-27B-MTP-Q8_0-GGUF using ik_llama.cpp to retain the MTP head while shrinking the VRAM requirements down to a highly efficient Q4 footprint.
Performance & Benchmarks
Retaining the MTP layers provides a nearly "free" ~20% generation speedup with zero quality degradation.
Speed Benchmarks:
- MTP Off: 26.32 tokens / second
- MTP On (
--draft-max 1): 31.69 tokens / second (~20% speedup) (Note: Higher draft max than 1 on this specific model decreases t/s for me.
Perplexity:
- Dataset:
wiki.test.raw - Context: 512
- Result:
7.0291 +/- 0.04648(over 580 chunks)
How to Use
When running your server or CLI, you must pass the MTP flags: -mtp --draft-max 1 --draft-p-min 0.0.
Example Server Command:
./llama-server \
-m ./Qwen3.6-27B-MTP-Q4_K_M.gguf \
-c 32768 \
-mtp --draft-max 1 --draft-p-min 0.0
Creation Details
This file was created using llama-quantize built from ik_llama.cpp using the following command:
./llama-quantize --allow-requantize /path/to/Qwen3.6-27B-MTP-Q8_0.gguf /path/to/Qwen3.6-27B-MTP-Q4_K_M.gguf Q4_K_M
- Downloads last month
- 2,346
4-bit
Model tree for RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF
Base model
Qwen/Qwen3.6-27B