Ppaso-TTS — Korean lightweight TTS (RK3576 NPU friendly)

한국어: README_ko.md

Ppaso (빠소) = short for ppareunsori (빠른소리, "fast voice" in Korean). Real-time Korean TTS for edge NPU devices.

What it is for

When pairing an LLM with TTS on an edge device (e.g. NanoPi R76S) for a Korean voice assistant, TTS inference often becomes the latency bottleneck. Ppaso-TTS is a lightweight TTS designed to remove that bottleneck by running on the NPU.

Designed for speed and small footprint over naturalness. Single voice, Korean only.

Highlights

Korean only, single female voice. Speed and lightweightness prioritized over quality (robotic timbre included).
~20× real-time on NPU — ~305 ms for a 5.94-second utterance on RK3576 (RTF 0.052).
21 MB on-device model (RKNN) — fits comfortably on RAM-constrained edge devices.
Two backends: RKNN NPU (RK3576 / RK3588) / ONNX CPU (anywhere).
Any-length input — built-in chunking handles long paragraphs without truncation.
Streaming mode — feed LLM tokens as they arrive; emit wav as soon as a sentence completes (voice assistant friendly).
Custom vocabulary focus — training corpus is self-synthesized via a CosyVoice teacher (not an external dataset). Naturalness is sacrificed, but the data distribution for domain-specific vocabulary (industrial safety, technical terms, etc.) can be controlled.

How it works

Korean text  →  IPA phones  →  mel  →  wav (22.05 kHz mono)

Internally a G2P (mecab-ko + MFA lexicon) + 4 NN modules (RKNN/ONNX) + iSTFT (CPU). For users it's a single function call.

Performance (RK3576, 5.94 s single utterance)

End-to-end measurements (warmup + median, single chunk, full pipeline — g2p, NN inference, iSTFT all included):

Backend	Latency	RTF	Real-time
RKNN NPU (FP16)	305 ms	0.052	19.3×
ONNX CPU (4 threads)	477 ms	0.081	12.4×

NPU is ~1.6× faster than CPU on the same device.

Audio Samples

samples/ (RKNN NPU output):

File	Duration	Text
`01_greeting.wav`	3.66 s	안녕하세요. 저는 인공지능 비서입니다. (Hello. I am your AI assistant.)
`02_datetime.wav`	5.65 s	오늘은 2026년 4월 26일이고, 오후 3시 30분입니다. (It's April 26, 2026, 3:30 PM.)
`03_question.wav`	2.35 s	정말이에요? 진짜 그래요? (Really? Is that true?)
`04_casual.wav`	3.45 s	오늘 점심 뭐 먹을지 정했어? 같이 갈래? (Decided what to have for lunch? Want to go together?)
`05_long.wav`	13.15 s	디자인은 인간과 인간, 인간과 사회와의 관계 속에서 그 역할을 찾아내게 되었던 것이고, 인간다움의 본질에 집중하면서 더 나은 삶을 실현하는 매개체이자 행위가 되었다. (long-form, auto-chunked)

Quick Start

Install

pip install numpy soundfile python-mecab-ko onnxruntime
# For RK3576 / RK3588 NPU:
pip install rknn-toolkit-lite2

Hello World

from ppaso_tts import PpasoTTS
import soundfile as sf

tts = PpasoTTS('./', backend='onnx')         # ONNX (CPU/CUDA, anywhere)
# tts = PpasoTTS('./', backend='rknn')       # RKNN (RK3576 / RK3588 NPU)

wav = tts.synthesize("안녕하세요. 자기소개 부탁드립니다.")
sf.write('hello.wav', wav, 22050)

Long input — auto-chunked:

wav = tts.synthesize(
    "안녕하세요. 오늘 날씨가 정말 좋네요. 산책 어떠세요? "
    "긴 문장도 끝부분 잘림 없이 자동적으로 합성됩니다."
)
sf.write('long.wav', wav, 22050)

Custom pronunciations (user dict)

Pass a user_dict mapping any input form → Korean spelling. Replacement is applied automatically before synthesis:

tts = PpasoTTS('./', backend='onnx', user_dict={
    "pok3r":   "포커",
    "GPT":     "지피티",
    "ChatGPT": "챗 지피티",
})
wav = tts.synthesize("pok3r and GPT are both spoken naturally as Korean.")

Streaming (voice assistant)

LLM token stream → wav emitted on each sentence completion:

from ppaso_tts import PpasoTTS, StreamingTTS

tts = PpasoTTS('./', backend='rknn')
stream = StreamingTTS(tts)

for token in llm_token_stream():        # hook your LLM token stream
    for wav in stream.feed(token):      # emit on strong punct (.!?)
        play(wav)                       # play immediately
for wav in stream.flush():              # end of dialog — drain buffer
    play(wav)

More examples

Five examples in example/:

File	Use case
`01_simple_onnx.py`	ONNX backend (CPU/CUDA)
`02_rknn_npu.py`	RKNN backend (RK3576 / RK3588 NPU) + latency measurement
`03_chunked.py`	Long-input chunking demo
`04_streaming.py`	StreamingTTS demo
`05_vocos_quality.py`	High-quality mode — vocos pretrained vocoder drop-in

Limitations

Korean only — other languages will be mispronounced.
Single speaker — no voice cloning.
Mild robotic timbre persists (capacity / speed trade-off).
Loanwords / English transliteration may sound unnatural.
Long-input chunking — arbitrary length is handled by built-in chunking, but prosody (intonation / pace) consistency across chunk boundaries is somewhat weak.
Rare pronunciations — words missing from the bundled lexicon (~25 K entries) fall back to character-level phones, which can sound off.
No personal identity — does not represent any real person; do not use for impersonation.

Intended / Out-of-Scope Use

Intended:

Edge AI voice assistants (Korean speakers)
IoT voice notifications (announcements, alarms)
Accessibility tooling (screen readers)

Prohibited:

❌ Voice impersonation of specific individuals
❌ Deepfake / fraudulent audio generation
❌ Standalone use in critical systems (medical, legal, emergency) without human review
❌ Spam / fraud automation

Repository Layout

.
├── README.md / README_ko.md     # This file (English / Korean)
├── LICENSE                      # Apache 2.0
├── config.json                  # Model metadata
│
├── onnx/                        # CPU/CUDA inference (37 MB)
├── rknn/                        # RK3576 / RK3588 NPU only (FP16, 21 MB)
├── runtime/                     # G2P + lexicon
├── example/                     # 5 example scripts + ppaso_tts.py class
└── samples/                     # Example synthesized waveforms

License

Apache License 2.0 — free for commercial use, includes patent grant, requires preservation of NOTICE / attribution.

Attribution

External resources used in training this model:

Fun-CosyVoice 3.0 (Apache 2.0) — used as a teacher to synthesize the Korean training corpus. The teacher weights are not redistributed in this repo.
Vocos (MIT) — the ConvNeXt-1D + iSTFT vocoder design is inspired by Vocos. Our vocoder is trained from scratch and does not include any Vocos weights.
MFA korean_mfa.dict (MIT, Eleanor Chodroff)
mecab-ko (BSD)

Acknowledgments

This project was developed with support from i-nx.com.
FunAudioLLM team — Fun-CosyVoice 3.0 and its permissive license.
Vocos authors — ConvNeXt-1D vocoder architecture.
Rockchip — RK3576 NPU documentation and RKNN toolkit.
Anthropic Claude — architecture exploration, training pipeline, and model card documentation.

Citation

@misc{ppaso-tts,
  author = {akamotaco},
  title  = {Ppaso-TTS: Korean edge-optimized TTS for RK3576 NPU},
  year   = {2026},
  url    = {https://huggingface.co/akamotaco/ppaso-tts-v1},
  note   = {Developed with support from i-nx.com, and AI assistance from Anthropic Claude.}
}

Teacher model:

@article{du2025cosyvoice,
  title   = {CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author  = {Du, Zhihao and others},
  journal = {arXiv preprint arXiv:2505.17589},
  year    = {2025}
}

Vocoder design:

@inproceedings{siuzdak2024vocos,
  title     = {Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
  author    = {Siuzdak, Hubert},
  booktitle = {ICLR},
  year      = {2024}
}

한국어 버전: README_ko.md

Downloads last month: 100

Paper for akamotaco/ppaso-tts-v1

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Paper • 2505.17589 • Published May 23, 2025 • 5