Ppaso-TTS β Korean lightweight TTS (RK3576 NPU friendly)
νκ΅μ΄: README_ko.md
Ppaso (λΉ μ) = short for ppareunsori (λΉ λ₯Έμ리, "fast voice" in Korean). Real-time Korean TTS for edge NPU devices.
What it is for
When pairing an LLM with TTS on an edge device (e.g. NanoPi R76S) for a Korean voice assistant, TTS inference often becomes the latency bottleneck. Ppaso-TTS is a lightweight TTS designed to remove that bottleneck by running on the NPU.
Designed for speed and small footprint over naturalness. Single voice, Korean only.
Highlights
- Korean only, single female voice. Speed and lightweightness prioritized over quality (robotic timbre included).
- ~20Γ real-time on NPU β ~305 ms for a 5.94-second utterance on RK3576 (RTF 0.052).
- 21 MB on-device model (RKNN) β fits comfortably on RAM-constrained edge devices.
- Two backends: RKNN NPU (RK3576 / RK3588) / ONNX CPU (anywhere).
- Any-length input β built-in chunking handles long paragraphs without truncation.
- Streaming mode β feed LLM tokens as they arrive; emit wav as soon as a sentence completes (voice assistant friendly).
- Custom vocabulary focus β training corpus is self-synthesized via a CosyVoice teacher (not an external dataset). Naturalness is sacrificed, but the data distribution for domain-specific vocabulary (industrial safety, technical terms, etc.) can be controlled.
How it works
Korean text β IPA phones β mel β wav (22.05 kHz mono)
Internally a G2P (mecab-ko + MFA lexicon) + 4 NN modules (RKNN/ONNX) + iSTFT (CPU). For users it's a single function call.
Performance (RK3576, 5.94 s single utterance)
End-to-end measurements (warmup + median, single chunk, full pipeline β g2p, NN inference, iSTFT all included):
| Backend | Latency | RTF | Real-time |
|---|---|---|---|
| RKNN NPU (FP16) | 305 ms | 0.052 | 19.3Γ |
| ONNX CPU (4 threads) | 477 ms | 0.081 | 12.4Γ |
NPU is ~1.6Γ faster than CPU on the same device.
Audio Samples
samples/ (RKNN NPU output):
| File | Duration | Text |
|---|---|---|
01_greeting.wav |
3.66 s | μλ νμΈμ. μ λ μΈκ³΅μ§λ₯ λΉμμ λλ€. (Hello. I am your AI assistant.) |
02_datetime.wav |
5.65 s | μ€λμ 2026λ 4μ 26μΌμ΄κ³ , μ€ν 3μ 30λΆμ λλ€. (It's April 26, 2026, 3:30 PM.) |
03_question.wav |
2.35 s | μ λ§μ΄μμ? μ§μ§ κ·Έλμ? (Really? Is that true?) |
04_casual.wav |
3.45 s | μ€λ μ μ¬ λ λ¨Ήμμ§ μ νμ΄? κ°μ΄ κ°λ? (Decided what to have for lunch? Want to go together?) |
05_long.wav |
13.15 s | λμμΈμ μΈκ°κ³Ό μΈκ°, μΈκ°κ³Ό μ¬νμμ κ΄κ³ μμμ κ·Έ μν μ μ°Ύμλ΄κ² λμλ κ²μ΄κ³ , μΈκ°λ€μμ λ³Έμ§μ μ§μ€νλ©΄μ λ λμ μΆμ μ€ννλ λ§€κ°μ²΄μ΄μ νμκ° λμλ€. (long-form, auto-chunked) |
Quick Start
Install
pip install numpy soundfile python-mecab-ko onnxruntime
# For RK3576 / RK3588 NPU:
pip install rknn-toolkit-lite2
Hello World
from ppaso_tts import PpasoTTS
import soundfile as sf
tts = PpasoTTS('./', backend='onnx') # ONNX (CPU/CUDA, anywhere)
# tts = PpasoTTS('./', backend='rknn') # RKNN (RK3576 / RK3588 NPU)
wav = tts.synthesize("μλ
νμΈμ. μκΈ°μκ° λΆνλ립λλ€.")
sf.write('hello.wav', wav, 22050)
Long input β auto-chunked:
wav = tts.synthesize(
"μλ
νμΈμ. μ€λ λ μ¨κ° μ λ§ μ’λ€μ. μ°μ±
μ΄λ μΈμ? "
"κΈ΄ λ¬Έμ₯λ λλΆλΆ μλ¦Ό μμ΄ μλμ μΌλ‘ ν©μ±λ©λλ€."
)
sf.write('long.wav', wav, 22050)
Custom pronunciations (user dict)
Pass a user_dict mapping any input form β Korean spelling. Replacement is applied automatically before synthesis:
tts = PpasoTTS('./', backend='onnx', user_dict={
"pok3r": "ν¬μ»€",
"GPT": "μ§νΌν°",
"ChatGPT": "μ± μ§νΌν°",
})
wav = tts.synthesize("pok3r and GPT are both spoken naturally as Korean.")
Streaming (voice assistant)
LLM token stream β wav emitted on each sentence completion:
from ppaso_tts import PpasoTTS, StreamingTTS
tts = PpasoTTS('./', backend='rknn')
stream = StreamingTTS(tts)
for token in llm_token_stream(): # hook your LLM token stream
for wav in stream.feed(token): # emit on strong punct (.!?)
play(wav) # play immediately
for wav in stream.flush(): # end of dialog β drain buffer
play(wav)
More examples
Five examples in example/:
| File | Use case |
|---|---|
01_simple_onnx.py |
ONNX backend (CPU/CUDA) |
02_rknn_npu.py |
RKNN backend (RK3576 / RK3588 NPU) + latency measurement |
03_chunked.py |
Long-input chunking demo |
04_streaming.py |
StreamingTTS demo |
05_vocos_quality.py |
High-quality mode β vocos pretrained vocoder drop-in |
Limitations
- Korean only β other languages will be mispronounced.
- Single speaker β no voice cloning.
- Mild robotic timbre persists (capacity / speed trade-off).
- Loanwords / English transliteration may sound unnatural.
- Long-input chunking β arbitrary length is handled by built-in chunking, but prosody (intonation / pace) consistency across chunk boundaries is somewhat weak.
- Rare pronunciations β words missing from the bundled lexicon (~25 K entries) fall back to character-level phones, which can sound off.
- No personal identity β does not represent any real person; do not use for impersonation.
Intended / Out-of-Scope Use
Intended:
- Edge AI voice assistants (Korean speakers)
- IoT voice notifications (announcements, alarms)
- Accessibility tooling (screen readers)
Prohibited:
- β Voice impersonation of specific individuals
- β Deepfake / fraudulent audio generation
- β Standalone use in critical systems (medical, legal, emergency) without human review
- β Spam / fraud automation
Repository Layout
.
βββ README.md / README_ko.md # This file (English / Korean)
βββ LICENSE # Apache 2.0
βββ config.json # Model metadata
β
βββ onnx/ # CPU/CUDA inference (37 MB)
βββ rknn/ # RK3576 / RK3588 NPU only (FP16, 21 MB)
βββ runtime/ # G2P + lexicon
βββ example/ # 5 example scripts + ppaso_tts.py class
βββ samples/ # Example synthesized waveforms
License
Apache License 2.0 β free for commercial use, includes patent grant, requires preservation of NOTICE / attribution.
Attribution
External resources used in training this model:
- Fun-CosyVoice 3.0 (Apache 2.0) β used as a teacher to synthesize the Korean training corpus. The teacher weights are not redistributed in this repo.
- Vocos (MIT) β the ConvNeXt-1D + iSTFT vocoder design is inspired by Vocos. Our vocoder is trained from scratch and does not include any Vocos weights.
- MFA
korean_mfa.dict(MIT, Eleanor Chodroff) - mecab-ko (BSD)
Acknowledgments
- This project was developed with support from i-nx.com.
- FunAudioLLM team β Fun-CosyVoice 3.0 and its permissive license.
- Vocos authors β ConvNeXt-1D vocoder architecture.
- Rockchip β RK3576 NPU documentation and RKNN toolkit.
- Anthropic Claude β architecture exploration, training pipeline, and model card documentation.
Citation
@misc{ppaso-tts,
author = {akamotaco},
title = {Ppaso-TTS: Korean edge-optimized TTS for RK3576 NPU},
year = {2026},
url = {https://huggingface.co/akamotaco/ppaso-tts-v1},
note = {Developed with support from i-nx.com, and AI assistance from Anthropic Claude.}
}
Teacher model:
@article{du2025cosyvoice,
title = {CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author = {Du, Zhihao and others},
journal = {arXiv preprint arXiv:2505.17589},
year = {2025}
}
Vocoder design:
@inproceedings{siuzdak2024vocos,
title = {Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
author = {Siuzdak, Hubert},
booktitle = {ICLR},
year = {2024}
}
νκ΅μ΄ λ²μ : README_ko.md
- Downloads last month
- 100