Ppaso-TTS β€” Korean lightweight TTS (RK3576 NPU friendly)

ν•œκ΅­μ–΄: README_ko.md

Ppaso (λΉ μ†Œ) = short for ppareunsori (λΉ λ₯Έμ†Œλ¦¬, "fast voice" in Korean). Real-time Korean TTS for edge NPU devices.

What it is for

When pairing an LLM with TTS on an edge device (e.g. NanoPi R76S) for a Korean voice assistant, TTS inference often becomes the latency bottleneck. Ppaso-TTS is a lightweight TTS designed to remove that bottleneck by running on the NPU.

Designed for speed and small footprint over naturalness. Single voice, Korean only.

Highlights

  • Korean only, single female voice. Speed and lightweightness prioritized over quality (robotic timbre included).
  • ~20Γ— real-time on NPU β€” ~305 ms for a 5.94-second utterance on RK3576 (RTF 0.052).
  • 21 MB on-device model (RKNN) β€” fits comfortably on RAM-constrained edge devices.
  • Two backends: RKNN NPU (RK3576 / RK3588) / ONNX CPU (anywhere).
  • Any-length input β€” built-in chunking handles long paragraphs without truncation.
  • Streaming mode β€” feed LLM tokens as they arrive; emit wav as soon as a sentence completes (voice assistant friendly).
  • Custom vocabulary focus β€” training corpus is self-synthesized via a CosyVoice teacher (not an external dataset). Naturalness is sacrificed, but the data distribution for domain-specific vocabulary (industrial safety, technical terms, etc.) can be controlled.

How it works

Korean text  β†’  IPA phones  β†’  mel  β†’  wav (22.05 kHz mono)

Internally a G2P (mecab-ko + MFA lexicon) + 4 NN modules (RKNN/ONNX) + iSTFT (CPU). For users it's a single function call.

Performance (RK3576, 5.94 s single utterance)

End-to-end measurements (warmup + median, single chunk, full pipeline β€” g2p, NN inference, iSTFT all included):

Backend Latency RTF Real-time
RKNN NPU (FP16) 305 ms 0.052 19.3Γ—
ONNX CPU (4 threads) 477 ms 0.081 12.4Γ—

NPU is ~1.6Γ— faster than CPU on the same device.

Audio Samples

samples/ (RKNN NPU output):

File Duration Text
01_greeting.wav 3.66 s μ•ˆλ…•ν•˜μ„Έμš”. μ €λŠ” 인곡지λŠ₯ λΉ„μ„œμž…λ‹ˆλ‹€. (Hello. I am your AI assistant.)
02_datetime.wav 5.65 s μ˜€λŠ˜μ€ 2026λ…„ 4μ›” 26일이고, μ˜€ν›„ 3μ‹œ 30λΆ„μž…λ‹ˆλ‹€. (It's April 26, 2026, 3:30 PM.)
03_question.wav 2.35 s μ •λ§μ΄μ—μš”? μ§„μ§œ κ·Έλž˜μš”? (Really? Is that true?)
04_casual.wav 3.45 s 였늘 점심 뭐 먹을지 μ •ν–ˆμ–΄? 같이 갈래? (Decided what to have for lunch? Want to go together?)
05_long.wav 13.15 s λ””μžμΈμ€ 인간과 인간, 인간과 μ‚¬νšŒμ™€μ˜ 관계 μ†μ—μ„œ κ·Έ 역할을 μ°Ύμ•„λ‚΄κ²Œ λ˜μ—ˆλ˜ 것이고, μΈκ°„λ‹€μ›€μ˜ λ³Έμ§ˆμ— μ§‘μ€‘ν•˜λ©΄μ„œ 더 λ‚˜μ€ 삢을 μ‹€ν˜„ν•˜λŠ” 맀개체이자 ν–‰μœ„κ°€ λ˜μ—ˆλ‹€. (long-form, auto-chunked)

Quick Start

Install

pip install numpy soundfile python-mecab-ko onnxruntime
# For RK3576 / RK3588 NPU:
pip install rknn-toolkit-lite2

Hello World

from ppaso_tts import PpasoTTS
import soundfile as sf

tts = PpasoTTS('./', backend='onnx')         # ONNX (CPU/CUDA, anywhere)
# tts = PpasoTTS('./', backend='rknn')       # RKNN (RK3576 / RK3588 NPU)

wav = tts.synthesize("μ•ˆλ…•ν•˜μ„Έμš”. μžκΈ°μ†Œκ°œ λΆ€νƒλ“œλ¦½λ‹ˆλ‹€.")
sf.write('hello.wav', wav, 22050)

Long input β€” auto-chunked:

wav = tts.synthesize(
    "μ•ˆλ…•ν•˜μ„Έμš”. 였늘 날씨가 정말 μ’‹λ„€μš”. μ‚°μ±… μ–΄λ– μ„Έμš”? "
    "κΈ΄ λ¬Έμž₯도 끝뢀뢄 잘림 없이 μžλ™μ μœΌλ‘œ ν•©μ„±λ©λ‹ˆλ‹€."
)
sf.write('long.wav', wav, 22050)

Custom pronunciations (user dict)

Pass a user_dict mapping any input form β†’ Korean spelling. Replacement is applied automatically before synthesis:

tts = PpasoTTS('./', backend='onnx', user_dict={
    "pok3r":   "포컀",
    "GPT":     "μ§€ν”Όν‹°",
    "ChatGPT": "μ±— μ§€ν”Όν‹°",
})
wav = tts.synthesize("pok3r and GPT are both spoken naturally as Korean.")

Streaming (voice assistant)

LLM token stream β†’ wav emitted on each sentence completion:

from ppaso_tts import PpasoTTS, StreamingTTS

tts = PpasoTTS('./', backend='rknn')
stream = StreamingTTS(tts)

for token in llm_token_stream():        # hook your LLM token stream
    for wav in stream.feed(token):      # emit on strong punct (.!?)
        play(wav)                       # play immediately
for wav in stream.flush():              # end of dialog β€” drain buffer
    play(wav)

More examples

Five examples in example/:

File Use case
01_simple_onnx.py ONNX backend (CPU/CUDA)
02_rknn_npu.py RKNN backend (RK3576 / RK3588 NPU) + latency measurement
03_chunked.py Long-input chunking demo
04_streaming.py StreamingTTS demo
05_vocos_quality.py High-quality mode β€” vocos pretrained vocoder drop-in

Limitations

  • Korean only β€” other languages will be mispronounced.
  • Single speaker β€” no voice cloning.
  • Mild robotic timbre persists (capacity / speed trade-off).
  • Loanwords / English transliteration may sound unnatural.
  • Long-input chunking β€” arbitrary length is handled by built-in chunking, but prosody (intonation / pace) consistency across chunk boundaries is somewhat weak.
  • Rare pronunciations β€” words missing from the bundled lexicon (~25 K entries) fall back to character-level phones, which can sound off.
  • No personal identity β€” does not represent any real person; do not use for impersonation.

Intended / Out-of-Scope Use

Intended:

  • Edge AI voice assistants (Korean speakers)
  • IoT voice notifications (announcements, alarms)
  • Accessibility tooling (screen readers)

Prohibited:

  • ❌ Voice impersonation of specific individuals
  • ❌ Deepfake / fraudulent audio generation
  • ❌ Standalone use in critical systems (medical, legal, emergency) without human review
  • ❌ Spam / fraud automation

Repository Layout

.
β”œβ”€β”€ README.md / README_ko.md     # This file (English / Korean)
β”œβ”€β”€ LICENSE                      # Apache 2.0
β”œβ”€β”€ config.json                  # Model metadata
β”‚
β”œβ”€β”€ onnx/                        # CPU/CUDA inference (37 MB)
β”œβ”€β”€ rknn/                        # RK3576 / RK3588 NPU only (FP16, 21 MB)
β”œβ”€β”€ runtime/                     # G2P + lexicon
β”œβ”€β”€ example/                     # 5 example scripts + ppaso_tts.py class
└── samples/                     # Example synthesized waveforms

License

Apache License 2.0 β€” free for commercial use, includes patent grant, requires preservation of NOTICE / attribution.

Attribution

External resources used in training this model:

  • Fun-CosyVoice 3.0 (Apache 2.0) β€” used as a teacher to synthesize the Korean training corpus. The teacher weights are not redistributed in this repo.
  • Vocos (MIT) β€” the ConvNeXt-1D + iSTFT vocoder design is inspired by Vocos. Our vocoder is trained from scratch and does not include any Vocos weights.
  • MFA korean_mfa.dict (MIT, Eleanor Chodroff)
  • mecab-ko (BSD)

Acknowledgments

  • This project was developed with support from i-nx.com.
  • FunAudioLLM team β€” Fun-CosyVoice 3.0 and its permissive license.
  • Vocos authors β€” ConvNeXt-1D vocoder architecture.
  • Rockchip β€” RK3576 NPU documentation and RKNN toolkit.
  • Anthropic Claude β€” architecture exploration, training pipeline, and model card documentation.

Citation

@misc{ppaso-tts,
  author = {akamotaco},
  title  = {Ppaso-TTS: Korean edge-optimized TTS for RK3576 NPU},
  year   = {2026},
  url    = {https://huggingface.co/akamotaco/ppaso-tts-v1},
  note   = {Developed with support from i-nx.com, and AI assistance from Anthropic Claude.}
}

Teacher model:

@article{du2025cosyvoice,
  title   = {CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author  = {Du, Zhihao and others},
  journal = {arXiv preprint arXiv:2505.17589},
  year    = {2025}
}

Vocoder design:

@inproceedings{siuzdak2024vocos,
  title     = {Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
  author    = {Siuzdak, Hubert},
  booktitle = {ICLR},
  year      = {2024}
}

ν•œκ΅­μ–΄ 버전: README_ko.md

Downloads last month
100
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for akamotaco/ppaso-tts-v1