Riva FastPitch (en-US)

Speech / TTS

NVIDIA · Riva TTS · v1.18 · released 2024-04-01

About

FastPitch is a parallel transformer-based mel-spectrogram generator that explicitly controls pitch and duration of speech. Packaged in NVIDIA NeMo / Riva as the front-end of the en-US neural TTS pipeline; the back-end vocoder is HiFi-GAN. Trained on the LJSpeech corpus.

Intended use: Real-time text-to-speech for conversational AI, accessibility, voice interfaces. Pair with HiFi-GAN for full waveform synthesis.

Architecture

Type: encoder-decoder
Parameters: 45M
Layers: 6
Hidden dim: 384

Mel-spectrogram acoustic model. Transformer text encoder + duration/pitch predictors + transformer decoder predicting 80-band mel-spectrogram frames. Designed to pair with a separate vocoder (HiFi-GAN) that converts mel-spectrograms to waveform. Non-autoregressive — predicts all frames in parallel for sub-real-time inference.

Memory

Weights (BF16): 0.09 GB
Activation estimate: 0.05 GB

Pricing

Free — open weights

Self-host on your own GPU. The calculator surfaces GPU-hours cost on the hardware page instead of an API price.

Provenance

Source: catalog.ngc.nvidia.com
License: cc-by-4.0
Hugging Face: nvidia/tts_en_fastpitch_ipa
Last verified: 2026-06-25

ttsspeech-synthesismel-spectrogramopen-weightenglishreal-time