Riva FastPitch (en-US)
Speech / TTSNVIDIA · Riva TTS · v1.18 · released
About
FastPitch is a parallel transformer-based mel-spectrogram generator that explicitly controls pitch and duration of speech. Packaged in NVIDIA NeMo / Riva as the front-end of the en-US neural TTS pipeline; the back-end vocoder is HiFi-GAN. Trained on the LJSpeech corpus.
Intended use: Real-time text-to-speech for conversational AI, accessibility, voice interfaces. Pair with HiFi-GAN for full waveform synthesis.
Architecture
- Type
- encoder-decoder
- Parameters
- 45M
- Layers
- 6
- Hidden dim
- 384
Mel-spectrogram acoustic model. Transformer text encoder + duration/pitch predictors + transformer decoder predicting 80-band mel-spectrogram frames. Designed to pair with a separate vocoder (HiFi-GAN) that converts mel-spectrograms to waveform. Non-autoregressive — predicts all frames in parallel for sub-real-time inference.
Memory
- Weights (BF16)
- 0.09 GB
- Activation estimate
- 0.05 GB
Pricing
Free — open weights
Self-host on your own GPU. The calculator surfaces GPU-hours cost on the hardware page instead of an API price.
Provenance
- Source
- catalog.ngc.nvidia.com
- License
- cc-by-4.0
- Hugging Face
- nvidia/tts_en_fastpitch_ipa
- Last verified
- 2026-06-25