What if you could generate human-quality speech, clone any voice from a 3-second sample, and run it all locally without cloud APIs, internet access, or sending a single byte of audio data off your machine?
Neuphonic made that possible with NeuTTS Air.
This open-source model packs 748 million parameters into a compact architecture that runs in real-time on everything from an RTX 4090 to a Raspberry Pi. It produces speech that's virtually indistinguishable from human recordings, with natural tone, pacing, and emotion. All processing happens locally on your hardware.
This guide walks you through NeuTTS Air's architecture, how it compares to other open-source TTS models, and how to deploy it on Spheron for production-grade voice synthesis.
What Makes NeuTTS Air Different
Most open-source TTS models force a tradeoff. High quality requires cloud GPUs, while on-device models sound robotic. NeuTTS Air eliminates that compromise.
Built on a Qwen2-based language model backbone paired with NeuCodec (a custom neural audio codec), the model generates studio-grade speech at real-time speeds on consumer hardware. It's the first open-source TTS system that delivers all three: human-level quality, instant voice cloning, and fully local inference.
Technical Specifications
| Specification | Value |
|---|---|
| Total Parameters | 748M (552M Embedding + Active) |
| Active Parameters | ~360M |
| Architecture | Qwen2-based LM + NeuCodec |
| Context Window | 2,048 tokens (~30s audio) |
| Audio Codec | NeuCodec (50 Hz, single codebook) |
| Codec Bitrate | 0.8 kbps |
| Output Sample Rate | 24 kHz |
| Quantization Formats | GGUF Q4, GGUF Q8 |
| Voice Cloning | 3 to 15 seconds reference audio |
| Languages | English, Spanish, German, French |
| License | Apache 2.0 |
NeuTTS Air also ships with a smaller sibling, NeuTTS Nano, at 229M total parameters (120M active) for even more constrained edge deployments.
Architecture Deep Dive
The Language Model Backbone
NeuTTS Air uses a Qwen2-based language model as its backbone, optimized for text-to-speech generation rather than general text completion. The model takes text input, processes it through a phoneme encoder (powered by espeak-ng), and generates a sequence of acoustic tokens that represent the target speech.
The 2,048-token context window accommodates approximately 30 seconds of audio, including the reference prompt. This is sufficient for generating complete sentences and paragraphs of natural speech, though longer passages need to be chunked.
The model ships in GGUF format with Q4 and Q8 quantizations. The Q4 variant uses 400–600 MB of RAM, making it deployable on devices with as little as 2 GB of available memory. The Q8 variant uses approximately 800 MB for higher fidelity.
NeuCodec: The Neural Audio Codec
NeuCodec is the engine behind NeuTTS Air's audio quality. Unlike traditional vocoders that reconstruct audio from mel spectrograms, NeuCodec operates as a single-codebook neural codec running at 50 Hz with finite scalar quantization (FSQ)
Here's how it works.
During inference, the language model generates acoustic tokens at 50 tokens per second. NeuCodec then decodes these tokens into raw 24 kHz audio waveforms through a 16x upsampling process: converting each token into 480 audio samples. The result is high-fidelity speech from an extremely compressed representation at just 0.8 kbps.
This codec-based approach is what enables NeuTTS Air to run on CPUs. Traditional neural TTS systems like Tacotron plus WaveNet generate audio sample-by-sample at 24,000 samples per second, which demands GPU-level parallelism. NeuCodec only needs to generate 50 tokens per second, then efficiently upsample. This is a much lighter computational load.
Voice Cloning Pipeline
NeuTTS Air's voice cloning works through reference encoding rather than fine-tuning. You provide a short audio clip (3 to 15 seconds) and its transcript, and the model extracts speaker characteristics (timbre, pitch range, speaking rhythm) into an embedding that conditions all subsequent generation.
With a 3-second reference, speaker similarity reaches 85-90% (measured by cosine distance on speaker embeddings). With 15 seconds of clean audio, similarity improves to 95% or higher. No training, no dataset collection, no GPU hours spent fine-tuning. Just provide a sample and start generating.
Reference audio requirements are straightforward: mono channel, 16–44 kHz sample rate, WAV format, with clean audio and minimal background noise.
Benchmark Performance
GPU Throughput
On high-end GPUs, NeuTTS Air generates speech far faster than real-time. The following benchmarks use Q4_0 quantization:
| Device | NeuTTS Air (tok/s) | NeuTTS Nano (tok/s) |
|---|---|---|
| NVIDIA RTX 4090 | 16,194 | 19,268 |
| AMD Ryzen 9 HX 370 (CPU) | 119 | 221 |
| Apple iMac M4 16 GB (CPU) | 111 | 195 |
| Samsung Galaxy A25 5G (CPU) | 20 | 45 |
At 16,194 tokens per second on an RTX 4090, NeuTTS Air generates speech over 320x faster than real-time. The codec runs at 50 tokens per second for real-time playback. This means a single GPU can serve hundreds of concurrent TTS streams simultaneously.
Even on CPU, the model comfortably exceeds real-time. The AMD Ryzen 9 achieves 119 tok/s (over 2x real-time), making NeuTTS Air viable for CPU-only deployments in cost-sensitive environments.
How It Compares to Other Open-Source TTS Models
| Model | Parameters | Voice Cloning | On-Device | Languages | Real-Time on CPU |
|---|---|---|---|---|---|
| NeuTTS Air | 748M | Yes (3s) | Yes | 4 | Yes |
| XTTS-v2 | ~1.5B | Yes (6s) | Limited | 20+ | No |
| Bark | ~1B | No | No | 13+ | No |
| Piper | 20–80M | No | Yes | 30+ | Yes |
| Kokoro-82M | 82M | No | Yes | 4 | Yes |
NeuTTS Air occupies a unique position: it's the only model that combines voice cloning capability with genuine on-device performance. XTTS-v2 offers broader language support and voice cloning but requires GPU inference for acceptable latency. Bark generates expressive, emotional speech but cannot clone voices and runs slowly. Piper is the fastest on-device option but lacks voice cloning and produces lower-fidelity output.
Core Capabilities
Studio-Grade Speech Synthesis
NeuTTS Air generates speech with natural prosody, appropriate pausing, emotional inflection, and realistic breathing patterns. The output is perceptually indistinguishable from human recordings in controlled listening tests, particularly for English.
The Transformer Engine combined with NeuCodec's high-fidelity reconstruction produces speech that sounds natural even at paragraph length, avoiding the monotone drift that plagues many TTS models on longer passages.
Instant Voice Cloning
The zero-shot voice cloning pipeline is NeuTTS Air's standout feature. From a single 3-second audio sample plus transcript, the model captures the following:
- Vocal timbre and tone
- Speaking pace and rhythm
- Pitch range and inflection patterns
- Accent characteristics
This enables use cases that previously required hours of recording and days of fine-tuning: personalized voice assistants, audiobook narration in a specific voice, multilingual dubbing, and accessibility tools that speak in a familiar voice.
Privacy-First Architecture
Every byte of audio stays on your hardware. There are no API calls, no cloud processing, no data leaving your network. For industries with strict compliance requirements (healthcare via HIPAA, finance via SOX, government via FedRAMP), this is not just a feature; it's a requirement.
All generated audio includes a Perth (Perceptual Threshold) watermark for provenance tracking, enabling responsible use while maintaining privacy.
Multi-Format Deployment
NeuTTS Air supports multiple deployment paths:
- GGUF plus llama-cpp-python: Lightweight CPU/GPU inference with quantization
- ONNX decoder: Optimized inference path for production deployments
- Native PyTorch: Full-precision inference with Hugging Face integration
- Gradio UI: Browser-based demo interface for testing and prototyping
Hardware Requirements
NeuTTS Air is designed to run on virtually any hardware, but performance scales with compute. Here are recommended configurations for different deployment scenarios:
Recommended Spheron GPU Configurations
| Configuration | VRAM | Best For |
|---|---|---|
| 1x NVIDIA RTX 4090 | 24 GB | High-throughput production (320x real-time) |
| 1x NVIDIA A6000 | 48 GB | Production with headroom for batch processing |
| 1x NVIDIA A100 | 80 GB | Multi-model serving (TTS + ASR + LLM) |
| 1x NVIDIA L4 | 24 GB | Cost-effective inference |
Storage: 20 GB minimum for model weights and dependencies.
RAM: 16 GB minimum. The Q4 GGUF variant uses 400-600 MB of model memory, but the inference pipeline (espeak-ng phonemizer, NeuCodec decoder, audio buffering) requires additional headroom.
Unlike trillion-parameter LLMs, NeuTTS Air needs only a single GPU (and often no GPU at all). The decision to use GPU infrastructure is about throughput, not capability. A single RTX 4090 serving at 16,194 tok/s can handle the concurrent load that would require dozens of CPU instances.
Deploy NeuTTS Air on Spheron
Step 1: Launch Your GPU Instance
- Log into your Spheron dashboard
- Navigate to the GPU marketplace and select your configuration:
- RTX 4090 for cost-effective, high-throughput TTS serving
- A6000 for production workloads with batch processing headroom
- Set storage to 50 GB (model weights + audio output buffer)
- Choose Ubuntu 22.04 or Ubuntu 24.04 as your base image
Step 2: Add the Startup Script
In the deployment configuration, add the following startup script. This automatically installs all dependencies, downloads the model, and starts the Gradio inference server.
#!/bin/bash
# Exit on error
set -e
echo "--- Setting Up NeuTTS Air Environment ---"
# 1. Update system and install dependencies
sudo apt-get update -y
sudo apt-get install -y git espeak-ng python3-venv wget
# 2. Install Miniconda for clean environment management
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -p /opt/miniconda3
export PATH="/opt/miniconda3/bin:$PATH"
/opt/miniconda3/bin/conda init bash
source ~/.bashrc
# 3. Create and activate Python environment
conda create -n tts python=3.11 -y
source activate tts
# 4. Clone NeuTTS Air repository
git clone https://github.com/neuphonic/neutts-air.git /opt/neutts-air
cd /opt/neutts-air
# 5. Install Python dependencies
pip install -r requirements.txt
pip install gradio
# 6. Create the inference application
cat > /opt/neutts-air/app.py << 'PYEOF'
import os
import sys
sys.path.append("/opt/neutts-air")
from neuttsair.neutts import NeuTTSAir
import numpy as np
import gradio as gr
SAMPLES_PATH = os.path.join("/opt/neutts-air", "samples")
DEFAULT_REF_TEXT = "So I'm live on radio. And I say, well, my dear friend James here clearly, and the whole room just froze."
DEFAULT_REF_PATH = os.path.join(SAMPLES_PATH, "dave.wav")
DEFAULT_GEN_TEXT = "My name is Dave, and um, I'm from London."
tts = NeuTTSAir(
backbone_repo="neuphonic/neutts-air",
backbone_device="cuda",
codec_repo="neuphonic/neucodec",
codec_device="cuda"
)
def infer(ref_text: str, ref_audio_path: str, gen_text: str) -> tuple[int, np.ndarray]:
gr.Info("Starting inference request!")
gr.Info("Encoding reference...")
ref_codes = tts.encode_reference(ref_audio_path)
gr.Info(f"Generating audio for input text: {gen_text}")
wav = tts.infer(gen_text, ref_codes, ref_text)
return (24_000, wav)
demo = gr.Interface(
fn=infer,
inputs=[
gr.Textbox(label="Reference Text", value=DEFAULT_REF_TEXT),
gr.Audio(type="filepath", label="Reference Audio", value=DEFAULT_REF_PATH),
gr.Textbox(label="Text to Generate", value=DEFAULT_GEN_TEXT),
],
outputs=gr.Audio(type="numpy", label="Generated Speech"),
title="NeuTTS Air on Spheron",
description="Upload a reference audio, provide reference text, and enter new text to synthesize speech with voice cloning."
)
if __name__ == "__main__":
demo.launch(server_name="0.0.0.0", server_port=7860, share=True)
PYEOF
# 7. Launch the application
echo "--- Launching NeuTTS Air Server ---"
nohup python /opt/neutts-air/app.py > /var/log/neutts.log 2>&1 &
# 8. Wait for server to initialize
echo "--- Waiting for NeuTTS Air to initialize ---"
for i in $(seq 1 300); do
if curl -s "http://localhost:7860/" > /dev/null; then
echo "NeuTTS Air is ready!"
break
fi
if [ $((i % 10)) -eq 0 ]; then
echo "Still loading model... ($i/300)"
fi
sleep 2
done
if ! curl -s "http://localhost:7860/" > /dev/null; then
echo "ERROR: Server took longer than 10 minutes to load."
echo "Check /var/log/neutts.log for details."
exit 1
fiStep 3: Deploy and Monitor
- Click Deploy to launch your instance
- Once the instance is running, SSH into it:
ssh root@<your-instance-ip>- Monitor the startup progress:
tail -f /var/log/neutts.logModel download and initialization takes 3 to 5 minutes depending on network speed. Once the Gradio server starts, you'll see the public share URL in the logs.
Step 4: Verify the Deployment
Open the Gradio share URL in your browser. You should see the NeuTTS Air interface with three inputs: reference text, reference audio, and text to generate.
Click Submit with the default values to test. You'll hear speech generated in the voice of the included sample, with natural pacing, emotion, and realistic delivery.
Using NeuTTS Air in Production
Python API Integration
For production applications, bypass the Gradio UI and call the model directly:
from neuttsair.neutts import NeuTTSAir
import soundfile as sf
# Initialize the model
tts = NeuTTSAir(
backbone_repo="neuphonic/neutts-air",
backbone_device="cuda",
codec_repo="neuphonic/neucodec",
codec_device="cuda"
)
# Encode a reference voice (do this once per speaker)
ref_codes = tts.encode_reference("reference_audio.wav")
# Generate speech
text = "Welcome to our platform. Your account has been created."
ref_text = "Transcript of the reference audio goes here."
wav = tts.infer(text, ref_codes, ref_text)
# Save to file
sf.write("output.wav", wav, 24000)Batch Processing
For processing large volumes of text (audiobook chapters, notification queues, or content libraries), pre-encode your reference voice and loop through inputs:
import os
import soundfile as sf
from neuttsair.neutts import NeuTTSAir
tts = NeuTTSAir(
backbone_repo="neuphonic/neutts-air",
backbone_device="cuda",
codec_repo="neuphonic/neucodec",
codec_device="cuda"
)
# Pre-encode reference (one-time cost)
ref_codes = tts.encode_reference("speaker_voice.wav")
ref_text = "Transcript of the reference audio."
# Process text segments
segments = [
"Chapter one. The morning began like any other.",
"She opened the door and stepped into the sunlight.",
"The city stretched out before her, alive with possibility.",
]
output_dir = "generated_audio"
os.makedirs(output_dir, exist_ok=True)
for i, text in enumerate(segments):
wav = tts.infer(text, ref_codes, ref_text)
sf.write(f"{output_dir}/segment_{i:04d}.wav", wav, 24000)
print(f"Generated segment {i+1}/{len(segments)}")REST API Wrapper
For microservice deployments, wrap the model in a FastAPI server:
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import StreamingResponse
from neuttsair.neutts import NeuTTSAir
import soundfile as sf
import numpy as np
import io
import tempfile
app = FastAPI()
tts = NeuTTSAir(
backbone_repo="neuphonic/neutts-air",
backbone_device="cuda",
codec_repo="neuphonic/neucodec",
codec_device="cuda"
)
# Cache for pre-encoded reference voices
voice_cache = {}
@app.post("/synthesize")
async def synthesize(
text: str = Form(...),
ref_text: str = Form(...),
ref_audio: UploadFile = File(...)
):
# Save uploaded reference audio to temp file
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
tmp.write(await ref_audio.read())
tmp_path = tmp.name
ref_codes = tts.encode_reference(tmp_path)
wav = tts.infer(text, ref_codes, ref_text)
# Return WAV audio
buffer = io.BytesIO()
sf.write(buffer, wav, 24000, format="WAV")
buffer.seek(0)
return StreamingResponse(buffer, media_type="audio/wav")Performance Tuning
Use GGUF Quantization for CPU Deployments
For CPU-only inference, use the pre-quantized GGUF models with llama-cpp-python:
pip install llama-cpp-pythonThe Q4 variant delivers real-time performance on modern CPUs (119 tok/s on AMD Ryzen 9) while using only 400-600 MB of RAM. The Q8 variant offers slightly better audio quality at roughly double the memory cost.
Optimize Reference Encoding
Reference encoding is the most expensive step in the pipeline. In production, pre-encode all reference voices at startup and cache the resulting embeddings. This reduces per-request latency to just the inference step.
GPU Memory Management
NeuTTS Air's 748M parameters require minimal GPU memory: well under 2 GB even at full precision. This leaves substantial headroom on any modern GPU for:
- Running multiple model instances for higher concurrency
- Co-locating TTS with other models (ASR, LLM) on the same GPU
- Increasing batch sizes for throughput-critical applications
Context Window Chunking
The 2,048-token context window handles approximately 30 seconds of audio. For longer content, chunk your input text at natural sentence boundaries and generate segments sequentially; then concatenate the output WAV files for seamless playback.
Troubleshooting
espeak-ng Not Found
If you see errors related to phonemization:
sudo apt-get install espeak-ngEnsure version 1.52.0 or later is installed for optimal language support.
CUDA Out of Memory
NeuTTS Air uses minimal GPU memory. If you see OOM errors, the issue is likely another process consuming VRAM:
nvidia-smiCheck for other GPU processes and terminate them if needed.
Audio Quality Issues
If generated speech sounds distorted or unnatural:
- Verify reference audio is mono channel (not stereo)
- Check that reference audio is between 3–15 seconds
- Ensure reference transcript accurately matches the reference audio
- Use clean audio with minimal background noise
Model Download Failures
If the Hugging Face download stalls or fails:
# Check available disk space
df -h
# Retry with explicit cache directory
export HF_HOME=/opt/hf_cache
python app.pyWhy Deploy on Spheron
Running NeuTTS Air locally on a laptop works for prototyping. Production voice AI needs GPU infrastructure. Spheron provides the compute and simplicity to deploy it without managing physical hardware.
Full VM Access: Root control over your environment. Install custom CUDA versions, configure networking, and run profiling tools.
Bare-Metal Performance: No virtualization overhead. Your workloads run directly on the GPU without noisy-neighbor effects or unpredictable throttling.
Cost Efficiency: Pay for GPU time without hidden egress fees, idle charges, or warm-up costs. A single RTX 4090 instance on Spheron can serve hundreds of concurrent TTS streams at a fraction of cloud API pricing.
Privacy Guarantee: Your audio data never leaves your GPU instance. Unlike cloud TTS APIs that process audio on shared infrastructure, Spheron gives you dedicated, isolated compute.
Conclusion
NeuTTS Air brings closed-source voice quality to the open-source ecosystem. With 748M parameters, instant 3-second voice cloning, and real-time CPU inference, it eliminates the tradeoff between speech quality and deployment flexibility.
Deploying it on Spheron takes the infrastructure complexity out of the equation. Choose your GPU, add the startup script, and you have a production-ready voice AI system running in under 10 minutes.
For teams building voice assistants, audiobook platforms, accessibility tools, or any application that needs human-quality speech synthesis without cloud API dependencies, NeuTTS Air on Spheron delivers the performance and privacy you need.
Explore GPU options on Spheron →
Frequently Asked Questions
How realistic is NeuTTS Air's speech output?
NeuTTS Air generates speech that is perceptually indistinguishable from human recordings in controlled tests. The NeuCodec audio codec reconstructs high-fidelity 24 kHz audio from compressed acoustic tokens, producing natural prosody, breathing patterns, and emotional inflection. Quality is highest in English and strong across all four supported languages.
How does voice cloning work without fine-tuning?
NeuTTS Air uses reference encoding (not model fine-tuning) for voice cloning. You provide a 3-15 second audio sample plus its transcript, and the model extracts speaker characteristics into an embedding. This embedding conditions all subsequent generation. With a 3-second reference, speaker similarity reaches 85-90%. With 15 seconds of clean audio, it exceeds 95%.
Can NeuTTS Air run without a GPU?
Yes. The GGUF Q4 quantized model runs in real-time on modern CPUs. An AMD Ryzen 9 achieves 119 tokens per second; over 2x real-time for the 50 tok/s codec rate. Even a Samsung Galaxy A25 smartphone reaches 20 tok/s, which approaches real-time generation. GPUs are only needed for high-throughput production serving.
How does NeuTTS Air compare to cloud TTS APIs?
Cloud APIs like ElevenLabs and Google Cloud TTS offer broader language support and may have slightly higher ceiling quality in some scenarios. However, NeuTTS Air provides comparable English speech quality with zero per-request cost, zero latency from network round-trips, full data privacy, and no rate limits. For teams generating thousands of audio clips, the cost savings are substantial.
What languages does NeuTTS Air support?
NeuTTS Air currently supports English, Spanish, German, and French. English has the highest quality, with the other languages performing well for most use cases. The Neuphonic team has indicated that additional languages are planned for future releases.
Is there a watermark on generated audio?
Yes. All audio generated by NeuTTS Air includes a Perth (Perceptual Threshold) watermark for provenance tracking. This invisible watermark supports responsible use by enabling detection of AI-generated speech without affecting audio quality or listening experience.