Deploy NeuTTS Air: Ultra-Realistic On-Device Voice AI with Instant Voice Cloning

What if you could generate human-quality speech, clone any voice from a 3-second sample, and run it all locally without cloud APIs, internet access, or sending a single byte of audio data off your machine?

Neuphonic made that possible with NeuTTS Air.

This open-source model packs 748 million parameters into a compact architecture that runs in real-time on everything from an RTX 4090 to a Raspberry Pi. It produces speech that's virtually indistinguishable from human recordings, with natural tone, pacing, and emotion. All processing happens locally on your hardware.

This guide walks you through NeuTTS Air's architecture, how it compares to other open-source TTS models, and how to deploy it on Spheron for production-grade voice synthesis.

What Makes NeuTTS Air Different

Most open-source TTS models force a tradeoff. High quality requires cloud GPUs, while on-device models sound robotic. NeuTTS Air eliminates that compromise.

Built on a Qwen2-based language model backbone paired with NeuCodec (a custom neural audio codec), the model generates studio-grade speech at real-time speeds on consumer hardware. It's the first open-source TTS system that delivers all three: human-level quality, instant voice cloning, and fully local inference.

Technical Specifications

Specification	Value
Total Parameters	748M (552M Embedding + Active)
Active Parameters	~360M
Architecture	Qwen2-based LM + NeuCodec
Context Window	2,048 tokens (~30s audio)
Audio Codec	NeuCodec (50 Hz, single codebook)
Codec Bitrate	0.8 kbps
Output Sample Rate	24 kHz
Quantization Formats	GGUF Q4, GGUF Q8
Voice Cloning	3 to 15 seconds reference audio
Languages	English, Spanish, German, French
License	Apache 2.0

NeuTTS Air also ships with a smaller sibling, NeuTTS Nano, at 229M total parameters (120M active) for even more constrained edge deployments.

Architecture Deep Dive

The Language Model Backbone

NeuTTS Air uses a Qwen2-based language model as its backbone, optimized for text-to-speech generation rather than general text completion. The model takes text input, processes it through a phoneme encoder (powered by espeak-ng), and generates a sequence of acoustic tokens that represent the target speech.

The 2,048-token context window accommodates approximately 30 seconds of audio, including the reference prompt. This is sufficient for generating complete sentences and paragraphs of natural speech, though longer passages need to be chunked.

The model ships in GGUF format with Q4 and Q8 quantizations. The Q4 variant uses 400–600 MB of RAM, making it deployable on devices with as little as 2 GB of available memory. The Q8 variant uses approximately 800 MB for higher fidelity.

NeuCodec: The Neural Audio Codec

NeuCodec is the engine behind NeuTTS Air's audio quality. Unlike traditional vocoders that reconstruct audio from mel spectrograms, NeuCodec operates as a single-codebook neural codec running at 50 Hz with finite scalar quantization (FSQ)

Here's how it works.

During inference, the language model generates acoustic tokens at 50 tokens per second. NeuCodec then decodes these tokens into raw 24 kHz audio waveforms through a 16x upsampling process: converting each token into 480 audio samples. The result is high-fidelity speech from an extremely compressed representation at just 0.8 kbps.

This codec-based approach is what enables NeuTTS Air to run on CPUs. Traditional neural TTS systems like Tacotron plus WaveNet generate audio sample-by-sample at 24,000 samples per second, which demands GPU-level parallelism. NeuCodec only needs to generate 50 tokens per second, then efficiently upsample. This is a much lighter computational load.

Voice Cloning Pipeline

NeuTTS Air's voice cloning works through reference encoding rather than fine-tuning. You provide a short audio clip (3 to 15 seconds) and its transcript, and the model extracts speaker characteristics (timbre, pitch range, speaking rhythm) into an embedding that conditions all subsequent generation.

With a 3-second reference, speaker similarity reaches 85-90% (measured by cosine distance on speaker embeddings). With 15 seconds of clean audio, similarity improves to 95% or higher. No training, no dataset collection, no GPU hours spent fine-tuning. Just provide a sample and start generating.

Reference audio requirements are straightforward: mono channel, 16–44 kHz sample rate, WAV format, with clean audio and minimal background noise.

Benchmark Performance

GPU Throughput

On high-end GPUs, NeuTTS Air generates speech far faster than real-time. The following benchmarks use Q4_0 quantization:

Device	NeuTTS Air (tok/s)	NeuTTS Nano (tok/s)
NVIDIA RTX 4090	16,194	19,268
AMD Ryzen 9 HX 370 (CPU)	119	221
Apple iMac M4 16 GB (CPU)	111	195
Samsung Galaxy A25 5G (CPU)	20	45

At 16,194 tokens per second on an RTX 4090, NeuTTS Air generates speech over 320x faster than real-time. The codec runs at 50 tokens per second for real-time playback. This means a single GPU can serve hundreds of concurrent TTS streams simultaneously.

Even on CPU, the model comfortably exceeds real-time. The AMD Ryzen 9 achieves 119 tok/s (over 2x real-time), making NeuTTS Air viable for CPU-only deployments in cost-sensitive environments.

How It Compares to Other Open-Source TTS Models

Model	Parameters	Voice Cloning	On-Device	Languages	Real-Time on CPU
NeuTTS Air	748M	Yes (3s)	Yes	4	Yes
XTTS-v2	~1.5B	Yes (6s)	Limited	20+	No
Bark	~1B	No	No	13+	No
Piper	20–80M	No	Yes	30+	Yes
Kokoro-82M	82M	No	Yes	4	Yes

NeuTTS Air occupies a unique position: it's the only model that combines voice cloning capability with genuine on-device performance. XTTS-v2 offers broader language support and voice cloning but requires GPU inference for acceptable latency. Bark generates expressive, emotional speech but cannot clone voices and runs slowly. Piper is the fastest on-device option but lacks voice cloning and produces lower-fidelity output.

Core Capabilities

Studio-Grade Speech Synthesis

NeuTTS Air generates speech with natural prosody, appropriate pausing, emotional inflection, and realistic breathing patterns. The output is perceptually indistinguishable from human recordings in controlled listening tests, particularly for English.

The Transformer Engine combined with NeuCodec's high-fidelity reconstruction produces speech that sounds natural even at paragraph length, avoiding the monotone drift that plagues many TTS models on longer passages.

Instant Voice Cloning

The zero-shot voice cloning pipeline is NeuTTS Air's standout feature. From a single 3-second audio sample plus transcript, the model captures the following:

Vocal timbre and tone
Speaking pace and rhythm
Pitch range and inflection patterns
Accent characteristics

This enables use cases that previously required hours of recording and days of fine-tuning: personalized voice assistants, audiobook narration in a specific voice, multilingual dubbing, and accessibility tools that speak in a familiar voice.

Privacy-First Architecture

Every byte of audio stays on your hardware. There are no API calls, no cloud processing, no data leaving your network. For industries with strict compliance requirements (healthcare via HIPAA, finance via SOX, government via FedRAMP), this is not just a feature; it's a requirement.

All generated audio includes a Perth (Perceptual Threshold) watermark for provenance tracking, enabling responsible use while maintaining privacy.

Multi-Format Deployment

NeuTTS Air supports multiple deployment paths:

GGUF plus llama-cpp-python: Lightweight CPU/GPU inference with quantization
ONNX decoder: Optimized inference path for production deployments
Native PyTorch: Full-precision inference with Hugging Face integration
Gradio UI: Browser-based demo interface for testing and prototyping

Hardware Requirements

NeuTTS Air is designed to run on virtually any hardware, but performance scales with compute. Here are recommended configurations for different deployment scenarios:

Recommended Spheron GPU Configurations

Configuration	VRAM	Best For
1x NVIDIA RTX 4090	24 GB	High-throughput production (320x real-time)
1x NVIDIA A6000	48 GB	Production with headroom for batch processing
1x NVIDIA A100	80 GB	Multi-model serving (TTS + ASR + LLM)
1x NVIDIA L4	24 GB	Cost-effective inference

Storage: 20 GB minimum for model weights and dependencies.

RAM: 16 GB minimum. The Q4 GGUF variant uses 400-600 MB of model memory, but the inference pipeline (espeak-ng phonemizer, NeuCodec decoder, audio buffering) requires additional headroom.

Unlike trillion-parameter LLMs, NeuTTS Air needs only a single GPU (and often no GPU at all). The decision to use GPU infrastructure is about throughput, not capability. A single RTX 4090 serving at 16,194 tok/s can handle the concurrent load that would require dozens of CPU instances.

Deploy NeuTTS Air on Spheron

Step 1: Launch Your GPU Instance

Log into your Spheron dashboard
Navigate to the GPU marketplace and select your configuration:

RTX 4090 for cost-effective, high-throughput TTS serving
A6000 for production workloads with batch processing headroom

Set storage to 50 GB (model weights + audio output buffer)
Choose Ubuntu 22.04 or Ubuntu 24.04 as your base image

Step 2: Add the Startup Script

In the deployment configuration, add the following startup script. This automatically installs all dependencies, downloads the model, and starts the Gradio inference server.

bash

#!/bin/bash

# Exit on error
set -e

echo "--- Setting Up NeuTTS Air Environment ---"

# 1. Update system and install dependencies
sudo apt-get update -y
sudo apt-get install -y git espeak-ng python3-venv wget

# 2. Install Miniconda for clean environment management
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -p /opt/miniconda3
export PATH="/opt/miniconda3/bin:$PATH"
/opt/miniconda3/bin/conda init bash
source ~/.bashrc

# 3. Create and activate Python environment
conda create -n tts python=3.11 -y
source activate tts

# 4. Clone NeuTTS Air repository
git clone https://github.com/neuphonic/neutts-air.git /opt/neutts-air
cd /opt/neutts-air

# 5. Install Python dependencies
pip install -r requirements.txt
pip install gradio

# 6. Create the inference application
cat > /opt/neutts-air/app.py << 'PYEOF'
import os
import sys
sys.path.append("/opt/neutts-air")
from neuttsair.neutts import NeuTTSAir
import numpy as np
import gradio as gr

SAMPLES_PATH = os.path.join("/opt/neutts-air", "samples")
DEFAULT_REF_TEXT = "So I'm live on radio. And I say, well, my dear friend James here clearly, and the whole room just froze."
DEFAULT_REF_PATH = os.path.join(SAMPLES_PATH, "dave.wav")
DEFAULT_GEN_TEXT = "My name is Dave, and um, I'm from London."

tts = NeuTTSAir(
    backbone_repo="neuphonic/neutts-air",
    backbone_device="cuda",
    codec_repo="neuphonic/neucodec",
    codec_device="cuda"
)

def infer(ref_text: str, ref_audio_path: str, gen_text: str) -> tuple[int, np.ndarray]:
    gr.Info("Starting inference request!")
    gr.Info("Encoding reference...")
    ref_codes = tts.encode_reference(ref_audio_path)
    gr.Info(f"Generating audio for input text: {gen_text}")
    wav = tts.infer(gen_text, ref_codes, ref_text)
    return (24_000, wav)

demo = gr.Interface(
    fn=infer,
    inputs=[
        gr.Textbox(label="Reference Text", value=DEFAULT_REF_TEXT),
        gr.Audio(type="filepath", label="Reference Audio", value=DEFAULT_REF_PATH),
        gr.Textbox(label="Text to Generate", value=DEFAULT_GEN_TEXT),
    ],
    outputs=gr.Audio(type="numpy", label="Generated Speech"),
    title="NeuTTS Air on Spheron",
    description="Upload a reference audio, provide reference text, and enter new text to synthesize speech with voice cloning."
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860, share=True)
PYEOF

# 7. Launch the application
echo "--- Launching NeuTTS Air Server ---"
nohup python /opt/neutts-air/app.py > /var/log/neutts.log 2>&1 &

# 8. Wait for server to initialize
echo "--- Waiting for NeuTTS Air to initialize ---"
for i in $(seq 1 300); do
  if curl -s "http://localhost:7860/" > /dev/null; then
    echo "NeuTTS Air is ready!"
    break
  fi

  if [ $((i % 10)) -eq 0 ]; then
    echo "Still loading model... ($i/300)"
  fi

  sleep 2
done

if ! curl -s "http://localhost:7860/" > /dev/null; then
  echo "ERROR: Server took longer than 10 minutes to load."
  echo "Check /var/log/neutts.log for details."
  exit 1
fi

Step 3: Deploy and Monitor

Click Deploy to launch your instance
Once the instance is running, SSH into it:

bash

ssh root@<your-instance-ip>

Monitor the startup progress:

bash

tail -f /var/log/neutts.log

Model download and initialization takes 3 to 5 minutes depending on network speed. Once the Gradio server starts, you'll see the public share URL in the logs.

Step 4: Verify the Deployment

Open the Gradio share URL in your browser. You should see the NeuTTS Air interface with three inputs: reference text, reference audio, and text to generate.

Click Submit with the default values to test. You'll hear speech generated in the voice of the included sample, with natural pacing, emotion, and realistic delivery.

Using NeuTTS Air in Production

Python API Integration

For production applications, bypass the Gradio UI and call the model directly:

python

from neuttsair.neutts import NeuTTSAir
import soundfile as sf

# Initialize the model
tts = NeuTTSAir(
    backbone_repo="neuphonic/neutts-air",
    backbone_device="cuda",
    codec_repo="neuphonic/neucodec",
    codec_device="cuda"
)

# Encode a reference voice (do this once per speaker)
ref_codes = tts.encode_reference("reference_audio.wav")

# Generate speech
text = "Welcome to our platform. Your account has been created."
ref_text = "Transcript of the reference audio goes here."
wav = tts.infer(text, ref_codes, ref_text)

# Save to file
sf.write("output.wav", wav, 24000)

Batch Processing

For processing large volumes of text (audiobook chapters, notification queues, or content libraries), pre-encode your reference voice and loop through inputs:

python

import os
import soundfile as sf
from neuttsair.neutts import NeuTTSAir

tts = NeuTTSAir(
    backbone_repo="neuphonic/neutts-air",
    backbone_device="cuda",
    codec_repo="neuphonic/neucodec",
    codec_device="cuda"
)

# Pre-encode reference (one-time cost)
ref_codes = tts.encode_reference("speaker_voice.wav")
ref_text = "Transcript of the reference audio."

# Process text segments
segments = [
    "Chapter one. The morning began like any other.",
    "She opened the door and stepped into the sunlight.",
    "The city stretched out before her, alive with possibility.",
]

output_dir = "generated_audio"
os.makedirs(output_dir, exist_ok=True)

for i, text in enumerate(segments):
    wav = tts.infer(text, ref_codes, ref_text)
    sf.write(f"{output_dir}/segment_{i:04d}.wav", wav, 24000)
    print(f"Generated segment {i+1}/{len(segments)}")

REST API Wrapper

For microservice deployments, wrap the model in a FastAPI server:

python

from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import StreamingResponse
from neuttsair.neutts import NeuTTSAir
import soundfile as sf
import numpy as np
import io
import tempfile

app = FastAPI()

tts = NeuTTSAir(
    backbone_repo="neuphonic/neutts-air",
    backbone_device="cuda",
    codec_repo="neuphonic/neucodec",
    codec_device="cuda"
)

# Cache for pre-encoded reference voices
voice_cache = {}

@app.post("/synthesize")
async def synthesize(
    text: str = Form(...),
    ref_text: str = Form(...),
    ref_audio: UploadFile = File(...)
):
    # Save uploaded reference audio to temp file
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        tmp.write(await ref_audio.read())
        tmp_path = tmp.name

    ref_codes = tts.encode_reference(tmp_path)
    wav = tts.infer(text, ref_codes, ref_text)

    # Return WAV audio
    buffer = io.BytesIO()
    sf.write(buffer, wav, 24000, format="WAV")
    buffer.seek(0)

    return StreamingResponse(buffer, media_type="audio/wav")

Performance Tuning

Use GGUF Quantization for CPU Deployments

For CPU-only inference, use the pre-quantized GGUF models with llama-cpp-python:

bash

pip install llama-cpp-python

The Q4 variant delivers real-time performance on modern CPUs (119 tok/s on AMD Ryzen 9) while using only 400-600 MB of RAM. The Q8 variant offers slightly better audio quality at roughly double the memory cost.

Optimize Reference Encoding

Reference encoding is the most expensive step in the pipeline. In production, pre-encode all reference voices at startup and cache the resulting embeddings. This reduces per-request latency to just the inference step.

GPU Memory Management

NeuTTS Air's 748M parameters require minimal GPU memory: well under 2 GB even at full precision. This leaves substantial headroom on any modern GPU for:

Running multiple model instances for higher concurrency
Co-locating TTS with other models (ASR, LLM) on the same GPU
Increasing batch sizes for throughput-critical applications

Context Window Chunking

The 2,048-token context window handles approximately 30 seconds of audio. For longer content, chunk your input text at natural sentence boundaries and generate segments sequentially; then concatenate the output WAV files for seamless playback.

Troubleshooting

espeak-ng Not Found

If you see errors related to phonemization:

bash

sudo apt-get install espeak-ng

Ensure version 1.52.0 or later is installed for optimal language support.

CUDA Out of Memory

NeuTTS Air uses minimal GPU memory. If you see OOM errors, the issue is likely another process consuming VRAM:

bash

nvidia-smi

Check for other GPU processes and terminate them if needed.

Audio Quality Issues

If generated speech sounds distorted or unnatural:

Verify reference audio is mono channel (not stereo)
Check that reference audio is between 3–15 seconds
Ensure reference transcript accurately matches the reference audio
Use clean audio with minimal background noise

Model Download Failures

If the Hugging Face download stalls or fails:

bash

# Check available disk space
df -h

# Retry with explicit cache directory
export HF_HOME=/opt/hf_cache
python app.py

Why Deploy on Spheron

Running NeuTTS Air locally on a laptop works for prototyping. Production voice AI needs GPU infrastructure. Spheron provides the compute and simplicity to deploy it without managing physical hardware.

Full VM Access: Root control over your environment. Install custom CUDA versions, configure networking, and run profiling tools.

Bare-Metal Performance: No virtualization overhead. Your workloads run directly on the GPU without noisy-neighbor effects or unpredictable throttling.

Cost Efficiency: Pay for GPU time without hidden egress fees, idle charges, or warm-up costs. A single RTX 4090 instance on Spheron can serve hundreds of concurrent TTS streams at a fraction of cloud API pricing.

Privacy Guarantee: Your audio data never leaves your GPU instance. Unlike cloud TTS APIs that process audio on shared infrastructure, Spheron gives you dedicated, isolated compute.

Conclusion

NeuTTS Air brings closed-source voice quality to the open-source ecosystem. With 748M parameters, instant 3-second voice cloning, and real-time CPU inference, it eliminates the tradeoff between speech quality and deployment flexibility.

Deploying it on Spheron takes the infrastructure complexity out of the equation. Choose your GPU, add the startup script, and you have a production-ready voice AI system running in under 10 minutes.

For teams building voice assistants, audiobook platforms, accessibility tools, or any application that needs human-quality speech synthesis without cloud API dependencies, NeuTTS Air on Spheron delivers the performance and privacy you need.

Explore GPU options on Spheron →

Frequently Asked Questions

How realistic is NeuTTS Air's speech output?

NeuTTS Air generates speech that is perceptually indistinguishable from human recordings in controlled tests. The NeuCodec audio codec reconstructs high-fidelity 24 kHz audio from compressed acoustic tokens, producing natural prosody, breathing patterns, and emotional inflection. Quality is highest in English and strong across all four supported languages.

How does voice cloning work without fine-tuning?

NeuTTS Air uses reference encoding (not model fine-tuning) for voice cloning. You provide a 3-15 second audio sample plus its transcript, and the model extracts speaker characteristics into an embedding. This embedding conditions all subsequent generation. With a 3-second reference, speaker similarity reaches 85-90%. With 15 seconds of clean audio, it exceeds 95%.

Can NeuTTS Air run without a GPU?

Yes. The GGUF Q4 quantized model runs in real-time on modern CPUs. An AMD Ryzen 9 achieves 119 tokens per second; over 2x real-time for the 50 tok/s codec rate. Even a Samsung Galaxy A25 smartphone reaches 20 tok/s, which approaches real-time generation. GPUs are only needed for high-throughput production serving.

How does NeuTTS Air compare to cloud TTS APIs?

Cloud APIs like ElevenLabs and Google Cloud TTS offer broader language support and may have slightly higher ceiling quality in some scenarios. However, NeuTTS Air provides comparable English speech quality with zero per-request cost, zero latency from network round-trips, full data privacy, and no rate limits. For teams generating thousands of audio clips, the cost savings are substantial.

What languages does NeuTTS Air support?

NeuTTS Air currently supports English, Spanish, German, and French. English has the highest quality, with the other languages performing well for most use cases. The Neuphonic team has indicated that additional languages are planned for future releases.

Is there a watermark on generated audio?

Yes. All audio generated by NeuTTS Air includes a Perth (Perceptual Threshold) watermark for provenance tracking. This invisible watermark supports responsible use by enabling detection of AI-generated speech without affecting audio quality or listening experience.

What Makes NeuTTS Air Different

Technical Specifications

Architecture Deep Dive

The Language Model Backbone

NeuCodec: The Neural Audio Codec

Voice Cloning Pipeline

Benchmark Performance

GPU Throughput

How It Compares to Other Open-Source TTS Models

Core Capabilities

Studio-Grade Speech Synthesis

Instant Voice Cloning

Privacy-First Architecture

Multi-Format Deployment

Hardware Requirements

Recommended Spheron GPU Configurations

Deploy NeuTTS Air on Spheron

Step 1: Launch Your GPU Instance

Step 2: Add the Startup Script

Step 3: Deploy and Monitor

Step 4: Verify the Deployment

Using NeuTTS Air in Production

Python API Integration

Batch Processing

REST API Wrapper

Performance Tuning

Use GGUF Quantization for CPU Deployments

Optimize Reference Encoding

GPU Memory Management

Context Window Chunking

Troubleshooting

espeak-ng Not Found

CUDA Out of Memory

Audio Quality Issues

Model Download Failures

Why Deploy on Spheron

Conclusion

Frequently Asked Questions

How realistic is NeuTTS Air's speech output?

How does voice cloning work without fine-tuning?

Can NeuTTS Air run without a GPU?

How does NeuTTS Air compare to cloud TTS APIs?

What languages does NeuTTS Air support?

Is there a watermark on generated audio?

Build what's next.