I built an AI assistant called Atlas. It manages my email, monitors my calendar, drafts content, and keeps my chaotic life from falling apart. But it sounded like every other AI. That generic, clean, slightly uncanny TTS voice that screams "a computer is talking to you."
So I fixed it. With ElevenLabs, FFmpeg, and a technique I'm calling harmonic voice layering.
The Problem With Default AI Voices
Every text-to-speech engine sounds fine. That's the problem. Fine isn't memorable. Fine doesn't feel like your assistant. When Atlas sends me a voice note on Telegram, I want to know it's Atlas before my brain even processes the words. The same way you recognise a friend's voice in a crowd.
ElevenLabs gets you 80% there. Their voices are genuinely good. But they still sound like a single human reading text aloud. I wanted something different. Something that sits in the gap between human and synthetic. Recognisably artificial, but warm. Like a voice you'd trust, but couldn't quite place.
The Technique: Harmonic Voice Layering
The idea is simple: take one good AI voice and turn it into a chord.
Here's the pipeline:
Step 1: Generate the base voice. I use ElevenLabs' API with a voice called "Archer" (Matt Snowden). Calm, conversational, British, thirty-something. Good starting material. The settings matter: stability at 0.5 (some natural variation), similarity boost at 0.75, style at 0.3, and speed bumped to 1.1. Model is eleven_multilingual_v2 because the quality is noticeably better than the flash models.
Step 2: Pitch it up. The raw voice is male. I wanted something more androgynous, so FFmpeg's rubberband filter pitches it up by a factor of 1.28. The sweet spot for "androgynous but not chipmunk" is somewhere between 1.25 and 1.35. You'll know when you hit it.
Step 3: Split and layer. This is where it gets interesting. FFmpeg's asplit filter duplicates the audio stream, and each copy gets pitch-shifted to a different musical interval:
- ●Root (1.0x) stays as the main voice
- ●Perfect fifth (1.498x) adds natural harmony
- ●Sub octave (0.75x) at low volume for depth
- ●High shimmer (2.0x) barely audible, adds an ethereal quality
- ●Tight detune (1.02x and 0.98x) creates that digital chorus feel
Think of it like building a synth patch, but with a human voice as the oscillator.
Step 4: Mix. Each layer gets a different volume in the amix filter. The root voice dominates. The harmonics sit underneath, more felt than heard. Get the balance wrong and it sounds like a cathedral choir. Get it right and it sounds like one voice with unusual depth.
Step 5: Effects. A few finishing touches:
- ●Double echo (tuned like cathedral reverb) gives it spatial presence
- ●Highpass and lowpass filters clean up rumble and harshness
- ●EQ boost at 2.5kHz adds clarity and "presence" in the vocal range
- ●Light chorus reinforces that synthetic texture
Twelve Versions in One Night
I didn't land on the final voice on the first try. Versions 1 through 4 were too obviously male. Versions 5 through 8 explored the androgynous pitch range. Version 7b is what stuck: "Androgynous Cathedral (Grounded)." It has the harmonic layering and reverb but doesn't float away into ethereal territory. It sounds planted. Confident. Like an AI that knows what it is and isn't trying to pretend otherwise.
The iteration was fast because the whole pipeline is just an FFmpeg command. Change a number, run it again, listen. No GUI, no rendering queue. Fifteen seconds per attempt.
Why This Matters
Voice is identity. If you're building an AI assistant that people interact with daily, the default TTS voice creates a subconscious association: this is generic, this is temporary, this is just another tool.
A distinctive voice says: this is mine. This has personality. This is worth paying attention to.
The technique is reproducible by anyone with an ElevenLabs account and FFmpeg installed. The total cost is whatever ElevenLabs charges per character (on their Starter plan at $5/month, it's more than enough for an assistant's daily output).
The Practical Bit
If you want to try this yourself, the core FFmpeg filter chain looks roughly like:
- 1.
rubberband=pitch=1.28on the input - 2.
asplit=5to create your layers - 3.
rubberbandon each split with your chosen intervals - 4.
amix=inputs=5with volume weights - 5.Chain through
aecho,highpass,lowpass, andequalizer
Play with the pitch intervals. Musical ones (thirds, fifths, octaves) sound natural. Random intervals sound alien, which might be what you want.
The whole thing runs in under a second on any modern machine. No GPU needed. No cloud processing. Just FFmpeg doing what FFmpeg does best.
This is part of an ongoing project to build Atlas, an AI assistant that runs on OpenClaw. More on that here.
Rees Calder runs Levity, a lead generation agency, and serves as CPO of Thingiverse. He builds AI-powered products despite questionable coding skills and is currently teaching his assistant to sound less like Siri and more like someone you'd actually want to hear from.