← Back to Series Course 6 of 14

Build Your Own Voice Assistant

You've been typing to AI. What if you could talk to it?

In 30 minutes you'll build a voice assistant from microphone to speaker.

For absolute beginners. No assumed knowledge.

Chapter 1: Sound is Numbers

What is audio? Your voice is just a list of numbers, 44,100 per second.

Record and Visualize

Click the red button to record 3 seconds of audio:

Ready to record

When you speak, your vocal cords vibrate air molecules. A microphone converts those pressure waves into electrical signals. An analog-to-digital converter samples that signal 44,100 times per second.

Each sample is a number between -1 and 1. Your entire voice — every word, every tone, every breath — is just a sequence of numbers.

The Raw Data

// Audio recording in JavaScript
async function recordAudio() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const mediaRecorder = new MediaRecorder(stream);
  const chunks = [];
  
  mediaRecorder.ondataavailable = (event) => {
    chunks.push(event.data);
  };
  
  mediaRecorder.onstop = async () => {
    const blob = new Blob(chunks, { type: 'audio/wav' });
    const arrayBuffer = await blob.arrayBuffer();
    const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
    
    // This is your voice as numbers!
    const samples = audioBuffer.getChannelData(0); // Float32Array
    console.log(`Recorded ${samples.length} samples`);
    console.log(`First 10 samples: ${Array.from(samples.slice(0, 10))}`);
    // [0.0234, -0.0456, 0.0123, -0.0789, ...]
  };
  
  mediaRecorder.start();
  setTimeout(() => mediaRecorder.stop(), 3000); // Record for 3 seconds
}

📚 Go Deeper

Digital Audio on Wikipedia — How sound becomes numbers

Chapter 2: Speech to Text

OpenAI's Whisper does this in 50 languages. Whisper architecture: audio spectrogram → encoder → decoder → text.

The Whisper Pipeline

Audio Waveform
Mel Spectrogram
Transformer Encoder
Transformer Decoder
Text Output

Test Speech Recognition

Try it yourself — speak into your microphone:

Click to start listening
No API key? We'll use your browser's built-in SpeechRecognition API as fallback. It's not as good as Whisper, but it works!
🔄 Spaced Repetition: The attention mechanism you built in Build Your Own LLM? Whisper uses the same thing. The encoder attends to audio features; the decoder attends to what it's already transcribed. Same math, different input. → Build Your Own LLM

The Code

// Speech-to-text with OpenAI Whisper API
async function transcribeAudio(audioBlob) {
  const formData = new FormData();
  formData.append('file', audioBlob, 'audio.wav');
  formData.append('model', 'whisper-1');
  formData.append('response_format', 'verbose_json'); // Get word timestamps
  
  const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`
    },
    body: formData
  });
  
  const result = await response.json();
  
  return {
    text: result.text,
    words: result.words, // [{word: "hello", start: 0.5, end: 0.9}, ...]
    language: result.language
  };
}

// Fallback: Browser SpeechRecognition API
function browserSpeechRecognition() {
  const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
  recognition.continuous = false;
  recognition.interimResults = false;
  
  return new Promise((resolve, reject) => {
    recognition.onresult = (event) => {
      const transcript = event.results[0][0].transcript;
      resolve({ text: transcript, confidence: event.results[0][0].confidence });
    };
    
    recognition.onerror = reject;
    recognition.start();
  });
}

📚 Go Deeper

Whisper Paper — "Robust Speech Recognition via Large-Scale Weak Supervision"

OpenAI Whisper API — Complete documentation

Chapter 3: The Brain in the Middle

You have text from speech. Send it to an LLM. Get a text response. The LLM doesn't know it's talking to a human. It just sees text.

Voice → Text → LLM → Text

What the Human Says

🎤 "What's the weather like today?"

What the LLM Sees

"What's the weather like today?"

This is identical to any text-based AI agent. Tools, memory, reasoning — everything works the same. The only difference is how the input arrived and how the output will be delivered.

🔗 Connection: This is just the Agent course in disguise. Speech comes in, gets converted to text, hits the same LLM + tools + memory stack you've already built, and the response gets converted back to speech. The 'brain' is everything you learned in courses 1-7.

Test Voice + AI

Click and hold the button to talk to your AI assistant

The Code

// Complete voice → AI → voice pipeline
class VoiceAssistant {
  constructor(apiKey) {
    this.apiKey = apiKey;
    this.isRecording = false;
    this.mediaRecorder = null;
  }
  
  async startVoiceChat() {
    if (this.isRecording) return;
    
    try {
      // 1. Start recording
      const audioBlob = await this.recordAudio();
      
      // 2. Transcribe speech to text
      const transcription = await this.transcribeAudio(audioBlob);
      console.log('User said:', transcription.text);
      
      // 3. Send to LLM
      const aiResponse = await this.callLLM(transcription.text);
      console.log('AI responded:', aiResponse);
      
      // 4. Convert response to speech (next chapter!)
      await this.speakText(aiResponse);
      
    } catch (error) {
      console.error('Voice chat error:', error);
    }
  }
  
  async callLLM(userText) {
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${this.apiKey}`
      },
      body: JSON.stringify({
        model: 'gpt-4o-mini',
        messages: [{
          role: 'user',
          content: userText
        }]
      })
    });
    
    const data = await response.json();
    return data.choices[0].message.content;
  }
}

const assistant = new VoiceAssistant(API_KEY);

📚 Go Deeper

OpenAI Chat API — Same API you'd use for text chat

Chapter 4: Text to Speech

Turn the response back into audio. Text → phonemes → mel spectrogram → waveform.

Choose Your Voice

🤖
Alloy
Neutral, clear
🎭
Echo
Expressive, dynamic

Nova
Warm, engaging
🎤
Onyx
Deep, authoritative

TTS Pipeline

Text: "Hello, world!"
Phonemes: /həˈloʊ wɜːrld/
Mel Spectrogram
Audio Waveform

Different approaches to TTS:

  • OpenAI TTS: Neural TTS with 6 voices
  • ElevenLabs: Voice cloning and custom voices
  • Browser SpeechSynthesis: Built-in, works offline
  • Google Cloud TTS: 300+ voices, many languages

The Code

// Text-to-speech with OpenAI TTS API
async function textToSpeech(text, voice = 'onyx') {
  const response = await fetch('https://api.openai.com/v1/audio/speech', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'tts-1',
      input: text,
      voice: voice, // alloy, echo, fable, onyx, nova, shimmer
      response_format: 'mp3'
    })
  });
  
  const audioBlob = await response.blob();
  return audioBlob;
}

// Fallback: Browser SpeechSynthesis API
function browserTextToSpeech(text) {
  return new Promise((resolve) => {
    const utterance = new SpeechSynthesisUtterance(text);
    utterance.onend = resolve;
    
    // Get available voices
    const voices = speechSynthesis.getVoices();
    utterance.voice = voices.find(voice => voice.name.includes('Google')) || voices[0];
    
    speechSynthesis.speak(utterance);
  });
}

// Play audio blob
function playAudio(audioBlob) {
  const audioUrl = URL.createObjectURL(audioBlob);
  const audio = new Audio(audioUrl);
  audio.play();
  
  audio.onended = () => {
    URL.revokeObjectURL(audioUrl);
  };
}

📚 Go Deeper

OpenAI TTS Guide — Complete TTS documentation

ElevenLabs API — Advanced voice cloning and synthesis

Chapter 5: The Full Loop

Mic → Whisper → LLM → TTS → Speaker. All connected. You just built Alexa. Siri. Every voice assistant works exactly like this.

Complete Voice Assistant

👋 Hi! I'm your voice assistant. Hold the button below to talk to me.
Ready for voice conversation

The Architecture You Built

🎤 Microphone
🎧 Whisper (Speech→Text)
🧠 LLM (Think & Respond)
🔊 TTS (Text→Speech)
📢 Speaker

This is identical to every voice assistant:

  • Amazon Alexa: Same pipeline, different models
  • Apple Siri: Same pipeline, different models
  • Google Assistant: Same pipeline, different models
  • Your assistant: Same pipeline, your choice of models!

Complete Implementation

// Complete voice assistant implementation
class FullVoiceAssistant {
  constructor() {
    this.isListening = false;
    this.mediaRecorder = null;
    this.audioChunks = [];
    this.conversation = [];
  }
  
  async startListening() {
    if (this.isListening) return;
    this.isListening = true;
    
    try {
      // Get microphone access
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      this.mediaRecorder = new MediaRecorder(stream);
      this.audioChunks = [];
      
      this.mediaRecorder.ondataavailable = (event) => {
        this.audioChunks.push(event.data);
      };
      
      this.mediaRecorder.start();
      this.updateUI('recording', '🎤 Listening...');
      
    } catch (error) {
      console.error('Failed to start recording:', error);
      this.updateUI('error', 'Microphone access denied');
    }
  }
  
  async stopListening() {
    if (!this.isListening) return;
    this.isListening = false;
    
    return new Promise(async (resolve) => {
      this.mediaRecorder.onstop = async () => {
        try {
          // 1. Process audio
          const audioBlob = new Blob(this.audioChunks, { type: 'audio/wav' });
          this.updateUI('processing', '🎧 Understanding...');
          
          // 2. Speech to text
          const transcription = await this.transcribe(audioBlob);
          this.addMessage('user', transcription.text);
          
          // 3. Get AI response
          this.updateUI('thinking', '🧠 Thinking...');
          const response = await this.getAIResponse(transcription.text);
          this.addMessage('assistant', response);
          
          // 4. Text to speech
          this.updateUI('speaking', '🗣️ Speaking...');
          await this.speak(response);
          
          this.updateUI('ready', 'Hold to talk');
          resolve();
          
        } catch (error) {
          console.error('Processing error:', error);
          this.updateUI('error', 'Something went wrong');
        }
      };
      
      this.mediaRecorder.stop();
      // Stop all audio tracks
      this.mediaRecorder.stream.getTracks().forEach(track => track.stop());
    });
  }
  
  async transcribe(audioBlob) {
    // Try OpenAI Whisper API first, fallback to browser
    try {
      const formData = new FormData();
      formData.append('file', audioBlob, 'audio.wav');
      formData.append('model', 'whisper-1');
      
      const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
        method: 'POST',
        headers: { 'Authorization': `Bearer ${OPENAI_API_KEY}` },
        body: formData
      });
      
      const result = await response.json();
      return { text: result.text };
      
    } catch (error) {
      // Fallback to browser speech recognition
      return await this.browserSpeechRecognition();
    }
  }
  
  async getAIResponse(userText) {
    this.conversation.push({ role: 'user', content: userText });
    
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${OPENAI_API_KEY}`
      },
      body: JSON.stringify({
        model: 'gpt-4o-mini',
        messages: [
          {role: 'system', content: 'You are a helpful voice assistant. Keep responses concise and conversational.'},
          ...this.conversation
        ]
      })
    });
    
    const data = await response.json();
    const aiResponse = data.choices[0].message.content;
    this.conversation.push({ role: 'assistant', content: aiResponse });
    
    return aiResponse;
  }
  
  async speak(text) {
    try {
      // Try OpenAI TTS API first
      const response = await fetch('https://api.openai.com/v1/audio/speech', {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${OPENAI_API_KEY}`,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({
          model: 'tts-1',
          input: text,
          voice: 'nova'
        })
      });
      
      const audioBlob = await response.blob();
      const audioUrl = URL.createObjectURL(audioBlob);
      const audio = new Audio(audioUrl);
      
      return new Promise((resolve) => {
        audio.onended = () => {
          URL.revokeObjectURL(audioUrl);
          resolve();
        };
        audio.play();
      });
      
    } catch (error) {
      // Fallback to browser speech synthesis
      return new Promise((resolve) => {
        const utterance = new SpeechSynthesisUtterance(text);
        utterance.onend = resolve;
        speechSynthesis.speak(utterance);
      });
    }
  }
}

📚 Go Deeper

Web Speech API — Browser built-in speech capabilities

Amazon Polly — Enterprise TTS service

Your Agent Has a Voice

You just built the voice interface behind every smart speaker and virtual assistant:

  • Audio capture — Converting sound waves to numbers
  • Speech recognition — Whisper turning speech into text
  • AI reasoning — LLM processing and responding
  • Text-to-speech — Converting responses to audio
  • Full conversation loop — Natural back-and-forth dialogue

This isn't a prototype. This is how Alexa works. How Siri works. How every voice assistant that you've ever used actually processes your voice.

The difference between typing and talking to AI? Just the input and output methods. The reasoning is identical.

What You Can Build Now

With voice, your AI becomes truly conversational.

🏠 Smart Home Hub
"Turn off the lights and set the temperature to 72"
📚 Learning Companion
Practice languages through natural conversation
♿ Accessibility Tool
Voice control for hands-free computing
📞 Phone Agent
Handle calls, take messages, schedule meetings
🎮 Game Character
NPCs that actually understand and respond naturally
🧑‍⚕️ Health Assistant
Voice-activated symptom checker and medication reminders

Voice interfaces make AI feel magical.

What conversations will you enable?

🧠 Final Recall

Test yourself. No peeking. These questions cover everything you just learned.

1. How is audio represented digitally?





2. What does OpenAI's Whisper model do in the voice assistant pipeline?





3. What does the LLM see when processing voice input?





4. What is the typical pipeline for text-to-speech conversion?





5. What is the complete voice assistant architecture?





← Previous: Vision Model Next: AI Agent →