You've been typing to AI. What if you could talk to it?
In 30 minutes you'll build a voice assistant from microphone to speaker.
For absolute beginners. No assumed knowledge.
What is audio? Your voice is just a list of numbers, 44,100 per second.
Click the red button to record 3 seconds of audio:
When you speak, your vocal cords vibrate air molecules. A microphone converts those pressure waves into electrical signals. An analog-to-digital converter samples that signal 44,100 times per second.
Each sample is a number between -1 and 1. Your entire voice — every word, every tone, every breath — is just a sequence of numbers.
// Audio recording in JavaScript
async function recordAudio() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const mediaRecorder = new MediaRecorder(stream);
const chunks = [];
mediaRecorder.ondataavailable = (event) => {
chunks.push(event.data);
};
mediaRecorder.onstop = async () => {
const blob = new Blob(chunks, { type: 'audio/wav' });
const arrayBuffer = await blob.arrayBuffer();
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
// This is your voice as numbers!
const samples = audioBuffer.getChannelData(0); // Float32Array
console.log(`Recorded ${samples.length} samples`);
console.log(`First 10 samples: ${Array.from(samples.slice(0, 10))}`);
// [0.0234, -0.0456, 0.0123, -0.0789, ...]
};
mediaRecorder.start();
setTimeout(() => mediaRecorder.stop(), 3000); // Record for 3 seconds
}
Digital Audio on Wikipedia — How sound becomes numbers
OpenAI's Whisper does this in 50 languages. Whisper architecture: audio spectrogram → encoder → decoder → text.
Try it yourself — speak into your microphone:
// Speech-to-text with OpenAI Whisper API
async function transcribeAudio(audioBlob) {
const formData = new FormData();
formData.append('file', audioBlob, 'audio.wav');
formData.append('model', 'whisper-1');
formData.append('response_format', 'verbose_json'); // Get word timestamps
const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${OPENAI_API_KEY}`
},
body: formData
});
const result = await response.json();
return {
text: result.text,
words: result.words, // [{word: "hello", start: 0.5, end: 0.9}, ...]
language: result.language
};
}
// Fallback: Browser SpeechRecognition API
function browserSpeechRecognition() {
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = false;
recognition.interimResults = false;
return new Promise((resolve, reject) => {
recognition.onresult = (event) => {
const transcript = event.results[0][0].transcript;
resolve({ text: transcript, confidence: event.results[0][0].confidence });
};
recognition.onerror = reject;
recognition.start();
});
}
Whisper Paper — "Robust Speech Recognition via Large-Scale Weak Supervision"
OpenAI Whisper API — Complete documentation
You have text from speech. Send it to an LLM. Get a text response. The LLM doesn't know it's talking to a human. It just sees text.
This is identical to any text-based AI agent. Tools, memory, reasoning — everything works the same. The only difference is how the input arrived and how the output will be delivered.
// Complete voice → AI → voice pipeline
class VoiceAssistant {
constructor(apiKey) {
this.apiKey = apiKey;
this.isRecording = false;
this.mediaRecorder = null;
}
async startVoiceChat() {
if (this.isRecording) return;
try {
// 1. Start recording
const audioBlob = await this.recordAudio();
// 2. Transcribe speech to text
const transcription = await this.transcribeAudio(audioBlob);
console.log('User said:', transcription.text);
// 3. Send to LLM
const aiResponse = await this.callLLM(transcription.text);
console.log('AI responded:', aiResponse);
// 4. Convert response to speech (next chapter!)
await this.speakText(aiResponse);
} catch (error) {
console.error('Voice chat error:', error);
}
}
async callLLM(userText) {
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${this.apiKey}`
},
body: JSON.stringify({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: userText
}]
})
});
const data = await response.json();
return data.choices[0].message.content;
}
}
const assistant = new VoiceAssistant(API_KEY);
OpenAI Chat API — Same API you'd use for text chat
Turn the response back into audio. Text → phonemes → mel spectrogram → waveform.
Different approaches to TTS:
// Text-to-speech with OpenAI TTS API
async function textToSpeech(text, voice = 'onyx') {
const response = await fetch('https://api.openai.com/v1/audio/speech', {
method: 'POST',
headers: {
'Authorization': `Bearer ${OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'tts-1',
input: text,
voice: voice, // alloy, echo, fable, onyx, nova, shimmer
response_format: 'mp3'
})
});
const audioBlob = await response.blob();
return audioBlob;
}
// Fallback: Browser SpeechSynthesis API
function browserTextToSpeech(text) {
return new Promise((resolve) => {
const utterance = new SpeechSynthesisUtterance(text);
utterance.onend = resolve;
// Get available voices
const voices = speechSynthesis.getVoices();
utterance.voice = voices.find(voice => voice.name.includes('Google')) || voices[0];
speechSynthesis.speak(utterance);
});
}
// Play audio blob
function playAudio(audioBlob) {
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();
audio.onended = () => {
URL.revokeObjectURL(audioUrl);
};
}
OpenAI TTS Guide — Complete TTS documentation
ElevenLabs API — Advanced voice cloning and synthesis
Mic → Whisper → LLM → TTS → Speaker. All connected. You just built Alexa. Siri. Every voice assistant works exactly like this.
This is identical to every voice assistant:
// Complete voice assistant implementation
class FullVoiceAssistant {
constructor() {
this.isListening = false;
this.mediaRecorder = null;
this.audioChunks = [];
this.conversation = [];
}
async startListening() {
if (this.isListening) return;
this.isListening = true;
try {
// Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
this.mediaRecorder = new MediaRecorder(stream);
this.audioChunks = [];
this.mediaRecorder.ondataavailable = (event) => {
this.audioChunks.push(event.data);
};
this.mediaRecorder.start();
this.updateUI('recording', '🎤 Listening...');
} catch (error) {
console.error('Failed to start recording:', error);
this.updateUI('error', 'Microphone access denied');
}
}
async stopListening() {
if (!this.isListening) return;
this.isListening = false;
return new Promise(async (resolve) => {
this.mediaRecorder.onstop = async () => {
try {
// 1. Process audio
const audioBlob = new Blob(this.audioChunks, { type: 'audio/wav' });
this.updateUI('processing', '🎧 Understanding...');
// 2. Speech to text
const transcription = await this.transcribe(audioBlob);
this.addMessage('user', transcription.text);
// 3. Get AI response
this.updateUI('thinking', '🧠 Thinking...');
const response = await this.getAIResponse(transcription.text);
this.addMessage('assistant', response);
// 4. Text to speech
this.updateUI('speaking', '🗣️ Speaking...');
await this.speak(response);
this.updateUI('ready', 'Hold to talk');
resolve();
} catch (error) {
console.error('Processing error:', error);
this.updateUI('error', 'Something went wrong');
}
};
this.mediaRecorder.stop();
// Stop all audio tracks
this.mediaRecorder.stream.getTracks().forEach(track => track.stop());
});
}
async transcribe(audioBlob) {
// Try OpenAI Whisper API first, fallback to browser
try {
const formData = new FormData();
formData.append('file', audioBlob, 'audio.wav');
formData.append('model', 'whisper-1');
const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
method: 'POST',
headers: { 'Authorization': `Bearer ${OPENAI_API_KEY}` },
body: formData
});
const result = await response.json();
return { text: result.text };
} catch (error) {
// Fallback to browser speech recognition
return await this.browserSpeechRecognition();
}
}
async getAIResponse(userText) {
this.conversation.push({ role: 'user', content: userText });
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${OPENAI_API_KEY}`
},
body: JSON.stringify({
model: 'gpt-4o-mini',
messages: [
{role: 'system', content: 'You are a helpful voice assistant. Keep responses concise and conversational.'},
...this.conversation
]
})
});
const data = await response.json();
const aiResponse = data.choices[0].message.content;
this.conversation.push({ role: 'assistant', content: aiResponse });
return aiResponse;
}
async speak(text) {
try {
// Try OpenAI TTS API first
const response = await fetch('https://api.openai.com/v1/audio/speech', {
method: 'POST',
headers: {
'Authorization': `Bearer ${OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'tts-1',
input: text,
voice: 'nova'
})
});
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
return new Promise((resolve) => {
audio.onended = () => {
URL.revokeObjectURL(audioUrl);
resolve();
};
audio.play();
});
} catch (error) {
// Fallback to browser speech synthesis
return new Promise((resolve) => {
const utterance = new SpeechSynthesisUtterance(text);
utterance.onend = resolve;
speechSynthesis.speak(utterance);
});
}
}
}
Web Speech API — Browser built-in speech capabilities
Amazon Polly — Enterprise TTS service
You just built the voice interface behind every smart speaker and virtual assistant:
This isn't a prototype. This is how Alexa works. How Siri works. How every voice assistant that you've ever used actually processes your voice.
The difference between typing and talking to AI? Just the input and output methods. The reasoning is identical.
With voice, your AI becomes truly conversational.
Voice interfaces make AI feel magical.
What conversations will you enable?
Test yourself. No peeking. These questions cover everything you just learned.
1. How is audio represented digitally?
2. What does OpenAI's Whisper model do in the voice assistant pipeline?
3. What does the LLM see when processing voice input?
4. What is the typical pipeline for text-to-speech conversion?
5. What is the complete voice assistant architecture?