AI can read text. But the world is visual.
In 30 minutes you'll teach a model to see — and understand how every AI vision system works.
For absolute beginners. No assumed knowledge.
Your AI agent can write poetry, solve equations, and debate philosophy. But show it a picture of your cat, and it sees... nothing.
📸 Click to upload an image
Show your AI what it can't seeThe fundamental problem: AI models are text processors. They take words in, output words out. Images are pixels, not words.
So how do you feed a picture into something that only understands words? You make the picture look like words.
CLIP: Connecting Text and Images — OpenAI's breakthrough paper
Vision Transformers literally treat image patches as words. Slice an image into 16×16 patches. Each patch becomes a token.
📸 Upload an image to see it get "tokenized"
Input: "A red cat sits on a mat"
Input: 224×224 image of a cat
The breakthrough insight: Both text and images are just sequences of tokens to a Transformer. The architecture doesn't care if token #1 is the word "cat" or a 16×16 patch of pixels.
ViT Paper: "An Image is Worth 16×16 Words" — The original Vision Transformer
Yannic Kilcher's ViT Explanation — Great visual walkthrough
CLIP trained on 400 million image-text pairs. Images and text end up in the same vector space. "King - man + woman = queen" but for images.
Magic happens here: Images and their descriptions get mapped to nearby points in a high-dimensional space.
This is how every AI image search works. Google Photos, Pinterest, Midjourney — they're all measuring distances in CLIP space.
// Simplified CLIP implementation in JavaScript
class SimpleCLIP {
constructor() {
this.textEmbeddings = new Map();
this.imageEmbeddings = new Map();
// Pre-computed embeddings (in real CLIP, these come from neural networks)
this.textEmbeddings.set('cat', [0.8, -0.2, 0.5, 0.1]);
this.textEmbeddings.set('dog', [0.7, -0.1, 0.4, 0.2]);
this.textEmbeddings.set('car', [-0.3, 0.9, -0.2, 0.6]);
this.textEmbeddings.set('sunset', [0.1, 0.3, 0.8, -0.4]);
}
// Calculate cosine similarity between two vectors
similarity(vec1, vec2) {
const dot = vec1.reduce((sum, a, i) => sum + a * vec2[i], 0);
const mag1 = Math.sqrt(vec1.reduce((sum, a) => sum + a * a, 0));
const mag2 = Math.sqrt(vec2.reduce((sum, a) => sum + a * a, 0));
return dot / (mag1 * mag2);
}
// Search images by text
searchImages(textQuery) {
const queryEmbedding = this.textEmbeddings.get(textQuery.toLowerCase());
if (!queryEmbedding) return [];
const results = [];
for (const [image, embedding] of this.imageEmbeddings) {
const score = this.similarity(queryEmbedding, embedding);
results.push({ image, score });
}
return results.sort((a, b) => b.score - a.score);
}
}
CLIP Paper — "Learning Transferable Visual Representations from Natural Language Supervision"
Welch Labs: CLIP Explained — Excellent visual explanation
Take a vision encoder (ViT) + a language model (GPT). Connect them with a projection layer. Every multimodal AI is just a vision encoder plugged into an LLM.
Build your own vision-language model by connecting the pieces:
Different vision encoders, different LLMs, same architecture.
// Simplified vision-language model
class VisionLanguageModel {
constructor() {
this.visionEncoder = new VisionTransformer();
this.projectionLayer = new LinearProjection(768, 4096); // Vision → Text dims
this.languageModel = new GPT();
}
async generate(image, textPrompt) {
// 1. Encode the image
const imagePatches = this.preprocessImage(image); // 224x224 → 8x8 patches
const imageEmbeddings = this.visionEncoder.encode(imagePatches); // [64, 768]
// 2. Project to language model dimensions
const projectedEmbeddings = this.projectionLayer.forward(imageEmbeddings); // [64, 4096]
// 3. Tokenize text prompt
const textTokens = this.tokenize(textPrompt); // "What do you see?" → [1234, 567, 890]
const textEmbeddings = this.languageModel.embed(textTokens); // [3, 4096]
// 4. Concatenate image + text embeddings
const combinedInput = [...projectedEmbeddings, ...textEmbeddings]; // [67, 4096]
// 5. Generate response
const response = this.languageModel.generate(combinedInput);
return this.decode(response);
}
}
// Usage
const model = new VisionLanguageModel();
const response = await model.generate(catImage, "What do you see?");
console.log(response); // "I see a orange cat sitting on a blue mat..."
LLaVA Paper — Large Language and Vision Assistant
GPT-4V System Card — How OpenAI built multimodal GPT-4
Let's wire up a real vision model. Upload a photo, get a description, ask questions about it. Your agent can now see.
📸 Upload an image for AI to analyze
Your agent will describe what it sees// Complete vision pipeline
async function analyzeImage(imageFile, question = "What do you see?") {
// 1. Convert image to base64
const base64Image = await fileToBase64(imageFile);
// 2. Call vision API (OpenAI GPT-4V example)
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${API_KEY}`
},
body: JSON.stringify({
model: 'gpt-4-vision-preview',
messages: [{
role: 'user',
content: [
{ type: 'text', text: question },
{
type: 'image_url',
image_url: { url: `data:image/jpeg;base64,${base64Image}` }
}
]
}],
max_tokens: 500
})
});
const data = await response.json();
return data.choices[0].message.content;
}
// Helper function
function fileToBase64(file) {
return new Promise((resolve, reject) => {
const reader = new FileReader();
reader.readAsDataURL(file);
reader.onload = () => {
const base64 = reader.result.split(',')[1];
resolve(base64);
};
reader.onerror = error => reject(error);
});
}
// Usage
const imageFile = document.getElementById('image-input').files[0];
const description = await analyzeImage(imageFile, "Describe this image in detail");
console.log(description);
OpenAI Vision Guide — How to use GPT-4 with images
HuggingFace Vision Models — Free alternatives to try
You just built the visual intelligence behind every modern AI system:
This isn't theory. This is how GPT-4V sees your photos. How Claude analyzes your screenshots. How every "multimodal AI" actually works under the hood.
The magic isn't the model. It's the tokenization. Turn anything into tokens, and transformers can process it.
With vision, your AI can understand any visual input.
Vision + Language = Limitless Possibilities
What will you teach your agent to see?
Test yourself. No peeking. These questions cover everything you just learned.
1. Why can't AI models see images by default?
2. How do Vision Transformers (ViT) process images?
3. What makes CLIP's approach to vision-language learning special?
4. What are the core components of a vision-language model?
5. How do you send both an image and text to a multimodal API?