← Back to Series Course 13 of 14
Chapter 1 of 5

Build Your Own 3D World

Spatial AI for Sports Broadcasting

Cameras see flat images. The real world is 3D.

In 30 minutes you'll learn how AI reconstructs the world from video — and how this changes sports broadcasting forever.

Chapter 1: "Flat to 3D"

How computers reconstruct depth from images

Imagine watching disc golf on ESPN. A single camera follows a player as they wind up, release, and watch their disc sail across the course. That flat video contains a hidden 3D world — if you know how to extract it.

The Problem

A photo has no depth. It's a 2D projection of a 3D world. When light hits your camera sensor, all depth information is lost. A mountain 10 miles away and a tree 10 feet away can cast the exact same shadow on the same pixel.

But your brain reconstructs 3D from two flat images — your eyes. Computers can do the same thing, and much more.

Depth from Single Images

🖼️ Monocular Depth Estimation

Modern AI can predict depth from a single image. Models like Depth Anything and MiDAS learned from millions of images to understand visual cues: size, shadows, overlap, perspective.

Red = close, Blue = far

The catch: Single-image depth is relative, not metric. You know the tree is closer than the mountain, but not by exactly how many meters. For broadcast sports, that's often enough.

Structure from Motion (SfM)

🎯 Parallax Explorer

Click matching points in both images. With two views of the same scene, we can triangulate the real 3D position of any point. This is how COLMAP and other SfM systems work.

The broadcast problem: Sports cameras follow the action — they don't orbit the scene. This gives you very few viewpoints. COLMAP might register only 4 frames out of 70. That's why broadcast 3D reconstruction is so hard.

You now understand the fundamental challenge. 2D cameras capture 3D worlds, but the depth is lost. AI can estimate it, geometry can triangulate it, but broadcast video gives us limited views.

This limitation drove the computer vision revolution of 2020-2024. Let's see what emerged.

Chapter 2: "Gaussian Splats"

The revolution in 3D reconstruction

NeRFs were first (2020) — train a neural network to represent a 3D scene. Beautiful results, but painfully slow to train and render.

Then came 3D Gaussian Splatting (2023). Same quality, 100x faster. It changed everything.

What is a Gaussian Splat?

A fuzzy 3D blob. Each "splat" has a position in 3D space, a size, a rotation, a color, and an opacity. Imagine a semi-transparent colored ellipsoid floating in space.

Now imagine millions of them. Together, they can approximate any 3D scene.

🎨 2D Splat Playground

Click to add colored gaussian blobs. Drag to move them. This is what 3D Gaussian Splatting does, but in 3D with millions of splats.

How Training Works

Start with a point cloud from SfM (remember Chapter 1). Each point becomes a gaussian splat. Then optimize:

  1. Render the splats to generate an image
  2. Compare with the actual photo from that viewpoint
  3. Adjust splat parameters to minimize the difference
  4. Repeat for all camera views

The loss function is simple: "Does my rendered image look like the real photo?" That's it. The magic is in the differentiable rendering — you can backpropagate through the entire splatting process.

The Broadcast Challenge

3D Gaussian Splatting needs many views of the same scene. Research papers use 100-200 photos taken from different angles around an object.

Broadcast video gives you maybe 5-10 useful views as the camera pans. This is an active research problem.

🔬
Single-view 3DGS
Research frontier
Multi-view 3DGS
Production ready
🔬
Dynamic scenes
4D Gaussian Splatting
Real-time training
VGGT: 3.8 seconds

The latest breakthrough: VGGT (Meta, CVPR 2025 Best Paper). 64 frames → 54,000 3D points in 3.8 seconds. Real-time 3D reconstruction is almost here.

For disc golf, imagine this: Live 3D course reconstruction during the broadcast. Viewers could switch to any angle, see the disc's flight path in 3D, understand the course layout instantly.

Chapter 3: "Segmentation"

Cutting objects out of images and video

Before you can track a disc golf player across the course, you need to separate them from the background. Segmentation is how AI cuts objects out of images — and it's the foundation of everything else.

The Types of Segmentation

  • Semantic segmentation — Label every pixel (sky, grass, person, disc, basket)
  • Instance segmentation — Distinguish between objects of the same class (player 1 vs player 2)
  • Panoptic segmentation — Combine both: every pixel gets a class AND instance ID

For sports broadcasting, instance segmentation is key. You need to track individual players, not just "human-shaped regions."

SAM: The Game Changer

✂️ Segment Anything Demo

Click on any object to segment it. SAM (Segment Anything Model) can isolate any object with a single click — no training required.

Meta's SAM changed everything in 2023. Before SAM, you needed to train a custom model for each type of object. Want to segment disc golf players? Train a player segmentation model. Want to segment discs? Train a disc segmentation model.

SAM segments ANYTHING. Zero-shot. One model, any object, perfect masks.

🔗 Connection: SAM uses a Vision Transformer — the same ViT architecture from Build Your Own Vision. Patches become tokens, attention finds objects. The difference is the output: instead of a caption, SAM outputs a pixel-level mask.

SAM 2: Video Segmentation

SAM was for static images. SAM 2 extends to video. Click an object in frame 1, and SAM 2 tracks that exact object through the entire video sequence.

For disc golf broadcasting:

  • Segment the player → track them across every frame → know their position always
  • Segment the disc → track its flight path → reconstruct the throw trajectory
  • Segment the basket → use as a landmark for camera calibration

This solves the "what" problem. Now we know where every object is in every frame. Next question: where are they in the real world?

YOLO: Real-Time Detection

SAM is incredible but slow. For real-time sports broadcasting, you need YOLO (You Only Look Once). YOLO-nano runs on phones. YOLOv8 does detection + segmentation in milliseconds.

Player Detection
YOLO + DeepSORT
Object Tracking
Real-time, production ready
🔬
Disc Tracking
Small, fast objects = hard
Video Segmentation
SAM 2

The pipeline: YOLO for real-time detection, SAM 2 for precision segmentation, DeepSORT for multi-object tracking. This combination works today.

Chapter 4: "Pixels to GPS"

Geo-localization from broadcast video

The killer application: You have broadcast video of disc golf hole 7. A player throws from the tee. You want to know the exact GPS coordinates where the disc lands. How?

Homography: The Magic Transform

Homography is a mathematical transform that maps pixel coordinates to real-world coordinates. If you know where 4+ points are in both the image AND on a map, you can transform any pixel to GPS.

📍 Pin the Map

Broadcast Frame

Satellite Map

Click matching landmarks in both images. Tee pad, basket, course markers — any fixed point that appears in both views. After 4+ matches, click anywhere in the broadcast frame to see its GPS coordinate.

Getting Ground Truth Data

The secret sauce: you need to know where things actually are. For disc golf courses, this data exists:

  • UDisc course maps — GPS coordinates of tee pads and baskets
  • Satellite imagery — Google Maps, OpenStreetMap
  • Course designers' KML files — Often publicly available
  • Tournament data — PDGA course layouts with GPS

The workflow: Identify landmarks in the broadcast frame → match to known GPS coordinates → compute homography → transform any pixel to real-world position.

The Challenges

Camera Movement
Pan/zoom breaks homography
Parallax Error
Tall objects aren't on ground plane
🔬
Auto-Calibration
Manual setup per hole
Static Accuracy
±3m with good landmarks

Camera zoom changes the homography. Every time the camera operator adjusts zoom or pan, you need to recompute the transform. This is why auto-calibration from video is such an active research area.

Real-World Application

We built this for the USDGC (US Disc Golf Championship). Hours of manual calibration per hole, but the results were incredible: real-time player positions on a course map, throw distances calculated automatically, landing zones highlighted for commentary.

The holy grail: Automatic camera calibration that works with any broadcast feed. No manual setup, no course-specific training. Just point the system at any disc golf video and get GPS coordinates out.

The computer vision community is getting close. The first team that cracks this builds something no sports broadcaster can ignore.

Chapter 5: "The Full Stack"

Putting it all together for sports broadcasting

The vision: A single broadcast camera feed goes in. Out comes a complete spatial understanding of the game.

The Complete Pipeline

🔧 Broadcast Pipeline Builder

Drag components into the pipeline:
📹 Video Input
🔍 Object Detection
✂️ Segmentation
📏 Depth Estimation
📍 Geo-Localization
🎯 Object Tracking
🌐 3D Reconstruction
📊 Analytics
Drop pipeline components here

The optimal pipeline: Video Input → Object Detection → Segmentation → Geo-Localization → Object Tracking → Analytics. Each component feeds the next.

What Comes Out

  1. Live player positions on course map (homography + tracking)
  2. 3D course reconstruction (gaussian splats or depth estimation)
  3. Disc flight tracking (segmentation + trajectory physics)
  4. Automatic highlight detection (motion pattern analysis)
  5. AR broadcast overlays (distance to basket, flight speed, wind effects)

Imagine watching disc golf with this system: Real-time distance measurements, 3D replays from any angle, automatic ace detection, wind visualization, course difficulty analysis.

What Works Today vs What's Research

Player Detection & Tracking
YOLO + DeepSORT, production ready
Video Segmentation
SAM 2, works beautifully
Depth Estimation
Relative depth, good enough
Geo-Localization
Works, needs manual calibration
🔬
3D from Broadcast
Limited views = active research
🔬
Real-Time Disc Tracking
Small, fast object = very hard

The Competition

Established sports have this already:

  • Hawk-Eye (tennis, cricket) — Ball tracking, line calls, 3D replays
  • Second Spectrum (NBA) — Player tracking, shot probability, defensive analysis
  • MLB Statcast — Exit velocity, launch angle, catch probability

Disc golf has NONE of this. No real-time tracking. No 3D reconstruction. No automatic analytics. No AR overlays. The opportunity is massive.

The Future is Now

The pieces exist. The research is converging.

Computer vision models can segment anything, estimate depth, track objects, and reconstruct 3D scenes from video. The hardware can run it in real-time. The broadcast infrastructure is ready.

The first team that assembles this pipeline for disc golf — or any sport — builds something no broadcaster can ignore.

And now you understand every piece.

← Previous: AI Chief of Staff Next: Model Cheat Sheet →