Build Your Own 3D World — Spatial AI for Sports Broadcasting

Chapter 1: "Flat to 3D"

How computers reconstruct depth from images

Imagine watching disc golf on ESPN. A single camera follows a player as they wind up, release, and watch their disc sail across the course. That flat video contains a hidden 3D world — if you know how to extract it.

The Problem

A photo has no depth. It's a 2D projection of a 3D world. When light hits your camera sensor, all depth information is lost. A mountain 10 miles away and a tree 10 feet away can cast the exact same shadow on the same pixel.

But your brain reconstructs 3D from two flat images — your eyes. Computers can do the same thing, and much more.

Depth from Single Images

🖼️ Monocular Depth Estimation

Modern AI can predict depth from a single image. Models like Depth Anything and MiDAS learned from millions of images to understand visual cues: size, shadows, overlap, perspective.

Red = close, Blue = far

The catch: Single-image depth is relative, not metric. You know the tree is closer than the mountain, but not by exactly how many meters. For broadcast sports, that's often enough.

Structure from Motion (SfM)

🎯 Parallax Explorer

Click matching points in both images. With two views of the same scene, we can triangulate the real 3D position of any point. This is how COLMAP and other SfM systems work.

The broadcast problem: Sports cameras follow the action — they don't orbit the scene. This gives you very few viewpoints. COLMAP might register only 4 frames out of 70. That's why broadcast 3D reconstruction is so hard.

You now understand the fundamental challenge. 2D cameras capture 3D worlds, but the depth is lost. AI can estimate it, geometry can triangulate it, but broadcast video gives us limited views.

This limitation drove the computer vision revolution of 2020-2024. Let's see what emerged.

📚 Go Deeper

COLMAP — Structure-from-Motion Toolkit Depth Anything V2 Paper (2024) MiDaS — Monocular Depth Estimation

Chapter 2: "Gaussian Splats"

The revolution in 3D reconstruction

NeRFs were first (2020) — train a neural network to represent a 3D scene. Beautiful results, but painfully slow to train and render.

Then came 3D Gaussian Splatting (2023). Same quality, 100x faster. It changed everything.

What is a Gaussian Splat?

A fuzzy 3D blob. Each "splat" has a position in 3D space, a size, a rotation, a color, and an opacity. Imagine a semi-transparent colored ellipsoid floating in space.

Now imagine millions of them. Together, they can approximate any 3D scene.

🎨 2D Splat Playground

Size: Opacity:

Click to add colored gaussian blobs. Drag to move them. This is what 3D Gaussian Splatting does, but in 3D with millions of splats.

How Training Works

Start with a point cloud from SfM (remember Chapter 1). Each point becomes a gaussian splat. Then optimize:

Render the splats to generate an image
Compare with the actual photo from that viewpoint
Adjust splat parameters to minimize the difference
Repeat for all camera views

The loss function is simple: "Does my rendered image look like the real photo?" That's it. The magic is in the differentiable rendering — you can backpropagate through the entire splatting process.

The Broadcast Challenge

3D Gaussian Splatting needs many views of the same scene. Research papers use 100-200 photos taken from different angles around an object.

Broadcast video gives you maybe 5-10 useful views as the camera pans. This is an active research problem.

🔬

Single-view 3DGS
Research frontier

✅

Multi-view 3DGS
Production ready

🔬

Dynamic scenes
4D Gaussian Splatting

⚡

Real-time training
VGGT: 3.8 seconds

The latest breakthrough: VGGT (Meta, CVPR 2025 Best Paper). 64 frames → 54,000 3D points in 3.8 seconds. Real-time 3D reconstruction is almost here.

For disc golf, imagine this: Live 3D course reconstruction during the broadcast. Viewers could switch to any angle, see the disc's flight path in 3D, understand the course layout instantly.

📚 Go Deeper

3D Gaussian Splatting Paper (2023) Luma AI — Try Gaussian Splatting VGGT: Real-time 3D Reconstruction (2025) Nerfstudio — 3D Reconstruction Toolkit

Chapter 3: "Segmentation"

Cutting objects out of images and video

Before you can track a disc golf player across the course, you need to separate them from the background. Segmentation is how AI cuts objects out of images — and it's the foundation of everything else.

The Types of Segmentation

Semantic segmentation — Label every pixel (sky, grass, person, disc, basket)
Instance segmentation — Distinguish between objects of the same class (player 1 vs player 2)
Panoptic segmentation — Combine both: every pixel gets a class AND instance ID

For sports broadcasting, instance segmentation is key. You need to track individual players, not just "human-shaped regions."

SAM: The Game Changer

✂️ Segment Anything Demo

Click on any object to segment it. SAM (Segment Anything Model) can isolate any object with a single click — no training required.

Meta's SAM changed everything in 2023. Before SAM, you needed to train a custom model for each type of object. Want to segment disc golf players? Train a player segmentation model. Want to segment discs? Train a disc segmentation model.

SAM segments ANYTHING. Zero-shot. One model, any object, perfect masks.

🔗 Connection: SAM uses a Vision Transformer — the same ViT architecture from Build Your Own Vision. Patches become tokens, attention finds objects. The difference is the output: instead of a caption, SAM outputs a pixel-level mask.

SAM 2: Video Segmentation

SAM was for static images. SAM 2 extends to video. Click an object in frame 1, and SAM 2 tracks that exact object through the entire video sequence.

For disc golf broadcasting:

Segment the player → track them across every frame → know their position always
Segment the disc → track its flight path → reconstruct the throw trajectory
Segment the basket → use as a landmark for camera calibration

This solves the "what" problem. Now we know where every object is in every frame. Next question: where are they in the real world?

YOLO: Real-Time Detection

SAM is incredible but slow. For real-time sports broadcasting, you need YOLO (You Only Look Once). YOLO-nano runs on phones. YOLOv8 does detection + segmentation in milliseconds.

✅

Player Detection
YOLO + DeepSORT

✅

Object Tracking
Real-time, production ready

🔬

Disc Tracking
Small, fast objects = hard

✅

Video Segmentation
SAM 2

The pipeline: YOLO for real-time detection, SAM 2 for precision segmentation, DeepSORT for multi-object tracking. This combination works today.

📚 Go Deeper

SAM: Segment Anything Paper (2023) SAM 2: Video Segmentation (2024) YOLO Documentation Try SAM Online

Chapter 4: "Pixels to GPS"

Geo-localization from broadcast video

The killer application: You have broadcast video of disc golf hole 7. A player throws from the tee. You want to know the exact GPS coordinates where the disc lands. How?

Homography: The Magic Transform

Homography is a mathematical transform that maps pixel coordinates to real-world coordinates. If you know where 4+ points are in both the image AND on a map, you can transform any pixel to GPS.

📍 Pin the Map

Broadcast Frame

Satellite Map

Click matching landmarks in both images. Tee pad, basket, course markers — any fixed point that appears in both views. After 4+ matches, click anywhere in the broadcast frame to see its GPS coordinate.

Getting Ground Truth Data

The secret sauce: you need to know where things actually are. For disc golf courses, this data exists:

UDisc course maps — GPS coordinates of tee pads and baskets
Satellite imagery — Google Maps, OpenStreetMap
Course designers' KML files — Often publicly available
Tournament data — PDGA course layouts with GPS

The workflow: Identify landmarks in the broadcast frame → match to known GPS coordinates → compute homography → transform any pixel to real-world position.

The Challenges

⚡

Camera Movement
Pan/zoom breaks homography

⚡

Parallax Error
Tall objects aren't on ground plane

🔬

Auto-Calibration
Manual setup per hole

✅

Static Accuracy
±3m with good landmarks

Camera zoom changes the homography. Every time the camera operator adjusts zoom or pan, you need to recompute the transform. This is why auto-calibration from video is such an active research area.

Real-World Application

We built this for the USDGC (US Disc Golf Championship). Hours of manual calibration per hole, but the results were incredible: real-time player positions on a course map, throw distances calculated automatically, landing zones highlighted for commentary.

The holy grail: Automatic camera calibration that works with any broadcast feed. No manual setup, no course-specific training. Just point the system at any disc golf video and get GPS coordinates out.

The computer vision community is getting close. The first team that cracks this builds something no sports broadcaster can ignore.

📚 Go Deeper

OpenCV Homography Tutorial Our Geo-Localization Demo UDisc — Course GPS Data PDGA Course Database

Chapter 5: "The Full Stack"

Putting it all together for sports broadcasting

The vision: A single broadcast camera feed goes in. Out comes a complete spatial understanding of the game.

The Complete Pipeline

🔧 Broadcast Pipeline Builder

Drag components into the pipeline:

📹 Video Input

🔍 Object Detection

✂️ Segmentation

📏 Depth Estimation

📍 Geo-Localization

🎯 Object Tracking

🌐 3D Reconstruction

📊 Analytics

Drop pipeline components here

The optimal pipeline: Video Input → Object Detection → Segmentation → Geo-Localization → Object Tracking → Analytics. Each component feeds the next.

What Comes Out

Live player positions on course map (homography + tracking)
3D course reconstruction (gaussian splats or depth estimation)
Disc flight tracking (segmentation + trajectory physics)
Automatic highlight detection (motion pattern analysis)
AR broadcast overlays (distance to basket, flight speed, wind effects)

Imagine watching disc golf with this system: Real-time distance measurements, 3D replays from any angle, automatic ace detection, wind visualization, course difficulty analysis.

What Works Today vs What's Research

✅

Player Detection & Tracking
YOLO + DeepSORT, production ready

✅

Video Segmentation
SAM 2, works beautifully

✅

Depth Estimation
Relative depth, good enough

⚡

Geo-Localization
Works, needs manual calibration

🔬

3D from Broadcast
Limited views = active research

🔬

Real-Time Disc Tracking
Small, fast object = very hard

The Competition

Established sports have this already:

Hawk-Eye (tennis, cricket) — Ball tracking, line calls, 3D replays
Second Spectrum (NBA) — Player tracking, shot probability, defensive analysis
MLB Statcast — Exit velocity, launch angle, catch probability

Disc golf has NONE of this. No real-time tracking. No 3D reconstruction. No automatic analytics. No AR overlays. The opportunity is massive.

The Future is Now

The pieces exist. The research is converging.

Computer vision models can segment anything, estimate depth, track objects, and reconstruct 3D scenes from video. The hardware can run it in real-time. The broadcast infrastructure is ready.

The first team that assembles this pipeline for disc golf — or any sport — builds something no broadcaster can ignore.

And now you understand every piece.

Build the Pipeline → Join the Research → Contact Web & Circuits