Spatial AI for Sports Broadcasting
Cameras see flat images. The real world is 3D.
In 30 minutes you'll learn how AI reconstructs the world from video — and how this changes sports broadcasting forever.
Imagine watching disc golf on ESPN. A single camera follows a player as they wind up, release, and watch their disc sail across the course. That flat video contains a hidden 3D world — if you know how to extract it.
A photo has no depth. It's a 2D projection of a 3D world. When light hits your camera sensor, all depth information is lost. A mountain 10 miles away and a tree 10 feet away can cast the exact same shadow on the same pixel.
But your brain reconstructs 3D from two flat images — your eyes. Computers can do the same thing, and much more.
Modern AI can predict depth from a single image. Models like Depth Anything and MiDAS learned from millions of images to understand visual cues: size, shadows, overlap, perspective.
Red = close, Blue = far
The catch: Single-image depth is relative, not metric. You know the tree is closer than the mountain, but not by exactly how many meters. For broadcast sports, that's often enough.
Click matching points in both images. With two views of the same scene, we can triangulate the real 3D position of any point. This is how COLMAP and other SfM systems work.
The broadcast problem: Sports cameras follow the action — they don't orbit the scene. This gives you very few viewpoints. COLMAP might register only 4 frames out of 70. That's why broadcast 3D reconstruction is so hard.
You now understand the fundamental challenge. 2D cameras capture 3D worlds, but the depth is lost. AI can estimate it, geometry can triangulate it, but broadcast video gives us limited views.
This limitation drove the computer vision revolution of 2020-2024. Let's see what emerged.
NeRFs were first (2020) — train a neural network to represent a 3D scene. Beautiful results, but painfully slow to train and render.
Then came 3D Gaussian Splatting (2023). Same quality, 100x faster. It changed everything.
A fuzzy 3D blob. Each "splat" has a position in 3D space, a size, a rotation, a color, and an opacity. Imagine a semi-transparent colored ellipsoid floating in space.
Now imagine millions of them. Together, they can approximate any 3D scene.
Click to add colored gaussian blobs. Drag to move them. This is what 3D Gaussian Splatting does, but in 3D with millions of splats.
Start with a point cloud from SfM (remember Chapter 1). Each point becomes a gaussian splat. Then optimize:
The loss function is simple: "Does my rendered image look like the real photo?" That's it. The magic is in the differentiable rendering — you can backpropagate through the entire splatting process.
3D Gaussian Splatting needs many views of the same scene. Research papers use 100-200 photos taken from different angles around an object.
Broadcast video gives you maybe 5-10 useful views as the camera pans. This is an active research problem.
The latest breakthrough: VGGT (Meta, CVPR 2025 Best Paper). 64 frames → 54,000 3D points in 3.8 seconds. Real-time 3D reconstruction is almost here.
For disc golf, imagine this: Live 3D course reconstruction during the broadcast. Viewers could switch to any angle, see the disc's flight path in 3D, understand the course layout instantly.
Before you can track a disc golf player across the course, you need to separate them from the background. Segmentation is how AI cuts objects out of images — and it's the foundation of everything else.
For sports broadcasting, instance segmentation is key. You need to track individual players, not just "human-shaped regions."
Click on any object to segment it. SAM (Segment Anything Model) can isolate any object with a single click — no training required.
Meta's SAM changed everything in 2023. Before SAM, you needed to train a custom model for each type of object. Want to segment disc golf players? Train a player segmentation model. Want to segment discs? Train a disc segmentation model.
SAM segments ANYTHING. Zero-shot. One model, any object, perfect masks.
SAM was for static images. SAM 2 extends to video. Click an object in frame 1, and SAM 2 tracks that exact object through the entire video sequence.
For disc golf broadcasting:
This solves the "what" problem. Now we know where every object is in every frame. Next question: where are they in the real world?
SAM is incredible but slow. For real-time sports broadcasting, you need YOLO (You Only Look Once). YOLO-nano runs on phones. YOLOv8 does detection + segmentation in milliseconds.
The pipeline: YOLO for real-time detection, SAM 2 for precision segmentation, DeepSORT for multi-object tracking. This combination works today.
The killer application: You have broadcast video of disc golf hole 7. A player throws from the tee. You want to know the exact GPS coordinates where the disc lands. How?
Homography is a mathematical transform that maps pixel coordinates to real-world coordinates. If you know where 4+ points are in both the image AND on a map, you can transform any pixel to GPS.
Broadcast Frame
Satellite Map
Click matching landmarks in both images. Tee pad, basket, course markers — any fixed point that appears in both views. After 4+ matches, click anywhere in the broadcast frame to see its GPS coordinate.
The secret sauce: you need to know where things actually are. For disc golf courses, this data exists:
The workflow: Identify landmarks in the broadcast frame → match to known GPS coordinates → compute homography → transform any pixel to real-world position.
Camera zoom changes the homography. Every time the camera operator adjusts zoom or pan, you need to recompute the transform. This is why auto-calibration from video is such an active research area.
We built this for the USDGC (US Disc Golf Championship). Hours of manual calibration per hole, but the results were incredible: real-time player positions on a course map, throw distances calculated automatically, landing zones highlighted for commentary.
The holy grail: Automatic camera calibration that works with any broadcast feed. No manual setup, no course-specific training. Just point the system at any disc golf video and get GPS coordinates out.
The computer vision community is getting close. The first team that cracks this builds something no sports broadcaster can ignore.
The vision: A single broadcast camera feed goes in. Out comes a complete spatial understanding of the game.
The optimal pipeline: Video Input → Object Detection → Segmentation → Geo-Localization → Object Tracking → Analytics. Each component feeds the next.
Imagine watching disc golf with this system: Real-time distance measurements, 3D replays from any angle, automatic ace detection, wind visualization, course difficulty analysis.
Established sports have this already:
Disc golf has NONE of this. No real-time tracking. No 3D reconstruction. No automatic analytics. No AR overlays. The opportunity is massive.
The pieces exist. The research is converging.
Computer vision models can segment anything, estimate depth, track objects, and reconstruct 3D scenes from video. The hardware can run it in real-time. The broadcast infrastructure is ready.
The first team that assembles this pipeline for disc golf — or any sport — builds something no broadcaster can ignore.
And now you understand every piece.