🎧 Third Ear

Real-Time AI Audio Description for Live Video

A project built for SWMxGemini Hackathon - Track 2: Real-Time Multimodal AI

👁️ The Problem: Left in the Dark

The digital world has shifted from static, pre-recorded media to live broadcasting, but accessibility has completely failed to keep up.

2.2+ Billion people globally live with near or distance vision impairment.
Traditional Audio Description (AD) relies entirely on human writers and voice actors. It is slow, highly expensive, and restricted to pre-recorded media (e.g., Netflix).
The Gap: Live streams, breaking news, user-generated content, and interactive broadcasts currently offer zero accessibility. The visually impaired community is locked out of the modern streaming era.

💡 The Solution

Third Ear is a fully automated, real-time audio description engine that requires zero human input. It watches live streams and instantly generates spoken descriptions of on-screen action.

Killer Feature: "Smart Injection" If an AI talks constantly, it ruins the experience by talking over the native dialogue. Third Ear doesn't just watch; it listens. Our custom pipeline continuously monitors the stream's native audio track. Using neural Voice Activity Detection (VAD), the system waits for natural silences and pauses in the dialogue, seamlessly injecting the AI-generated spoken description only during these gaps.

⚙️ Under the Hood: The Tech Stack

Our zero-latency pipeline is built using industry-standard real-time media processing tools and state-of-the-art multimodal AI.

Fishjam: Handles the robust, low-latency real-time video and audio streaming infrastructure.
Gemini Live API: Acts as our "visual brain," continuously analyzing incoming video frames to generate context-aware descriptions of the on-screen action.
PyAV / FFmpeg (via PyAV): Intercepts the live stream to demux, transcode, and mix media packets directly in Python memory.
PyTorch & Torchaudio: The backend framework powering high-speed audio data transformation.
VAD: Analyzes the local stream's audio track in milliseconds to identify guaranteed silence gaps for the description overlay.

🔄 Architecture Flow

Ingest: Fishjam receives the live broadcast stream.
Fork: The stream is split. Video frames are sent to the Gemini Live API; audio is routed to our local PyAV processor.
Analyze: Gemini generates text descriptions of visual events. Simultaneously, Torchaudio/VAD analyzes the audio for human speech.
Inject: When Gemini outputs a description AND the VAD detects a pause in dialogue, the system triggers Text-to-Speech (TTS).
Broadcast: PyAV mixes the generated AI voice back into the audio track and sends it to Fishjam to stream to the end user.

📦 Usage

Prerequisites

Python 3.11 (exact version — see .python-version)
uv package manager (recommended) or pip
FFmpeg installed and available on PATH (required by PyAV / pydub)
Node.js ≥ 18 (for the frontend)
Google Cloud credentials with access to:
- Vertex AI (Gemini API)
- Cloud Text-to-Speech API

Installation

# Clone the repo
git clone https://github.com/RETIOM/hackathon-SWMxGemini.git
cd hackathon-SWMxGemini

# Create venv & install Python deps
uv sync

# Install frontend deps
cd frontend && npm install && cd ..

Environment Variables

Variable	Description
`GOOGLE_APPLICATION_CREDENTIALS`	Path to your GCP service account JSON key
`GOOGLE_CLOUD_PROJECT`	Your GCP project ID (defaults to `swmxgemini`)

Running the Backend (API Server)

The FastAPI server accepts video uploads via SSE and streams back described segments:

# Start the dev server on port 8000 (with hot reload)
uv run uvicorn src.server:app --reload --port 8000

The server exposes:

POST /api/process — upload an MP4 file; returns an SSE stream of base64-encoded described video segments with AI-generated narration text.

Running the Frontend

cd frontend
npm run dev

Open http://localhost:5173, drag & drop an MP4 file, and watch the AI-described version stream back in real time.

CLI Pipeline (Offline Processing)

Process a video file directly from the command line without the web UI:

# Basic usage — outputs segments to ./pipeline_output/
uv run python src/pipeline.py input_video.mp4

# Custom options
uv run python src/pipeline.py input_video.mp4 \
  --project-id my-gcp-project \
  --output-dir ./output \
  --output-mode single_file \
  --chunk-duration 10.0 \
  --realtime

Flag	Default	Description
`--project-id`	`swmxgemini`	GCP project ID for Vertex AI and TTS
`--output-dir`	`pipeline_output`	Directory for output MP4 files
`--output-mode`	`segmented`	`segmented` (one file per chunk) or `single_file` (one concatenated output)
`--chunk-duration`	`10.0`	Duration of each processing chunk in seconds
`--realtime`	off	Simulate real-time ingestion speed

Running Tests

# Run all tests (42 tests, no API keys needed)
uv run pytest tests/ -v

# Run only the sanity/unit tests
uv run pytest tests/test_sanity.py -v

# Run narrator unit tests (mocked API clients)
uv run pytest src/narrator/test_narrator.py -v

🚀 Future Roadmap & Impact

This project was built with Track 2's "Multi-user scenarios" in mind. Our next steps include:

Multi-User Personalization: Utilizing Fishjam's routing to send personalized audio streams to different users in the same room (e.g., basic action cues vs. highly detailed cinematic descriptions).
Global Reach: Real-time translation of generated audio descriptions into multiple languages on the fly.
Broad Application: Expanding beyond entertainment to live educational webinars, e-sports, and breaking news broadcasts.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
frontend		frontend
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
demo.mp4		demo.mp4
presentation.pptx		presentation.pptx
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎧 Third Ear

👁️ The Problem: Left in the Dark

💡 The Solution

⚙️ Under the Hood: The Tech Stack

🔄 Architecture Flow

📦 Usage

Prerequisites

Installation

Environment Variables

Running the Backend (API Server)

Running the Frontend

CLI Pipeline (Offline Processing)

Running Tests

🚀 Future Roadmap & Impact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎧 Third Ear

👁️ The Problem: Left in the Dark

💡 The Solution

⚙️ Under the Hood: The Tech Stack

🔄 Architecture Flow

📦 Usage

Prerequisites

Installation

Environment Variables

Running the Backend (API Server)

Running the Frontend

CLI Pipeline (Offline Processing)

Running Tests

🚀 Future Roadmap & Impact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages