Skip to main content

Command Palette

Search for a command to run...

Making Opinons Count Again - RevVision, a case study

Updated
โ€ข10 min read
E
I might own the next company that buys your current favorite company or I might work for your fav company. Time will tell. PS: I wrote the above before the advent of AI ๐Ÿ˜‚

It is a not so distant future, in a not too unfamiliar world, there is an app for everything, you own cool gadgets, there is a pendant that detects your mood and finetunes your playlist while you slurp on a slushie in your self-driven Tesla. It is extra cold outside but you don't feel it, your Eight Sleep custom chair in this car has adjusted to the optimal temperature for you to sleep. But today, you just cannot sleep, you are a bit annoyed about something, you would rather go to the cinema nearest to you, maybe a movie might take your mind off things.

You get into the cinema lobby, there are movie posters everywhere, you pause... "what to watch, what to waaaatch"... the next logical thing is to check on the internet, imdb, rottentomatoes, tmdb, somewhere where you can find what movie afficionados think about the movie you are getting keen on. Your meta Rayban glasses light up, you get to imdb via a web search, scroll to the reviews section and you start to read, long missives of opinons, some positive, some who found the movie meh, but it is a lot for you to take in, you are not exactly in the mood for this, a line is starting to form... if only this process was better...

RevVision

In comes, RevVision, a prototype AI agent that shows a glimpse of what the future holds for opinions. The idea is that someday you will be able to use your high tech glasses to quickly get a feel for what people think before you make a decision.

Right now it is a point your camera at a movie poster or share your screen while watching web app, and an AI agent identifies the movie in real-time, reads out reviews, and shows ratings from IMDB, with the possibility of multiple other sources in the future. Built with Stream's Vision Agents SDK + Google Gemini Live API.

If you followed the story you can quickly identify the problem; there is simply no easy way to skim through reviews. RevApp has the functionality, this AI agent just connects to that and brings a spark to reviews again.

How it works

The Pipeline โ€” From Video Frame to Spoken Review

There are four systems working together in real-time:

  1. Your browser captures video (camera or screen share) and sends it via WebRTC

  2. Stream's Edge Network relays the video to our agent at ultra-low latency

  3. Google Gemini watches the video at 5 frames per second and identifies movies

  4. When it spots a movie, it calls our lookup_movie() tool which hits the RevApp API

Here's the flow in detail:

STEP 1
Video Capture (Browser โ†’ Stream Edge) The frontend is a React app using the Stream Video SDK. When you click "Start RevVision", the app creates a WebRTC video call. Your camera feed or screen share is transmitted through Stream's edge network, which is the same infrastructure that powers video calls for millions of users. This gives us sub-200ms latency.

STEP 2
Video Analysis (Stream Edge โ†’ Gemini) On the server, we run a Vision Agent, a Python process using Stream's Vision Agents SDK. This agent joins the same video call as the user. It receives video frames at 5 fps and pipes them directly into Google Gemini's Live API (bidiGenerateContent). Gemini processes these frames in real-time, watching for anything that looks like a movie e.g., posters, DVD covers, streaming thumbnails, title cards.

STEP 3
Tool Call (Gemini โ†’ RevApp API) When Gemini recognises a movie, it calls our registered lookup_movie() function. This is a real function call, not just text generation using Gemini to execute it as a tool. The function initiates a five-step process:

a. Search TMDB for the movie title

b. Obtain external IDs, particularly the IMDB ID

c. Retrieve IMDB reviews using the RevApp API

d. Create an AI-generated review summary

e. Acquire the IMDB rating

STEP 4: The results are sent to two places at once: back to Gemini as text, allowing it to read the review summary aloud using ElevenLabs TTS, and to the frontend as a custom event. Through Stream's event system, the agent delivers structured movie data, including the poster, rating, overview, reviews, and AI summary. This information is beautifully displayed as a card in the sidebar by the React app.

The Agent (main.py)

The entire agent is roughly 250 lines of Python. The core is simple:

llm = gemini.Realtime(model="gemini-2.5-flash-native-audio-preview-12-2025", fps=5)
llm.function_registry.register()(lookup_movie)
edge = getstream.Edge()

agent = Agent(
    edge=edge,
    agent_user=User(name="RevVision", id="revvision-agent"),
    instructions=AGENT_INSTRUCTIONS,
    llm=llm,
    tts=elevenlabs.TTS(),
)

That's it. The Vision Agents SDK handles all the WebRTC plumbing, frame extraction, and Gemini session management. We just define WHAT the agent should do (watch for movies, call lookup_movie) and the SDK handles the HOW.

The Frontend (App.jsx)

The React frontend does three things:

  1. Manages the WebRTC video call via Stream Video SDK

  2. Listens for custom events from the agent (movie_detected, movie_searching)

  3. Renders MovieCard components in a sidebar as movies are identified

When the agent sends a movie_detected event, the card immediately appears with the poster (from TMDB), the IMDB rating, a plot summary, the AI review summary, and a sentiment badge (positive/negative/mixed).

CHALLENGES AND SOLUTIONS

Challenge 1: The Gemini Model Nightmare

This was the hardest problem. Not every Gemini model works with the Live API, and even fewer support tool calling. We tested five models:

Model: gemini-2.5-flash-native-audio-latest Problem: Crashes with 1008 WebSocket Policy Violation the moment it tries to call a tool. This is a known bug โ€” the model cannot handle function calling in bidiGenerateContent mode. Wasted hours on this before finding the GitHub issue.

Model: gemini-2.0-flash-exp Problem: Returns 1011 Service Unavailable. The model was deprecated while we were building.

Model: gemini-2.5-flash Problem: Does not support bidiGenerateContent at all. It is a text/vision model only, not a LiveAPI model.

Model: gemini-2.5-flash-native-audio-preview-09-2025 Problem: The most deceptive one. It joins calls, receives video, speaks responses โ€” but NEVER actually calls tools. Instead of executing lookup_movie(), it DESCRIBES what it would do: "I can see The Dark Knight! It's a thriller directed by Christopher Nolan..." All from its own training data. It never once triggered the function call. Even with extremely forceful prompt instructions ("NEVER talk about tools, just call them"), the Sep 2025 model just... wouldn't.

Model: gemini-2.5-flash-native-audio-preview-12-2025 โœ… Finally found the one that works. The Dec 2025 native audio preview executes tool calls AND speaks the results. This was the breakthrough โ€” as soon as we switched, it started calling lookup_movie() and returning real data from RevApp.

Lesson: if you're building with Gemini's Live API and need function calling, the model version matters enormously. The Dec 2025 preview is currently the only one that reliably executes tools.

Challenge 2: WebRTC Track Handshake Timeouts

After solving the model issue, we hit another wall: the agent could not receive the user's video or audio. The server logs showed:

TimeoutError: Timeout waiting for pending track: AUDIO from user guest-39a51a34...
Waited 10.0s but WebRTC track_added with matching kind was never received.

The root cause: the Vision Agents SDK maintains a "track map" โ€” a registry of which WebRTC tracks belong to which participants. When a user disconnects and reconnects, the track map keeps entries from the old session. The new user publishes tracks, but the SDK is looking for different track IDs from the stale map. So it waits 10 seconds and times out.

The result: Gemini NEVER receives the user's video. It responds from its own knowledge, which is why it talked about movies without actually seeing them.

Our fix: restart the agent before each new user session. When you click "Start RevVision", the frontend:

  1. POSTs to /api/restart-agent (which restarts the Docker container)

  2. Waits for a new call-config.json with a fresh call ID

  3. Joins the new, clean call

This guarantees a fresh WebRTC track map every time. The button shows "Preparing agent..." while this happens.

Challenge 3: Getting Gemini to Actually Call Tools (Prompt Engineering)

Even with the right model, getting Gemini to USE tools instead of talking about them was difficult.

The system prompt was rewritten to be extremely explicit:

"CRITICAL RULE: When you identify a movie, you MUST execute the lookup_movie function call.
 Do NOT just talk about it.
 Do NOT say 'I can look that up.'
 NEVER talk about what a tool does. Just call it.
 NEVER say 'let me use my tool' or 'I can look that up.' Just DO IT."

With the Dec 2025 model, this forceful prompt works. With the Sep 2025 model, even this didn't help โ€” proving it was a model-level limitation, not a prompting issue.

Challenge 4: Agent Crash Recovery

Gemini's Live API sessions crash. A lot. Reasons include:

  • 1008 Policy Violation on tool calls (wrong model)

  • 1011 Service Unavailable (API outages)

  • WebSocket disconnections (network issues)

  • Session timeouts (idle too long)

When the agent crashes, the user has no idea. The video feed keeps showing, the UI looks normal, but the AI is dead.

We built a crash detection and recovery system:

HEARTBEAT: The agent writes to status.json every 10 seconds via a background asyncio task.

CRASH DETECTION: The frontend polls status.json every 5 seconds. If the timestamp is more than 30 seconds stale, the UI shows a red "Agent Offline" indicator.

RESTART API: A separate sidecar Docker container (restart-api.py) runs a tiny HTTP server that talks to the Docker Engine API via the unix socket. When the frontend POSTs to /api/restart-agent, the sidecar calls the Docker API to restart the agent container.

RESTART BUTTON: When the agent is detected as offline, a "Restart Agent" button appears in the status panel. Click it, wait 15 seconds, and the agent comes back on a fresh call. The page auto-reloads when it detects the new call ID.

Challenge 5: Stream Custom Event Size Limit

Stream's custom events have a ~5KB payload limit. Our full review data (including 20+ review texts) easily exceeds this. We solved it by slicing the review list to max 5 items:

"reviews": {"reviews": review_list[:5]}

The AI summary and sentiment are always sent in full since they are compact. The 5 reviews that do get sent are enough for the frontend review badge ("5 reviews").

DEPLOYMENT

The whole system runs on a single DigitalOcean droplet:

  • Docker container: revvision-agent (the AI agent)

  • Docker container: revvision-restart-api (crash recovery sidecar)

  • Nginx: serves the React frontend + proxies the restart API

  • SSL: Let's Encrypt via Certbot

The frontend is a static Vite build served by nginx at revvision.revappai.com. The agent restarts automatically via Docker's "unless-stopped" policy, plus the manual restart button for immediate recovery.

TECH STACK SUMMARY

Agent: Python + Vision Agents SDK + Gemini 2.5 Flash (Dec 2025 Native Audio) + ElevenLabs TTS API: RevApp (Node.js/Express) โ€” TMDB search, IMDB reviews, AI summaries Frontend: React + Stream Video SDK + Vite Deployment: Docker Compose + Nginx + DigitalOcean + Let's Encrypt

How to Demo (step by step):

  1. Go to revvision.revappai.com

  2. Allow camera & microphone permissions when prompted

  3. Click ๐ŸŽฌ Start RevVision โ€” wait for "Preparing agent..." to finish (~15s)

  4. Point your camera at a movie poster or share your screen showing a movie

  5. The AI identifies the movie, speaks the review aloud, and a MovieCard appears in the sidebar with poster, IMDB rating, and AI review summary

  6. Try multiple movies โ€” each one gets its own card

  7. If the agent goes offline, click the ๐Ÿ”„ Restart Agent button in the status panel

3 views