We processed 2,000 videos from one streamer. Here's what we learned.

What we did

We ran approximately 25 VODs from a single GTA RP streamer through our full pipeline. The system produced 2,062 videos: highlights, compilations, and Shorts. Then our 7-agent review panel judged every single one.

The numbers

Metric Value
Source VODs processed ~25
Total videos produced 2,062
Passed final review 103
Killed by review panel 34
Sent back for improvement 8
Kill rate (of reviewed content) 23%
Total API cost $52
Cost per video $0.025
Review agents per video 7

The 7-agent review panel: Brand Guardian, First Impression, Audio Quality, Pacing, Title Match, Completion Predictor, and Distinctiveness. Every video gets scored by all seven before it can pass or fail.

What surprised us

Our own AI killed 23% of its own output

Nearly a quarter of the videos that made it to review got rejected. Not by us. By the review agents. The most common verdicts from Brand Guardian were blunt: "Brand Guardian rates this 1/10 -- completely misrepresentative." Another frequent kill reason: "Off-brand generic stream chatter." The system produced content, then decided that content wasn't good enough to ship.

This is the part that convinced us the review layer is non-negotiable. Without it, all 145 of those reviewed videos would have gone to YouTube. A 23% error rate on a public channel is not survivable.

The AI hallucinated a title about a drug bust

This was the worst failure we caught. The pipeline generated a title about a dramatic drug dealer bust -- complete with police chase language and arrest references. The actual video content was the streamer chatting about Walmart.

Not close. Not a stretch. Completely fabricated.

The Title Match agent flagged it. That agent exists specifically because LLMs will confidently generate plausible-sounding titles that have nothing to do with the actual content. In GTA RP, this is especially dangerous because the game's subject matter (crime, police, drugs) gives the AI a rich vocabulary of dramatic events to hallucinate.

We ended up fixing 43 of 49 uploaded titles because the "SEO-optimized" versions the AI generated were worse than the originals. The AI consistently over-dramatized, injected keywords that didn't match the content, or invented narrative arcs that never happened. We stopped letting the AI optimize titles and started letting it describe what actually happened. Results improved immediately.

Shorts flooded the review queue

We initially configured Shorts extraction at 25 clips per VOD. The math seemed reasonable. It was not. From ~25 VODs, the system produced 1,131 Shorts candidates. The review queue became unusable. Every meaningful highlight was buried under hundreds of mediocre 30-second clips of the streamer walking between buildings or sitting in a car saying nothing.

We cut the extraction limit to 8 per VOD. The quality of the candidates went up because the system was forced to be selective. The review queue became manageable. More clips is not better clips.

What we changed

What we'd do differently

Start with fewer Shorts. We should have started at 5 per VOD and worked up, not started at 25 and worked down. The flood created hours of wasted review time and masked real quality signals under noise.

Never trust AI-generated titles without validation. We knew LLMs hallucinate. We built a Title Match agent specifically for this. But we initially made it optional. That was a mistake born from optimism. Title hallucination in RP content is not an edge case. It is the default behavior. The AI sees GTA RP and immediately starts writing crime drama titles regardless of what's actually happening in the video.

Kill rate is a feature, not a bug. Our first instinct when we saw 23% rejection was to worry about waste. We were wrong. A 23% kill rate means the quality gate is working. If everything passes, the bar is too low. We'd rather produce 2,000 candidates and ship 103 good ones than produce 200 candidates and ship 180 mediocre ones.

Cost is not the bottleneck. $52 for 2,062 videos is $0.025 per video. Two and a half cents. The bottleneck is review quality, not API cost. We could run the pipeline ten times over and still spend less than a single freelance editor charges for one video. The money conversation is over. The quality conversation is where all the hard problems live.

The total cost breakdown: $52 across all API calls -- generation, transcription, review, title matching, and metadata. That covers the full lifecycle of 2,062 videos from raw VOD to reviewed output. At scale, the API cost is rounding error. The real cost is the engineering time to build a review layer that actually catches hallucinations before they reach the public.

The takeaway

Generating content with AI is the easy part. Any pipeline can produce volume. The hard part is building a system that knows when its own output is bad, and has the authority to kill it.

We spent more engineering time on the review panel than on the content generation pipeline. That ratio was not an accident. It was the lesson from processing 2,062 videos: the generation layer is a commodity. The quality gate is the product.

See what your VODs look like after 7 agents review them.

View pricing