Twelve Labs Raises $100M for Video AI Search

For all the noise around AI that generates video — the Soras and Veos churning out clips from a text prompt — a quieter, arguably more consequential problem has gone underfunded: how do you actually understand the video that already exists? The world’s footage is piling up faster than anyone can watch it, let alone catalogue it. A new financing round suggests investors are finally treating that as a category of its own.

Twelve Labs, a startup building foundation models that read and search video, has raised fresh capital to push that idea forward — and the identity of its backers says as much as the number itself.

The raise

Twelve Labs raised $100 million from Amazon, NEA and Naver Ventures to advance AI that searches and analyses across large video libraries, turning extensive footage into queryable, analysable data, according to The Neuron, citing Bloomberg (July 1 2026).

The cap table is the interesting part. This isn’t a syndicate of generalist funds chasing an AI logo. You have a hyperscaler in Amazon, whose cloud infrastructure and media ambitions (Prime Video, Twitch, and a sprawling advertising business) all run on video that needs to be indexed, moderated and monetised. You have NEA, one of the deeper-pocketed venture institutions with the patience for infrastructure bets. And you have Naver Ventures — the investment arm tied to the Korean tech giant behind a search-and-content ecosystem that spans web, media and increasingly AI.

Read together, the backers describe the market Twelve Labs wants to own: enterprise-grade video understanding, sold to the platforms and clouds that sit on the largest footage archives in the world. Strategic money from a hyperscaler and a search-native conglomerate is a signal that video understanding is being treated as a platform layer, not a feature.

Why video understanding matters

The round reflects rising demand for video understanding — not just generation — as video becomes the dominant online format, with applications in search, moderation, analytics and highlights, per The Neuron (July 2 2026).

Consider the asymmetry. We have gotten extraordinarily good at producing video and remarkably bad at finding anything inside it. A broadcaster with decades of archive footage can’t easily answer “show me every clip where a specific product appears on screen.” A social platform can’t reliably tell whether a livestream is violating policy in real time. A brand can’t measure how many seconds its logo was visible across a season of sponsored content. The video is there; the meaning is locked inside it.

Understanding models attack this differently from generation models. Instead of turning text into pixels, they turn pixels — plus audio, on-screen text, motion and context — into structured, searchable data. That unlocks a cluster of use cases that all share one requirement: the machine has to actually comprehend what’s happening on screen.

Search: natural-language queries over hours of footage (“find the moment the CEO mentions guidance”).
Moderation: flagging harmful, infringing or non-compliant content at scale, ideally before it spreads.
Highlights: auto-clipping the goals, the punchlines, the key moments for sports, entertainment and short-form repurposing.
Analytics: quantifying brand exposure, scene composition, audience-relevant events and content patterns.

Generation gets the headlines because it’s visually spectacular. Understanding gets the budgets because it maps directly to revenue and risk — the two things enterprises pay to control.

The competitive picture

Twelve Labs does not have this space to itself, and it won’t. The large multimodal foundation models are steadily absorbing video capabilities: the frontier labs now advertise the ability to reason over video inputs, and the hyperscalers are packaging video intelligence into their cloud AI stacks. The specialist’s perennial risk is that the platform giants make “good enough” video understanding a checkbox in a broader product.

Two things work in favour of a focused player. First, demand is concentrated in high-stakes verticals — enterprise media libraries, sports, advertising, and security and surveillance — where buyers care less about a demo and more about whether the system holds up on their specific, messy footage. Second, the durable moats in this business are unglamorous: the quality and breadth of training data, and accuracy at retrieval. Getting the right clip back, reliably, across noisy real-world video is genuinely hard, and small differences in precision compound into large differences in trust.

That is the differentiation battle to watch. A general model that understands video adequately across everything competes on breadth; a specialist that understands your video precisely competes on depth and defensibility. The strategic backing from Amazon and Naver suggests Twelve Labs intends to fight on the depth-plus-distribution side rather than out-scaling the frontier labs on raw model size.

The India read

Here is where the bet gets locally interesting. India’s internet is arguably the most video-first large market on earth. Between Reels, YouTube Shorts, a crowded OTT landscape, live commerce, creator content in a dozen languages, and short-form vernacular platforms, the country produces and consumes staggering volumes of video every day. If the global problem is that video is piling up faster than anyone can understand it, India is where the pile grows fastest and in the most linguistically complex form.

That creates three concrete openings.

Content operations: media houses, OTT platforms and creator networks need to tag, clip, repurpose and re-version content across languages and formats. Understanding models can turn a two-hour show into hundreds of searchable, clip-ready moments — a direct cost and speed win for lean content teams.
Moderation: India’s scale and its regulatory attention on platform safety make automated video moderation a necessity, not a luxury. Real-time comprehension of livestreams and uploads — across regional languages and cultural context — is a demand curve that only steepens.
Discovery: as vernacular video explodes, text metadata fails. Search and recommendation that actually understand what’s in a clip — spoken words, on-screen text, objects, sentiment — could reshape how audiences find content in languages that English-first systems handle poorly.

The opportunity for India’s video-AI builders isn’t necessarily to out-raise Twelve Labs. It’s to win on the ground the global models handle worst: Indian languages, code-switching, regional context, and price points that fit domestic content economics. A moderation or discovery layer tuned for Hinglish livestreams or Tamil short-form is a defensible niche that a US-centric foundation model may never prioritise.

The broader takeaway is a useful corrective to the generation hype. As video eats the internet, the scarce, valuable capability is comprehension — and the $100M flowing into Twelve Labs is a reminder that the money follows the harder, less flashy problem. For Indian founders sitting on the world’s most active video ecosystem, understanding is the layer worth building.

Twelve Labs’ $100M Bet: Teaching AI to Understand Video, Not Just Make It

The raise

Why video understanding matters

The competitive picture

The India read

Jack Turner

The Signal — one email, every Tuesday.