Skip to main content

Major Investments Shaping Web Data Infrastructure for AI

This article breaks down the major investments and acquisitions reshaping web data infrastructure for AI. Learn why scraping, public web data, and access pipelines are becoming core assets for training and agentic AI systems.
Author Jake Nulty
Last updated

Web data and AI models are converging fast. In 2026, we’re seeing a continued wave of agentic AI flowing into companies all around the world. Companies are realizing very quickly that web data access is central to building functional AI agents. In the last year, we’ve seen hundreds of millions of dollars poured into web data infrastructure for AI.

In this piece, we’re going to map where this money is coming from and where it’s going. By the time you’ve finished reading, you’ll be able to answer the following questions.

  • How are public web data providers changing?
  • Why are AI companies buying and building web data pipelines?
  • Where is the money going and what does this say about the future?

The scraping layer is consolidating

The web scraping industry is booming. There’s no other way to describe it. Thousands of years ago, irrigation helped us grow crops during the agricultural revolution. Today, we’re using web data to grow AI systems. Not long ago, companies like Oxylabs and Bright Data were known primarily as proxy providers — companies bought bandwidth from them and built scrapers. At the time, this was a sort of niche market in tech. Today, public web data and automated access are the primary feeds for both training data and retrieval-augmented generation (RAG) pipelines.

The market for public web data is exploding

Recently, Oxylabs purchased ScrapingBee in what seemed like a perplexing move. In 2019, the small team behind PricingBot and ShopToList decided to build a web scraping API. In 2020, ScrapingBee received funding from TinySeed. In the last year, they were purchased for an undisclosed 8-figure sum after the company reportedly achieved “triple digit annual growth” according to FOX44 news. Enterprise giants don’t buy tools like this unless they see the potential for continued exponential growth. You can read more about the acquisition on ScrapingBee’s blog.

Bright Data has also seen massive growth. According to this article, their company has been growing by more than 50% year over year. They are reportedly on track for $400 million in revenue for 2026. This same report also notes that Bright Data is now operating the third largest cached web page repository — behind only Google and the Internet Archive. In the Data for AI Report, Bright Data noted a 132% increase in real time data usage alone.

Firecrawl has also seen similar growth. In August 2025, this company completed their Series A funding round and received $14.5 million from Nexus Venture Partners, Tobias Lütke (CEO of Shopify), Abhinav Asthana (CEO of Postman) and Matt McClure (Founder of Mux). Y Combinator also increased their investment. When Firecrawl published this report, their company was only 16 months old and had shown “15x” (1500%) growth.

Web scraping is showing unprecedented growth as AI demand increases. A few years ago, web scraping was mainly about proxy integration. Today’s world needs real time web access, search features and CAPTCHA solving to keep AI agents running. Companies and investors that once had no business in web scraping are now throwing everything at public web access.

Funding: the AI-native data access layer

Alongside traditional scraping, we’re seeing entirely new access paradigms. AI-native tools like semantic search and agent browsers are helping power the next generation of AI systems.

Exa recently raised $85 million in Series B funding. This capital is expected to handle stage 2 of their roadmap. Exa plans to scale up its indexing to “gather the vast majority of the world’s information.” To support this, they need a GPU cluster five times their current size. Exa also plans on expanding the size of their team to “build, sell and lead Exa into its next stage.” This next stage is incredibly ambitious. According to this same report, Exa believes we’ll be seeing upwards of 28 billion daily web searches by the year 2030 using AI. Today, we see approximately 14 billion daily searches made by humans. Exa’s report on this funding is available here.

You.com just raised $100 million in Series C funding in September 2025. According to their report, they’ve been valued at $1.5 billion. They cite Cox Enterprises, Georgian, Salesforce Ventures and Norwest as investors. You.com began building AI search APIs in December of 2022 to help address the knowledge cutoff problem that comes with pretrained models. As their search products gained traction, You.com has continued to see massive demand increase and now considers itself an “AI infrastructure company” rather than a simple API provider.

In November of 2025, Parallel Web Systems announced a $100 million Series A with a valuation of $750 million. Parallel offers both a search API and an extract API built specifically for AI models. They also provide agent tools. In their report, they mention funding from Kleiner Perkins, Index Ventures, Spark Capital, Khosla Ventures, First Round Capital and Terrain. Their report cites Clay, Sourcegraph, Owner, Starbridge, Actively, Genpact and “leading Fortune 500 companies” among their userbase.

Another emerging provider, Nimble announced a $47 million Series B in February of 2026. They provide AI search as well as structured data feeds. In their report, they announced a partnership with Databricks and also cited “Fortune 500 banks, top-tier consulting firms and leading retail and CPG companies” as part of their userbase.

Nebius also announced in February 2026 that they are acquiring Tavily. Yahoo Finance reports that the acquisition values Tavily at $275 million although the exact terms of the deal have not been publicly disclosed. Nebius calls itself the “AI cloud company.” They offer AI cloud infrastructure such as compute, storage and kubernetes. Nebius also offers inference, image generation and post-training services via their Token Factory. Tavily’s AI-native search features bring Nebius closer to offering and end to end suite for AI products.

AI companies are buying data infrastructure

Infrastructure companies aren’t just investing in AI. AI companies are investing directly in web data infrastructure. The first generation of LLMs taught us a valuable lesson: To properly handle context windows and knowledge cutoffs, AI agents need memory and real-time web data.

In June 2025, Meta invested in Scale AI and valued the company at over $29 billion according to post by Scale. Forbes reports that the deal included a $14.3 billion investment for 49% ownership of Scale. The 49% is telling. Meta’s not after complete control, they want maximum exposure while allowing Scale to remain as its own entity.

In May of 2025, Salesforce announced plans to buy Informatica for $8 billion. In November 2025, Salesforce announced that they had completed the acquisition. Salesforce cited improvements to Data 360, Mulesoft, Agentforce 360 and Tableau as benefits of the acquisition. Informatica’s mission is to provide teams with a unification platform for AI data. They will continue to serve their own customers while making improvements to Salesforce as well.

Confluent is a real-time data streaming provider. In December 2025, IBM announced plans to acquire Confluent at a cost of $31 per share with a total valuation of $11 billion. IBM announced that the acquisition was complete. IBM acknowledged that there is currently a gap in enterprise data architecture. The Confluent deal will help them address this and integrate real-time data pipelines into AI systems.

What the capital is saying

The first generation of AI tools exposed a critical gap in the industry. In 2023, models were already powerful but their usecases were relegated mostly to chat features. Since the industry first made the leap from chatbots to agents, external data has been the main bottleneck. A pretrained model only knows about its training data. Context windows were addressed with memory systems and vector databases.

However, AI agents need more than just memory and task continuity. To perform research, models need search tools. To execute complex tasks, models often need full fledged browsers. Data and AI companies have seen the gap and they are addressing it rapidly. AI-native search has been the most common investment of this last year but real-time data pipelines are the actual endpoint. In the last year alone, the figures in this article show that over $33 billion has flowed into web data infrastructure.

AI agents need web data. AI companies are handing out blank checks to purchase it.

Photo of Jake Nulty
Written by

Jake Nulty

Software Developer & Writer at Independent

Jacob is a software developer and technical writer with a focus on web data infrastructure, systems design and ethical computing.

232 articles Data collection framework-agnostic system design