Have you ever used a tool that constantly provided incorrect results? When searching for answers, it provided links that were out of date or results not pertinent to your search? If you find that you cannot trust the results of the tool, you’ll probably stop using it and find one that better meets your needs with fresh and accurate data.
AI agents are quickly becoming the go-to tool for learning and searching the internet for information, as users can refine or expand their searches in a conversational format. But users often find that the data provided by the AI is outdated or is not useful. Users want the data or a link to the data — not advice on how to find the data.
As the example above shows, the data that powers your tool cannot be static; it must be continuously fed fresh content. AI data pipelines that ingest data rarely (or just once!) will quickly fall out of sync with the real world or provide generic responses like Llama 3.
When building an AI agent, you must include frameworks that continually ensure data freshness. This means that new data is continuously ingested into the agent, and data currently in the tool must be measured for data freshness.
Data pipelines for AI agents can come from thousands of different locations (websites, databases, data streams, logging information, etc.). While all of the inputs that contribute data into AI agents are important, this post will focus on techniques to pull data from websites in real time.
Techniques to get real-time data Into an AI agent
In the examples above, the AI Agent with real-time Apple stock data results in a better user experience. So, how does one build an AI Agent with up-to-the minute information?
While AI Agents often converse like Large Language Models (LLMs), LLMs take a long time to train. For Agents that require real-time data, LLMs alone cannot provide the content that is required. So, how do developers leverage real-time data alongside the conversational response of an LLM?
Retrieval Augmented Generation (RAG) is an AI architecture that combines the conversational tone of an LLM but augments the LLM with data stored in an external source. This external source is typically a vector database like Weaviate, Qdrant or Pinecone. When a query is run, the LLM is able to create a conversational response by combining the fresh data from the database with a conversational response from the LLM.
When teams build pipelines to continuously update the RAG vector database with fresh data, AI agents using RAG architectures stay up-to-date while still providing the detailed LLM-style responses that users have come to expect.
By leveraging real-time data pipelines within an RAG architecture, AI agents and chatbots leverage the most up-to-date data through continuous updates. With fresh data, the AI agent builds trust and users can be confident that the responses from the AI agent are fresh and accurate.
For many data sources, real-time streaming from production databases can be wired into the vector database to keep the content up-to-date. Product inventories, train schedules and other items with regular updates can be fed via streaming services into the vector database.
Data published to the web is different. To aggregate web content, teams build a web scraping pipeline that accesses the sites, parses the content and inserts it into the vector database. When building a web scraping pipeline for Agentic AI, there are several considerations that must be made: Determining data collection frequency, the process for triggering the scrapes and how the data is processed and inserted into the RAG database.
Data collection frequency
Web scraping is the process of having a script or a headless browser read the content on a website. The content can then be injected into a pipeline to be inserted into the database. How frequently should web scraping tools be run? If ran too frequently, the website might detect the scraping tool and restrict access to the content. If the scraping occurs too infrequently, the data provided by the Agentic AI may be out of date. How do we define the frequency that data is scraped from the web?
In general, the processes used to scrape website data can be grouped into two buckets: Real-time ingestion and batched ingestion.
Real-Time Ingestion
Real-time data ingestion from websites is often performed through an API call. For example, it’s not feasible for a RAG database to ingest real-time weather or traffic data for every location in the world; the sum of raw data to be updated every few minutes would be insurmountable. Instead, the RAG can utilize real-time APIs in conjunction with the database. With connections to weather and traffic websites, the API can quickly respond with the required real-time data and provide it to the LLM for the response.
Real time ingestion is a preferred tool for dynamic content, like stock prices, sports scores, breaking news, etc. With fresh data straight from the API, the AI agent is able to provide up-to-the-minute content for users.
For websites controlled by the same organization, the web team can build webhooks that alert an AI pipeline whenever a new update has been published. That way, every website update can be scraped and parsed within minutes of launch.
Batched Data
Not all webpages that are scraped for your Agentic AI will be under your organization’s control. These pages will need to be scraped by a tool at a regular interval to maintain data freshness. Knowledge bases, blog posts and other websites change regularly with useful knowledge for AI agents, but the changes can occur on random timescales: Hours, days or even weeks.
These sites can be updated using batched scripts. These scripts run at regular intervals to scrape the pages for data. Once new data is scraped, it can be compared to the existing data in the RAG and updated as needed.
For do-it-yourselfers, open source scraping tools like Beautiful Soup, Selenium or Puppeteer can be scheduled to run on a regular cadence to load sites and return the data. Batched groups of URLs might be kicked off with a CRON job on a local machine, where the CRON task runs a process to scrape a list of pages. Once the data is scraped, an AI pipeline converts the data and inserts it into the vector database.
As the web scraping project begins to scale and grow, workflow automation tools like Airflow can simplify the process. One advantage to using automation tools is that retry mechanisms are easier to automate, and the workflow can also handle the process of feeding the data into the RAG database.
Another important consideration is that many website providers implement measures such as Cloudflare and CAPTCHA systems to block malicious or unwanted automated traffic. Various commercial tools address these challenges by employing proxies and simulating human-like behavior to improve scraping success rates.
Scheduling and Triggering Web Scrapes
When building a web scraping pipeline, you must strike the right balance between scraping a page too often (where no change has occurred), versus waiting too long and missing a critical piece of information for your RAG. Typically, the schedule for collecting and processing data from websites is based on how often they are updated: Hourly, daily, weekly or monthly. If building pipelines, your workflows might automatically shift sites from one bucket to another, as publishing frequency changes.
When it comes to scheduling the scraping of pages, there are a few tricks that can save loads of time and effort:
- Sitemaps: Many websites publish a sitemap, an XML listing of all the pages on the site, including a timestamp of when they were last modified. The snippet below also shows that WordPress sites include a changefreq term that can be used as a suggestion for collection frequency when building scheduling tooling. Sitemap pages can often be found at /sitemap.xml on a website.
<lastmod>2025-04-15T00:14:46+00:00</lastmod>
<changefreq>monthly</changefreq>
- llms.txt: Some sites have introduced an llms.txt. This page is similar to the sitemap, but is designed specifically for LLM scraping. The URLs included in the llms.txt point to versions of each page in markdown format, which is easier to parse and extract into vectors, simplifying the scraping process.
- RSS feeds: Many blogs have RSS feeds. An RSS feed is an XML listing of content added to a site in order of freshness. While the sitemap has urls pointing to the content, the RSS feed contains all of the content on each of the pages, making further scraping unnecessary. RSS pages are often found at /feed, /rss or rss.xml.
Detecting and Handling Content Changes
Once a page has been scraped, the AI pipeline will process data: Converting the scraped text into vectors for insertion into the database. The LLM used in the RAG will generally have an API endpoint to convert the content into a vector. These vectors can be stored in the RAG database and can be used to compare the relevancy and similarity of the content to queries and other content in the database.
Before inserting the data, the ingestion workflow must identify if the content from the page already exists in the RAG database. One common approach is to include data freshness metadata for each entry in your RAG database (example below). Every time data updates from a page are scraped, the AI pipeline compares the last modified date and the hash of the updated data to what is stored in the vector database. If they are the same, the page’s information has already been added to the RAG database, and no further action is needed. A difference indicates that the new content is fresher than the content currently in the database, and that the data in the RAG should be updated.
{
“url”: “https://example.com/news/article123“,
“source”: “example.com“,
“timestamp”: “2025-06-20T13:00:00Z”,
“last_scanned”: “2025-06-20T13:00:00Z”,
“timescale”: “Month”,
“hash”: “a1b2c3…”, // optional for versioning
“category”: “news”
}
When inserting updated data from a webpage, the previous version of the page must be removed to preserve AI data freshness. For example, if a competitor’s website indicates that their pricing has changed from $49/month to $75/month, the fresh data must be inserted, and the old data removed to ensure the AI agent is up to date. The process will generally follow the steps:
- Data cleaning: Remove all existing vectors from the database sourced from the page in question.
- Chunking: Break the new content into chunks and create vector embeddings.
- Uploading Vectors: Insert the new vectors into the DB.
Creating pipelines to ingest data
When building a web scraping pipeline to ingest fresh data from the web, the process might consist of a few Python scripts and API calls that are kicked off by CRON jobs. As collecting and processing data becomes more complex, the process can quickly become unsustainable. Scripts are prone to failure, sites may become aggressive against web scraping tools and keeping the whole process up and running involves tons of tweaking and editing of your scripts. Any downtime means that the AI agent’s data freshness suffers.
Building workflows and AI pipelines can make the scraping and ingesting of website data into your RAG and AI agent faster and easier to maintain. Purpose-built scraping tools simplify the scraping process, improve reliability and access to web data.. While data pipeline workflow tools help to simplify the real time data processing, converting into vectors and inserting the fresh data into the RAG database backend.
Measuring AI data Freshness
When designing your RAG database, ensure that all data collected has metadata describing the time and frequency at which the data was collected. This makes checking the data for updates much easier. In the example metadata presented in the last section, the last_scanned and timescale values show that the file was scanned on June 20, and will next be scanned on July 20, 2025 (one month from the last scan).
Prometheus is a tool that can be used to monitor the timestamps and dates in the database, ensuring data freshness. By comparing last_scanned + timescale to measure freshness, Prometheus can add the sites to the pool of sites for the next scraping batch. If desired, the data from Prometheus can be piped into charting tools like Grafana to visually indicate the freshness of the content being used by your AI agents.
Summary
Real-time AI agents are the future for interacting and querying knowledge stores. When building AI pipelines for your AI agent, it is critical to ensure that the agent is constantly being fed fresh data. If your RAG is not updated frequently, the responses from your AI agent become less accurate, and customers will trust the results less and less. In this post, we examined strategies for building AI pipelines to scrape the web and insert/replace the scraped content in the AI agent’s RAG database.