LangChain and LlamaIndex are powerful tools for AI integration. When you combine these tools with an LLM, you can transform raw web data into meaningful context that your AI models can use. Follow along and level up your web data infrastructure.
When you’re finished with this tutorial, you’ll be able to answer the following questions.
- What is seamless data ingestion?
- What is LlamaIndex?
- What is LangChain?
- How do you design a data flow with these tools?
What is seamless data ingestion?
In this industry, we’ve all heard of data pipelines. Seamless data ingestion takes the concept of a data pipeline even further. Like a pipeline, we harvest raw data from the web. Then, we clean it. A normal data pipeline pushes the clean data into a spreadsheet or database — the pipeline ends there. With seamless ingestion, your data flows into a larger system — and it’s usually AI-based.
This sort of pipeline allows for truly real time data feeds. Integrate it with AI and you have a smart system that’s always up to date with the latest information. With a robust ingestion flow, your AI receives a continuous supply of useful data — automatically.
What is LangChain?

LangChain provides a robust framework for tools and functions. It works by allowing LLMs to chain together inputs and outputs to control different tools.

The diagram above shows how LangChain works in practice. Our LLM decides on an action. LangChain routes the action to the tool. The tool sends the result back to LangChain. Then, LangChain converts the result into an LLM-friendly output — the model can read the output to help with its next decision.
This is useful for all types of automation tools. Here are just a few things LangChain can do for your LLM. The applications are practically limitless. If you can write a function for it, you can convert it into a LangChain tool.
- Searching a vector database.
- Surfing and scraping the web.
- Saving data to a storage medium.
- Performing calculations.
What is LlamaIndex?

LlamaIndex is just as revolutionary as LangChain. LlamaIndex allows us to convert raw unstructured data into a vector database for our LLM to reference. Our scraped data has no real structure — LlamaIndex gives it a searchable format that the LLM can read.

Our chart here shows how our LLM’s connections to both LangChain and LlamaIndex. LlamaIndex does for an LLM what SQL does for traditional webapps — it stores the data in a lightweight, machine readable format. When the user inputs a URL, the LLM can first check the vector store. If the information is missing or out of date, the LLM can then use the scraping tool to find the relevant information and update LlamaIndex.
Designing our data flow
Before we start writing code, we need to understand our data flow from one end to the other. With our LLM hooked into both the scraping tool via LangChain and a database using LlamaIndex, we’ll need a basic runtime. The user can input a URL. Then, the model checks its internal data and performs a new scrape if need be. Finally, the summary gets sent to the user.

- User inputs a URL.
- The LLM checks the vector store (LlamaIndex). If the storage is good, skip to the final step.
- If the data is old or nonexistent, use the LangChain scraping tool.
- Update LlamaIndex with the newly extracted data.
- Output the summary to the user.
Building the infrastructure
Next, we’ll get started actually building. We’ll start by installing dependencies and adding our import statements to a Python file.
First, we’ll install Requests. We use this to fetch the page.
pip install requests
Next, we’ll install BeautifulSoup for extracting the page data.
pip install beautifulsoup4
Now, we install LangChain and its OpenAI package.
pip install langchain langchain-openai
Finally, we install LlamaIndex and its OpenAI tooling as well.
pip install llama-index llama-index-embeddings-openai llama-index-llms-openai
Now, we’ll add our imports and API key.
import os
import requests
from bs4 import BeautifulSoup
from pathlib import Path
from langchain.tools import StructuredTool
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings
openai_api_key = "your-openai-api-key"
Writing the scraper
Next, we’ll write a basic scraping function. We use Requests to GET the page and BeautifulSoup to extract its text. After defining the function, we create the scrape_tool variable so our agent can access it.
#basic scrape function -- fetch a site and extract its text
def scrape_and_clean(url: str) -> str:
"""Fetch & clean a web page using requests + BeautifulSoup."""
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
text = soup.get_text(separator="\n", strip=True)
lines = [line.strip() for line in text.splitlines() if line.strip()]
cleaned = "\n".join(lines)
return cleaned
#convert the function into a tool that langchain can call
scrape_tool = StructuredTool.from_function(scrape_and_clean)
Creating the actual agent
We used GPT-4o to create our agent. You can use any OpenAI model you like. We give our agent a system prompt so it understands its purpose. We also ensure that we have a data folder to hold our data for the connection to LlamaIndex.
#llm for the agent -- use any openai model you want
llm = ChatOpenAI(model="gpt-4o", openai_api_key=openai_api_key)
#system prompt for the agent
prompt = ChatPromptTemplate.from_messages([
("system", "You are a smart agent that can use tools to fetch and clean websites."),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
#create an agent from our tooling
agent = create_openai_functions_agent(llm, tools=[scrape_tool], prompt=prompt)
executor = AgentExecutor(agent=agent, tools=[scrape_tool], verbose=True)
#ensure we have a data folder
Path("data").mkdir(exist_ok=True)
The runtime loop
The code below holds our runtime loop. First, the user inputs a URL. If the file exists within the data folder, we skip the scrape. If the file isn’t present in our storage, the agent uses our scraping tool to extract the text from the URL and update LlamaIndex. The model then summarizes the site to the user.
#runtime loop for realtime chat
while True:
url = input("\nEnter a URL to summarize (or 'exit' to quit): ").strip()
if url.lower() == "exit":
break
#format the filename
filename = url.replace("https://", "").replace("http://", "").replace("/", "_") + ".txt"
filepath = f"data/{filename}"
if os.path.exists(filepath):
print("✅ Found cached version. Skipping scrape.")
else:
print("🔍 Not found. Asking agent to fetch...")
agent_result = executor.invoke({
"input": f"Use the scrape tool to get the cleaned text of {url} and show the result only."
})
with open(filepath, "w", encoding="utf-8") as f:
f.write(agent_result["output"])
print("✅ New content saved.")
#build or update index
documents = SimpleDirectoryReader(input_dir="data").load_data()
embed_model = OpenAIEmbedding(api_key=openai_api_key)
llm_for_index = OpenAI(model="gpt-4o", api_key=openai_api_key)
Settings.embed_model = embed_model
Settings.llm = llm_for_index
index = VectorStoreIndex.from_documents(documents)
#query the index for summary
query_engine = index.as_query_engine()
response = query_engine.query(f"What is {url} about?")
print("\n=== LlamaIndex RESPONSE ===")
print(response)
Putting it all together
Here’s our full code example.
import os
import requests
from bs4 import BeautifulSoup
from pathlib import Path
from langchain.tools import StructuredTool
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings
openai_api_key = "your-openai-api-key"
#basic scrape function -- fetch a site and extract its text
def scrape_and_clean(url: str) -> str:
"""Fetch & clean a web page using requests + BeautifulSoup."""
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
text = soup.get_text(separator="\n", strip=True)
lines = [line.strip() for line in text.splitlines() if line.strip()]
cleaned = "\n".join(lines)
return cleaned
#convert the function into a tool that langchain can call
scrape_tool = StructuredTool.from_function(scrape_and_clean)
#llm for the agent -- use any openai model you want
llm = ChatOpenAI(model="gpt-4o", openai_api_key=openai_api_key)
#system prompt for the agent
prompt = ChatPromptTemplate.from_messages([
("system", "You are a smart agent that can use tools to fetch and clean websites."),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
#create an agent from our tooling
agent = create_openai_functions_agent(llm, tools=[scrape_tool], prompt=prompt)
executor = AgentExecutor(agent=agent, tools=[scrape_tool], verbose=True)
#ensure we have a data folder
Path("data").mkdir(exist_ok=True)
#runtime loop for realtime chat
while True:
url = input("\nEnter a URL to summarize (or 'exit' to quit): ").strip()
if url.lower() == "exit":
break
#format the filename
filename = url.replace("https://", "").replace("http://", "").replace("/", "_") + ".txt"
filepath = f"data/{filename}"
if os.path.exists(filepath):
print("✅ Found cached version. Skipping scrape.")
else:
print("🔍 Not found. Asking agent to fetch...")
agent_result = executor.invoke({
"input": f"Use the scrape tool to get the cleaned text of {url} and show the result only."
})
with open(filepath, "w", encoding="utf-8") as f:
f.write(agent_result["output"])
print("✅ New content saved.")
#build or update index
documents = SimpleDirectoryReader(input_dir="data").load_data()
embed_model = OpenAIEmbedding(api_key=openai_api_key)
llm_for_index = OpenAI(model="gpt-4o", api_key=openai_api_key)
Settings.embed_model = embed_model
Settings.llm = llm_for_index
index = VectorStoreIndex.from_documents(documents)
#query the index for summary
query_engine = index.as_query_engine()
response = query_engine.query(f"What is {url} about?")
print("\n=== LlamaIndex RESPONSE ===")
print(response)
If you input a URL, the LLM will get the site summary from LlamaIndex or perform a new scrape for the site — and then update LlamaIndex.
In the output below we asked it for an example site and then asked it for the same site again. The first time, it fetched the page. The second time, it found our data in storage.
Enter a URL to summarize (or 'exit' to quit): https://example.com
🔍 Not found. Asking agent to fetch...
> Entering new AgentExecutor chain...
Invoking: `scrape_and_clean` with `{'url': 'https://example.com'}`
Example Domain
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...**Example Domain**
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information...
> Finished chain.
✅ New content saved.
=== LlamaIndex RESPONSE ===
The domain https://example.com is used for illustrative purposes in documents. It can be utilized in literature without needing prior coordination or permission.
Enter a URL to summarize (or 'exit' to quit): https://example.com
✅ Found cached version. Skipping scrape.
=== LlamaIndex RESPONSE ===
The domain is intended for use in illustrative examples in documents. It can be used in literature without needing prior coordination or permission.
Enter a URL to summarize (or 'exit' to quit):
Seamless data ingestion powers LLM workflows
With seamless data ingestion, your LLM can decide what data it needs, when it needs it and where to store it for future use. With LangChain’s tooling and the vector database from LlamaIndex, we can create a real scraping agent that only runs new extractions when it needs to.
The loop you made here today is the same basic backbone that holds up most modern chatbots. When you ask ChatGPT or Gemini to summarize a website, they use this same basic workflow under the hood.