AI systems must constantly integrate new data and information to stay up to date. When an AI tool reports old, out of date information, customers lose trust in the results. But where does fresh data come from? How is it collected, and how is it added to the LLM’s “knowledge”? In this post, we’ll walk through commonly used techniques for building document processing pipelines, specifically looking at how systems can work with unstructured data. We will build a workflow to collect files that need to be added to our AI tooling. This will include data extraction steps for unstructured data from the files, format and structure the data. Finally, the workflow will add this newly structured data into the AI tooling.
What is unstructured data?
Data that is used by an LLM for inference must be highly structured (think of cells in a row of a spreadsheet or a database). Unstructured data refers to formats that don’t follow a predictable schema, such as freeform text, images, scanned documents or PDFs.
Despite its unstructured state, this data has information that is critical for adding intelligence to AI systems. In order to structure the data forAI systems, we will process and structure the data using tools like OCR, parsing and normalization. To understand how to best structure the data, let’s look at how AI tools use and incorporate structured data.
Using structured data in AI
Once the data is structured, AI systems store the data with retrieval augmented generation (RAG). In order to understand the format of the structured data in an AI tool, we need to understand how RAGs work. A RAG system combines an LLM for processing and reasoning solutions with a vector database containing structured information.
When a user poses a query to the agent, the RAG performs a similarity search on the structured data in the vector database to find relevant content. The content retrieved from the database and the initial query are then fed to the LLM. The response from the LLM is based on both the initial query and the structured data, allowing for a more contextually accurate result.
How is data structured?
LLMs use structured data to create small, tokenized pieces of data. Each token has a cost when sent to the LLM, so the size or length of the structured data is important — we don’t want to send the full whitepaper, but just a portion of it. However, if the created tokens are too small, context is lost, and the similarity search does not work very well.
To best understand tokenization strategies, let’s look at the second paragraph of Lincoln’s Gettysburg Address in order to get an idea of how the data should be structured:
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
Recall that a RAG search searches for topics in the database that have appropriate context. So what are the options for structuring the data?
- Chunk each word:
- Advantage: Small chunks, which cost less to process through the LLM.
- Disadvantage: All context is lost: “Now,” “we” and “are” do not explain the context of the entry.
- Chunk sentences: text.split(“.”)
- Advantage: Each sentence holds context.
- Disadvantage: Single sentences may not convey the context of the paragraph.
- Overlap Sentences:
- For example, three chunks: sentences one and two, sentences two and three, sentences three and four.
- Advantage: More context than single sentence.
- Disadvantage: Entries are larger, and there are more entries.
- Entire paragraph:
- Advantage: Full context of the content is present.
- Disadvantage: Long paragraphs could lead to excessive token usage in the LLM.
Any workflow that is structuring data must have discrete rules for breaking the data into meaningful chunks. For simplicity in this post, we will chunk all content into paragraphs.
Overview of the data extraction process
Now that we have an idea of what we structure the data needs, we can begin the process to extract the data. This process will depend on:
Where is the data stored?
- Files that are stored on a drive or cloud storage.
- Invoices
- Receipts
- Contracts
- Whitepapers
- Scanned documents
- Webpages
- text
- images
What type of data is to be extracted?
- Metadata
- OCR
- Text parsing
What is the desired output?
- JSON
- CSV
- Vectors
For our document processing pipeline, we will scrape a webpage for images and PDFs. Each file will be downloaded locally, and the text extracted. Once the text is extracted, it must be broken into tokens and properly formatted and structured. Once the data has been structured, it can be ingested into a vector database, and the database re-indexed for the RAG.
A typical data ingestion pipeline will have multiple entry points, depending on where the files/data are to be accessed. Once the data has been accessed and converted, the rest of the pipeline can be run in a similar fashion.
Data collection
The location and form of the data to be extracted will play a large role in the collection process. For invoices or other financial documents, there may be a team share drive.
The following code copies all of the files that were added to /invoices/2025 from July 1, 2025. This is the first step in collecting the files for a data extraction pipeline.
find /invoices/2025 -type f -newermt “2024-07-01” -exec cp –parents {} /ingest \;
Other files might be downloaded by API calls. Yet other content may just exist on the web, and web scraping predefined list of pages can be used to collect the unstructured data.
One common use case for web content is to extract text from images. In this example code, we will scrape a webpage for images and download them. The code for this can be found in a Jupyter Notebook on GitHub.
The following code uses Bright Data to access a popular meme site and find the URLs for the top 20 images on the site.
#make an API call to Brightdata to begin the collection of image urls
load_dotenv()
BRIGHTDATA_API_KEY = os.getenv(‘brightdata’)
headers = {
“accept”: “application/json”,
“Authorization”: f”Bearer {BRIGHTDATA_API_KEY}”
}
data = [{“url”:”https://www.memedroid.com/memes/top/day?page=1″}]
url = “https://api.brightdata.com/dca/trigger?collector=c_mdmeul7w21y46152n8&queue_next=1”
# Make the POST request
response = requests.post(url, headers=headers, json=data)
# Print response
print(“Status code:”, response.status_code)
#print(“Response JSON:”, response.json())
response = response.json()
collectionId = response[‘collection_id’]
print(collectionId)
This kicks off a web scraping process. After a short period of time, the collection agent will have completed the data scrape, and a JSON file with the results will be available. In the following code, we request the JSON file, parse out the image URLs and then download the images locally.
#use the collectionId to get the Json with urls
url = f”https://api.brightdata.com/dca/dataset?id={collectionId}”
response = requests.get(url, headers=headers)
responseJson =response.json()
#download all the images from the scraping results
#directory for saving the files
destination_dir = “documents”
#meme list
memes = responseJson[0][‘memes’]
for meme in memes:
url = meme[‘image_url’]
#get the image and save it locally
if url is not None:
filename = url.split(“/”)[-1] # Get filename from URL
destination_path = os.path.join(destination_dir, filename)
# Download the file
response = requests.get(url)
# Save the file if request was successful
if response.status_code == 200:
with open(destination_path, “wb”) as f:
f.write(response.content)
print(f”Downloaded to: {destination_path}”)
else:
print(f”Failed to download. Status code: {response.status_code}”)
The images are now downloaded locally and are ready to have text extracted from them.
Data extraction
The next step in the document processing workflow is automated data extraction. Data extraction can take on many different forms, all depending on the needs of the organization. What is most critical is that the tools accurately extract data.
There are purpose-built tools that extract structured data from specific data types. For example, Mindee offers APIs that extract structured data from identification cards, invoices and receipts. For teams leveraging these specific file types, Mindee’s APIs are designed for specific formats like invoices and ID cards, often returning pre-labeled fields.
When files are stored locally, there are several open source libraries that can extract text from files. When extracting text from PDF files, PyPDF or pdfplumber are commonly used. pdfplumber generally handles layouts better than PyPDF, but in this example, layout is less important, and we use PyPDF:
import PyPDF2
def extract_text_from_pdf(file_path):
#Extract text from a PDF file.
pdf_file_obj = open(file_path, ‘rb’)
pdf_reader = PyPDF2.PdfReader(pdf_file_obj)
num_pages = len(pdf_reader.pages)
text = ”
for page in range(num_pages):
page_obj = pdf_reader.pages[page]
text += page_obj.extract_text()
pdf_file_obj.close()
return text
Running ext = extract_text_from_pdf(“/folder/herman-melville-moby-dick.pdf)` extracts the full text of Moby Dick for data parsing.
For extracting text from images, this code uses Pillow to load the image and Pytesseract to extract the text:
from PIL import Image
import pytesseract
def extract_text_from_images(file_path):
#Extract text from image file.
image = Image.open(file_path)
text = pytesseract.image_to_string(image)
return text
img = extract_text_from_images(“/directorys/TR.jpg”)
print(img)
Note, we could have also used PaddleOCR for the text extraction.
This prints the extracted data: “Speak softly and carry a big stick; you will go far.” Theodore Roosevelt
Structuring the data: Part 1
The code present so far has downloaded the image files and extracted the text from them. While the extracted text alone can power the RAG tooling, additional metadata strengthens the power of the data.
The following code classifies documents into types, extracts the text from both file types (PDFs and images using the functions shown above) and also identifies attributes of the file itself — the name, location, type and creation data.
def extract_text_from_files(directory_path):
#Extract text from PDF files and images in a directory.
#this can be extended and modified as more file types are added to the ingestion engine
#will hold texts and associated metadata
texts = []
filetype = “”
#counters are helpful in debugging
img_count = 0
pdf_count=0
#loop through all the files in the directory. Determine filetype, and process accordingly
for filename in os.listdir(directory_path):
#create the full file path
path = os.path.join(directory_path, filename)
#file creation time (assuming linux) convert to a datetime object
stat = os.stat(path)
birth_time = stat.st_birthtime
created_at = datetime.datetime.fromtimestamp(birth_time)
if filename.endswith(“.pdf”):
filetype=”pdf”
txt = extract_text_from_pdf(path)
pdf_count+=1
elif filename.endswith(“.jpg”) or filename.endswith(“.jpeg”) or filename.endswith(“.png”):
filetype=”image”
txt = extract_text_from_images(path)
img_count+=1
#create an object with the text and the metadata, insert it into the texts arraat
temp = {‘type’:filetype, ‘filename’:filename, ‘created_at’:created_at,’text’:txt}
texts.append(temp)
#all texted parsed. Print the stats, and return the data
print(f”parsed {img_count} image(s) and {pdf_count} pdf(s).”)
return texts
When this command is run:
extracted_text = []
extracted_text = extract_text_from_files(directory_path)
The extracted_text array contains JSON objects containing the extracted data:
{‘type’: ‘image’, ‘filename’: ‘TR.jpg’, ‘created_at’: datetime.datetime(2025, 7, 21, 21, 24, 38, 382369) , ‘text’: ‘“Speak softly and\ncarry a big stick;\n\nyou will go far.”\n\nTheodore Roosevelt\n’}
Running this code on the directory of the 20 downloaded images extracts the text from each image.
Structuring the data: Part 2
With the data extracted, the text now needs to be broken into tokens for parsing and insertion into the vector database. For text pulled from image memes, the text length is generally very short, so the entire text should be included. For longer files, like PDFs, the text will be broken into paragraphs. For Moby Dick, there are ~14,000 paragraphs, leading to that many tokens being created.
At this point, the tokens can be embedded into a vector. This is the last step before inserting the data into the vector database. In this example, we will use the sentence-embeddings Python library. This is considered a good embedding library, but if a specific LLM is going to be used, consider using the embedding library specific to that LLM.
The embedding code looks like:
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
embedding = embedder.encode(text)
In addition to creating the embedding, there are some other pieces of metadata that could be created and added to each token. In the example code, in addition to the embedding, we also create:
- A summary of the text.
- Sentiment analysis of the text (positive, negative, etc.).
- Entities: This code extracts names and places from the text.
Storing this additional metadata will improve the results from the RAG. The code to pull this metadata and create the embedding is in the enrich_chunk function in the Jupyter Notebook. (Code omitted from the post for space considerations.) Each of these additional steps will affect the operational efficiency of the pipeline, so testing each additional process is essential.
All of the document processing described in this section is compiled in the following cell of the Jupyter Notebook:
- Text is tokenized differently for PDFs and images.
- For each token, create the embedding and enriched data.
- Create an object with the token, all of the metadata and the embedding.
- Create a pandas dataframe with the data from each token.
data = []
for text in extracted_text:
# Split the text into individual words or tokens
#this depends on the type of file.. images – all text in one
#PDFs split by new lines
filetype = text[‘type’]
if filetype == “image”:
#full text
tokens = [text[‘text’]]
else:
#pdf – split on every newline
tokens = re.split(r'[\n|\ue002]’, text[‘text’])
# Remove empty tokens
tokens = [t for t in tokens if t.strip()]
for token in tokens:
#create the embedding
#and
#get entities (named people)
# get sentimatent analysis
#set a summary
enriched = enrich_chunk(token)
if enriched is not None:
#only add an embedding if everything is successfully extracted
summary = enriched[‘summary’]
entities = enriched[‘entities’]
sentiment = enriched[‘sentiment’]
embedding = enriched[’embedding’]
# Create a new array for each entry
#keep all the metadata from extraction
entry = {
‘filetype’: filetype,
‘filename’:text[‘filename’],
‘created_at’: text[‘created_at’],
‘tokens’: token,
‘summary’:summary,
‘entities’:entities,
‘sentiment’:sentiment,
’embedding’:embedding
}
data.append(entry)
# Create a data frame from the data list
df = pd.DataFrame(data)
Inserting the data into a vector database
There are many vector databases that can be used as part of an RAG architecture. For the sake of this demo application, we will use FAISS an in-memory vector database.
Note: FAISS doesn’t support metadata natively, so we store metadata in parallel Python dicts. Production-grade vector databases like Weaviate or Qdrant support metadata filtering out of the box.
import numpy as np
import faiss
# Convert embeddings to NumPy array
embedding_matrix = np.array(df[’embedding’].tolist()).astype(‘float32’)
#Build a FAISS index
dimension = embedding_matrix.shape[1] # number of features in each vector
index = faiss.IndexFlatL2(dimension)
index.add(embedding_matrix)
#Metadata as in memory database
metadata = df.to_dict(orient=’records’)
And there we have it. The Jupyter Notebook with the code is a working pipeline to download images, extract text from each image, tokenize and structure the text and finally add it to a RAG database. We also created an in-memory metadata store that includes pertinent metadata for each entry to the vector database.
Creating production-ready AI data pipelines
AI Agents must be constantly fed a stream of up-to-date information in order to remain relevant. In this post, we discussed techniques and tools that are commonly used to create workflows that automatically pull, parse and ingest content.
If you and your team are interested in trying out content ingestion workflows, try the Jupyter Notebook to see how easy it is to create document extraction pipelines for AI Agents.