Why financial AI needs better data
AI agents are the way of the future. They ‘think’ faster than us. They’re more precise than us. They can read hundreds of lines of data in the time it takes us to read one. In this guide, we’ll go over how to extract pricing and news data from financial markets and even build a small pipeline using Google Finance.
Without up-to-date information, even the most powerful AI models are doomed to fail. When a model buys assets without knowing the prices, it’s gambling at best. It’s often much worse — like driving with Virtual Reality (VR) Goggles on. Our data pipeline will feed news headlines and ticker information into CSV files you can integrate directly into your AI/ML environment.
Financial data types relevant to AI
We need to understand the key differences between our two main data types — structured and unstructured. Structured data is easier for models to parse. Ticker objects are structured. News articles are unstructured. We need to give them structure that the AI model can understand.
Structured
Take a look at the “most active” section from Google Finance. Each asset has the following: Price, change in dollar value and change in percent.

This data already fits neatly into a table for both humans and AI models to review easily. When laid out in a CSV, this is how our data will appear. The structure presented is the same as the structure on the page.
| Asset | Price | Change ($) | Change (%) |
|---|---|---|---|
| OPEN | $4.66 | +$0.12 | 2.53% |
| NIO | $6.69 | +$0.59 | 9.77% |
| NVDA | $182.27 | +$2.44 | 1.35% |
| LCID | $2.11 | +$0.027 | 1.30% |
Unstructured
Unstructured data is a bit more difficult. Take a look at the article below from Investing.com. There’s really no structure for an AI model to follow. It’s all free text nested inside an HTML document. HTML is meant for rendering, not for interpretation and inference.

To prepare our data for AI, we need to give it structure. Fortunately, Google Finance already converted it to a structured object when it gave us the link. Take a look at their news section below.

As you can see, Google has converted a list of articles into a structured feed. Each article has the following properties: Title, URL (this is embedded in the link we click) and a source. This table is something we can work with.
| Title | URL | Source |
|---|---|---|
| Energy Fuels stock soars after rare earth supply deal with Vulcan Elements | https://www.investing.com/news/stock-market-news/energy-fuels-stock-soars-after-rare-earth-supply-deal-with-vulcan-elements-93CH-4211151 | Investing.com |
The title gives an instant feel for market sentiment. The source can be checked against a list of trusted sources. AI models can even give some sources better weight than others. With the URL, you can fetch the article entirely so your AI agent can read it.
Extracting ticker and news data
Now that we understand these structured and unstructured objects conceptually, let’s actually write code for extracting them. In each block, we create an array to hold our scraped objects. Once our logic has finished executing, we can then pass that array into a function that writes it to a Comma-Separated Value (CSV) file.
Structured data
We start by finding the table on the page. Google Finance actually hides this within a ul (unordered list) object instead of the standard table element. Then, we find all the li elements — these are all of the actual table rows. We then extract text elements from each row to create our asset object.
scraped_data = []
#find the pricing table
table = soup.find("ul")
#get the list elements
list_elements = table.find_all("li")
#iterate through the list elements and extract their text
for list_element in list_elements:
divs = list_element.find_all("div")
asset = {
"ticker": divs[3].text,
"name": divs[6].text,
"price": divs[8].text,
"change": divs[11].text
}
scraped_data.append(asset)
Unstructured data
Now, we’ll pull our unstructured data, the news articles. Since Google Finance already gave them structure, we can reuse that structure as a basic skeleton to work from.
We begin with a list to hold our scraped data. Then, we find all div elements with the attribute, data-article-source-name. This selector is unique to each article in the table and also gives the source of the article. Next, we need to remove all the span elements from the article object. This removes ticker data that Google Finance sometimes embeds within the table. We don’t need ticker data — we already scraped our tickers. Google also adds a loose timestamp to each article, such as “2 hours ago”. By splitting at the word “ago”, we can separate this timestamp from the actual article title. Once the spans and timestamp are removed, we pull the title, source and URL of each article.
articles = []
#find the articles on the page
article_boxes = soup.select("div[data-article-source-name]")
#iterate through the articles
for article in article_boxes:
#remove the span elements--tickers are in the pricing report
spans = article.find_all("span")
for span in spans:
span.decompose()
#split the title from its timestamp
title_unformatted = article.text.split("ago")
#format our field items
title = title_unformatted[1]
source = article.get("data-article-source-name")
link_element = article.find("a")
url = link_element.get("href")
#save them in a dict object
article_object = {
"title": title,
"source": source,
"url": url,
}
articles.append(article_object)
How it all fits together
In the full code below, we also add a write_to_csv() function to write our extracted data to a CSV file. Before running it, make sure you’ve got Requests and BeautifulSoup installed.
Install Requests
pip install requests
Install BeautifulSoup
pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import csv
from pathlib import Path
def write_to_csv(data, filename):
if type(data) != list:
data = [data]
print("Writing to CSV...")
filename = f"google-finance-{filename}.csv"
mode = "w"
if Path(filename).exists():
mode = "a"
print("Writing data to CSV File...")
with open(filename, mode) as file:
writer = csv.DictWriter(file, fieldnames=data[0].keys())
if mode == "w":
writer.writeheader()
writer.writerows(data)
print(f"Successfully wrote {filename} to CSV...")
def scrape_page():
response = requests.get(f"https://google.com/finance/markets/most-active")
soup = BeautifulSoup(response.text, "html.parser")
scraped_data = []
#find the pricing table
table = soup.find("ul")
#get the list elements
list_elements = table.find_all("li")
#iterate through the list elements and extract their text
for list_element in list_elements:
divs = list_element.find_all("div")
asset = {
"ticker": divs[3].text,
"name": divs[6].text,
"price": divs[8].text,
"change": divs[11].text
}
scraped_data.append(asset)
#save the pricing data
write_to_csv(scraped_data, "most-active")
articles = []
#find the articles on the page
article_boxes = soup.select("div[data-article-source-name]")
#iterate through the articles
for article in article_boxes:
#remove the span elements--tickers are in the pricing report
spans = article.find_all("span")
for span in spans:
span.decompose()
#split the title from its timestamp
title_unformatted = article.text.split("ago")
#format our field items
title = title_unformatted[1]
source = article.get("data-article-source-name")
link_element = article.find("a")
url = link_element.get("href")
#save them in a dict object
article_object = {
"title": title,
"source": source,
"url": url,
}
articles.append(article_object)
#save the headlines
write_to_csv(articles, f"top-headlines")
if __name__ == "__main__":
scrape_page()
Here are the CSV files containing our extracted data.


Our articles and tickers are now sitting inside a nice, tabular data file. This is easy to work with and easy for AI models to review. From here, if you want to enrich the data, simply add a column. In your new columns, you can add metadata, sentiment scores — anything that might help your model notice the patterns you want highlighted.
Data acquisition methods
So far, we’ve seen how to scrape this data using a hands-on approach. Web scraping isn’t the only way to feed this data into your AI pipeline. In fact, there are a variety of APIs, data providers and crawlers you can use.
APIs
With an Application Programming Interface (API), you can skip the scraping phase entirely. This can lead to faster development but there are some tradeoffs you need to be aware of.
Pros
- No scraping required
- Data is semi-structured, usually JSON
Cons
- Restricted to whatever the provider offers
- Data often needs to be restructured
- Can be expensive
- Often require setting up an account just to try the product
Data archives
Data archives can provide you with a wealth of historical data. Purchase your data, enrich it and then drop it into your AI pipeline. These are often the most complete datasets but they’re not as up-to-date as other options.
Pros
- No scraping required
- Data comes fully structured and ready for your environment
- Easily enriched
Cons
- Not as up-to-date as other sources
- Often more expensive than other options
On-demand crawlers
Many data providers will also offer prebuilt crawlers that you can run on-demand. These are often the same crawlers used to generate their data archives. If you’re looking for an easy solution with up-to-date data, these are the option for you.
Pros
- Save time on development
- Extract data on demand
- Easily enriched
Cons
- Data still needs to be cleaned
- Expensive to run
Providers
- Yahoo! Finance API: Access to comprehensive Yahoo! Finance through their API.
- Bright Data: Use collection APIs, prebuilt datasets and on-demand scrapers whenever you want.
- Alpha Vantage: Gain access to dashboards and APIs serving financial market data.
Data cleaning and enrichment
We’ve identified our source, but we still haven’t finished curating our data. Our structured data needs to be cleaned so our model has an easier time using it. Our unstructured data needs to be enriched so AI models can better understand it.
Cleaning structured data
We’ve already got data flowing to a CSV file. When we enrich our data, we can make it even more friendly for an AI model to read. Take a look at the top row from our CSV file.
| ticker | name | price | change |
|---|---|---|---|
| OPEN | Opendoor Technologies Inc | $4.61 | +$0.066 |
If we want to perform calculations, we need to do some more cleaning. If you look, our price and change columns are strings. For proper math, they need to be converted to numbers. When properly formatted, it would look more like the example below.
| ticker | name | price | change |
|---|---|---|---|
| OPEN | Opendoor Technologies Inc | 4.61 | 0.066 |
Enriching unstructured data
Now, let’s look at our unstructured data. To improve this data, we don’t really need to clean it. We need to classify it.
| title | source | url |
|---|---|---|
| Tesla (TSLA) Stock: Goldman Sachs Reiterates Neutral on Robotaxi Launch | Yahoo Finance | https://finance.yahoo.com/news/tesla-tsla-stock-goldman-sachs-160720156.html |
By adding a sentiment column, our AI agent can make better inferences.
| title | source | url | company | sentiment |
|---|---|---|---|---|
| Tesla (TSLA) Stock: Goldman Sachs Reiterates Neutral on Robotaxi Launch | Yahoo Finance | https://finance.yahoo.com/news/tesla-tsla-stock-goldman-sachs-160720156.html | TSLA | Neutral/Negative |
Conclusion
AI-powered finance isn’t magic. Like almost all web development, it’s a matter of plumbing. First, you need to identify your data source. Then, you build a pipeline to tap that source and transport the data to your AI environment. Once the pipeline exists, you just need to worry about enhancements like cleaning structured data and enriching unstructured data. The concept is almost identical to creating a drinking water system.
After enrichment, you can take the next step. Perhaps you’re training an LLM, building a RAG pipeline or even creating a prediction model. Regardless of your use case, the foundation is the same — you need a fresh and reliable data pipeline. Once it’s there, your financial AI system has the tools it needs for the fast, intelligent decisions you’re looking for.