As machine learning engineers solve increasingly complex problems, the AI models that they rely on are themselves becoming more complex. Many AI models perform better (and behave in a more humanlike way) when they work with multiple data types (such as text, images, video and audio) that represent a particular concept.
Just as humans can solve problems better when they use information from all their different senses, AI systems benefit from using multimodal data to solve problems more effectively. This multimodal data is typically used to build or fine-tune AI models.
Today, multimodal AI is widely used in industries such as e-commerce, healthcare, gaming, education and autonomous vehicles. Wherever AI is being used to tackle complex problems, multimodal AI is helping provide better results.
What is multimodal AI?
Multimodal AI refers to AI models or systems that can process and reason about multiple types of data simultaneously (such as text, image, video, or audio). These AI systems take this multimodal data (the data of different modalities) and fuse them together, performing reasoning across all the data as a whole, which helps to improve its understanding of complex problems.
For example, a multimodal AI model designed to act as a tennis coach could take different types of data about a tennis match as inputs, including a video of the match, an image of a player’s serve posture and audio of the coach’s verbal feedback. By processing this data together, the system is better able to answer questions like “How does the player’s form compare to that of professional players?”

Caption: An example of a multimodal AI system taking multimodal data as its inputs
Semantically aligned vs. semantically diverse multimodal data
Multimodal data is often semantically aligned — that is, the different data types all reference the same concept (like tennis, in the example above). However, it’s also possible for semantically diverse pieces of data to be multimodal data, as the definition of multimodal data is just data of different modalities that will be used by a multimodal AI system. This usually happens when a multimodal AI is specifically tasked with tying together disparate concepts — for example, in a creative task for open-ended generation.

Caption: Diagram comparing semantically similar (all about Hendrix at Woodstock) vs. diverse multimodal data (Hendrix text, Santa image, cat video).
Many multimodal AI models process semantically similar multimodal data — for example, an image paired with a caption or a video paired with a transcript. However, an example of a creative task that might require semantically diverse multimodal data is a model that’s tasked with inventing a joke that ties together a bunch of disparate concepts:
Caption: ChatGPT combining semantically diverse multimodal data
How multimodal AI models are trained
Multimodal AI models are typically trained on large multimodal datasets — for example, images paired with captions, or video paired with transcripts. That way, the models can learn how patterns in one modality map to patterns in another. These models typically require large amounts of well-aligned, good-quality data in order to work effectively.
The types (and combinations) of multimodal data that will be needed depend on the type of task that the AI model is trying to solve: A model that needs to answer a question about an image will need to be trained on images paired with text about those images, while more creative or open-ended generation tasks can involve more loosely related or varied types of data.
Many of the biggest AI systems have already become multimodal. For example, OpenAI’s GPT-4 with vision is now the default model within ChatGPT, and Google Gemini is also multimodal. You can now send images to the model and ask them questions like “What does the error in this screenshot mean?” or “Can you explain this chart/graph?” Multimodal AI doesn’t always have to mean text, image, video and audio — it can include many different data types. For example, autonomous vehicle systems might use video, LIDAR and GPS data.
While the availability of open multimodal datasets is improving, finding the exact kind of task-specific data that you need (especially at scale) is still a big challenge when developing multimodal AI systems.
Why do we need multimodal AI?
Multimodal AI helps us solve more complex, real-world problems that can’t be solved using just one single type of data. Before multimodal AI models existed, single modality models were stitched together to make a multimodal AI system using complex pipelines (for example, one for image recognition and one for text classification). This did allow for more complex problems to be tackled, but the systems were brittle and broke easily if the inputs changed only a small amount, so these AI systems required a lot of manual intervention to keep them running.
Now that transformers have become the dominant AI architecture, it’s much easier to design models that accept many types of data as inputs and process them all in a unified way, as any data type can be transformed (via appropriate encoders) into the embeddings or tokens that all these models accept.
There’s growing demand for AI systems that can solve problems in a more humanlike way: By integrating information received from multiple sensory inputs. Training (or fine-tuning) these systems means you’ll need huge amounts of high-quality, appropriate multimodal data, often from a large variety of data sources.
This is because multimodal models must not only learn how to understand each individual modality (for example, recognizing objects in images or parsing natural language), but how the different modalities work together. For example, a model trained on image-caption pairs must learn that the sentence “a dog chasing a ball” corresponds to specific visual patterns in an image. This is known as cross-modal alignment, and if you don’t use high quality, diverse data that represents a variety of real-world situations, your model will be less robust.
For example, not including enough images or videos in different lighting levels, or enough audio files in a variety of accents will cause your model to underperform once it’s let loose on the real world and forced to generalize. The consequences of this range from customers losing confidence in your product, to allegations of bias, to real-world harm in high-stakes applications like medical AI (where clinical images are paired with annotations) or self-driving vehicles (which use video and LIDAR).
When should you use multimodal AI?
Multimodal AI is best suited to complex problems where your model needs to understand different types of data and reason about it. Some typical uses cases for this are:

Caption: A multimodal AI model with two inputs: an image and some text with a question about the image
- Your model needs access to multiple modes of data to get enough context: When an AI model needs to detect emotion or a person’s intent from written language alone, a single mode of data is often not sufficient. For example, to understand if a speaker is being sarcastic, it’s necessary to hear the tone of voice, rather than just reading a transcript.
- You need more human-like results: Humans solve problems using multiple types of sensory information together, and when an AI system does the same, it tends to produce more human-like results. For example, you could train a model to assist users of an e-commerce system. If a customer uploads a photo of a red jacket and types “Show me something like this but more waterproof and in green,” the AI system will need to be multimodal.
- You need to solve real-world, sensory-rich problems: Some examples of common real-world industries with these issues are healthcare, robotics and autonomous vehicles. In healthcare, multimodal AI is being used to help decide the best course of treatment for patients. An example of this is the ArteraAI Prostate Cancer Test, a test developed for those who have already received a cancer diagnosis. It combines biopsy images with the patient’s health records to work out the most effective personalized treatment plan for the patient.
When not to use multimodal AI
If your problem can be solved using just one type of data, there’s no need to add more complexity and cost by adding multimodal data. In such cases, it’s best to keep things simple so your model is easier to maintain, cheaper to train and faster to deploy.
Where to find multimodal data
ML engineers use large amounts of high quality multimodal data to train their models, so it’s important to find good, reliable sources of data. Below we share some of the best places to get multimodal data on the web.
Public multimodal dataset repositories and indexes
Hugging Face: Hugging Face is the largest public repository of AI datasets, and it includes some multimodal datasets. It includes some of the largest multimodal datasets, such as LAION-400M, CC12M and WIT — all of which use image and text. These are used to train massive foundational models like CLIP or Stable Diffusion. To filter for only multimodal datasets, click Main and then filter by clicking on the Modalities you’re interested in.
Papers with Code: Papers with Code has an index of datasets that can be filtered by modality or task, including large multimodal models like HowTo100M, a massive video, audio and text dataset using YouTube videos, and VQA v2.0 (Visual Question Answering), an image and text multimodal dataset.
Google Dataset Search: Google Dataset Search doesn’t host any datasets itself; it’s a general-purpose dataset search engine that links out to places like Hugging Face, Papers with Code, or any other dataset repositories or indexes.
Kaggle: This platform hosts machine learning competitions where users upload a dataset and announce a problem that needs solving. Participants then compete to solve the problem. Kaggle hosts many thousands of datasets that can be used for basic training or fine-tuning; however, they’re typically not large enough to train foundational models.
Web archives
Another source of large amounts of data is anywhere that has archived large numbers of web pages. The data is less structured than datasets specifically hosted for AI use, but you can use web scraping or AI-ready data extraction tools to retrieve the data in a useful format.
- Wikipedia dumps: Wikipedia backs up its data and makes it publicly available through a Creative Commons license. You can download a Wikipedia dump of the whole English language site and use it for free. It’s text-only data in XML, so it’s not fully multimodal, but it contains links to images hosted on Wikimedia Commons, so if you use that API, it’s fairly easy to combine the two to create a large text and image multimodal dataset.
- Common Crawl: Common Crawl is a free, open repository of web crawl data that’s free to use as long as you follow its terms of use. However, the data is typically in HTML format and you’ll need to parse it and convert it into an easier format like JSON or CSV.
- Bright Data Web Archive: This enterprise-grade product continuously crawls the web and converts data into useful formats like JSON. It adds over 2.5 PB of fresh image, text and audio data daily, and the archive keeps growing, making it possible to find valuable data you didn’t even know existed. The platform also offers annotation services and an API that makes it easy to access this data in a ready-to-use format for training or fine-tuning your AI models.
- Internet Archive: This hosts the Wayback Machine, a web crawler that has historical web pages going all the way back to 1996. It also has a variety of free datasets, but it’s a little harder to find multimodal data in there.
Limitations of using open data sources for multimodal data
Although some of the larger open datasets have been used to great success when training foundational AI models, most business use cases require specialized data that’s unlikely to be publicly available.
For example, using multimodal AI can help you conduct market research on your competitor’s pricing strategies, and they’re not likely to make this data available as a public dataset. But if you can get a hold of this data, you can analyze product images alongside product descriptions, which may reveal hidden information (like a 20% off banner that’s part of an image, or the fact that two products that have similar descriptions are actually different products).
You might want to use multimodal AI to protect your brand’s reputation by analyzing product reviews and discussion forums. While all of this content is technically publicly available, the information can be scattered across different websites in different formats.
A more flexible solution: Scraping multimodal data from the web
Sometimes, web crawling is the only way to get the specialized datasets you need. There are tools that can help you scrape data from a website, but the basic steps involved are:
- Inspect and analyze the website or webpage: Understand the structure of the HTML and where the data that you’re interested in is located within this.
- Fetch HTML: Use code to fetch the HTML data of a page.
- Extract the desired text data: Parse the HTML and remove only the data you’re interested in.
- Download multimedia: Extract image, video and audio data URLs from the HTML, and then download the data from the URLs.
There are many libraries that can help with this. You’ll need a tool that can parse HTML, such as BeautifulSoup for Python, Cheerio for JavaScript, or XPath, which works with many languages and can be used for extracting complex nested data. If you’re crawling static websites, these tools should be enough, but if you need to crawl dynamic sites that load HTML on a particular JavaScript action (this includes any modern site that uses infinite scroll to load posts), then all the data won’t be available when you fetch the HTML. Instead, you need to use a browser automation tool like Playwright, Puppeteer, or Selenium.
Challenges when web scraping for multimodal AI
Web scraping can present a variety of technical, legal and ethical challenges:
- Copyrighted content: Public web data may include content protected by copyright. Responsible data collection practices should include safeguards to prevent unintended reproduction or leaks of copyrighted material.
- Rate limits: Rate limits are put in place in order to ensure fair usage and maintain stable service for all users, and to prevent malicious traffic from reaching the site servers.
- Large file sizes: Scraping multimodal data can be slow to download and expensive to store — especially high-resolution images or videos.
- Regional access: Some content is localized. To build globally representative datasets, teams may need to account for regional variations in content availability.
Best practices when web scraping multimodal data
To conduct responsible web scraping, make sure you follow these rules:
- Respect robots.txt: Use tools that provide configuration options to tailor automation activities to your compliance and risk standards — for example, by enabling robots.txt compliance or disabling CAPTCHA handling as needed. Responsible users may choose to incorporate these signals into their scraping workflows.
- Rate limiting and delays: Implement your own request throttling and randomized delays. Make sure to prevent degradation of source websites and to maintain sustainable access.
- Use APIs over web scraping: Official APIs provide structured data with built-in rate limits, so it’s better to use these if they’re offered.
- IP rotation: Rotating IP addresses can help distribute traffic and reduce the likelihood of triggering automated access restrictions.
- Use proxies: Proxies can help manage access to content across regions and reduce the risk of IP-based blocks when collecting data at scale.
Using a web scraping platform for multimodal AI will save you time and money
Public datasets can be a great starting point when looking for data to power your multimodal AI models, but these datasets are typically not specialized enough for enterprise and business purposes. To get the data you really need, you’ll need a custom solution, most likely involving web scraping.
Rather than wasting development time on building and maintaining your own web scrapers and handling the infrastructure to manage this at scale, it’s simpler to use a web scraping platform that does the heavy lifting for you, adapting to site structure changes, extracting text and media at scale and managing the infrastructure while supporting responsible data collection practices.
Many of these tools offer flexible integration options, such as developer-friendly APIs or visual interfaces, and use advanced techniques like IP rotation and rate limiting to enable reliable, scalable data collection from a wide range of sources.