Knowledge graphs store interlinked information, representing data points as entities (nodes or vertices) and capturing the relationships (edges) between them, as shown
below:
Knowledge graphs organize complex information and provide richer semantic understanding to AI systems across diverse use cases, including question-answering (QA), recommendations and semantic search.
However, a knowledge graph is only as powerful as the data it contains. While proprietary datasets and large language model (LLM) training data offer a foundation, they often lack freshness and breadth. Web data can fill this gap, enriching knowledge graphs with real-time information that improves context, accuracy and explainability. Integrating web information into knowledge graphs makes them a factual repository that can complement retrieval augmented generation (RAG) and LLM applications.
In this guide, we discuss:
- Tools and techniques for pulling web data into knowledge graphs
- A practical demonstration of the web-to-graph pipeline
- How to use Diffbot to shorten time-to-graph
- Best practices for maintaining a knowledge graph
Data teams can use this guide to construct web-powered graphs and derive insights that proprietary or LLM training data alone might not provide.
How to integrate web data into a knowledge graph
Constructing a web-powered knowledge graph is a multi-step process. It involves retrieving the data, processing it with natural language processing (NLP) techniques and modeling the graph with the acquired knowledge. To illustrate the workflow, we will design a company info knowledge graph that models the who and where of different companies.
Below are the key steps involved in bringing a web-powered knowledge graph to life, including the different tools and strategies you can leverage.
Step 1: Define the use case and schema
Before building, you need to identify the graph’s purpose and schema (classification of entities and relationships) to guide your data selection process. A clear understanding of the problem the knowledge graph will solve or the questions it will answer helps you prioritize relevant data. For this sample graph, the primary use case is a question-answering RAG system. Users ask questions about a company, and an LLM provides contextual answers using the graph data.
Knowledge graphs also rely on schemas for data modeling, consistency and relevance as they define what data points to store and how to structure them. Our demo graph will support these core entities and relationships:
| Entity type | Attributes or properties | Relationships |
| Person | Name, role | Name → OWNS → Company |
| Company or organization | Name | Organization → LOCATED_AT → Location |
| Location | Name | Location ← LOCATED_AT ← Organization |
As your data needs increase, you can introduce ontologies (richer hierarchies) and expand the graph to accommodate more complex relationships. Once you know the graph’s scope, you need data that aligns with it.
Step 2: Extract web data
The next step is to scrape the data that will populate the graph. We are using the Wikipedia API for this demo because it returns JSON-formatted summaries of the lead section of Wikipedia articles. These articles already contain the company info we need for this graph.
Using purpose-built APIs, such as the Wikipedia API, is one way to obtain web data, especially when you need information from a specific site or for a particular use case. For full-scale extraction, you can also adopt web scraping tools like Bright Data, Firecrawl and ScraperAPI. Here’s an overview of tasks they are well-suited for and the types of outputs they provide:
| Web scraping tool | Well-suited for | Outputs |
| Bright Data | Enterprise-grade scraping of dynamic websites | JSON, CSV |
| Firecrawl | AI-driven crawling and extracting data from all accessible subpages in a website | Markdown, JSON |
| ScraperAPI | Scraping product data from e-commerce marketplaces such as Amazon and Walmart | Text, Markdown, CSV, JSON |
To extract data from Wikipedia:
- Install requests library
| pip install requests |
- Define a list of CEO names so Wikipedia knows which ones to return.
| ceo_names = [ “Elon Musk”, “Sundar Pichai”, “Tim Cook”, “Satya Nadella”, “Mark Zuckerberg”, “Andy Jassy”, “Jensen Huang”, “Ginni Rometty”, “Larry Page”, “Susan Wojcicki”, “Shantanu Narayen”, “Reed Hastings”, “Michael Dell”, “Daniel Ek”, “Evan Spiegel”, “Marc Benioff”, “Lisa Su”, “Dara Khosrowshahi”, “Patrick Collison”, “Brian Chesky” ] |
We provided specific names because the Wikipedia API needs a search term or page title to fetch summaries.
- Define a function that retrieves the Wikipedia summaries of the highlighted CEOs.
| import requests def get_wikipedia_summary(name): url = f”https://en.wikipedia.org/api/rest_v1/page/summary/{name.replace(‘ ‘, ‘_’)}” try: response = requests.get(url) if response.status_code == 200: return response.json().get(“extract”, “”) else: return “” except Exception as e: return “” |
Now that we have the data, the next step is to ensure that it is graph-ready.
Step 3: Preprocess the data
The quality of any knowledge graph depends heavily on the quality of its input. The Wikipedia API outputs structured data, which removes the need for manual preprocessing. So, we will skip this step. But if you’re using a custom scraping tool, you might need to clean, normalize and extract structure from the web data to prepare it for graph ingestion.
Pandas Python library is suitable for this data manipulation task. It offers flexible data structures for cleaning and preparing raw data for text analysis. Use Pandas to handle missing values, duplicates and inconsistent formatting in your data.
Step 4: Recognizing entities
The goal at this stage is to find the key entities within the data and define their properties using named entity recognition (NER). NER classifies entities into predefined categories that capture their essential characteristics, as shown below:
We will use Hugging Face Transformers, an open-source library offering pretrained models optimized for NLP tasks such as entity recognition. Its dslim/bert-base-NER model has been fine-tuned to recognize person (PER), location (LOC) and organization (ORG) entities in text data, which we need for this graph.
Here’s how to perform NER using Hugging Face Transformers:
- Install the Transformers library.
| pip install transformers |
- Initialize the NER pipeline.
| from transformers import pipeline ner_pipeline = pipeline(“ner”, model=”dslim/bert-base-NER”, aggregation_strategy=”simple”) |
ner_pipeline is Hugging Face’s pre-trained NER pipeline for token classification. It takes raw text input, detects entity spans, classifies them and outputs the recognized named entities. The pipeline’s simple aggregation strategy parameter groups tokens that have the same named entity into one entity span for cleaner results.
- Extract organization (ORG) entity from each CEO summary.
| def extract_organizations(text): ner_results = ner_pipeline(text) orgs = set() for entity in ner_results: if entity[“entity_group”] == “ORG”: orgs.add(entity[“word”]) return list(orgs) |
- Extract location (LOC) entity.
| def extract_locations(text): ner_results = ner_pipeline(text) locations = set() for entity in ner_results: # Check if the entity is labeled as a location if entity[“entity_group”] == “LOC”: locations.add(entity[“word”]) return list(locations) |
We have successfully identified and classified the entities within the Wikipedia summaries. If you’re replicating this pipeline with custom models or want to define more domain-specific entities, SpaCy is another open-source NLP library you can use. It provides tools for training custom models on NER and manually defining new entities.
Step 5: Mapping relationships
Next, we map the relationships between the extracted named entities to give the graph structure and context.
- Map each CEO to their company.
| ceo_to_companies = {} for ceo in ceo_names: summary = get_wikipedia_summary(ceo) organizations = extract_organizations(summary) ceo_to_companies[ceo] = organizations ceo_to_companies |
The result will look like this:
| { ‘Sundar Pichai’: [‘Alphabet Inc’, ‘Google’], ‘Tim Cook’: [‘Apple’, ‘Apple Inc’], ‘Satya Nadella’: [‘Microsoft’], ‘Mark Zuckerberg’: [‘Meta Platforms’, ‘Facebook’], } |
- Aggregate all unique company entities to remove any duplicates using the .update() set method.
| all_companies = set() for company_list in ceo_to_companies.values(): all_companies.update(company_list) all_companies = list(all_companies) print(“Unique companies found:”, all_companies) |
It will return a list of all unique companies present in the Wikipedia summaries.
- Map the relationship between location and company.
| company_to_locations = {} for company in all_companies: summary = get_wikipedia_summary(company) locations = extract_locations(summary) company_to_locations[company] = locations company_to_locations |
This will return a dictionary of the companies and their corresponding locations, as shown below:
| { ‘YouTube’: [‘California’, ‘San Bruno’], ‘IBM’: [‘Armonk’, ‘U. S.’, ‘New York’] ‘Nvidia’: [‘Santa Clara’, ‘California’], } |
Note: You may notice that the results stored in ceo_to_companies or company_to_locations can vary between runs. This happens because the dslim/bert-base-NER model is probabilistic, meaning its predictions can fluctuate. But it won’t affect the overall process.
Step 6: Store the data in a graph database
After entity recognition, we store the structured dictionaries in a persistent and queryable format using a graph database (DB). There are two primary types of databases for managing knowledge graphs: property graph and resource description framework (RDF) databases.
Property graph databases organize information as nodes, relationships and properties (descriptive details). The nodes are associated with labels and attributes (key-value pairs) that define their role in the network, as shown in this image below:
The most widely adopted property graph is Neo4j, which uses Cypher query language (CQL) to query and manipulate graph structures. We are building on this database because it lets you adapt your schema as your needs evolve without any major refactoring, and it integrates smoothly into RAG workflows.
Here’s how to store the processed data in Neo4j:
- Set up an instance in Neo4j AuraDB and save the connection URI, username (typically, Neo4j) and password as environment variables using a .env file.
| import dotenv dotenv.load_dotenv(“.env”, override=True) import os from neo4j import GraphDatabase uri = os.environ[“NEO4J_URI”] user = os.environ[“NEO4J_USERNAME”] password = os.environ[“NEO4J_PASSWORD”] driver = GraphDatabase.driver(uri, auth=(user, password)) |
- Populate Neo4j with the company info.
| from datetime import datetime def create_node_with_provenance(session, label, name, timestamp): session.run(f””” MERGE (n:{label} {{name: $name}}) WITH n MATCH (p:Provenance {{run_time: datetime($time)}}) MERGE (n)-[:EXTRACTED_FROM]->(p) “””, name=name, time=timestamp) def create_knowledge_graph(ceo_to_companies, company_to_locations): timestamp = datetime.utcnow().isoformat() source = “https://en.wikipedia.org/api/rest_v1/page/summary” with driver.session() as session: #provenance node for the entire data load session.run(“”” MERGE (p:Provenance {run_time: datetime($time)}) SET p.source = $source “””, time=timestamp, source=source) for ceo, companies in ceo_to_companies.items(): create_node_with_provenance(session, “CEO”, ceo, timestamp) for company in companies: create_node_with_provenance(session, “Company”, company, timestamp) session.run(“”” MATCH (c:CEO {name:$ceo}), (comp:Company {name:$company}) MERGE (c)-[:OWNS]->(comp) “””, ceo=ceo, company=company) for location in company_to_locations.get(company, []): create_node_with_provenance(session, “Location”, location, timestamp) session.run(“”” MATCH (comp:Company {name:$company}), (l:Location {name:$location}) MERGE (comp)-[:LOCATED_IN]->(l) “””, company=company, location=location) create_knowledge_graph(ceo_to_companies, company_to_locations) |
We created a provenance entity which has source and run_time properties, and connects to other existing entities. Provenance is crucial for the reproducibility and traceability of knowledge graphs because it provides metadata about the data origin, retrieval time and extracted entities.
To view this entity:
| def view_provenance(): with driver.session() as session: query = “”” MATCH (p:Provenance)<-[:EXTRACTED_FROM]-(n) RETURN p.source AS source, p.run_time AS timestamp, collect(n.name) AS linked_entities “”” result = session.run(query) for record in result: print(f”Source: {record[‘source’]}”) print(f”Timestamp: {record[‘timestamp’]}”) print(“Entities:”, record[‘linked_entities’]) view_provenance() |
The result:
In Neo4j AuraDB, you can download this information as a CSV or JSON file.
- Define a function that returns knowledge from the graph.
| def get_graph_data(): query = “”” MATCH (ceo:CEO)-[:OWNS]->(company:Company) OPTIONAL MATCH (company)-[:LOCATED_IN]->(loc:Location) RETURN ceo.name AS ceo, company.name AS company, loc.name AS location “”” with driver.session() as session: results = session.run(query) return [record.data() for record in results] |
The Cypher query will retrieve the CEO name, company name and location (if it exists).
- Install Pyvis, a Python library for graph visualization.
| pip install pyvis |
- Visualize the graph.
| from pyvis.network import Network def visualize_graph(data): net = Network(height=’600px’, width=’100%’, notebook=True) for item in data: ceo = item[‘ceo’] company = item[‘company’] location = item[‘location’] net.add_node(ceo, label=ceo, color=’orange’, shape=’dot’) net.add_node(company, label=company, color=’lightblue’, shape=’box’) net.add_edge(ceo, company, label=’OWNS’) if location: net.add_node(location, label=location, color=’lightgreen’, shape=’ellipse’) net.add_edge(company, location, label=’LOCATED_IN’) net.show(‘ceo_graph.html’) data = get_graph_data() visualize_graph(data) |
Here’s a zoomed-in diagram of the graph:
You can also use RDF (or triple stores) databases for graph storage. They capture data as subject-predicate-object triples using SPARQL queries. Subject represents a resource (node), predicate represents edges and object is another node or property value. Apache Jena and Amazon Neptune are some examples of RDF databases. Apache Jena is an open-source, RDF-native database while Amazon Neptune is a fully managed database service that supports both RDF and property graph data models. If you are seeking graph interoperability, Amazon Neptune lets you write Cypher queries that operate over RDF data models.
Step 7: Query and validate the graph
Our sample project relies on the Neo4j database, so we will use Cypher, supplemented with an LLM, to query the graph. LLMs and knowledge graphs have a mutual relationship. LLMs assist in generating Cypher for graph querying, and knowledge graphs strengthen LLM outputs or provide explanations for them. We chose OpenAI’s GPT-4.1 because of its reasoning strength and context retention.
Here’s how to query the graph:
- Install the openai package and initialize the client with your OpenAI API key.
| from openai import OpenAI client = OpenAI(api_key=os.getenv(‘OPENAI_API_KEY’)) |
- Generate Cypher query with GPT-4.1 to query the Neo4j database.
| def generate_cypher_query(user_input): # Define the prompt that guides gpt-4.1 to translate natural language into Cypher prompt = f””” You are an assistant that translates natural language into Cypher queries for a Neo4j graph. The graph has the following entities: – CEO (with `name` property) – Company (with `name` property) – Location (with `name` property) Relationships: – (ceo:CEO)-[:OWNS]->(company:Company) – (company:Company)-[:LOCATED_IN]->(location:Location) Now write a Cypher query for the following request: “{user_input}” Only return the query, without explanations. “”” response = client.chat.completions.create( model=”gpt-4.1″, messages=[{“role”: “user”, “content”: prompt}], temperature=0, ) return response.choices[0].message.content.strip() |
This function uses GPT-4.1 to turn natural language questions into Cypher queries for Neo4j database, so a user can query the graph without writing Cypher syntax manually.
- Execute the Cypher query.
| def execute_cypher_query(cypher_query): with driver.session() as session: result = session.run(cypher_query) return [record.data() for record in result] |
Here, when we run a given Cypher query on Neo4j database, it will return the results as a list of dictionaries, where each dictionary represents a record from the database.
| def ask_graph_with_natural_output(user_input): print(“Step 1: Generating Cypher query…”) cypher_query = generate_cypher_query(user_input) print(“Cypher:”, cypher_query) print(“Step 2: Executing on Neo4j…”) results = execute_cypher_query(cypher_query) print(“Raw results:”, results) print(“Step 3: Generating human-friendly response…”) response_prompt = f””” You are an assistant that explains the result of a database query to users in plain English. User Question: “{user_input}” Query Results: {results} Write a clear and concise answer to the user’s question based on the results. “”” response = client.chat.completions.create( model=”gpt-4.1″, messages=[{“role”: “user”, “content”: response_prompt}], temperature=0.3, ) answer = response.choices[0].message.content.strip() return answer |
When a user poses a natural language question to GPT-4.1, it generates a Cypher query to retrieve the relevant information from the graph. Neo4j executes the Cypher query and returns the results to GPT-4.1. Then the LLM interprets the results and generates a natural language response for the user.
Let’s validate that the graph can serve the use case defined in Step 1.
| print(ask_graph_with_natural_output(“Which company does Satya Nadella own?”)) |
Here’s the result of the query:
| Step 1: Generating Cypher query… Cypher: MATCH (ceo:CEO {name: “Satya Nadella”})-[:OWNS]->(company:Company) RETURN company.name Step 2: Executing on Neo4j… Raw results: [{‘company.name’: ‘Microsoft’}] Step 3: Generating human-friendly response… Satya Nadella is the owner of Microsoft. |
The company info knowledge graph is now ready to add context to a question-answering RAG system. This web-to-graph demo:
- Extracts Wikipedia summaries using Wikipedia API
- Performs NER and relationship mapping using Hugging Face Transformers library
- Stores the data in Neo4j database
- Uses Cypher and GPT-4.1 to query the graph
You can extend this sample implementation to other domains such as e-commerce, finance and academia. If your knowledge graph doesn’t deliver meaningful or expected results, you may need to adjust its schema, assess whether the data truly represents its domain or confirm that the NER model recognized the right entities.
Step 8 (optional): Automate data ingestion to keep the graph fresh
Web content changes frequently, and your knowledge graph needs to stay up-to-date. Automating data ingestion improves the graph’s accuracy and maintains consistency across its entities and relationships. To achieve this automation, we use Prefect, a workflow orchestration framework that provides a Pythonic API for defining workflows (called flows) as a series of tasks. Prefect can run on either a self-hosted server or in Prefect Cloud. For this sample company info graph, Prefect orchestrates three stages of the pipeline:
- Data extraction
- NER processing
- Data loading into Neo4j
Once deployed, the automation runs daily at 6:00am UTC using a CRON schedule. Each run pulls the latest Wikipedia summaries, processes them and refreshes the graph with up-to-date entities and relationships. Here’s how to do this:
- Install Prefect.
| pip install prefect |
- Refactor the existing code with Prefect. The complete code on how to achieve this is available in the web-to-graph-pipeline GitHub repository. Once there, scroll to the “Automate data ingestion using Prefect” section in the notebook to follow along.
The result should look like this:
| 10:30:34.899 | INFO | prefect – Starting temporary server on http://127.0.0.1:8566 10:30:38.929 | INFO | Flow run ‘benign-lemur’ – Beginning flow run ‘benign-lemur’ for flow ‘test-flow’ Starting pipeline… 10:30:38.933 | INFO | Flow run ‘benign-lemur’ – Knowledge Graph updated successfully at 2025-07-29 10:30:38.933849+00:00 10:30:38.958 | INFO | Flow run ‘benign-lemur’ – Finished in state Completed() |
- Save the script as a .py file and run locally in your terminal to test.
| #Provide file path python kg_pipeline.py #file was saved as kg_pipeline |
If everything is working as expected, you should get a result similar to the one above.
- Create a Prefect deployment with a CRON schedule.
| prefect work-pool create “default” |
This creates a process-based work pool, meaning Prefect will run your flow on your local machine.
| prefect deploy –name “kg_pipeline” –cron “0 6 * * *” –pool “Knowledge graph” kg_pipeline.py:knowledge_graph_pipeline #sets schedule to 6am UTC daily |
We set the schedule for data ingestion into Neo4j at 6am UTC daily. Here’s the result:
The deployment configuration will be saved as a .yaml file. You can make changes to the configuration by modifying the .yaml file.
- Verify the CRON schedule.
| prefect deployment inspect knowledge-graph-pipeline/kg_pipeline |
The response looks like this:
- Launch a local Prefect server.
| prefect server start |
Keep this server running in a separate terminal window. It acts as the backend where deployments are registered, schedules are stored and Prefect Workers (the component responsible for running the flows) connect to fetch tasks. If successful, you will see this:
PREFECT_API_URL=http://127.0.0.1:4200/api
- Start the Prefect Worker.
| prefect worker start -p “Knowledge graph” |
It should connect successfully to your local server. You can confirm the connection using:
| prefect work-pool inspect “Knowledge graph” |
Scroll down to the end of the lines of code to see the status of the work pool.
Prefect is now fully set up and ready to schedule the flow on your local server. As long as your local server and Worker are running, Prefect will trigger the deployment and add it to the default queue at the scheduled time. The Worker will pick it up from the queue and the data ingestion will run automatically.
Here’s a summary table of all the highlighted tools in this guide and the roles they play in the web-to-graph pipeline:
| Pipeline stage | Suggested tools | Roles they play |
| Data extraction | Wikipedia API, Bright Data, Firecrawl, ScraperAPI | Collects and returns output in a structured format |
| Data preprocessing | Pandas | Cleans, structures and manipulates raw data into a machine-readable format for further analysis |
| Entity recognition and relationship mapping | Hugging Face Transformers, SpaCy | Extracts entities from text and maps their relationships |
| Data storage | Neo4j, Amazon Neptune, Apache Jena | Stores entities and relationships in a graph database for efficient querying |
| Graph querying | Cypher + GPT-4.1 | Retrieves insights from the graph using natural language queries |
| Data automation | Prefect | Automates and schedules data extraction |
When you want to eliminate some of these pipeline stages (such as entity recognition, data preprocessing and relationship mapping), there’s a tool that can shorten the process from web extraction to graph creation.
Using Diffbot for automated web-to-graph extraction
Diffbot is an AI-driven service that crawls the web, transforms unstructured data into a graph-ready format and provides a rich knowledge graph where you can search through already-extracted data. It offers four core features that accelerate the web-to-graph pipeline:
- Extract: Diffbot REST APIs extract articles, products, social media feeds or images from the web in structured JSON or CSV formats using computer vision and a machine learning (ML) model. The computer vision algorithm classifies the web page into one of 20 predefined types, and the ML model interprets the results to identify key attributes on the page.
- Crawlbot: This web spider crawls links and pages on a website starting with one or more seed URLs, with customization options like skipping pages with a differing canonical URL or limiting crawl depth. To use the Crawlbot, you need to choose between Diffbot automatic (AI-enabled APIs that pull relevant information from a provided page type) or custom (for specific page types) extraction APIs.
- Knowledge Graph: Diffbot’s Knowledge Graph lets you search for people, companies, articles and products. You can specify the type of nodes and relationships you want to extract so that the Knowledge Graph can return a clean and structured dataset that fits your requirements. The screenshots below are search results for TechCrunch articles that Diffbot’s Knowledge Graph returned.
- Natural Language: Using its Natural Language API, Diffbot extracts nodes, relationships and semantic context from text data, which you can then transform into a graph. It also provides options for including your preferred set of entities and relationships.
Here’s an example of how to build a company info graph using Diffbot’s NL API, LangChain and Neo4j database:
- Install the required packages.
| pip install –upgrade –quiet langchain langchain-experimental langchain-openai langchain-neo4j neo4j wikipedia |
- Get your API token from Diffbot. Note that you have to sign up with a work email.
| import os from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer diffbot_api_key = os.getenv(“DIFFBOT_API_KEY”) diffbot_nlp = DiffbotGraphTransformer(diffbot_api_key=diffbot_api_key) |
The DiffbotGraphTransformer class extracts the entities and relationships and returns graph-ready data. Thus, you don’t need dedicated tools for data preprocessing or NER implementation.
- Fetch data from Wikipedia.
| from langchain_community.document_loaders import WikipediaLoader search_term = “Elon Musk” wiki_data = WikipediaLoader(query=search_term).load() try: graph_docs = diffbot_nlp.convert_to_graph_documents(wiki_data) print(“Graph Documents:”, graph_docs) except Exception as e: print(f”An error occurred while converting to graph documents: {e}”) |
Here’s a snippet of the result:
- Start an instance on Neo4j AuraDB and save your credentials in a .env file.
| from langchain_neo4j import Neo4jGraph uri = os.getenv(“NEO4J_URI”) username = os.getenv(“NEO4J_USERNAME”) password = os.getenv(“NEO4J_PASSWORD”) graph = Neo4jGraph(uri=uri, username=username, password=password) graph.add_graph_documents(graph_docs) |
The add_graph_documents method loads the graph documents into Neo4j as shown below:
- Query the graph using GraphCypherQAChain and GPT-4.1.
| from langchain_neo4j import GraphCypherQAChain from langchain_openai import ChatOpenAI openai_api_key = os.getenv(“OPENAI_API_KEY”) chain = GraphCypherQAChain.from_llm( cypher_llm=ChatOpenAI(temperature=0, model_name=”gpt-4.1″, api_key=openai_api_key), qa_llm=ChatOpenAI(temperature=0, model_name=”gpt-4.1″, api_key=openai_api_key), graph=graph, verbose=True, allow_dangerous_requests=True, ) result = chain.run(“What organizations is Elon Musk a member of?”) print(result) |
This is the result:
| > Entering new GraphCypherQAChain chain… Generated Cypher: MATCH (p:Person {name: “Elon Musk”})<-[:FOUNDED_BY]-(org1:Organization)-[:ACQUIRED_BY]->(org2:Organization) RETURN org2.name AS AcquiredCompany Full Context: [{‘AcquiredCompany’: ‘Compaq’}, {‘AcquiredCompany’: ‘eBay’}, {‘AcquiredCompany’: ‘eBay’}, {‘AcquiredCompany’: ‘SpaceX’}, {‘AcquiredCompany’: ‘x.ai’}, {‘AcquiredCompany’: ‘X Corp.’}] > Finished chain. Compaq, eBay, SpaceX, x.ai, X Corp. were acquired by Elon Musk. |
Diffbot’s platform provides functionalities that automate most of the lifecycle of creating web-infused knowledge graphs. But building a knowledge graph is just one part. You need to maintain it to enhance its dependability and integrity.
Best practices for maintaining a web-powered knowledge graph
To maximize the full potential of AI knowledge graphs, you need to continuously manage them as web data updates frequently and business needs evolve. Below are some best practices for optimizing web-based knowledge graphs.
- Track provenance – Provenance serves as an audit trail that captures where the web data stems from and when it was extracted. In step 6 of the demo pipeline, we introduced provenance while loading the data into Neo4j. Proper provenance management validates your data (especially when using scraping tools), allows reproducibility and increases trust in the knowledge that the graph provides.
- Incrementally update the graph – Instead of full graph rebuilds every time new data becomes available, consider continually adding or modifying new information while preserving existing knowledge, entities and relationships. Repeatedly reprocessing existing graphs from scratch resets everything the graph already knows and can be computationally expensive and inefficient.
- Monitor performance – The performance of downstream applications such as question-answering and information retrieval systems can be affected by the quality of the graph feeding them knowledge. Define a schema early in the pipeline and scrape data that aligns with the graph’s domain so you can better identify what to measure and where to optimize.
By following these best practices, data teams can minimize data quality issues, catch any decline in information accuracy promptly and build knowledge graphs that scale.
Making web data work for your knowledge graph
Incorporating web data into a knowledge graph is a practical way to improve its accuracy, consistency, completeness and timeliness. This guide helps you build a dynamic knowledge graph that fuels real-time RAG and enables AI applications to respond more intelligently.
There’s no one-size-fits-all tool. You can adopt the options used in this guide or substitute them for your preferred stack. You will find the complete step-by-step code in this GitHub repository. Run the cells as you go, test the flow and experiment with the pipeline in real time.