Knowledge Graph
A structured knowledge representation that encodes entities and their relationships as a directed labelled graph, enabling machines to reason over interconnected facts across diverse domains.
A knowledge graph (KG) is a data structure that represents knowledge as a collection of entities (real-world objects, concepts, or events) and the typed relationships that connect them. It is formalised as a directed labelled graph in which nodes represent entities and edges represent relations, with each edge carrying a label specifying the type of relationship. The fundamental unit is the triple: (subject, predicate, object) — for example, (Kuala Lumpur, capital_of, Malaysia) or (Maybank, founded_in, 1960).
Knowledge graphs combine the flexibility of a graph data model with formal semantic definitions drawn from ontologies, enabling machines to perform structured reasoning, inference, and complex query answering over large collections of interconnected facts.
Historical Context
The intellectual roots of knowledge graphs lie in symbolic AI and the Semantic Web initiative. Tim Berners-Lee's vision of a machine-readable web of linked data, formalised through the Resource Description Framework (RDF) and the Web Ontology Language (OWL) standards, established the triple-store paradigm in the early 2000s. The term "knowledge graph" was popularised by Google in 2012 when the company announced its Knowledge Graph feature, which enriched search results with structured information about entities sourced from Wikipedia, Freebase, and other curated databases.
Large-scale public knowledge graphs include Wikidata (a collaborative knowledge base maintained by the Wikimedia Foundation with over 100 million entities as of 2025), DBpedia (extracted from Wikipedia), and YAGO. Proprietary knowledge graphs maintained by technology companies are substantially larger; Microsoft's Satori, Google's Knowledge Graph, and LinkedIn's Economic Graph each contain billions of entities and triples.
Data Model
Knowledge graphs typically use one of two complementary data models. The RDF model, standardised by the World Wide Web Consortium (W3C), represents every fact as a subject-predicate-object triple stored in triple stores queryable via SPARQL. RDF graphs are well-suited to linked open data and interoperability across systems.
Property graphs, supported by databases such as Neo4j, Amazon Neptune, and TigerGraph, allow both nodes and edges to carry arbitrary key-value properties, providing a more expressive and developer-friendly model for application-level knowledge management. Cypher (Neo4j) and Gremlin are the dominant query languages for property graphs.
Knowledge Graph Completion
Real-world knowledge graphs are inevitably incomplete — Wikidata contains millions of missing facts that can be inferred from existing information. Knowledge graph completion (KGC) is the task of predicting missing links or entity attributes. Embedding-based methods such as TransE, DistMult, ComplEx, and RotatE learn low-dimensional vector representations for entities and relations such that the geometry of the embedding space reflects the graph's relational structure, enabling missing triples to be scored by geometric operations.
More recent approaches combine graph neural networks with knowledge graph embeddings, or use large language models to perform KGC by framing it as a text generation or ranking task.
Integration with Large Language Models
A significant development in 2024-2025 was the integration of knowledge graphs with large language models to address the hallucination problem. LLMs trained on text alone may generate plausible but factually incorrect statements. Grounding LLM outputs in structured knowledge graphs provides a verifiable factual backbone. In GraphRAG (Microsoft, 2024), a knowledge graph is constructed from a document corpus and used to augment retrieval-augmented generation, enabling more accurate responses to multi-hop questions that require traversing multiple relationships.
Knowledge graphs also improve the explainability of AI outputs: because each fact is traceable to a named source triple, systems can cite specific graph paths as justification for their answers. PingCAP's TiKV and similar graph-augmented databases reported up to 300 percent accuracy improvements on complex multi-hop queries when knowledge graph integration was applied.
Applications
Knowledge graphs power Google's featured snippets and entity panels in search results. In healthcare, graphs such as the Human Disease Ontology and DrugBank link symptoms, diagnoses, genes, proteins, and pharmaceutical compounds, enabling hypothesis generation and adverse drug interaction detection. In finance, knowledge graphs model corporate ownership structures, supply chain relationships, and transaction networks for risk and compliance. In e-commerce, product knowledge graphs connect items, attributes, brands, and user preferences to improve search and recommendation. In manufacturing, KGs encode bill-of-materials hierarchies, supplier relationships, and process parameters to support root-cause analysis.
See Also
References
References
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1-37.
- Bordes, A., et al. (2013). Translating embeddings for modeling multi-relational data. NeurIPS 2013.
- Edge, D., et al. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv:2404.16130. Microsoft Research.
- W3C. (2004). Resource description framework (RDF): Concepts and abstract syntax. World Wide Web Consortium.
- PingCAP. (2025). How knowledge graphs transform machine learning in 2025. pingcap.com.