CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration
- CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration
- The CROssBARv2 Web Platform
- Download / Usage
- Cite
- License
CROssBARv2 is a comprehensive computational framework that integrates large-scale biological and biomedical data into a unified knowledge graph (KG). The framework includes:
- 🔄 Automated Data Integration: Systematically processes and integrates data from 34 data sources, covering 14 node types and 51 edge types.
- 🔍 Vector Search: Includes precomputed vector embeddings from biological representation learning models as node embeddings for sophisticated semantic similarity searches.
- 💬 LLM-Powered Chat: Integrates Large Language Models (CROssBAR-LLM) for answering scientific questions using natural language.
- 🌐 User-Friendly Platform: Provides an intuitive online interface with fast queries, interactive visualizations, and an API for easy access, even for non-programmers.
- 📊 Validation & Applications: Demonstrated utility through biomedical use-cases, literature validation, benchmarking of LLMs vs. CROssBAR-LLM for scientific Q&A, and deep learning experiments on protein function prediction.
Biomedical discovery is hindered by fragmented, modality-specific repositories and uneven metadata, limiting integrative analysis, accessibility, and reproducibility. We present CROssBARv2, a provenance-rich biomedical data-and-knowledge integration platform that unifies heterogeneous sources into a maintainable, queryable system. By consolidating diverse data sources into an extensive knowledge graph enriched with ontologies, metadata, and deep learning-based vector embeddings, the system eliminates the need for researchers to navigate multiple, siloed databases and enables users to uncover novel insights. CROssBARv2 provides programmatic access, interactive exploration, embedding-based semantic search, and an intuitive natural language interface powered by large language models (LLMs). We assess CROssBARv2 through (i) multiple use-case analyses to test biological coherence; (ii) knowledge-augmented biomedical question-answering benchmarks comparing CROssBAR-LLM with up-to-date generalist LLMs; and (iii) a deep-learning–based predictive-modelling validation experiment for protein-function prediction using heterogeneous CROssBARv2 data. Collectively, CROssBARv2 offers a scalable and user-friendly foundation that facilitates hypothesis generation, knowledge discovery, and translational research.
Figure 1. Overview of the CROssBARv2 workflow. (a) Integration of data from 34 well-established sources covering various biomedical domains. (b) Automatic retrieval, standardisation, and integration of source data using modular adapter scripts. (c) CROssBARv2 KG schema, comprising 14 node types and 51 edge types. (d) Storage of the KG in a Neo4j graph database, along with rich metadata and node embeddings computed using deep learning-based methods. (e) Execution of the CROssBAR-LLM workflow, which translates natural language queries into Cypher, executes the queries on KG, and synthesizes structured results into natural language responses; also supports vector-based similarity search. (f) Exploration of the KG through three interfaces: natural language querying via LLM interface, programmatic access via GraphQL API, and interactive visual exploration via the Neo4j browser.
The easiest way to experience CROssBARv2 is through our interactive web platform. It provides multiple interfaces tailored for different use cases, from natural language chat to interactive, visual exploration. Check out our website here.
Interact with the CROssBARv2 database using natural language. Through the Graph Explorer, you can navigate direct relationships between entities, retrieving structured facts and connections from the graph. The Semantic Search feature, powered by embeddings, enables the discovery of biologically meaningful patterns by identifying similarities between entities, that go beyond direct graph links.
Access the CROssBARv2 database programmatically using a flexible GraphQL interface. You can build custom, nested queries to retrieve precisely the data you need. It's ideal for integrating CROssBARv2 into your analytical workflows or applications, and supports seamless development with tools like Apollo Studio.
Explore the API in Apollo Studio
Explore the CROssBARv2 KG visually through the Neo4j Browser interface. This interactive tool lets you run Cypher queries, visualize nodes and relationships, and investigate the structure of the graph in detail.
To clone this repository:
git clone --recurse-submodules https://github.com/HUBioDataLab/CROssBARv2This repository provides adapter scripts for collecting, processing, and harmonizing biological data from a wide range of biomedical and biological sources. The adapters transform heterogeneous datasets into a standardized format that can be seamlessly imported into a Neo4j graph database.
Neo4j-importable CSV files required to reconstruct the KG are publicly available at here.
Check the relevant README.
This repository contains the unified docker compose file, of the services below.
Check the relevant README.
This repository contain frontend and backend scripts for CROssBARv2 LLM.
Check the relevant README.
This repository contains the GraphQL API interface of CROssBARv2 KG, which is useful for programmatic data retrieval.
Check the relevant README.
You can browse our KG with Neo4j.
This repository contains the necessary TLS utils, to host Neo4j with Traefik.
Check the relevant README.
@article{Sen_2025_CROssBARv2,
author = {Bünyamin Şen and Erva Ulusoy and Melih Darcan and Mert Ergün and Sebastian Lobentanzer and Ahmet S. Rifaioglu and Dénes Türei and Julio Saez-Rodriguez and Tunca Doğan},
title = {CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration},
journal = {Journal Name},
year = {2025},
volume = {XX},
number = {X},
pages = {XXX--XXX},
doi = {10.XXXX/journal.XXXXXXX},
note = {Project website available at \url{https://crossbarv2.hubiodatalab.com}}
}
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
The AGPL-3.0 is a strong copyleft license that ensures that derivatives of the work, including those modified and offered as a service over a network, remain free and open-source. If you use this software in a network service, you are required to make the source code of your modified version available to the users of that service.
For the full license text, please see the LICENSE file included in this repository.



