nanoMINER: Multimodal Information Extraction for Nanomaterials

This repository contains a cutting-edge multi-agent system that integrates large language models with multimodal analysis to extract crucial information on nanomaterials from research articles.

This system processes scientific documents end-to-end, leveraging tools such as the YOLO model for visual data extraction and GPT-4o for linking textual and visual information. The core of our architecture is the ReAct agent, which orchestrates various specialized agents, ensuring comprehensive and accurate data extraction. We demonstrate its efficacy through a case study in nanozyme research.

Features

Upload and process PDF files of scientific articles and supplementary information
Extract text from PDF files
Utilize an AI agent to answer questions about the uploaded articles
Handle multiple file uploads, including separate article and supplement files

Installation

Clone this repository:

git clone https://github.com/ai-chem/LLM-Pipeline-for-Automated-Extraction-of-Nanozyme-Data.git
cd LLM-Pipeline-for-Automated-Extraction-of-Nanozyme-Data

Install the required packages:

poetry install

This command will install all necessary dependencies from pyproject.toml file using poetry package manager.

Set up your OpenAI API key in a .env file:

OPENAI_API_KEY=your_api_key_here

Usage

Agent app

Run the Streamlit app:

poetry run streamlit run agent_app.py

Open the provided URL in your web browser.
Upload a PDF file of a scientific article (and optionally, a supplementary information file).
Once the files are processed, you can start asking questions about the article in the chat interface.

Auto extraction

The auto_extraction.py script is designed for batch processing of PDF files to extract detailed information about nanozyme experiments. It uses a multi-agent system to analyze scientific articles and supplementary information files, extracting named entities and other relevant data.

To run the script, use the following command:
```
python auto_extraction.py <pdf_articles_dir> <pdf_supplements_dir> <ner_json_dir> <results_dir>
```
Where:
- pdf_articles_dir: Directory containing the PDF articles.
- pdf_supplements_dir: Directory containing the supplementary PDF files.
- ner_json_dir: Directory containing the JSON files with named entity recognition (NER) data.
- results_dir: Directory where the results will be saved.

Structured Output

The structured_output.ipynb Jupyter Notebook is designed for converting the markdown ReAct agent's answers into structured tabular data. This notebook also contains the calculation of the Jaccard Index on the full dataset.

File Structure

agent_app.py: Main Streamlit application file;
auto_extraction.py: Script for automated extraction of information from multiple PDF files;
structured_output.ipynb: Jupyter Notebook for structured output postprocessing and calculation of the Jaccard Index.
pdf2txt.py: Module for extracting text from PDF files;
utils.py: Utility functions;
logger.py: Logging configuration;
image_processing/: Directory containing image processing modules.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data_preproccessing		data_preproccessing
graph_processing		graph_processing
llm-extraction		llm-extraction
README.md		README.md
agent_app.py		agent_app.py
auto_extraction.py		auto_extraction.py
crystal_system_classification.ipynb		crystal_system_classification.ipynb
logger.py		logger.py
metrics.ipynb		metrics.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
structured_output.ipynb		structured_output.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoMINER: Multimodal Information Extraction for Nanomaterials

Features

Installation

Usage

Agent app

Auto extraction

Structured Output

File Structure

Contributing

License

About

Packages

Contributors 8

Languages

ai-chem/nanoMINER

Folders and files

Latest commit

History

Repository files navigation

nanoMINER: Multimodal Information Extraction for Nanomaterials

Features

Installation

Usage

Agent app

Auto extraction

Structured Output

File Structure

Contributing

License

About

Resources

Stars

Watchers

Forks

Packages 0

Contributors 8

Languages

Packages