MetaMine 🔍

A Neuro-Symbolic Approach for Dataset Extraction from Research Papers

MetaMine is a novel approach for extracting structured dataset metadata from scientific research papers. It utilizes a multi-stage chain-of-thought prompting strategy with knowledge distillation to train a compact model that can accurately identify and extract dataset metadata according to the DCAT vocabulary standard.

🔍 Overview

Scientific datasets are valuable knowledge assets often hidden within research papers, limiting their discovery and reuse. MetaMine addresses this challenge by:

Using a multi-stage chain-of-thought prompting strategy to guide large teacher models (GPT) in dataset identification and metadata extraction
Employing knowledge distillation to transfer these capabilities to a smaller student model (Llama-3.2-3B-Instruct)
Preserving the reasoning process during distillation for improved extraction accuracy
Aligning extracted metadata with the DCAT vocabulary for semantic web integration
Converting the structured output into RDF triples for knowledge graph creation

The distilled model processes papers in 35 seconds compared to 120 seconds for larger models, making it practical for processing large scientific corpora while maintaining high-quality extraction.

🏗️ Pipeline Structure

The MetaMine pipeline consists of four main phases:

Data Collection and Processing: Papers are collected from sources like Papers With Code and processed through OCR to extract text content.
Data Annotation: A teacher model (GPT-o4-mini) annotates papers using a multi-stage prompting strategy, and a subset is verified by human annotators.
Knowledge Distillation: The extraction capabilities and reasoning process are transferred to a smaller student model (Llama-3.2-3B-Instruct) through fine-tuning.
Knowledge Graph Creation: Extracted metadata is converted to RDF triples for integration with the semantic web.

✨ Key Features

Efficient metadata extraction from scientific papers
Multi-stage chain-of-thought prompting for accurate annotation
Knowledge distillation for model compression
Preservation of reasoning process during distillation
DCAT vocabulary alignment for semantic web integration
RDF triple generation for knowledge graph creation
3.4x faster processing than larger models

📁 Directory Structure

├── data/                  # Contains all data files
│   ├── aws/               # Amazon Mechanical Turk annotation files
│   ├── gs/                # Gold standard datasets
│   ├── llama/             # Generated output from the base Llama model
│   ├── llama_tuned/       # Generated output from the fine-tuned Llama model
│   └── qwen/              # Generated output from the DeepSeek Qwen model
├── fine_tuning/           # Scripts for fine-tuning the student model
├── inference/             # Scripts for generating dataset metadata using the fine-tuned model
└── results/               # Evaluation results for different models

💻 Installation and Dependencies

The project requires the following dependencies:

Python 3.8+
PyTorch
Transformers
PEFT (Parameter-Efficient Fine-Tuning)
DeepSpeed
Pandas
Matplotlib
pdfminer.six

🚀 Usage

The MetaMine pipeline consists of the following main scripts:

Download papers metadata from Papers With Code repository: https://github.com/paperswithcode/paperswithcode-data
1_choose_papers_randomly.py: Selects papers randomly from Papers With Code repository
2_download_papers.py: Downloads selected papers in PDF format
3_pdf2txt.py: Converts PDF files to text
4_process_papers_fine_tune.py: Processes papers with the teacher model to generate training data
5_combine_datasets.py: Merges the output of different models into a single dataset file
6_combine_csv.py: Processes Amazon Mechanical Turk annotations
7_annotation_accuracy.py: Analyzes annotation accuracy and generates figures
8_order_columns.py: Reorders columns in annotation files for better readability

Fine-tuning the Model

To fine-tune the model, navigate to the fine_tuning directory and run:

bash run_fine_tune.sh

Running Inference

To extract dataset metadata from new papers:

cd inference
bash inference.sh

📈 Results

The distilled model (Llama-3.2-3B-Instruct) achieves an F1 score of 0.74 for dataset identification, outperforming its pre-distillation baseline (0.65) and rivaling much larger models like DeepSeek-R1-Distill-Qwen-32B (0.73) despite being 10× smaller. The model particularly excels at challenging metadata fields like dataset creator identification.

Annotation Accuracy

Running the 7_annotation_accuracy.py script will generate an annotation accuracy figure showing the accuracy for each metadata field. This provides insights into which fields are more challenging for the model to extract correctly.

Annotation Accuracy

The figure below shows the annotation accuracy for each metadata field extracted from papers. This provides insights into which fields are more challenging for the model to extract correctly.

Note: Generate the annotation accuracy visualization by running:
python 7_annotation_accuracy.py

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

We thank the following resources and communities:

Papers With Code for providing access to research papers
Hugging Face for Transformers and PEFT libraries
Microsoft for DeepSpeed optimization
Meta AI for the Llama model
The semantic web community for DCAT vocabulary standards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MetaMine 🔍

A Neuro-Symbolic Approach for Dataset Extraction from Research Papers

📋 Table of Contents

🔍 Overview

🏗️ Pipeline Structure

✨ Key Features

📁 Directory Structure

💻 Installation and Dependencies

🚀 Usage

Fine-tuning the Model

Running Inference

📈 Results

Annotation Accuracy

Annotation Accuracy

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
fields_coverage		fields_coverage
fine_tuning		fine_tuning
inference		inference
results		results
.DS_Store		.DS_Store
1_choose_papers_randomly.py		1_choose_papers_randomly.py
2_2_download_papers_all.py		2_2_download_papers_all.py
2_download_papers.py		2_download_papers.py
3_pdf2txt.py		3_pdf2txt.py
4_process_papers_fine_tune.py		4_process_papers_fine_tune.py
5_combine_datasets.py		5_combine_datasets.py
6_combine_csv.py		6_combine_csv.py
7_annotation_accuracy.py		7_annotation_accuracy.py
8_order_columns.py		8_order_columns.py
LICENSE		LICENSE
README.md		README.md
annotation_accuracy.png		annotation_accuracy.png

License

SDM-TIB/LDM_Datasets

Folders and files

Latest commit

History

Repository files navigation

MetaMine 🔍

A Neuro-Symbolic Approach for Dataset Extraction from Research Papers

📋 Table of Contents

🔍 Overview

🏗️ Pipeline Structure

✨ Key Features

📁 Directory Structure

💻 Installation and Dependencies

🚀 Usage

Fine-tuning the Model

Running Inference

📈 Results

Annotation Accuracy

Annotation Accuracy

📄 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages