MetaMine is a novel approach for extracting structured dataset metadata from scientific research papers. It utilizes a multi-stage chain-of-thought prompting strategy with knowledge distillation to train a compact model that can accurately identify and extract dataset metadata according to the DCAT vocabulary standard.
- Overview
- Pipeline Structure
- Key Features
- Directory Structure
- Installation and Dependencies
- Usage
- Results
- License
Scientific datasets are valuable knowledge assets often hidden within research papers, limiting their discovery and reuse. MetaMine addresses this challenge by:
- Using a multi-stage chain-of-thought prompting strategy to guide large teacher models (GPT) in dataset identification and metadata extraction
- Employing knowledge distillation to transfer these capabilities to a smaller student model (Llama-3.2-3B-Instruct)
- Preserving the reasoning process during distillation for improved extraction accuracy
- Aligning extracted metadata with the DCAT vocabulary for semantic web integration
- Converting the structured output into RDF triples for knowledge graph creation
The distilled model processes papers in 35 seconds compared to 120 seconds for larger models, making it practical for processing large scientific corpora while maintaining high-quality extraction.
The MetaMine pipeline consists of four main phases:
- Data Collection and Processing: Papers are collected from sources like Papers With Code and processed through OCR to extract text content.
- Data Annotation: A teacher model (GPT-o4-mini) annotates papers using a multi-stage prompting strategy, and a subset is verified by human annotators.
- Knowledge Distillation: The extraction capabilities and reasoning process are transferred to a smaller student model (Llama-3.2-3B-Instruct) through fine-tuning.
- Knowledge Graph Creation: Extracted metadata is converted to RDF triples for integration with the semantic web.
- Efficient metadata extraction from scientific papers
- Multi-stage chain-of-thought prompting for accurate annotation
- Knowledge distillation for model compression
- Preservation of reasoning process during distillation
- DCAT vocabulary alignment for semantic web integration
- RDF triple generation for knowledge graph creation
- 3.4x faster processing than larger models
├── data/ # Contains all data files
│ ├── aws/ # Amazon Mechanical Turk annotation files
│ ├── gs/ # Gold standard datasets
│ ├── llama/ # Generated output from the base Llama model
│ ├── llama_tuned/ # Generated output from the fine-tuned Llama model
│ └── qwen/ # Generated output from the DeepSeek Qwen model
├── fine_tuning/ # Scripts for fine-tuning the student model
├── inference/ # Scripts for generating dataset metadata using the fine-tuned model
└── results/ # Evaluation results for different models
The project requires the following dependencies:
- Python 3.8+
- PyTorch
- Transformers
- PEFT (Parameter-Efficient Fine-Tuning)
- DeepSpeed
- Pandas
- Matplotlib
- pdfminer.six
The MetaMine pipeline consists of the following main scripts:
- Download papers metadata from Papers With Code repository: https://github.com/paperswithcode/paperswithcode-data
1_choose_papers_randomly.py
: Selects papers randomly from Papers With Code repository2_download_papers.py
: Downloads selected papers in PDF format3_pdf2txt.py
: Converts PDF files to text4_process_papers_fine_tune.py
: Processes papers with the teacher model to generate training data5_combine_datasets.py
: Merges the output of different models into a single dataset file6_combine_csv.py
: Processes Amazon Mechanical Turk annotations7_annotation_accuracy.py
: Analyzes annotation accuracy and generates figures8_order_columns.py
: Reorders columns in annotation files for better readability
To fine-tune the model, navigate to the fine_tuning
directory and run:
bash run_fine_tune.sh
To extract dataset metadata from new papers:
cd inference
bash inference.sh
The distilled model (Llama-3.2-3B-Instruct) achieves an F1 score of 0.74 for dataset identification, outperforming its pre-distillation baseline (0.65) and rivaling much larger models like DeepSeek-R1-Distill-Qwen-32B (0.73) despite being 10× smaller. The model particularly excels at challenging metadata fields like dataset creator identification.
Running the 7_annotation_accuracy.py
script will generate an annotation accuracy figure showing the accuracy for each metadata field. This provides insights into which fields are more challenging for the model to extract correctly.
The figure below shows the annotation accuracy for each metadata field extracted from papers. This provides insights into which fields are more challenging for the model to extract correctly.
Note: Generate the annotation accuracy visualization by running:
python 7_annotation_accuracy.py
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
We thank the following resources and communities:
- Papers With Code for providing access to research papers
- Hugging Face for Transformers and PEFT libraries
- Microsoft for DeepSpeed optimization
- Meta AI for the Llama model
- The semantic web community for DCAT vocabulary standards