A Python-based tool that crawls arXiv papers to extract author email addresses. This tool is useful for researchers and academics who need to compile contact information for authors in specific research areas.
This tool helps collect email addresses from public academic papers. While the data is publicly available:
- The collected data (emails, database) is not included in this repository
- Users should respect privacy and data protection regulations when using this tool
- Consider the ethical implications and use the tool responsibly
- Search arXiv papers using custom queries
- Download and process PDFs automatically
- Extract email addresses from paper PDFs
- Store results in a SQLite database
- Export results to CSV and text files
- Rate-limiting compliance with arXiv's API guidelines
- Robust error handling and logging
- Multiple platform support (Local, Google Colab, Kaggle)
arxiv_parser/
├── main.py # Main script that generates notebooks
├── process_remaining.py # Script for processing remaining papers
├── notebooks/ # Generated notebook versions
│ ├── arxiv_email_crawler.ipynb # Local Jupyter version
│ ├── arxiv_email_crawler_colab.ipynb # Google Colab version
│ └── arxiv_email_crawler_kaggle.ipynb # Kaggle version
├── data/ # Directory for database and output files (not tracked in git)
│ ├── papers.db # SQLite database (generated)
│ ├── papers_with_emails.csv # Exported results (generated)
│ └── unique_emails.txt # List of unique emails (generated)
└── requirements.txt # Python dependencies
- Python 3.7+
- Dependencies (automatically installed):
- feedparser==6.0.10
- requests==2.31.0
- pdfplumber==0.10.3
- jupyter==1.0.0 (for local notebook usage)
- Clone this repository:
git clone https://github.com/yourusername/arxiv-email-crawler.git
cd arxiv-email-crawler
- Install dependencies:
pip install -r requirements.txt
- Generate notebooks:
python main.py
- Start collecting data:
# Option 1: Use Jupyter notebook
jupyter notebook notebooks/arxiv_email_crawler.ipynb
# Option 2: Use the processing script directly
python process_remaining.py
- Upload
notebooks/arxiv_email_crawler_colab.ipynb
to Google Drive - Open with Google Colab
- Mount your Google Drive when prompted
- Run the cells sequentially
- Upload
notebooks/arxiv_email_crawler_kaggle.ipynb
to Kaggle - Create a new notebook from this file
- Run the cells sequentially
The crawler can be configured by modifying the search queries in the notebook:
search_queries = [
"all:AI AND all:agent",
"cat:cs.AI AND all:agent",
"cat:cs.MA", # Multi-agent systems
"cat:cs.AI AND all:LLM"
]
The tool will generate several files in the data/
directory (not tracked in git):
-
papers.db
: SQLite database containing:- Paper metadata
- Processing status
- Extracted emails
- Retry information
-
papers_with_emails.csv
: CSV export containing:- Paper details
- Associated email addresses
- Publication information
-
unique_emails.txt
: Simple text file with unique email addresses
The tool implements appropriate rate limiting to comply with arXiv's API guidelines:
- Adaptive delays based on success/failure
- Automatic retry system for failed downloads
- Smart backoff for newer papers
When using this tool, please:
- Respect rate limits and terms of service
- Handle collected email addresses responsibly
- Consider privacy implications
- Follow applicable data protection regulations
- Use the data only for legitimate academic/research purposes
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Please use this tool responsibly and in accordance with arXiv's terms of service and any applicable privacy laws and regulations. The tool is provided "as is" without warranty of any kind.