arXiv Email Crawler

A Python-based tool that crawls arXiv papers to extract author email addresses. This tool is useful for researchers and academics who need to compile contact information for authors in specific research areas.

Important Note on Data Privacy

This tool helps collect email addresses from public academic papers. While the data is publicly available:

The collected data (emails, database) is not included in this repository
Users should respect privacy and data protection regulations when using this tool
Consider the ethical implications and use the tool responsibly

Features

Search arXiv papers using custom queries
Download and process PDFs automatically
Extract email addresses from paper PDFs
Store results in a SQLite database
Export results to CSV and text files
Rate-limiting compliance with arXiv's API guidelines
Robust error handling and logging
Multiple platform support (Local, Google Colab, Kaggle)

Project Structure

arxiv_parser/
├── main.py              # Main script that generates notebooks
├── process_remaining.py # Script for processing remaining papers
├── notebooks/          # Generated notebook versions
│   ├── arxiv_email_crawler.ipynb        # Local Jupyter version
│   ├── arxiv_email_crawler_colab.ipynb  # Google Colab version
│   └── arxiv_email_crawler_kaggle.ipynb # Kaggle version
├── data/              # Directory for database and output files (not tracked in git)
│   ├── papers.db     # SQLite database (generated)
│   ├── papers_with_emails.csv  # Exported results (generated)
│   └── unique_emails.txt       # List of unique emails (generated)
└── requirements.txt   # Python dependencies

Requirements

Python 3.7+
Dependencies (automatically installed):
- feedparser==6.0.10
- requests==2.31.0
- pdfplumber==0.10.3
- jupyter==1.0.0 (for local notebook usage)

Installation & Usage

Local Usage

Clone this repository:

git clone https://github.com/yourusername/arxiv-email-crawler.git
cd arxiv-email-crawler

Install dependencies:

pip install -r requirements.txt

Generate notebooks:

python main.py

Start collecting data:

# Option 1: Use Jupyter notebook
jupyter notebook notebooks/arxiv_email_crawler.ipynb

# Option 2: Use the processing script directly
python process_remaining.py

Google Colab Usage

Upload notebooks/arxiv_email_crawler_colab.ipynb to Google Drive
Open with Google Colab
Mount your Google Drive when prompted
Run the cells sequentially

Kaggle Usage

Upload notebooks/arxiv_email_crawler_kaggle.ipynb to Kaggle
Create a new notebook from this file
Run the cells sequentially

Configuration

The crawler can be configured by modifying the search queries in the notebook:

search_queries = [
    "all:AI AND all:agent",
    "cat:cs.AI AND all:agent",
    "cat:cs.MA",  # Multi-agent systems
    "cat:cs.AI AND all:LLM"
]

Generated Files

The tool will generate several files in the data/ directory (not tracked in git):

papers.db: SQLite database containing:
- Paper metadata
- Processing status
- Extracted emails
- Retry information
papers_with_emails.csv: CSV export containing:
- Paper details
- Associated email addresses
- Publication information
unique_emails.txt: Simple text file with unique email addresses

Rate Limiting

The tool implements appropriate rate limiting to comply with arXiv's API guidelines:

Adaptive delays based on success/failure
Automatic retry system for failed downloads
Smart backoff for newer papers

Data Privacy & Ethics

When using this tool, please:

Respect rate limits and terms of service
Handle collected email addresses responsibly
Consider privacy implications
Follow applicable data protection regulations
Use the data only for legitimate academic/research purposes

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

Please use this tool responsibly and in accordance with arXiv's terms of service and any applicable privacy laws and regulations. The tool is provided "as is" without warranty of any kind.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.cursor/rules		.cursor/rules
data_backup		data_backup
notebooks		notebooks
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
prd.md		prd.md
process_remaining.py		process_remaining.py
processing.log		processing.log
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arXiv Email Crawler

Important Note on Data Privacy

Features

Project Structure

Requirements

Installation & Usage

Local Usage

Google Colab Usage

Kaggle Usage

Configuration

Generated Files

Rate Limiting

Data Privacy & Ethics

Contributing

License

Disclaimer

About

Releases

Packages

Languages

License

nembal/arxiv_parser

Folders and files

Latest commit

History

Repository files navigation

arXiv Email Crawler

Important Note on Data Privacy

Features

Project Structure

Requirements

Installation & Usage

Local Usage

Google Colab Usage

Kaggle Usage

Configuration

Generated Files

Rate Limiting

Data Privacy & Ethics

Contributing

License

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages