Skip to content

A notebook to get data from and about arxiv papers

License

Notifications You must be signed in to change notification settings

nembal/arxiv_parser

Repository files navigation

arXiv Email Crawler

A Python-based tool that crawls arXiv papers to extract author email addresses. This tool is useful for researchers and academics who need to compile contact information for authors in specific research areas.

Important Note on Data Privacy

This tool helps collect email addresses from public academic papers. While the data is publicly available:

  • The collected data (emails, database) is not included in this repository
  • Users should respect privacy and data protection regulations when using this tool
  • Consider the ethical implications and use the tool responsibly

Features

  • Search arXiv papers using custom queries
  • Download and process PDFs automatically
  • Extract email addresses from paper PDFs
  • Store results in a SQLite database
  • Export results to CSV and text files
  • Rate-limiting compliance with arXiv's API guidelines
  • Robust error handling and logging
  • Multiple platform support (Local, Google Colab, Kaggle)

Project Structure

arxiv_parser/
├── main.py              # Main script that generates notebooks
├── process_remaining.py # Script for processing remaining papers
├── notebooks/          # Generated notebook versions
│   ├── arxiv_email_crawler.ipynb        # Local Jupyter version
│   ├── arxiv_email_crawler_colab.ipynb  # Google Colab version
│   └── arxiv_email_crawler_kaggle.ipynb # Kaggle version
├── data/              # Directory for database and output files (not tracked in git)
│   ├── papers.db     # SQLite database (generated)
│   ├── papers_with_emails.csv  # Exported results (generated)
│   └── unique_emails.txt       # List of unique emails (generated)
└── requirements.txt   # Python dependencies

Requirements

  • Python 3.7+
  • Dependencies (automatically installed):
    • feedparser==6.0.10
    • requests==2.31.0
    • pdfplumber==0.10.3
    • jupyter==1.0.0 (for local notebook usage)

Installation & Usage

Local Usage

  1. Clone this repository:
git clone https://github.com/yourusername/arxiv-email-crawler.git
cd arxiv-email-crawler
  1. Install dependencies:
pip install -r requirements.txt
  1. Generate notebooks:
python main.py
  1. Start collecting data:
# Option 1: Use Jupyter notebook
jupyter notebook notebooks/arxiv_email_crawler.ipynb

# Option 2: Use the processing script directly
python process_remaining.py

Google Colab Usage

  1. Upload notebooks/arxiv_email_crawler_colab.ipynb to Google Drive
  2. Open with Google Colab
  3. Mount your Google Drive when prompted
  4. Run the cells sequentially

Kaggle Usage

  1. Upload notebooks/arxiv_email_crawler_kaggle.ipynb to Kaggle
  2. Create a new notebook from this file
  3. Run the cells sequentially

Configuration

The crawler can be configured by modifying the search queries in the notebook:

search_queries = [
    "all:AI AND all:agent",
    "cat:cs.AI AND all:agent",
    "cat:cs.MA",  # Multi-agent systems
    "cat:cs.AI AND all:LLM"
]

Generated Files

The tool will generate several files in the data/ directory (not tracked in git):

  1. papers.db: SQLite database containing:

    • Paper metadata
    • Processing status
    • Extracted emails
    • Retry information
  2. papers_with_emails.csv: CSV export containing:

    • Paper details
    • Associated email addresses
    • Publication information
  3. unique_emails.txt: Simple text file with unique email addresses

Rate Limiting

The tool implements appropriate rate limiting to comply with arXiv's API guidelines:

  • Adaptive delays based on success/failure
  • Automatic retry system for failed downloads
  • Smart backoff for newer papers

Data Privacy & Ethics

When using this tool, please:

  1. Respect rate limits and terms of service
  2. Handle collected email addresses responsibly
  3. Consider privacy implications
  4. Follow applicable data protection regulations
  5. Use the data only for legitimate academic/research purposes

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

Please use this tool responsibly and in accordance with arXiv's terms of service and any applicable privacy laws and regulations. The tool is provided "as is" without warranty of any kind.

About

A notebook to get data from and about arxiv papers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published