Product Requirements Document (PRD)

Title: arXiv Email Crawler for AgentDomain Initiative

Version: 1.1
Date: [Insert Date]

1. Introduction

The arXiv Email Crawler is a system to harvest email addresses from AI and agent-related research papers on arXiv. Its purpose is to invite researchers to join the AgentDomain.xyz initiative and promote the .agent TLD. The MVP will be a local Python application running in a Jupyter notebook with a SQLite database. Long-term, you plan to store data in Supabase for cloud-based scalability (not part of the MVP), with potential expansion into a hosted, open-source service.

2. Functional Requirements

2.1 arXiv API Interaction

Query the arXiv API with user-defined search terms (e.g., "AI", "agent").
Extract metadata from the API response, including:
- Title
- Authors (as a list)
- Published date
- DOI (if available)
- PDF URL
- Abstract

2.2 PDF Downloading and Parsing

Download PDFs from the extracted PDF URLs.
Extract text from PDFs using pdfplumber.

2.3 Email Extraction (MVP)

Use regular expressions (regex) to identify email addresses in the PDF text.

2.4 Database Management (MVP)

Store metadata and emails in a local SQLite database (papers.db).
Track processed papers to avoid duplication.

Database schema:

arxiv_id (TEXT, PRIMARY KEY)
title (TEXT)
authors (TEXT, e.g., comma-separated or JSON)
published_date (TEXT)
pdf_link (TEXT)
doi (TEXT, nullable)
abstract (TEXT)
emails (TEXT, e.g., comma-separated or JSON)
processed (INTEGER, 0 = unprocessed, 1 = processed)

2.5 Main Logic

Run in a Jupyter notebook to:

Query the arXiv API.
Add new papers to the database.
Download and parse PDFs for unprocessed papers.
Extract emails and update the database.
Mark papers as processed.
Enforce a 20-second delay between PDF downloads to comply with arXiv's rate limits.

3. Non-Functional Requirements

3.1 Performance

Process papers efficiently on a local machine while respecting arXiv's crawl-delay policy.
Handle errors (e.g., failed downloads) without crashing.

3.2 Usability

Provide a single, easy-to-run Jupyter notebook with clear outputs for progress tracking.
Require minimal setup (e.g., install dependencies via pip).

3.3 Scalability

Design the database schema and code to support a future transition to Supabase.
Keep the structure modular for easy enhancements.

4. System Architecture (MVP)

main.ipynb: Orchestrates the workflow.
utils/ (directory for modules):
- arxiv_api.py: Queries arXiv API and parses metadata.
- pdf_handler.py: Downloads PDFs and extracts text.
- email_extractor.py: Extracts emails with regex.
- db_manager.py: Manages SQLite database.
data/ (directory):
- papers.db: SQLite database file.

5. Data Flow (MVP)

User inputs search terms in main.ipynb.
System queries arXiv API and retrieves metadata.
Checks SQLite for unprocessed papers.
For each unprocessed paper:
- Downloads the PDF.
- Extracts text.
- Finds emails.
- Updates the database.
- Marks as processed.
- Waits 20 seconds between PDF downloads.

6. Future Enhancements

6.1 Database Migration

Move from SQLite to Supabase for cloud storage (post-MVP, per your goal).

6.2 AI Enhancements

Use an AI model (e.g., Google Gemini Flash2) for better email extraction, affiliations, or paper summaries.

6.3 Hosted Service

Deploy as a continuous, cloud-based service (e.g., via Docker).

6.4 Open-Source

Share on GitHub with documentation for community contributions.

6.5 Additional Features

Store author affiliations.
Deduplicate emails across papers.
Add email-sending capabilities.

7. User Stories

As a user, I want to run the system locally to collect emails from arXiv papers.
As a user, I want processed papers tracked to avoid redundant work.
As a user, I want a foundation that supports moving to Supabase later.

Cursor Rules for AI Assistance These rules guide AI behavior in Cursor Builder for coding the arXiv Email Crawler. They're tailored to the MVP's structure and stored in .cursor/rules/.

Project Structure main.ipynb: Main logic.

utils/: arxiv_api.py: API queries.

pdf_handler.py: PDF handling.

email_extractor.py: Email extraction.

db_manager.py: Database operations.

data/: papers.db: SQLite database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prd.md

prd.md

Product Requirements Document (PRD)

Title: arXiv Email Crawler for AgentDomain Initiative

1. Introduction

2. Functional Requirements

2.1 arXiv API Interaction

2.2 PDF Downloading and Parsing

2.3 Email Extraction (MVP)

2.4 Database Management (MVP)

2.5 Main Logic

3. Non-Functional Requirements

3.1 Performance

3.2 Usability

3.3 Scalability

4. System Architecture (MVP)

5. Data Flow (MVP)

6. Future Enhancements

6.1 Database Migration

6.2 AI Enhancements

6.3 Hosted Service

6.4 Open-Source

6.5 Additional Features

7. User Stories

Files

prd.md

Latest commit

History

prd.md

File metadata and controls

Product Requirements Document (PRD)

Title: arXiv Email Crawler for AgentDomain Initiative

1. Introduction

2. Functional Requirements

2.1 arXiv API Interaction

2.2 PDF Downloading and Parsing

2.3 Email Extraction (MVP)

2.4 Database Management (MVP)

2.5 Main Logic

3. Non-Functional Requirements

3.1 Performance

3.2 Usability

3.3 Scalability

4. System Architecture (MVP)

5. Data Flow (MVP)

6. Future Enhancements

6.1 Database Migration

6.2 AI Enhancements

6.3 Hosted Service

6.4 Open-Source

6.5 Additional Features

7. User Stories