Skip to content

Latest commit

 

History

History
130 lines (100 loc) · 4.34 KB

prd.md

File metadata and controls

130 lines (100 loc) · 4.34 KB

Product Requirements Document (PRD)

Title: arXiv Email Crawler for AgentDomain Initiative

Version: 1.1
Date: [Insert Date]

1. Introduction

The arXiv Email Crawler is a system to harvest email addresses from AI and agent-related research papers on arXiv. Its purpose is to invite researchers to join the AgentDomain.xyz initiative and promote the .agent TLD. The MVP will be a local Python application running in a Jupyter notebook with a SQLite database. Long-term, you plan to store data in Supabase for cloud-based scalability (not part of the MVP), with potential expansion into a hosted, open-source service.

2. Functional Requirements

2.1 arXiv API Interaction

  • Query the arXiv API with user-defined search terms (e.g., "AI", "agent").
  • Extract metadata from the API response, including:
    • Title
    • Authors (as a list)
    • Published date
    • DOI (if available)
    • PDF URL
    • Abstract

2.2 PDF Downloading and Parsing

  • Download PDFs from the extracted PDF URLs.
  • Extract text from PDFs using pdfplumber.

2.3 Email Extraction (MVP)

  • Use regular expressions (regex) to identify email addresses in the PDF text.

2.4 Database Management (MVP)

  • Store metadata and emails in a local SQLite database (papers.db).
  • Track processed papers to avoid duplication.

Database schema:

  • arxiv_id (TEXT, PRIMARY KEY)
  • title (TEXT)
  • authors (TEXT, e.g., comma-separated or JSON)
  • published_date (TEXT)
  • pdf_link (TEXT)
  • doi (TEXT, nullable)
  • abstract (TEXT)
  • emails (TEXT, e.g., comma-separated or JSON)
  • processed (INTEGER, 0 = unprocessed, 1 = processed)

2.5 Main Logic

Run in a Jupyter notebook to:

  • Query the arXiv API.
  • Add new papers to the database.
  • Download and parse PDFs for unprocessed papers.
  • Extract emails and update the database.
  • Mark papers as processed.
  • Enforce a 20-second delay between PDF downloads to comply with arXiv's rate limits.

3. Non-Functional Requirements

3.1 Performance

  • Process papers efficiently on a local machine while respecting arXiv's crawl-delay policy.
  • Handle errors (e.g., failed downloads) without crashing.

3.2 Usability

  • Provide a single, easy-to-run Jupyter notebook with clear outputs for progress tracking.
  • Require minimal setup (e.g., install dependencies via pip).

3.3 Scalability

  • Design the database schema and code to support a future transition to Supabase.
  • Keep the structure modular for easy enhancements.

4. System Architecture (MVP)

  • main.ipynb: Orchestrates the workflow.
  • utils/ (directory for modules):
    • arxiv_api.py: Queries arXiv API and parses metadata.
    • pdf_handler.py: Downloads PDFs and extracts text.
    • email_extractor.py: Extracts emails with regex.
    • db_manager.py: Manages SQLite database.
  • data/ (directory):
    • papers.db: SQLite database file.

5. Data Flow (MVP)

  1. User inputs search terms in main.ipynb.
  2. System queries arXiv API and retrieves metadata.
  3. Checks SQLite for unprocessed papers.
  4. For each unprocessed paper:
    • Downloads the PDF.
    • Extracts text.
    • Finds emails.
    • Updates the database.
    • Marks as processed.
    • Waits 20 seconds between PDF downloads.

6. Future Enhancements

6.1 Database Migration

  • Move from SQLite to Supabase for cloud storage (post-MVP, per your goal).

6.2 AI Enhancements

  • Use an AI model (e.g., Google Gemini Flash2) for better email extraction, affiliations, or paper summaries.

6.3 Hosted Service

  • Deploy as a continuous, cloud-based service (e.g., via Docker).

6.4 Open-Source

  • Share on GitHub with documentation for community contributions.

6.5 Additional Features

  • Store author affiliations.
  • Deduplicate emails across papers.
  • Add email-sending capabilities.

7. User Stories

  • As a user, I want to run the system locally to collect emails from arXiv papers.
  • As a user, I want processed papers tracked to avoid redundant work.
  • As a user, I want a foundation that supports moving to Supabase later.

Cursor Rules for AI Assistance These rules guide AI behavior in Cursor Builder for coding the arXiv Email Crawler. They're tailored to the MVP's structure and stored in .cursor/rules/.

Project Structure main.ipynb: Main logic.

utils/: arxiv_api.py: API queries.

pdf_handler.py: PDF handling.

email_extractor.py: Email extraction.

db_manager.py: Database operations.

data/: papers.db: SQLite database.