nembal
diff --git a/‎.gitignore
Lines changed: 46 additions & 0 deletions b/‎.gitignore
Lines changed: 46 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 46 additions & 13 deletions b/‎README.md
Lines changed: 46 additions & 13 deletions
diff --git a/‎data/.DS_Store
-6 KB b/‎data/.DS_Store
-6 KB
diff --git a/‎data/papers.db
-19.6 MB b/‎data/papers.db
-19.6 MB
diff --git a/‎data/papers_with_emails.csv
Lines changed: 0 additions & 64 deletions b/‎data/papers_with_emails.csv
Lines changed: 0 additions & 64 deletions
diff --git a/‎data/unique_emails.txt
Lines changed: 0 additions & 107 deletions b/‎data/unique_emails.txt
Lines changed: 0 additions & 107 deletions
@@ -0,0 +1,46 @@
+# Data files
+data/
+*.db
+*.csv
+*.txt
+!requirements.txt
+!README.txt
+
+# OS files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# Jupyter Notebook
+.ipynb_checkpoints
+*.ipynb
+
+# Logs
+*.log 
@@ -2,6 +2,13 @@
 
 A Python-based tool that crawls arXiv papers to extract author email addresses. This tool is useful for researchers and academics who need to compile contact information for authors in specific research areas.
 
+## Important Note on Data Privacy
+
+This tool helps collect email addresses from public academic papers. While the data is publicly available:
+- The collected data (emails, database) is not included in this repository
+- Users should respect privacy and data protection regulations when using this tool
+- Consider the ethical implications and use the tool responsibly
+
 ## Features
 
 - Search arXiv papers using custom queries
@@ -18,15 +25,16 @@ A Python-based tool that crawls arXiv papers to extract author email addresses.
 ```
 arxiv_parser/
 ├── main.py              # Main script that generates notebooks
-├── notebooks/           # Generated notebook versions
+├── process_remaining.py # Script for processing remaining papers
+├── notebooks/          # Generated notebook versions
 │   ├── arxiv_email_crawler.ipynb        # Local Jupyter version
 │   ├── arxiv_email_crawler_colab.ipynb  # Google Colab version
 │   └── arxiv_email_crawler_kaggle.ipynb # Kaggle version
-├── data/               # Directory for database and output files
-│   ├── papers.db      # SQLite database
-│   ├── papers_with_emails.csv  # Exported results
-│   └── unique_emails.txt       # List of unique emails
-└── requirements.txt    # Python dependencies
+├── data/              # Directory for database and output files (not tracked in git)
+│   ├── papers.db     # SQLite database (generated)
+│   ├── papers_with_emails.csv  # Exported results (generated)
+│   └── unique_emails.txt       # List of unique emails (generated)
+└── requirements.txt   # Python dependencies
 ```
 
 ## Requirements
@@ -58,9 +66,13 @@ pip install -r requirements.txt
 python main.py
 ```
 
-4. Run the local Jupyter notebook:
+4. Start collecting data:
 ```bash
+# Option 1: Use Jupyter notebook
 jupyter notebook notebooks/arxiv_email_crawler.ipynb
+
+# Option 2: Use the processing script directly
+python process_remaining.py
 ```
 
 ### Google Colab Usage
@@ -89,17 +101,38 @@ search_queries = [
 ]
 ```
 
-## Output Files
+## Generated Files
+
+The tool will generate several files in the `data/` directory (not tracked in git):
+
+1. `papers.db`: SQLite database containing:
+   - Paper metadata
+   - Processing status
+   - Extracted emails
+   - Retry information
+
+2. `papers_with_emails.csv`: CSV export containing:
+   - Paper details
+   - Associated email addresses
+   - Publication information
 
-1. `data/papers.db`: SQLite database containing all paper metadata and extracted emails
-2. `data/papers_with_emails.csv`: CSV file containing papers and their associated emails
-3. `data/unique_emails.txt`: Text file containing all unique email addresses
+3. `unique_emails.txt`: Simple text file with unique email addresses
 
 ## Rate Limiting
 
 The tool implements appropriate rate limiting to comply with arXiv's API guidelines:
-- 3-second delay between API queries
-- 20-second delay between PDF downloads
+- Adaptive delays based on success/failure
+- Automatic retry system for failed downloads
+- Smart backoff for newer papers
+
+## Data Privacy & Ethics
+
+When using this tool, please:
+1. Respect rate limits and terms of service
+2. Handle collected email addresses responsibly
+3. Consider privacy implications
+4. Follow applicable data protection regulations
+5. Use the data only for legitimate academic/research purposes
 
 ## Contributing