Skip to content

Commit 960692d

Browse files
useruser
user
authored and
user
committed
Remove data files from git tracking for privacy reasons and update documentation
1 parent 4a099c1 commit 960692d

8 files changed

+248
-205
lines changed

.gitignore

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Data files
2+
data/
3+
*.db
4+
*.csv
5+
*.txt
6+
!requirements.txt
7+
!README.txt
8+
9+
# OS files
10+
.DS_Store
11+
.DS_Store?
12+
._*
13+
.Spotlight-V100
14+
.Trashes
15+
ehthumbs.db
16+
Thumbs.db
17+
18+
# Python
19+
__pycache__/
20+
*.py[cod]
21+
*$py.class
22+
*.so
23+
.Python
24+
env/
25+
build/
26+
develop-eggs/
27+
dist/
28+
downloads/
29+
eggs/
30+
.eggs/
31+
lib/
32+
lib64/
33+
parts/
34+
sdist/
35+
var/
36+
wheels/
37+
*.egg-info/
38+
.installed.cfg
39+
*.egg
40+
41+
# Jupyter Notebook
42+
.ipynb_checkpoints
43+
*.ipynb
44+
45+
# Logs
46+
*.log

README.md

+46-13
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,13 @@
22

33
A Python-based tool that crawls arXiv papers to extract author email addresses. This tool is useful for researchers and academics who need to compile contact information for authors in specific research areas.
44

5+
## Important Note on Data Privacy
6+
7+
This tool helps collect email addresses from public academic papers. While the data is publicly available:
8+
- The collected data (emails, database) is not included in this repository
9+
- Users should respect privacy and data protection regulations when using this tool
10+
- Consider the ethical implications and use the tool responsibly
11+
512
## Features
613

714
- Search arXiv papers using custom queries
@@ -18,15 +25,16 @@ A Python-based tool that crawls arXiv papers to extract author email addresses.
1825
```
1926
arxiv_parser/
2027
├── main.py # Main script that generates notebooks
21-
├── notebooks/ # Generated notebook versions
28+
├── process_remaining.py # Script for processing remaining papers
29+
├── notebooks/ # Generated notebook versions
2230
│ ├── arxiv_email_crawler.ipynb # Local Jupyter version
2331
│ ├── arxiv_email_crawler_colab.ipynb # Google Colab version
2432
│ └── arxiv_email_crawler_kaggle.ipynb # Kaggle version
25-
├── data/ # Directory for database and output files
26-
│ ├── papers.db # SQLite database
27-
│ ├── papers_with_emails.csv # Exported results
28-
│ └── unique_emails.txt # List of unique emails
29-
└── requirements.txt # Python dependencies
33+
├── data/ # Directory for database and output files (not tracked in git)
34+
│ ├── papers.db # SQLite database (generated)
35+
│ ├── papers_with_emails.csv # Exported results (generated)
36+
│ └── unique_emails.txt # List of unique emails (generated)
37+
└── requirements.txt # Python dependencies
3038
```
3139

3240
## Requirements
@@ -58,9 +66,13 @@ pip install -r requirements.txt
5866
python main.py
5967
```
6068

61-
4. Run the local Jupyter notebook:
69+
4. Start collecting data:
6270
```bash
71+
# Option 1: Use Jupyter notebook
6372
jupyter notebook notebooks/arxiv_email_crawler.ipynb
73+
74+
# Option 2: Use the processing script directly
75+
python process_remaining.py
6476
```
6577

6678
### Google Colab Usage
@@ -89,17 +101,38 @@ search_queries = [
89101
]
90102
```
91103

92-
## Output Files
104+
## Generated Files
105+
106+
The tool will generate several files in the `data/` directory (not tracked in git):
107+
108+
1. `papers.db`: SQLite database containing:
109+
- Paper metadata
110+
- Processing status
111+
- Extracted emails
112+
- Retry information
113+
114+
2. `papers_with_emails.csv`: CSV export containing:
115+
- Paper details
116+
- Associated email addresses
117+
- Publication information
93118

94-
1. `data/papers.db`: SQLite database containing all paper metadata and extracted emails
95-
2. `data/papers_with_emails.csv`: CSV file containing papers and their associated emails
96-
3. `data/unique_emails.txt`: Text file containing all unique email addresses
119+
3. `unique_emails.txt`: Simple text file with unique email addresses
97120

98121
## Rate Limiting
99122

100123
The tool implements appropriate rate limiting to comply with arXiv's API guidelines:
101-
- 3-second delay between API queries
102-
- 20-second delay between PDF downloads
124+
- Adaptive delays based on success/failure
125+
- Automatic retry system for failed downloads
126+
- Smart backoff for newer papers
127+
128+
## Data Privacy & Ethics
129+
130+
When using this tool, please:
131+
1. Respect rate limits and terms of service
132+
2. Handle collected email addresses responsibly
133+
3. Consider privacy implications
134+
4. Follow applicable data protection regulations
135+
5. Use the data only for legitimate academic/research purposes
103136

104137
## Contributing
105138

data/.DS_Store

-6 KB
Binary file not shown.

data/papers.db

-19.6 MB
Binary file not shown.

data/papers_with_emails.csv

-64
This file was deleted.

data/unique_emails.txt

-107
This file was deleted.

0 commit comments

Comments
 (0)