Skip to content

Commit c7b05db

Browse files
author
Dirk Breeuwer
committed
Refactor code into separate modules, add --target-string argument, and support larger HTML file processing
1 parent 8f4cf62 commit c7b05db

File tree

5 files changed

+191
-87
lines changed

5 files changed

+191
-87
lines changed

README.md

Lines changed: 42 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,70 @@
1-
# GPT-based Automated Web Scraper
1+
# AI Web Scraper
22

3-
![GPT based automated webscrapper](https://cdn.discordapp.com/attachments/984632500875821066/1104363425439698944/analyticsynthetic_Small_cute_mining_robot_with_large_eyes_5501ffb9-ea08-4dfc-b04d-9623f7c4481a.png "GPT based automated webscrapper")
3+
This project is an AI-powered web scraper that allows you to extract information from HTML sources based on user-defined requirements. It generates scraping code and executes it to retrieve the desired data.
44

5-
This GPT-based Universal Web Scraper is a project that allows users to automate web scraping effortlessly by leveraging GPT models to analyze website structures, user requirements, and network traffic, streamlining the data extraction process.
5+
## Prerequisites
66

7-
**Note**: The GPT prompt for analyzing API calls is still in progress and may not return accurate results at this time. We are working on improving the prompt to provide better analysis results.
7+
Before running the AI Web Scraper, ensure you have the following prerequisites installed:
88

9-
## Documentation
10-
11-
Detailed information about the project can be found in the following documents:
12-
13-
- [Technical Design Document (TDD)](tdd.md): The TDD provides a comprehensive overview of the system architecture, component design, and implementation details.
14-
- [Product Requirements Document (PRD)](prd.md): The PRD outlines the features, functionality, and requirements of the GPT-based Universal Web Scraper.
15-
16-
## Main Components
17-
18-
1. `gpt_interaction`: Handles communication with the GPT model and manages user interaction to gather scraping requirements.
19-
2. `scraper_generation`: Generates scraper code based on the results of the website structure analysis and user requirements.
20-
3. `url_preprocessing`: Handles URL validation, normalization, and cleaning tasks.
21-
4. `website_analysis`: Analyzes website DOM, identifies relevant elements, and detects APIs through network traffic analysis for data extraction.
22-
5. `data_extraction`: Executes the generated scraper and extracts data from the target website.
9+
- Python 3.x
10+
- The required Python packages specified in the `requirements.txt` file
11+
- An API key for the OpenAI GPT-4
2312

2413
## Installation
2514

26-
To install the project dependencies, run the following command:
15+
1. Clone the project repository:
2716

28-
```
29-
pip install -r requirements.txt
30-
```
17+
```shell
18+
git clone https://github.com/dirkjbreeuwer/gpt-automated-web-scraper
19+
```
3120

32-
Next, copy the `config.json.example` file to `config.json` and enter your GPT-4 API key in the `gpt4` section:
21+
2. Navigate to the project directory:
3322

34-
```json
35-
{
36-
"gpt4": {
37-
"api_key": "your-api-key-here"
38-
}
39-
}
40-
```
23+
```shell
24+
cd gpt-automated-web-scraper
25+
```
4126

42-
## Usage
43-
44-
You can analyze the network traffic of websites using the NetworkAnalyzer class provided in the `./website_analysis/network_analysis.py` file. Here's an example of how to use the class:
45-
46-
```python
27+
3. Install the required Python packages:
4728

48-
from website_analysis.network_analysis import NetworkAnalyzer
29+
```shell
30+
pip install -r requirements.txt
31+
```
4932

50-
# URL of the website to analyze
51-
url = "https://www.example.com"
33+
4. Set up the OpenAI GPT-3 API key:
34+
35+
- Obtain an API key from OpenAI by following their documentation.
36+
- Rename the file called `.env.example` to `.env` in the project directory.
37+
- Add the following line to the `.env` file, replacing `YOUR_API_KEY` with your actual API key:
5238

53-
# User requirements for the data extraction (currently not used)
54-
user_requirements = {}
39+
```plaintext
40+
OPENAI_API_KEY=YOUR_API_KEY
41+
```
5542
56-
# Create a NetworkAnalyzer instance
57-
analyzer = NetworkAnalyzer(url, user_requirements)
58-
59-
# Analyze the website
60-
analysis_results = analyzer.analyze_website()
43+
## Usage
6144
62-
# Print the analysis results
63-
print(analysis_results)
64-
```
45+
To use the AI Web Scraper, run the `gpt-scraper.py` script with the desired command-line arguments.
6546
66-
You can also analyze multiple websites at once using the `analyze_websites` function provided in the same file. Just pass a list of website URLs as an argument:
47+
### Command-line Arguments
6748
68-
```python
49+
The following command-line arguments are available:
6950
70-
from website_analysis.network_analysis import analyze_websites
51+
- `--source`: The URL or local path to the HTML source to scrape.
52+
- `--source-type`: Type of the source. Specify either `"url"` or `"file"`.
53+
- `--requirements`: User-defined requirements for scraping.
54+
- `--target-string`: Due to the maximum token limit of GPT-4 (4k tokens), the AI model processes a smaller subset of the HTML where the desired data is located. The target string should be an example string that can be found within the website you want to scrape.
7155
72-
# List of website URLs to analyze
73-
websites = [
74-
"https://www.example1.com",
75-
"https://www.example2.com",
76-
"https://www.example3.com"
77-
]
56+
### Example Usage
7857
79-
# Analyze the websites
80-
results = analyze_websites(websites)
58+
Here are some example commands for using the AI Web Scraper:
8159
82-
# Print the analysis results
83-
print(results)
60+
```shell
61+
python3 gpt-scraper.py --source-type "url" --source "https://www.scrapethissite.com/pages/forms/" --requirements "Print a JSON file with all the information available for the Chicago Blackhawks" --target-string "Chicago Blackhawks"
8462
```
8563

64+
Replace the values for `--source`, `--requirements`, and `--target-string` with your specific values.
8665

87-
## Testing
8866

89-
Currently the project is still under development. This section will be updated once the project is ready for use.
67+
## License
9068

91-
## Contributing
69+
This project is licensed under the [MIT License](LICENSE). Feel free to modify and use it according to your needs.
9270

93-
We welcome contributions to improve the GPT-based Universal Web Scraper. Please feel free to submit issues, feature requests, and pull requests on the repository.

gpt-scraper.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,33 +2,34 @@
22
from langchain import PromptTemplate
33
from langchain.llms import OpenAI
44
import argparse
5-
from website_analysis.dom_analysis import HtmlLoader, UrlHtmlLoader
5+
from website_analysis.dom_analysis import HtmlLoader, UrlHtmlLoader, HtmlManager
66
from scraper_generation.scraper_generator import ScrapingCodeGenerator, CodeWriter
77
from data_extraction.data_extractor import CodeExecutor
88

99

1010
def main():
11+
# Receive and parse arguments
1112
parser = argparse.ArgumentParser(description='AI Web Scraper')
1213
parser.add_argument('--source', type=str, help='The URL or local path to HTML to scrape')
1314
parser.add_argument('--source-type', type=str, choices=['url', 'file'], help='Type of the source: url or file')
1415
parser.add_argument('--requirements', type=str, help='The user requirements for scraping')
16+
parser.add_argument('--target-string', type=str, help='An example string to guide the scraper')
1517
args = parser.parse_args()
1618

1719
source = args.source
1820
source_type = args.source_type
1921
USER_REQUIREMENTS = args.requirements
22+
target_string = args.target_string
23+
2024

21-
# Create HtmlLoader or UrlHtmlLoader based on the source type
22-
def create_html_loader(source, source_type):
23-
if source_type == 'url':
24-
return UrlHtmlLoader(source)
25-
else: # source_type == 'file'
26-
return HtmlLoader(source)
25+
# Instantiate the HTML manager
26+
manager = HtmlManager(source, source_type, target_string)
2727

28-
html_loader = create_html_loader(source, source_type)
28+
# Load Processed HTML
29+
processed_html = manager.process_html()
2930

30-
# Instantiate ScrapingCodeGenerator with the html_loader
31-
code_generator = ScrapingCodeGenerator(html_loader, source=source, source_type=source_type)
31+
# Instantiate ScrapingCodeGenerator with the processed_html
32+
code_generator = ScrapingCodeGenerator(processed_html, source=source, source_type=source_type)
3233

3334
# Generate scraping code
3435
scraping_code = code_generator.generate_scraping_code(USER_REQUIREMENTS)

requirements.txt

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,72 @@
11
aiohttp==3.8.4
22
aiosignal==1.3.1
3+
altair==4.2.2
34
anyio==3.6.2
45
argilla==1.7.0
6+
asttokens==2.2.1
57
async-timeout==4.0.2
68
attrs==23.1.0
9+
Automat==22.10.0
10+
backcall==0.2.0
711
backoff==2.2.1
12+
backports.zoneinfo==0.2.1
13+
beautifulsoup4==4.12.2
14+
blinker==1.6.2
15+
cachetools==5.3.0
816
certifi==2023.5.7
917
cffi==1.15.1
1018
charset-normalizer==3.1.0
1119
chromadb==0.3.22
1220
click==8.1.3
1321
clickhouse-connect==0.5.24
1422
cmake==3.26.3
23+
comm==0.1.3
1524
commonmark==0.9.1
25+
constantly==15.1.0
26+
contourpy==1.0.7
1627
cryptography==40.0.2
28+
cssselect==1.2.0
29+
cycler==0.11.0
1730
dataclasses-json==0.5.7
31+
debugpy==1.6.7
32+
decorator==5.1.1
1833
Deprecated==1.2.13
1934
duckdb==0.7.1
35+
entrypoints==0.4
2036
et-xmlfile==1.1.0
37+
executing==1.2.0
38+
faiss-cpu==1.7.4
2139
fastapi==0.95.1
2240
filelock==3.12.0
41+
fonttools==4.39.4
2342
frozenlist==1.3.3
2443
fsspec==2023.5.0
44+
gitdb==4.0.10
45+
GitPython==3.1.31
2546
greenlet==2.0.2
2647
h11==0.14.0
2748
hnswlib==0.7.0
2849
httpcore==0.16.3
2950
httptools==0.5.0
3051
httpx==0.23.3
3152
huggingface-hub==0.14.1
53+
hyperlink==21.0.0
3254
idna==3.4
3355
importlib-metadata==6.6.0
56+
importlib-resources==5.12.0
57+
incremental==22.10.0
58+
ipykernel==6.23.1
59+
ipython==8.12.2
60+
itemadapter==0.8.0
61+
itemloaders==1.1.0
62+
jedi==0.18.2
3463
Jinja2==3.1.2
64+
jmespath==1.0.1
3565
joblib==1.2.0
66+
jsonschema==4.17.3
67+
jupyter_client==8.2.0
68+
jupyter_core==5.3.0
69+
kiwisolver==1.4.4
3670
langchain==0.0.170
3771
lit==16.0.3
3872
lxml==4.9.2
@@ -41,11 +75,14 @@ Markdown==3.4.3
4175
MarkupSafe==2.1.2
4276
marshmallow==3.19.0
4377
marshmallow-enum==1.5.1
78+
matplotlib==3.7.1
79+
matplotlib-inline==0.1.6
4480
monotonic==1.6
4581
mpmath==1.3.0
4682
msg-parser==1.2.0
4783
multidict==6.0.4
4884
mypy-extensions==1.0.0
85+
nest-asyncio==1.5.6
4986
networkx==3.1
5087
nltk==3.8.1
5188
numexpr==2.8.4
@@ -67,55 +104,98 @@ openapi-schema-pydantic==1.2.4
67104
openpyxl==3.1.2
68105
packaging==23.1
69106
pandas==2.0.1
107+
parsel==1.8.1
108+
parso==0.8.3
70109
pdfminer.six==20221105
110+
pexpect==4.8.0
111+
pickleshare==0.7.5
71112
Pillow==9.5.0
72113
pkg_resources==0.0.0
114+
pkgutil_resolve_name==1.3.10
115+
platformdirs==3.5.1
73116
posthog==3.0.1
117+
prompt-toolkit==3.0.38
118+
Protego==0.2.1
119+
protobuf==3.20.3
120+
psutil==5.9.5
121+
ptyprocess==0.7.0
122+
pure-eval==0.2.2
123+
pyarrow==12.0.0
124+
pyasn1==0.5.0
125+
pyasn1-modules==0.3.0
74126
pycparser==2.21
75127
pydantic==1.10.7
128+
pydeck==0.8.1b0
129+
PyDispatcher==2.0.7
76130
Pygments==2.15.1
131+
Pympler==1.0.1
132+
pyOpenSSL==23.1.1
77133
pypandoc==1.11
134+
pyparsing==3.0.9
135+
pyrsistent==0.19.3
78136
python-dateutil==2.8.2
79137
python-docx==0.8.11
80138
python-dotenv==1.0.0
81139
python-magic==0.4.27
82140
python-pptx==0.6.21
83141
pytz==2023.3
84142
PyYAML==6.0
143+
pyzmq==25.0.2
144+
queuelib==1.6.2
85145
regex==2023.5.5
86146
requests==2.30.0
147+
requests-file==1.5.1
87148
rfc3986==1.5.0
88149
rich==13.0.1
89150
scikit-learn==1.2.2
90151
scipy==1.10.1
152+
Scrapy==2.9.0
91153
sentence-transformers==2.2.2
92154
sentencepiece==0.1.99
155+
service-identity==21.1.0
93156
six==1.16.0
157+
smmap==5.0.0
94158
sniffio==1.3.0
159+
soupsieve==2.4.1
95160
SQLAlchemy==2.0.13
161+
stack-data==0.6.2
96162
starlette==0.26.1
163+
streamlit==1.22.0
97164
sympy==1.12
98165
tenacity==8.2.2
99166
threadpoolctl==3.1.0
100167
tiktoken==0.4.0
168+
tldextract==3.4.3
101169
tokenizers==0.13.3
170+
toml==0.10.2
171+
toolz==0.12.0
102172
torch==2.0.1
103173
torchvision==0.15.2
174+
tornado==6.3.2
104175
tqdm==4.65.0
176+
traitlets==5.9.0
105177
transformers==4.29.1
106178
triton==2.0.0
179+
Twisted==22.10.0
107180
typer==0.9.0
108181
typing-inspect==0.8.0
109182
typing_extensions==4.5.0
110183
tzdata==2023.3
184+
tzlocal==5.0.1
111185
unstructured==0.6.6
112186
urllib3==2.0.2
113187
uvicorn==0.22.0
114188
uvloop==0.17.0
189+
validators==0.20.0
190+
w3lib==2.1.1
191+
watchdog==3.0.0
115192
watchfiles==0.19.0
193+
wcwidth==0.2.6
116194
websockets==11.0.3
195+
wikipedia==1.4.0
117196
wrapt==1.14.1
118197
XlsxWriter==3.1.0
119198
yarl==1.9.2
120199
zipp==3.15.0
200+
zope.interface==6.0
121201
zstandard==0.21.0

0 commit comments

Comments
 (0)