osnapitsjoey
diff --git a/‎README.md
Lines changed: 42 additions & 65 deletions b/‎README.md
Lines changed: 42 additions & 65 deletions
diff --git a/‎gpt-scraper.py
Lines changed: 11 additions & 10 deletions b/‎gpt-scraper.py
Lines changed: 11 additions & 10 deletions
diff --git a/‎requirements.txt
Lines changed: 80 additions & 0 deletions b/‎requirements.txt
Lines changed: 80 additions & 0 deletions
@@ -1,93 +1,70 @@
-# GPT-based Automated Web Scraper
+# AI Web Scraper
 
-![GPT based automated webscrapper](https://cdn.discordapp.com/attachments/984632500875821066/1104363425439698944/analyticsynthetic_Small_cute_mining_robot_with_large_eyes_5501ffb9-ea08-4dfc-b04d-9623f7c4481a.png "GPT based automated webscrapper")
+This project is an AI-powered web scraper that allows you to extract information from HTML sources based on user-defined requirements. It generates scraping code and executes it to retrieve the desired data.
 
-This GPT-based Universal Web Scraper is a project that allows users to automate web scraping effortlessly by leveraging GPT models to analyze website structures, user requirements, and network traffic, streamlining the data extraction process.
+## Prerequisites
 
-**Note**: The GPT prompt for analyzing API calls is still in progress and may not return accurate results at this time. We are working on improving the prompt to provide better analysis results.
+Before running the AI Web Scraper, ensure you have the following prerequisites installed:
 
-## Documentation
-
-Detailed information about the project can be found in the following documents:
-
-- [Technical Design Document (TDD)](tdd.md): The TDD provides a comprehensive overview of the system architecture, component design, and implementation details.
-- [Product Requirements Document (PRD)](prd.md): The PRD outlines the features, functionality, and requirements of the GPT-based Universal Web Scraper.
-
-## Main Components
-
-1. `gpt_interaction`: Handles communication with the GPT model and manages user interaction to gather scraping requirements.
-2. `scraper_generation`: Generates scraper code based on the results of the website structure analysis and user requirements.
-3. `url_preprocessing`: Handles URL validation, normalization, and cleaning tasks.
-4. `website_analysis`: Analyzes website DOM, identifies relevant elements, and detects APIs through network traffic analysis for data extraction.
-5. `data_extraction`: Executes the generated scraper and extracts data from the target website.
+- Python 3.x
+- The required Python packages specified in the `requirements.txt` file
+- An API key for the OpenAI GPT-4
 
 ## Installation
 
-To install the project dependencies, run the following command:
+1. Clone the project repository:
 
-```
-pip install -r requirements.txt
-```
+   ```shell
+   git clone https://github.com/dirkjbreeuwer/gpt-automated-web-scraper
+   ```
 
-Next, copy the `config.json.example` file to `config.json` and enter your GPT-4 API key in the `gpt4` section:
+2. Navigate to the project directory:
 
-```json
-{
-  "gpt4": {
-    "api_key": "your-api-key-here"
-  }
-}
-```
+   ```shell
+   cd gpt-automated-web-scraper
+   ```
 
-## Usage
-
-You can analyze the network traffic of websites using the NetworkAnalyzer class provided in the `./website_analysis/network_analysis.py` file. Here's an example of how to use the class:
-
-```python
+3. Install the required Python packages:
 
-from website_analysis.network_analysis import NetworkAnalyzer
+   ```shell
+   pip install -r requirements.txt
+   ```
 
-# URL of the website to analyze
-url = "https://www.example.com"
+4. Set up the OpenAI GPT-3 API key:
+   
+   - Obtain an API key from OpenAI by following their documentation.
+   - Rename the file called `.env.example` to `.env` in the project directory.
+   - Add the following line to the `.env` file, replacing `YOUR_API_KEY` with your actual API key:
 
-# User requirements for the data extraction (currently not used)
-user_requirements = {}
+     ```plaintext
+     OPENAI_API_KEY=YOUR_API_KEY
+     ```
 
-# Create a NetworkAnalyzer instance
-analyzer = NetworkAnalyzer(url, user_requirements)
-
-# Analyze the website
-analysis_results = analyzer.analyze_website()
+## Usage
 
-# Print the analysis results
-print(analysis_results)
-```
+To use the AI Web Scraper, run the `gpt-scraper.py` script with the desired command-line arguments.
 
-You can also analyze multiple websites at once using the `analyze_websites` function provided in the same file. Just pass a list of website URLs as an argument:
+### Command-line Arguments
 
-```python
+The following command-line arguments are available:
 
-from website_analysis.network_analysis import analyze_websites
+- `--source`: The URL or local path to the HTML source to scrape.
+- `--source-type`: Type of the source. Specify either `"url"` or `"file"`.
+- `--requirements`: User-defined requirements for scraping.
+- `--target-string`:  Due to the maximum token limit of GPT-4 (4k tokens), the AI model processes a smaller subset of the HTML where the desired data is located. The target string should be an example string that can be found within the website you want to scrape. 
 
-# List of website URLs to analyze
-websites = [
-    "https://www.example1.com",
-    "https://www.example2.com",
-    "https://www.example3.com"
-]
+### Example Usage
 
-# Analyze the websites
-results = analyze_websites(websites)
+Here are some example commands for using the AI Web Scraper:
 
-# Print the analysis results
-print(results)
+```shell
+python3 gpt-scraper.py --source-type "url" --source "https://www.scrapethissite.com/pages/forms/" --requirements "Print a JSON file with all the information available for the Chicago Blackhawks" --target-string "Chicago Blackhawks"
 ```
 
+Replace the values for `--source`, `--requirements`, and `--target-string` with your specific values.
 
-## Testing
 
-Currently the project is still under development. This section will be updated once the project is ready for use.
+## License
 
-## Contributing
+This project is licensed under the [MIT License](LICENSE). Feel free to modify and use it according to your needs.
 
-We welcome contributions to improve the GPT-based Universal Web Scraper. Please feel free to submit issues, feature requests, and pull requests on the repository.
@@ -2,33 +2,34 @@
 from langchain import PromptTemplate
 from langchain.llms import OpenAI
 import argparse
-from website_analysis.dom_analysis import HtmlLoader, UrlHtmlLoader
+from website_analysis.dom_analysis import HtmlLoader, UrlHtmlLoader, HtmlManager
 from scraper_generation.scraper_generator import ScrapingCodeGenerator, CodeWriter
 from data_extraction.data_extractor import CodeExecutor 
 
 
 def main():
+    # Receive and parse arguments
     parser = argparse.ArgumentParser(description='AI Web Scraper')
     parser.add_argument('--source', type=str, help='The URL or local path to HTML to scrape')
     parser.add_argument('--source-type', type=str, choices=['url', 'file'], help='Type of the source: url or file')
     parser.add_argument('--requirements', type=str, help='The user requirements for scraping')
+    parser.add_argument('--target-string', type=str, help='An example string to guide the scraper')
     args = parser.parse_args()
 
     source = args.source
     source_type = args.source_type
     USER_REQUIREMENTS = args.requirements
+    target_string = args.target_string
+    
 
-    # Create HtmlLoader or UrlHtmlLoader based on the source type
-    def create_html_loader(source, source_type):
-        if source_type == 'url':
-            return UrlHtmlLoader(source)
-        else:  # source_type == 'file'
-            return HtmlLoader(source)
+    # Instantiate the HTML manager
+    manager = HtmlManager(source, source_type, target_string)
 
-    html_loader = create_html_loader(source, source_type)
+    # Load Processed HTML
+    processed_html = manager.process_html()
 
-    # Instantiate ScrapingCodeGenerator with the html_loader
-    code_generator = ScrapingCodeGenerator(html_loader, source=source, source_type=source_type)
+    # Instantiate ScrapingCodeGenerator with the processed_html
+    code_generator = ScrapingCodeGenerator(processed_html, source=source, source_type=source_type)
 
     # Generate scraping code
     scraping_code = code_generator.generate_scraping_code(USER_REQUIREMENTS)
 
@@ -1,38 +1,72 @@
 aiohttp==3.8.4
 aiosignal==1.3.1
+altair==4.2.2
 anyio==3.6.2
 argilla==1.7.0
+asttokens==2.2.1
 async-timeout==4.0.2
 attrs==23.1.0
+Automat==22.10.0
+backcall==0.2.0
 backoff==2.2.1
+backports.zoneinfo==0.2.1
+beautifulsoup4==4.12.2
+blinker==1.6.2
+cachetools==5.3.0
 certifi==2023.5.7
 cffi==1.15.1
 charset-normalizer==3.1.0
 chromadb==0.3.22
 click==8.1.3
 clickhouse-connect==0.5.24
 cmake==3.26.3
+comm==0.1.3
 commonmark==0.9.1
+constantly==15.1.0
+contourpy==1.0.7
 cryptography==40.0.2
+cssselect==1.2.0
+cycler==0.11.0
 dataclasses-json==0.5.7
+debugpy==1.6.7
+decorator==5.1.1
 Deprecated==1.2.13
 duckdb==0.7.1
+entrypoints==0.4
 et-xmlfile==1.1.0
+executing==1.2.0
+faiss-cpu==1.7.4
 fastapi==0.95.1
 filelock==3.12.0
+fonttools==4.39.4
 frozenlist==1.3.3
 fsspec==2023.5.0
+gitdb==4.0.10
+GitPython==3.1.31
 greenlet==2.0.2
 h11==0.14.0
 hnswlib==0.7.0
 httpcore==0.16.3
 httptools==0.5.0
 httpx==0.23.3
 huggingface-hub==0.14.1
+hyperlink==21.0.0
 idna==3.4
 importlib-metadata==6.6.0
+importlib-resources==5.12.0
+incremental==22.10.0
+ipykernel==6.23.1
+ipython==8.12.2
+itemadapter==0.8.0
+itemloaders==1.1.0
+jedi==0.18.2
 Jinja2==3.1.2
+jmespath==1.0.1
 joblib==1.2.0
+jsonschema==4.17.3
+jupyter_client==8.2.0
+jupyter_core==5.3.0
+kiwisolver==1.4.4
 langchain==0.0.170
 lit==16.0.3
 lxml==4.9.2
@@ -41,11 +75,14 @@ Markdown==3.4.3
 MarkupSafe==2.1.2
 marshmallow==3.19.0
 marshmallow-enum==1.5.1
+matplotlib==3.7.1
+matplotlib-inline==0.1.6
 monotonic==1.6
 mpmath==1.3.0
 msg-parser==1.2.0
 multidict==6.0.4
 mypy-extensions==1.0.0
+nest-asyncio==1.5.6
 networkx==3.1
 nltk==3.8.1
 numexpr==2.8.4
@@ -67,55 +104,98 @@ openapi-schema-pydantic==1.2.4
 openpyxl==3.1.2
 packaging==23.1
 pandas==2.0.1
+parsel==1.8.1
+parso==0.8.3
 pdfminer.six==20221105
+pexpect==4.8.0
+pickleshare==0.7.5
 Pillow==9.5.0
 pkg_resources==0.0.0
+pkgutil_resolve_name==1.3.10
+platformdirs==3.5.1
 posthog==3.0.1
+prompt-toolkit==3.0.38
+Protego==0.2.1
+protobuf==3.20.3
+psutil==5.9.5
+ptyprocess==0.7.0
+pure-eval==0.2.2
+pyarrow==12.0.0
+pyasn1==0.5.0
+pyasn1-modules==0.3.0
 pycparser==2.21
 pydantic==1.10.7
+pydeck==0.8.1b0
+PyDispatcher==2.0.7
 Pygments==2.15.1
+Pympler==1.0.1
+pyOpenSSL==23.1.1
 pypandoc==1.11
+pyparsing==3.0.9
+pyrsistent==0.19.3
 python-dateutil==2.8.2
 python-docx==0.8.11
 python-dotenv==1.0.0
 python-magic==0.4.27
 python-pptx==0.6.21
 pytz==2023.3
 PyYAML==6.0
+pyzmq==25.0.2
+queuelib==1.6.2
 regex==2023.5.5
 requests==2.30.0
+requests-file==1.5.1
 rfc3986==1.5.0
 rich==13.0.1
 scikit-learn==1.2.2
 scipy==1.10.1
+Scrapy==2.9.0
 sentence-transformers==2.2.2
 sentencepiece==0.1.99
+service-identity==21.1.0
 six==1.16.0
+smmap==5.0.0
 sniffio==1.3.0
+soupsieve==2.4.1
 SQLAlchemy==2.0.13
+stack-data==0.6.2
 starlette==0.26.1
+streamlit==1.22.0
 sympy==1.12
 tenacity==8.2.2
 threadpoolctl==3.1.0
 tiktoken==0.4.0
+tldextract==3.4.3
 tokenizers==0.13.3
+toml==0.10.2
+toolz==0.12.0
 torch==2.0.1
 torchvision==0.15.2
+tornado==6.3.2
 tqdm==4.65.0
+traitlets==5.9.0
 transformers==4.29.1
 triton==2.0.0
+Twisted==22.10.0
 typer==0.9.0
 typing-inspect==0.8.0
 typing_extensions==4.5.0
 tzdata==2023.3
+tzlocal==5.0.1
 unstructured==0.6.6
 urllib3==2.0.2
 uvicorn==0.22.0
 uvloop==0.17.0
+validators==0.20.0
+w3lib==2.1.1
+watchdog==3.0.0
 watchfiles==0.19.0
+wcwidth==0.2.6
 websockets==11.0.3
+wikipedia==1.4.0
 wrapt==1.14.1
 XlsxWriter==3.1.0
 yarl==1.9.2
 zipp==3.15.0
+zope.interface==6.0
 zstandard==0.21.0