This project provides a complete workflow for extracting and scraping detailed information about vessels from Baltic Shipping's website. It consists of two Python scripts:
extract.py: Extracts all vessel URLs from the site's XML sitemaps and consolidates them into a text file.scraper.py: Scrapes the detailed data from the extracted URLs and exports the results to an Excel file.
- Automated Sitemap Parsing:
extract.pyprocesses multiple XML sitemaps to fetch vessel URLs. - Web Interaction:
scraper.pyuses Selenium to interact with webpages and load hidden content dynamically. - Data Consolidation: Extracted data is saved in a structured Excel file for easy analysis.
- Error Handling: Skips invalid URLs and handles timeouts or missing elements gracefully.
- Python 3.7+
- Google Chrome (or another supported browser)
- ChromeDriver (or the corresponding WebDriver for your browser)
-
Clone this repository or download the scripts.
-
Install the required Python libraries using pip:
pip install -r requirements.txt
Run the extract.py script to fetch vessel URLs from the Baltic Shipping sitemaps.
-
Ensure
extract.pyis in the working directory. -
Run the script:
python extract.py
-
Output:
- A file named
urls.txtcontaining all the extracted vessel URLs, one per line.
- A file named
Run the scraper.py script to scrape detailed information for each vessel URL in urls.txt.
-
Ensure
scraper.pyandurls.txtare in the same directory. -
Run the script:
python scraper.py
-
Output:
- An Excel file named
output.xlsxcontaining the scraped vessel data.
- An Excel file named
- A plain text file containing one vessel URL per line.
- An Excel spreadsheet with detailed vessel information. Each row corresponds to a vessel, and columns represent attributes or details.
-
Timeouts and Failures:
- The scripts handle missing elements and timeouts, skipping problematic entries while continuing the process.
-
Performance:
- For large datasets, processing times may vary based on the number of URLs and the website's response speed.
-
Customizability:
- Adjust
scraper.pyfor specific webpage structures or additional data points.
- Adjust
-
Driver Errors:
- Ensure the ChromeDriver version matches your installed version of Google Chrome. Update or replace the driver if necessary.
-
Missing Data:
- Check that the webpage's structure (e.g., button classes, table classes) has not changed.
This project is licensed under the MIT License. See the LICENSE file for details.