Skip to content

In order to save time when searching for information about the quality of life in a particular country or city, the project's goal is to use web scraping frameworks (in this case, the BeautifulSoup4 library) to extract data from Numbeo's website.

License

Notifications You must be signed in to change notification settings

rafaelgreca/numbeo-scraper

Repository files navigation

Numbeo Scraper

Table of Contents
  1. About The Project
  2. Installation
  3. Features
  4. Examples
  5. Running Tests
  6. Roadmap
  7. Contributing
  8. License
  9. Contact
  10. Acknowledgments

About The Project

The largest crowdsourced global database of quality of life data, including housing indicators, perceived crime rates, healthcare quality, transportation quality, and other statistics, is Numbeo. In order to save time when searching for information about the quality of life in a particular country or city, the project's goal is to use web scraping frameworks (in this case, the BeautifulSoup4 library) to extract data from Numbeo's website.

(back to top)

Installation

To install this package, first clone the repository to the directory of your choice using the following command:

git clone https://github.com/rafaelgreca/numbeo-scraper.git

Using Virtual Environment

Create a virtual environment (ideally using conda) and install the requirements using the following command:

conda create --name numbeo-scraper python=3.10.16 
conda activate numbeo-scraper
pip install -r requirements.txt

Using Docker

Build the Docker image using the following command:

sudo docker build -f Dockerfile -t numbeo-scraper . --no-cache

Run the Docker container using the following command:

sudo docker run -it --name numbeo-scraper-container numbeo-scraper

Finally, run the following command inside the container:

python3 -m <YOUR_PYTHON_FILE_LOCATION>

Example (this is the same command as used with the virtual environment approach):

python3 -m examples.by_country.get_quality_of_life_data

(back to top)

Features

Examples

You can pass the variables that will be used to collect the desired data by creating a YAML file (such as the config.yaml file located in the root folder) and creating a piece of code like the one below (and saving in a Python file, obviously):

from pathlib import Path

from src.core.utils import read_yaml_credentials_file
from src.schema.input import Input
from src.core.scraper import NumbeoScraper

if __name__ == "__main__":
    # reading the YAML file
    config = Input(
        **read_yaml_credentials_file(
            file_path=Path(__file__).resolve().parents[1], # the folder where the config file is located
            file_name="config.yaml", # the configuration file name
        )
    )

    scraper = NumbeoScraper(
        config=config,
    )
    dataframes = scraper.scrap()  # will return a list of tuples (each category will be saved separately)
                                  # where the first index is the name of the dataframe
                                  # and the second one is the collected data 

    dataframe_name, data = dataframes[0]  # the name is used to identify the data

    print(f"\nDataframe '{dataframe_name}' has a shape of {data.shape}.")
    print(f"The first five rows of the dataset:\n{data.head(5)}\n")

Or you can pass the values directly, like this:

from src.schema.input import Input
from src.core.scraper import NumbeoScraper

if __name__ == "__main__":
    config = Input(
        categories="historical-data",
        years=2021,
        mode="country",
        currency="EUR",
        historical_items=[
          '1 Pair of Jeans (Levis 501 Or Similar)',
          'Banana (1kg)'
        ],
        countries=[
          'China',
          'France',
          'United States'
        ],
    )

    scraper = NumbeoScraper(
        config=config,
    )
    dataframes = scraper.scrap()  # will return a list of tuples (each category will be saved separately)
                                  # where the first index is the name of the dataframe
                                  # and the second one is the collected data 

    dataframe_name, data = dataframes[0]  # the name is used to identify the data

    print(f"\nDataframe '{dataframe_name}' has a shape of {data.shape}.")
    print(f"The first five rows of the dataset:\n{data.head(5)}\n")

Available parameters that can/must be used:

  • categories (can be a list of strings or just a string, mandatory): Which type of data will be collected. You can see the available categories here.

  • years (can be a list of integers or just an integer, mandatory): Which years the data will be extracted from. You can see the available years here.

  • mode (a string, mandatory): Whether the data will be collected by country or by city. You can see the available modes here.

  • currency (a string, optional): Which currency the values will be displayed. You can see the available currencies here. This parameter is optional; however it must be used when the chosen category is historical-data with mode country or cost-of-living or property-investment with mode city.

  • historical_items (can be a list of strings or just a string, optional): Which items the historical data will be extracted from. You can see the available items here. This parameter is optional, however it must be used when the chosen category is historical-data with mode country.

  • countries (can be a list of strings or just a string, optional): Which countries the data will be extracted from. You can see the available countries here.

  • cities (can be a list of strings or just a string, mandatory): Which cities will the data be extracted from. This parameter is mandatory when the mode city is chosen.

Check the examples folder to see more examples of how to use this library.

(back to top)

Running Tests

Run the following command on the root folder:

python3 -m unittest discover -p 'test_*.py'

(back to top)

Roadmap

  • Add a feature to get the food prices by country or by city.

  • Fix logging (currently it's not saving the logs into a file, but rather showing them directly in the terminal).

  • Improve testing cases, especially to validate the parameters values and typing.

  • Test the code using Docker.

(back to top)

Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Rafael Greca Vieira - GitHub - LinkedIn - [email protected]

(back to top)

Acknowledgments

In addition to helping people from all over the world plan their travels and find a new place to call home, Numbeo is the world's largest cost-of-living crowdsourced global database, and for that, I want to express my profound gratitude to everyone who works behind the scenes to make it possible.

About

In order to save time when searching for information about the quality of life in a particular country or city, the project's goal is to use web scraping frameworks (in this case, the BeautifulSoup4 library) to extract data from Numbeo's website.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published