Table of Contents
The largest crowdsourced global database of quality of life data, including housing indicators, perceived crime rates, healthcare quality, transportation quality, and other statistics, is Numbeo. In order to save time when searching for information about the quality of life in a particular country or city, the project's goal is to use web scraping frameworks (in this case, the BeautifulSoup4 library) to extract data from Numbeo's website.
To install this package, first clone the repository to the directory of your choice using the following command:
git clone https://github.com/rafaelgreca/numbeo-scraper.git
Create a virtual environment (ideally using conda) and install the requirements using the following command:
conda create --name numbeo-scraper python=3.10.16
conda activate numbeo-scraper
pip install -r requirements.txt
Build the Docker image using the following command:
sudo docker build -f Dockerfile -t numbeo-scraper . --no-cache
Run the Docker container using the following command:
sudo docker run -it --name numbeo-scraper-container numbeo-scraper
Finally, run the following command inside the container:
python3 -m <YOUR_PYTHON_FILE_LOCATION>
Example (this is the same command as used with the virtual environment approach):
python3 -m examples.by_country.get_quality_of_life_data
-
Cost of living index by country (check an example here) or by city (check an example here).
-
Property price/investment index by country (check an example here) or by city (check an example here).
-
Quality of life index by country (check an example here) or by city (check an example here).
-
Crime index by country (check an example here) or by city (check an example here).
-
Health care index by country (check an example here) or by city (check an example here).
-
Pollution index by country (check an example here) or by city (check an example here).
-
Traffic index by country (check an example here) or by city (check an example here).
-
Historical data in a country (check an example here).
You can pass the variables that will be used to collect the desired data by creating a YAML file (such as the config.yaml
file located in the root folder) and creating a piece of code like the one below (and saving in a Python file, obviously):
from pathlib import Path
from src.core.utils import read_yaml_credentials_file
from src.schema.input import Input
from src.core.scraper import NumbeoScraper
if __name__ == "__main__":
# reading the YAML file
config = Input(
**read_yaml_credentials_file(
file_path=Path(__file__).resolve().parents[1], # the folder where the config file is located
file_name="config.yaml", # the configuration file name
)
)
scraper = NumbeoScraper(
config=config,
)
dataframes = scraper.scrap() # will return a list of tuples (each category will be saved separately)
# where the first index is the name of the dataframe
# and the second one is the collected data
dataframe_name, data = dataframes[0] # the name is used to identify the data
print(f"\nDataframe '{dataframe_name}' has a shape of {data.shape}.")
print(f"The first five rows of the dataset:\n{data.head(5)}\n")
Or you can pass the values directly, like this:
from src.schema.input import Input
from src.core.scraper import NumbeoScraper
if __name__ == "__main__":
config = Input(
categories="historical-data",
years=2021,
mode="country",
currency="EUR",
historical_items=[
'1 Pair of Jeans (Levis 501 Or Similar)',
'Banana (1kg)'
],
countries=[
'China',
'France',
'United States'
],
)
scraper = NumbeoScraper(
config=config,
)
dataframes = scraper.scrap() # will return a list of tuples (each category will be saved separately)
# where the first index is the name of the dataframe
# and the second one is the collected data
dataframe_name, data = dataframes[0] # the name is used to identify the data
print(f"\nDataframe '{dataframe_name}' has a shape of {data.shape}.")
print(f"The first five rows of the dataset:\n{data.head(5)}\n")
Available parameters that can/must be used:
-
categories
(can be a list of strings or just a string, mandatory): Which type of data will be collected. You can see the available categories here. -
years
(can be a list of integers or just an integer, mandatory): Which years the data will be extracted from. You can see the available years here. -
mode
(a string, mandatory): Whether the data will be collected bycountry
or bycity
. You can see the available modes here. -
currency (a string, optional): Which currency the values will be displayed. You can see the available currencies here. This parameter is optional; however it must be used when the chosen category is
historical-data
with modecountry
orcost-of-living
orproperty-investment
with modecity
. -
historical_items (can be a list of strings or just a string, optional): Which items the historical data will be extracted from. You can see the available items here. This parameter is optional, however it must be used when the chosen category is
historical-data
with modecountry
. -
countries
(can be a list of strings or just a string, optional): Which countries the data will be extracted from. You can see the available countries here. -
cities
(can be a list of strings or just a string, mandatory): Which cities will the data be extracted from. This parameter is mandatory when the modecity
is chosen.
Check the examples
folder to see more examples of how to use this library.
Run the following command on the root folder:
python3 -m unittest discover -p 'test_*.py'
-
Add a feature to get the food prices by country or by city.
-
Fix logging (currently it's not saving the logs into a file, but rather showing them directly in the terminal).
-
Improve testing cases, especially to validate the parameters values and typing.
-
Test the code using Docker.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
Distributed under the MIT License. See LICENSE
for more information.
Rafael Greca Vieira - GitHub - LinkedIn - [email protected]
In addition to helping people from all over the world plan their travels and find a new place to call home, Numbeo is the world's largest cost-of-living crowdsourced global database, and for that, I want to express my profound gratitude to everyone who works behind the scenes to make it possible.