This project can be used to load data into a cachedb database (see demo description and Solution diagram and components description). It can be used to load data generated by the artificial-data-generator project and prepared by the jupyter notebooks in training-with-artaficial-data.
Contents:
- data folder -- holds a copy of data to be loaded
- create_db.py -- python script that creates cache database tables
- load_data.py -- loads data into the database
- requirements.txt -- python requirements for the scripts
- Dockerfile -- docker file to build a docker image that can be used to run the scripts
The python scripts use the following environment variables:
DATA_PATH
-- path to the data folderPOSTGRES_USR
-- cachedb user namePOSTGRES_PW
-- cachedb passwordPOSTGRES_DB
-- cachedb database namePOSTGRES_HST
-- postgres hostRECREATE_DATABASE
-- if set totrue
the database will be dropped and created again before creating the tables (default:false
)
Run Postgres in a Docker container:
export POSTGRES_USR=cacheUser
export POSTGRES_PW=cachePass
export POSTGRES_DB=cacheDb
export POSTGRES_HST=localhost
docker run --name postgres -p 5432:5432 -e POSTGRES_USER=$POSTGRES_USR -e POSTGRES_PASSWORD=$POSTGRES_PW -d postgres
Set the environment variables:
export DATA_PATH=$(pwd)/data
export POSTGRES_USR=cacheUser
export POSTGRES_PW=cachePass
export POSTGRES_DB=cacheDb
export POSTGRES_HST=localhost
Now you can run the scripts:
python create_db.py
python load_data.py
Build the image:
docker build -t cachedb-load-data .
Then you can run the scripts by providing the environment variables, for example:
docker run -it --net=host \
--env DATA_PATH=/loader/data \
--env POSTGRES_HST=127.0.0.1 \
--env POSTGRES_DB=cacheDb \
--env POSTGRES_USR=cacheUser \
--env POSTGRES_PW=cachePass \
cachedb-load-data
You could also mount the data folder to the container at a runtime (with different csv than the ones in the image):
docker run -it --net=host \
-v $(pwd)/data:/loader/data \
--env DATA_PATH=/loader/data \
--env POSTGRES_HST=localhost \
--env POSTGRES_DB=cacheDb \
--env POSTGRES_USR=cacheUser \
--env POSTGRES_PW=cachePass \
cachedb-load-data
The scripts can be run in Kubernetes/OpenShift using a job. Example job definition is in cachedb-load-data-job.yaml.
Update the env variables in the job definition to match your environment, including postgres service name, username, password and database name.
Then you can run the job:
kubectl apply -f cachedb-load-data-job.yaml
Right now the scripts are using static data embedded in the image. In the future we could add a script that downloads the data from a remote location. (They could be downloaded, for example, from a central config server.)
For OpenShift usage, the helm charts could be updated to include the scripts and the data (for example, as post-install job) and the image preparation can be done with build config.