A go web crawler with MySQL persistency
A simple go web crawler that finds the 10 most valuable companies according to the Fundamentus website. It searches for stock name, company name, average daily rate of stock, and company's market value, storing them at a MySQL dabatase.
You can either run it locally via a shell script or in a container via Docker Compose. Also, you need to provide an .env file to access the database. An example is shown in .env.example.
Do the following:
go get github.com/psatler/go-web-crawler
cd go-web-crawler
sh init-db.sh
The script might ask you about sudo passwords to make sure MySQL service is up. Also, it opens the MySQL db as a root user.
The project also comes with a Docker Compose file to create containers for each service (go application and MySQL). So, to run using it, do:
git clone https://github.com/psatler/go-web-crawler.git
cd go-web-crawler
sudo docker-compose up
NOTE: If the MySQL service is up, it might rise some port conflicts/errors, so you might have to do sudo service mysql stop
first to be able to run docker compose.
- GoQuery: implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
- Go-sql-driver: package mysql provides a MySQL driver for Go's database/sql package.
- StrConv: implements conversions to and from string representations of basic data types.
- Sort: provides primitives for sorting slices and user-defined collections.
This project is licensed under the terms of the MIT License © Pablo Satler 2018
The app first searches a list of links to be queried afterwards. Then, it pulls some information of these links, like stock price, market value, etc. This second search (for details of each link) is the process which takes longer.
The first implementation, without any concurrency used, took about 9min30s to 10min to be completed. Then, another approach was dividing the slice into go routines, where each go routine would take care of a part of the slice, appending the result to a final slice of structs. With that approach, the time spent dropped down to 4mins ish.
It was used a WaitGroup. A WaitGroup waits for a collection of goroutines to finish. The main goroutine calls Add to set the number of goroutines to wait for. Then each of the goroutines runs and calls Done when finished. At the same time, Wait can be used to block until all goroutines have finished.
show databases;
use <DBName>;
show tables;
describe <TableName>;
drop database <DBName>;
drop table <TableName>;
select * from <TableName>;
It was used the command below (as a root user), where db-mysql-container
is the container name defined on Docker Compose file.
sudo docker exec -it db-mysql-container mysql -uroot -proot
It's done via volumes, as shown in the docker-compose.yaml
file in this project and as shown here and here.
db:
volumes:
/path-to-sql-files-on-your-host:/docker-entrypoint-initdb.d
then run docker-compose down -v
to destroy containers and volumes and run docker-compose up
to recreate them.
To reach a service on another container, take this docker tutorial as reference.
For example, in this project, DB_HOST env var is defined as db
, the name given to the mysql service. And DB_PORT is set with the same number exposed in ports inside the mysql.