Skip to content

AFP-Medialab/twint-wrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twint Wrapper

Twint Wrapper is a java Spring Boot application that expose operations in front of Twint scraping tool.

This tool supported following operations:

  • collect => ask for a twitter scraping with search criteria
  • status => check the scraping process
  • collect-history => get request history
  • collect-update => re-process an existing scraping

All operations are defined through swagger page /twint-wrapper/swagger-ui.html

Target build is a Docker image that combine Twint docker image and the java Spring-Boot application.

Requirements

  • ElasticSearch to index twitter scraping results
  • MySQL to manage scraping sessions
  • Docker
  • Twint
  • FusionAuth to protect operations to authorized users

architecture

Twint Docker build

Twint is the only scraper that is supported. It must be build as a docker image as a prerequisite to any new development. Project pom.xml defined the build of twint bases on a commit 81f6c2c516a231136fbd821bd6a53d7959965fee of twint project. Dockerfile that build twint image is located src/main/docker/twint

To build twint image with maven run:

mvn docker:build -P twint-docker

Test twint image:

docker run --rm -it twint:2.1.4 "twint -s '#pactemondialsurlesmigrations' --since '2018-12-01 00:00:00' --until '2018-12-15 00:00:00' -l fr --count"

How to develop

src/scripts/dev folder gives docker-compose file to start MySQL, ElasticSearch and FusionAuth.

$cd src/scripts/dev
$docker-compose up -d

By default, the application will connect to the different containers

An other script add Kibana to monitor ElasticSearch Index.

$ cd src/scripts/dev
$ docker-compose -f docker-compose-kibana.yml up -d

Some linux system need to increase it virtual memory cf https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html. If elasticsearch fails at startup:

sudo sysctl -w vm.max_map_count=262144

Builds

Default build build Spring-Boot application as jar file. Default profile is dev

mvn clean package

Build docker image with Spring-Boot application and twint application

mvn clean package -P prod

Run service

Twitter-gateway can be run with different profiles (Default, dev & prod).

default:

java -jar twint-wrapper.jar

dev:

java -Dspring.profiles.active=dev -jar twint-wrapper.jar

prod

java -Dspring.profiles.active=prod -jar twint-wrapper.jar

Full application with all components run with docker-compose

$src/main/script/prod
$docker-compose up -d

FusionAuth configuration

FusionAuth executed with docker-compose use a setup sql script that embedded minimum configuration to add authorized users. This inialisation will be set at first docker-compose startup. Supported version is 1.15.* (tested with 1.15.2 & 1.15.5)

Setup to grant users are:

  • Register user
  • Approved user through fusionAuth UI
  • User ask a register code with his email address
  • User ask a token with his received register code to use scraping operations.

FusionAuth is accessible locally:

Email configuration with google account

FusionAuth send registration code by email. To enable this feature, a SMTP server need to be setup in fusionAuth.

Example with a Gmail Account:

Tenants > Weverify (Edit) > Email (SMTP settings)

  • Host: smtp.gmail.com
  • Port: 587
  • Username:
  • Change password: (password is stored encrypted locally)
  • Security: TLS

To use your google account as a SMTP Gateway, you must turn on 2-step Verification in your account

  • Account
  • Security
  • Turn on 2-step verification

Add an access to your local FusionAuth by creating an application password

  • Account
  • Security
  • Application password
  • Other -> Give a Name -> Generated
  • Copy paste the generated password to FusionAuth Email configuration.

Slack notification

It is possible to get notify to your slack account when a user registered.

To add this feature just set the environment variable $SLACK_URL in the run time with your own slack URL.