Twint Wrapper is a java Spring Boot application that expose operations in front of Twint scraping tool.
This tool supported following operations:
- collect => ask for a twitter scraping with search criteria
- status => check the scraping process
- collect-history => get request history
- collect-update => re-process an existing scraping
All operations are defined through swagger page /twint-wrapper/swagger-ui.html
Target build is a Docker image that combine Twint docker image and the java Spring-Boot application.
- ElasticSearch to index twitter scraping results
- MySQL to manage scraping sessions
- Docker
- Twint
- FusionAuth to protect operations to authorized users
TwintPlus is the scraper that is supported. It must be build as a docker image as a prerequisite to any new development.
Test twint image:
docker run --rm -it twintplus "tplus -cq %23deepfake --since 2020-05-01 --until 2020-05-19 -ee elasticsearch:9200 -es --index-name tsna --limit 10000"
There is several profiles that allow to increase or decrease scraping limits. The following results can be override with environment values:
- TWINT_LIMIT_DEFAULT (15k)
- TWINT_LIMIT_MIN (15k)
- TWINT_LIMIT_DEFAULT (30k)
- TWINT_LIMIT_MAX (60k)
src/scripts/dev folder gives docker-compose file to start MySQL, ElasticSearch and FusionAuth.
$cd src/scripts/dev
$docker-compose up -d
By default, the application will connect to the different containers
An other script add Kibana to monitor ElasticSearch Index.
$ cd src/scripts/dev
$ docker-compose -f docker-compose-kibana.yml up -d
Some linux system need to increase it virtual memory cf https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html. If elasticsearch fails at startup:
sudo sysctl -w vm.max_map_count=262144
Default build build Spring-Boot application as jar file. Default profile is dev
mvn clean package
Build docker image with Spring-Boot application and twint application
mvn clean package -P prod
Twitter-gateway can be run with different profiles (Default, dev & prod).
default:
java -jar twint-wrapper.jar
dev:
java -Dspring.profiles.active=dev -jar twint-wrapper.jar
prod
java -Dspring.profiles.active=prod -jar twint-wrapper.jar
Full application with all components run with docker-compose
$src/main/script/prod
$docker-compose up -d
FusionAuth executed with docker-compose use a setup sql script that embedded minimum configuration to add authorized users. This inialisation will be set at first docker-compose startup. Supported version is 1.15.* (tested with 1.15.2 & 1.15.5)
Setup to grant users are:
- Register user
- Approved user through fusionAuth UI
- User ask a register code with his email address
- User ask a token with his received register code to use scraping operations.
FusionAuth is accessible locally:
- http://localhost:9011
- login: [email protected]
- password: Weverify$
FusionAuth send registration code by email. To enable this feature, a SMTP server need to be setup in fusionAuth.
Example with a Gmail Account:
Tenants > Weverify (Edit) > Email (SMTP settings)
- Host: smtp.gmail.com
- Port: 587
- Username:
- Change password: (password is stored encrypted locally)
- Security: TLS
To use your google account as a SMTP Gateway, you must turn on 2-step Verification in your account
- Account
- Security
- Turn on 2-step verification
Add an access to your local FusionAuth by creating an application password
- Account
- Security
- Application password
- Other -> Give a Name -> Generated
- Copy paste the generated password to FusionAuth Email configuration.
It is possible to get notify to your slack account when a user registered.
To add this feature just set the environment variable $SLACK_URL in the run time with your own slack URL.
There is 2 Twitie version. Application is supporting both. Differences between the two version is about request format and endPoint calls. Response format is the same.
Legacy Twitie version that support only English
- URL: /gate/process?annotations=:Person,:UserID,:Location,:Organization
- Body format: String
- Content-Type: text/plain
Spacy base Twitie version. Can support following langues (de, el, en, es, fr, it, pt) default is english
- URL: /process
- Body format: json
- Content-Type: application/json
Example request:
{
"type":"text",
"content": "The Wall Street Journal always “forgets” to mention that the ratings for the White House Press
Briefings are “through the roof” (Monday Night Football, Bachelor Finale, according to @nytimes) & is only way
for me to escape the Fake News & get my views across. WSJ is Fake News!",
"features": {
"lang":"en"
}
}
- TWITIE_IS_SPACY (true): set which version of Twitie to used
- TWITIE_URL (http://localhost:8081/process): Twitie service endPoint URL
- TWITIE_THREADS (8): number of threads used to process tweets with Twitie
- TWITIE_ES_PAGESIZE (100): ES request size result for Twitie processing
- TWITIE_ES_BULKSIZE (5000): ES update bulk size after Twitie processing