-
Requirements:
- Python > 3.8 environment
- Install required packages:
pip install -r requirements.txt
- Firefox + webdriver for Firefox (geckodriver) https://github.com/mozilla/geckodriver/releases
- Put file
df_config.json
undersrc/backend
for DB credentials
-
Run Scraping:
- Run
python scraping_main.py
- Run
-
NLP analysis:
NLP Analysis.ipynb
- Get site's html content (in priority order): requests, selenium (Firefox driver)
- Scraping html: BeautifulSoup
Main site -> Category -> Article
- Start from the main site, scraping all the urls for categories (e.g., world, business, ...) and sub-categories (e.g., Asia, Australia, Tech, ...).
- Go to each categories (or sub-categories), scraping all the urls of the articles shown (use selenium and wait for the page to finish loading). Validate the article urls (some urls come from other sites, or contain only video), keep only valid ones.
- Go to each article, scraping title, metadata (author, publish date), and text.
Generalize to scraping from more news sites.
- Tags and class names of certain components are parameterized and defined in the
src/backend/input/data.yml
- Template for one field is:
<current_page>_<scraping_target>_<parameter>:<value>
main_category_tag: a
: in the main site, tag of category isa
main_category_class: sc-fjdhpX sc-chPdSV hnOkcW
: in the main site, class of the category issc-fjdhpX sc-chPdSV hnOkcW
article_title_tag: h1
: in the article site, tag of title ish1
Notes: make sure values for tag and class name are unique.
- Type: postgresql
- Host: AWS RDS db.t3.micro
- Credentials: put file
db_config.json
undersrc/backend/
.
src/backend/input/
: input/setting files.src/backend/lib/
: code for main features: scrapping, database, ...webdriver/
: download and put webdriver for Firefox herescrapper.ipynb
: notebook for scrappingNLP Analysis.ipynb.ipynb
: notebook for NPL analysis
- Create Postgresql Database in AWS
- Code to scraping articles from CNN
- Define DB structure
- Logging
- Clean, map and push scraping data to DB
- Analyse data
- Sentimental
- Statistical
- Relationship
- Named Entity Recognition
- Topic clustering
- Generalize scraping strategy
- Create install file
- Functions to update analysis label to DB