News Scraper

1. How to run

Requirements:
- Python > 3.8 environment
- Install required packages: pip install -r requirements.txt
- Firefox + webdriver for Firefox (geckodriver) https://github.com/mozilla/geckodriver/releases
- Put file df_config.json under src/backend for DB credentials
Run Scraping:
- Run python scraping_main.py
NLP analysis:
- NLP Analysis.ipynb

2. Scraper

2.1. Tools

Get site's html content (in priority order): requests, selenium (Firefox driver)
Scraping html: BeautifulSoup

2.1. Strategy

Main site -> Category -> Article

Start from the main site, scraping all the urls for categories (e.g., world, business, ...) and sub-categories (e.g., Asia, Australia, Tech, ...).
Go to each categories (or sub-categories), scraping all the urls of the articles shown (use selenium and wait for the page to finish loading). Validate the article urls (some urls come from other sites, or contain only video), keep only valid ones.
Go to each article, scraping title, metadata (author, publish date), and text.

2.2 Generalization

Generalize to scraping from more news sites.

Tags and class names of certain components are parameterized and defined in the src/backend/input/data.yml
Template for one field is: <current_page>_<scraping_target>_<parameter>:<value>

Example:

main_category_tag: a: in the main site, tag of category is a
main_category_class: sc-fjdhpX sc-chPdSV hnOkcW: in the main site, class of the category is sc-fjdhpX sc-chPdSV hnOkcW
article_title_tag: h1: in the article site, tag of title is h1

Notes: make sure values for tag and class name are unique.

3. Database

Type: postgresql
Host: AWS RDS db.t3.micro
Credentials: put file db_config.json under src/backend/.

4. Project Structure

src/backend/input/: input/setting files.
src/backend/lib/: code for main features: scrapping, database, ...
webdriver/: download and put webdriver for Firefox here
scrapper.ipynb: notebook for scrapping
NLP Analysis.ipynb.ipynb: notebook for NPL analysis

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
src		src
.gitignore		.gitignore
NLP Analysis.ipynb		NLP Analysis.ipynb
README.md		README.md
requirements.txt		requirements.txt
scraper.ipynb		scraper.ipynb
scraping_main.py		scraping_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Scraper

1. How to run

2. Scraper

2.1. Tools

2.1. Strategy

2.2 Generalization

Example:

3. Database

4. Project Structure

TODO

About

Releases

Packages

Languages

habom2310/news-scraper

Folders and files

Latest commit

History

Repository files navigation

News Scraper

1. How to run

2. Scraper

2.1. Tools

2.1. Strategy

2.2 Generalization

Example:

3. Database

4. Project Structure

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages