Skip to content

A Flask-based web scraper that fetches the latest news headlines from The Atlantic and displays them on a webpage. The application integrates Google OAuth and reCAPTCHA for security.

Notifications You must be signed in to change notification settings

mai-repo/Newscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Scraper Application

This Flask-based application scrapes the latest news headlines and descriptions from The Atlantic and stores the data before rendering it on a webpage. Additionally, it integrates with the Pokémon API to fetch and manage Pokémon data.

FrontPage

Table of Contents

Requirements

The following packages are required to run the application:

  • Flask: Web framework for Python.
  • requests: HTTP library for sending GET requests to fetch the news data.
  • beautifulsoup4: Library for parsing HTML and scraping data.
  • google-auth: Library for authenticating with Google services.
  • google-auth-oauthlib: Library for OAuth 2.0 authentication with Google.
  • Svelte: A modern JavaScript framework for building user interfaces.

Setup Instructions

Follow these steps to get the application up and running:

1. Clone the Repository

Clone this repository to your local machine:

git clone https://github.com/mai-repo/RG-Knowledge-Check-1.git

2. Create a Virtual Environment

python -m venv .venv

3. Create a .gitignore

Add .venv in the .gitignore file to prevent committing the virtual environment folder:

.venv/

4. Install Dependencies

Install all the required Python packages:

pip3 install -r requirements.txt

5. Set Up Google Applications and Keys

Follow the instructions to set up Google applications and obtain the necessary keys for authentication.

Google Application Setup and .env Configuration

1. Create a Google Cloud Project

  • Go to the Google Cloud Console.
  • Create a new project and enable the following APIs:
    • Google Identity Services API
    • reCAPTCHA API

2. Create OAuth 2.0 Credentials

  • Go to Credentials in the Google Cloud Console.
  • Create OAuth 2.0 Client ID for a web app.
  • Add authorized origins (e.g., http://127.0.0.1:5000/).
  • Download the JSON file with client secrets.

3. Set Up reCAPTCHA

4. Create .env File

  • Create a .env file in the root directory of your project.
  • Add the following variables:
GOOGLE_CLIENT_SECRET=your-google-client-secret
RECAPTCHA_SECRET_KEY=your-recaptcha-secret-key
DATABASE_URL=your-render-database

6. Data Schema

News Data Schema

News

The news data is stored in an PostgreSQL database with the following schema:

CREATE TABLE IF NOT EXISTS news (
    id SERIAL PRIMARY,
    headline TEXT NOT NULL,
    summary TEXT NOT NULL,
    link TEXT NOT NULL
);

Pokémon Data Schema

Pokemon

The Pokémon data is stored in an PostgreSQL database with the following schema:

CREATE TABLE IF NOT EXISTS pokemon (
    id SERIAL PRIMARY,
    username TEXT NOT NULL,
    pokemonName TEXT NOT NULL,
    image TEXT NOT NULL
);

Favorite Articles Data Schema

FavArt

The favorite articles data is stored in a PostgreSQL database with the following schema:

CREATE TABLE IF NOT EXISTS favArt (
    id SERIAL PRIMARY KEY,
    username TEXT NOT NULL,
    news_id INT NOT NULL,
    FOREIGN KEY (news_id) REFERENCES news(id) ON DELETE CASCADE
);

Table for Full-Text Search

News_fts

The table for full-text search is created using the following schema:

CREATE TABLE IF NOT EXISTS news_fts (
      id INT PRIMARY KEY,
      headline TEXT,
      summary TEXT,
      link TEXT
);

7. Download Frontend Dependencies

Navigate to the frontend directory:

cd Frontend

Install dependencies:

npm install

8. Run the Flask Application

export FLASK_APP=Backend.main
flask run

8.1 Run the Frontend Application

Navigate to the frontend directory:

cd Frontend

Start the development server:

npm run dev

This will start the frontend application, and you can access it in your web browser at http://localhost:9000.

9. Open Your Web Browser

A webpage with a webscraper that asks user to click a button to scrape data from the Atlantic and returns a JSON file with the latest headlines

10. Testing Instructions

Follow these steps to run the tests for the application:

1. Set Up the Testing Environment

Ensure that you have installed all the required dependencies as mentioned in the Setup Instructions.

  • Using unittest To run tests with unittest, use the following command:
python -m unittest discover -s Backend/tests -p "test_*.py"

2. View Test Coverage Report

If you want to generate a test coverage report, you can use pytest-cov.

  • Install pytest-cov Install pytest-cov using the following command:
pip install pytest-cov
  • Run Tests with Coverage Run the tests with coverage using the following command:
pytest --cov=Backend --cov-report=html Backend/tests

This will generate a coverage report in the htmlcov directory. You can view the report by opening the index.html file in a web browser:

open htmlcov/index.html

11. Deployment

1. Create a Dockerfile

Create a Dockerfile in the root directory of your project:

> **Note:** The following `Dockerfile` is a template. You may need to adjust it according to your specific project requirements.

```dockerfile
# Use the official Python image from the Docker Hub
FROM python:3.9-slim

# Set the working directory
WORKDIR /app

# Copy the requirements file into the container
COPY requirements.txt .

# Install the dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code into the container
COPY . .

# Set environment variables
ENV FLASK_APP=Backend.main

# Expose the port the app runs on
EXPOSE 5000

# Run the application
CMD ["flask", "run", "--host=0.0.0.0"]

2. Create a Render YAML File

Create a render.yaml file in the root directory of your project to define the Render service configuration this is a template that comes with the repo:

services:
  - type: web
    name: news-scraper
    env: docker
    dockerfilePath: ./Dockerfile
    envVars:
      - key: FLASK_ENV
        value: production
      - key: FLASK_APP
        value: Backend/main.py
      - key: GOOGLE_CLIENT_KEY
        value: GOOGLE_CLIENT_KEY
      - key: BACKEND_KEY
        fromDatabase: BACKEND_KEY
      - key: DATABASE_URL
        fromDatabase: DATABASE_URL
    startCommand: gunicorn -w 4 -b 0.0.0.0:8080 Backend.main:app

12. Environment Configuration

1. In the Frontend create .env.local and .env.production

Create .env.local for local development:

VITE_API_BASE_URL=https://your-loca-url.com

Create .env.production for production:

VITE_API_BASE_URL=https://your-production-url.com

13. Deploy on Render

Follow these steps to deploy your application on Render:

  1. Create a new web service on Render.
  2. Connect your GitHub repository.
  3. Set the build and start commands:
  • Build Command: pip install -r Backend/requirements.txt
  • Start Command: gunicorn -w 4 -b 0.0.0.0:8080 Backend.main:app
  1. Add environment variables in the Render dashboard:
  • GOOGLE_CLIENT_SECRET=your-google-client-secret
  • RECAPTCHA_SECRET_KEY=your-recaptcha-secret-key
  • DATABASE_URL=your-database-key
  1. Deploy the application.

14. Deploy on Vercel

Follow these steps to deploy your frontend application on Vercel:

  1. Log in to your Vercel account.
  2. Connect your GitHub repository to Vercel.
  3. Set the environment variables in the Vercel dashboard:
  • VITE_API_BASE_URL=https://your-backend-url.com
  1. Deploy the application.

15. Database on Render

To set up the database on Render:

  1. Create a new PostgreSQL database on Render.
  2. Note the database URL provided by Render.
  3. Add the database URL to your environment variables in the Render dashboard:
  • DATABASE_URL=your-database-url
  1. Update your application to use the Render database URL for database connections.

Advanced Search

The advanced search feature allows users to search for news articles based on specific keywords. This feature enhances the user experience by providing more relevant search results.

How to Use Advanced Search

  1. Navigate to the Search Page: Go to the search page in the application.
  2. Enter Keywords: Enter the keywords you want to search for in the search bar.
  3. View Results: The application will display the news articles that match the entered keywords.

Example

  • If you want to search for articles related to "economy" or "Trump", enter "economy" and "trump" in the search bar and press enter. The application will display all articles that contain the keyword "economy".

Stretch Goals

  • Allow users to choose from a variety of news sites

Learning Experience

Building this News Scraper Application was a challenging yet rewarding experience that taught me how to build a full-stack application from scratch, integrating both backend and frontend technologies. I learned how to develop a full-stack application using Flask for the backend and Svelte for the frontend, and how to implement web scraping using BeautifulSoup to fetch news data. Deploying the application on Render and Vercel required careful configuration and taught me how to manage deployment pipelines and ensure the application runs smoothly in a production environment.

About

A Flask-based web scraper that fetches the latest news headlines from The Atlantic and displays them on a webpage. The application integrates Google OAuth and reCAPTCHA for security.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published