Fighting Fake News and Identifying Trends with Big Data

Introduction

In today's information age, the spread of fake news has become a significant concern. Fake news, defined as false or misleading information presented as factual news, poses a serious threat to society. Recognizing the gravity of this issue, our project endeavors to combat fake news and identify trends by harnessing the power of big data technologies.

Overview

Our project utilizes the GDELT project, a vast repository of global news data, to track and identify trending news stories. Additionally, we aim to tackle the challenge of fake news by applying sophisticated machine-learning techniques to the news articles dataset. By training machine learning models using the content and metadata of the articles, we can detect patterns and indicators that distinguish between authentic news and potentially fake news. This empowers users with the ability to verify the authenticity of the news articles they encounter.

Tools, Technology and Core Components

PySpark and Apache Spark

PySpark, a Python library, was used along with Apache Spark, a powerful big data processing framework. This combination allowed for distributed processing and analysis of large-scale datasets.

PySpark Machine Learning Libraries

The project made use of PySpark's machine learning libraries, which provide a wide range of algorithms and tools for building and training machine learning models.

Streamlit

Streamlit, a Python library, was employed for developing the user interface (UI) of the platform. Streamlit simplifies the creation of interactive and customizable web applications.

Airflow

Airflow was utilized as an orchestration tool for managing workflows and data pipelines. It allowed for the automation of data processing tasks, ensuring a smooth and efficient flow of data throughout the project.

Methodology

The project employed comprehensive data collection, effective preprocessing techniques, the development of machine learning models for fake news detection, and in-depth exploratory data analysis.

Data Collection

Real-time news articles were collected from the GDELT Project dataset and the FakeNewsCorpus V1.0. Each source provided unique advantages and contributed to the overall richness of the dataset.

Data Preprocessing

Several preprocessing steps were carried out to clean and prepare the data for analysis and fake news detection, including null values removal, irrelevant columns dropping, and anomaly detection and cleaning.

Fake News Detection

Logistic regression and RandomForestClassifier models were developed to classify news articles as either authentic or potentially fake based on their content and metadata.

User Interface

A user interface (UI) or dashboard was developed to provide users with a seamless experience while accessing verified and authentic trending news.

Results and Outcomes

The project yielded several noteworthy results and outcomes, including the development of a platform for accessing verified authentic trending news, potential applications in journalism and media studies, and contribution to combating the spread of fake news.

Acknowledgements

We would like to express our sincere appreciation to the GDELT Project and FakeNewsCorpus for providing access to their extensive dataset. This project was developed as a term project for NYU's Big Data course. Special thanks to all the team members for the implementation and analysis.

Thanks for exploring this repository! Feel free to contact for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
AirflowDAG.py		AirflowDAG.py
AlternateMLModels.ipynb		AlternateMLModels.ipynb
DataAnalysis&MLPrediction.ipynb		DataAnalysis&MLPrediction.ipynb
NewsIngestion.py		NewsIngestion.py
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
StreamlitApp.py		StreamlitApp.py
temp.c		temp.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fighting Fake News and Identifying Trends with Big Data

Introduction

Overview

Tools, Technology and Core Components

PySpark and Apache Spark

PySpark Machine Learning Libraries

Streamlit

Airflow

Methodology

Data Collection

Data Preprocessing

Fake News Detection

User Interface

Results and Outcomes

Acknowledgements

About

Releases

Packages

Languages

AmruthaPatil/NYU-FightFakeNews

Folders and files

Latest commit

History

Repository files navigation

Fighting Fake News and Identifying Trends with Big Data

Introduction

Overview

Tools, Technology and Core Components

PySpark and Apache Spark

PySpark Machine Learning Libraries

Streamlit

Airflow

Methodology

Data Collection

Data Preprocessing

Fake News Detection

User Interface

Results and Outcomes

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages