In today's information age, the spread of fake news has become a significant concern. Fake news, defined as false or misleading information presented as factual news, poses a serious threat to society. Recognizing the gravity of this issue, our project endeavors to combat fake news and identify trends by harnessing the power of big data technologies.
Our project utilizes the GDELT project, a vast repository of global news data, to track and identify trending news stories. Additionally, we aim to tackle the challenge of fake news by applying sophisticated machine-learning techniques to the news articles dataset. By training machine learning models using the content and metadata of the articles, we can detect patterns and indicators that distinguish between authentic news and potentially fake news. This empowers users with the ability to verify the authenticity of the news articles they encounter.
PySpark, a Python library, was used along with Apache Spark, a powerful big data processing framework. This combination allowed for distributed processing and analysis of large-scale datasets.
The project made use of PySpark's machine learning libraries, which provide a wide range of algorithms and tools for building and training machine learning models.
Streamlit, a Python library, was employed for developing the user interface (UI) of the platform. Streamlit simplifies the creation of interactive and customizable web applications.
Airflow was utilized as an orchestration tool for managing workflows and data pipelines. It allowed for the automation of data processing tasks, ensuring a smooth and efficient flow of data throughout the project.
The project employed comprehensive data collection, effective preprocessing techniques, the development of machine learning models for fake news detection, and in-depth exploratory data analysis.
Real-time news articles were collected from the GDELT Project dataset and the FakeNewsCorpus V1.0. Each source provided unique advantages and contributed to the overall richness of the dataset.
Several preprocessing steps were carried out to clean and prepare the data for analysis and fake news detection, including null values removal, irrelevant columns dropping, and anomaly detection and cleaning.
Logistic regression and RandomForestClassifier models were developed to classify news articles as either authentic or potentially fake based on their content and metadata.
A user interface (UI) or dashboard was developed to provide users with a seamless experience while accessing verified and authentic trending news.
The project yielded several noteworthy results and outcomes, including the development of a platform for accessing verified authentic trending news, potential applications in journalism and media studies, and contribution to combating the spread of fake news.
We would like to express our sincere appreciation to the GDELT Project and FakeNewsCorpus for providing access to their extensive dataset. This project was developed as a term project for NYU's Big Data course. Special thanks to all the team members for the implementation and analysis.
Thanks for exploring this repository! Feel free to contact for more information.