From 2e57784326c1898154adbd7f44332f4d225096b8 Mon Sep 17 00:00:00 2001 From: Dead Teddy Date: Sun, 25 Aug 2019 23:32:15 +0530 Subject: [PATCH] Headline Module Commit --- CODE_OF_CONDUCT.md => Code of Conduct.md | 2 +- Idea/0-Algorithm_Proposal.md | 61 ++++++++++++++ {idea => Idea}/1. Headline.md | 2 +- LICENSE => License | 2 +- README.md => Read Me.md | 0 functions/summary/config.py | 4 - functions/summary/readme.md | 19 ----- functions/summary/requirements.txt | 3 - functions/summary/summary.py | 95 --------------------- functions/url/config.py | 5 -- functions/url/readme.md | 23 ------ functions/url/requirements.txt | 3 - functions/url/url.py | 55 ------------ headline.py | 55 ++++++++++++ main.py | 5 -- training_data.csv | 101 +++++++++++++++++++++++ 16 files changed, 220 insertions(+), 215 deletions(-) rename CODE_OF_CONDUCT.md => Code of Conduct.md (98%) create mode 100644 Idea/0-Algorithm_Proposal.md rename {idea => Idea}/1. Headline.md (81%) rename LICENSE => License (96%) rename README.md => Read Me.md (100%) delete mode 100644 functions/summary/config.py delete mode 100644 functions/summary/readme.md delete mode 100644 functions/summary/requirements.txt delete mode 100644 functions/summary/summary.py delete mode 100644 functions/url/config.py delete mode 100644 functions/url/readme.md delete mode 100644 functions/url/requirements.txt delete mode 100644 functions/url/url.py create mode 100644 headline.py delete mode 100644 main.py create mode 100644 training_data.csv diff --git a/CODE_OF_CONDUCT.md b/Code of Conduct.md similarity index 98% rename from CODE_OF_CONDUCT.md rename to Code of Conduct.md index 6971336..3f48d54 100644 --- a/CODE_OF_CONDUCT.md +++ b/Code of Conduct.md @@ -78,4 +78,4 @@ For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq -Copyright (c) 2019, John Steinhable +Copyright (c) 2019, The UnTruth team diff --git a/Idea/0-Algorithm_Proposal.md b/Idea/0-Algorithm_Proposal.md new file mode 100644 index 0000000..f07e59a --- /dev/null +++ b/Idea/0-Algorithm_Proposal.md @@ -0,0 +1,61 @@ +# Proposing Algorithm + +The critical thinking model applied on humans can also be applied to a program in order to write an algorithm that detects a fake news. The program can be written in several parts ensuring that each module carry out only a single step from the steps below. + +##### Critical Thinking Model: + +1. Read the headline. +2. Read the entire article. +3. Don’t believe a word of anything you read until you check facts and check sources. +4. Are the sources and facts credible? Why or why not? +5. Do a quick search engine scan to see who else has covered the story. +6. Do you see two sides (or more) to the article? +7. Are you being spun? Do you feel manipulated? +8. Are other credible news outlets covering the story? +9. Is this story a potential fake news story? + + +### Implementation + + +#### Read the headline +The headline will provide the program a rough idea. It may be designed in a way that the headline will be reverse-searched on top search engines and gather all the data from similar headlines into heap. The program will also look up for the data on the source website to estimate the legitness_score of that source. + + + +#### Read the entire article +The next steps involves scanning through the whole article word by word and finding relevant patterns that may be crucial to further classify the article into fake or legit. Further the motive of the article may be compared with the headline to predict weather the misleading_title returns True or False + + + +#### Don’t believe a word of anything you read until you check facts and check sources +The initial overall trust_score of the article always always remains -1 until all the scores are calculated i.e The program will always consider the news to be fake unless it had completely processed it, hence not giving any preference to BBC.com over FakeNews.com and both considered a fake initially + + + +#### Are the sources and facts credible? Why or why not? +The source of the current article, the author and the images on the article are reverse-searched to ensure the credibility of the source. the history of posts from the same author and images uploaded on the article are original or just carried forward from other sources and articles + + + +#### Do a quick search engine scan to see who else has covered the story. * + + +#### Do you see two sides (or more) to the article? +This step may involve checking if the article is comparing one entity with another example, political parties. The job of the program here is to determine what is being talked about here and what is it compared with eg: An article constantly comparing Males and Females + + + +#### Are you being spun? Do you feel manipulated? +The next part will help determine if the article is biased towards one side more than the other, in the above example if the article is about Males and Females, the program checks if there's any bias to the comparison, One being favoured more over other and calculate the bias_score . When in favour of females the bias_score for females will be shown as +1 and -1 for men. unbias will be reflected with a bias_score totalling to 0 + + + +#### Are other credible news outlets covering the story? + + +#### Is this story a potential fake news story? +Finally after everything is taken into consideration, The parameters will be used to label the data to be a fake or a legit + + + diff --git a/idea/1. Headline.md b/Idea/1. Headline.md similarity index 81% rename from idea/1. Headline.md rename to Idea/1. Headline.md index 8633c69..329e2f6 100644 --- a/idea/1. Headline.md +++ b/Idea/1. Headline.md @@ -1,6 +1,6 @@ # 1. Headline -### I deas for executing a headline rating +### Ideas for executing a headline rating #### 1.1 Trigger word list Check the words contained in the headline against a pre determined list of words. diff --git a/LICENSE b/License similarity index 96% rename from LICENSE rename to License index e0749ab..8b94d12 100644 --- a/LICENSE +++ b/License @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2019 John Steinhable +Copyright (c) 2019 The UnTruth team Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/Read Me.md similarity index 100% rename from README.md rename to Read Me.md diff --git a/functions/summary/config.py b/functions/summary/config.py deleted file mode 100644 index 17697d8..0000000 --- a/functions/summary/config.py +++ /dev/null @@ -1,4 +0,0 @@ -url = '' # url you want to evaluate -language = '' # language prefix the article is in - -lines = # number of sentences of the summary (integer) diff --git a/functions/summary/readme.md b/functions/summary/readme.md deleted file mode 100644 index 099d2e5..0000000 --- a/functions/summary/readme.md +++ /dev/null @@ -1,19 +0,0 @@ -# summary - -Returns summary from an article's main content. - -## Setup - -1. `cd functions/summary` -2. `virtualenv env` -3. `source env/bin/activate` -4. `pip install -r requirements.txt` - -## Parameters - -Inside `config.py` file the following parameters are **necessary** and **customizable**: -- `url` -- `language` -- `lines` - -***Do not change the value of other parameters*** diff --git a/functions/summary/requirements.txt b/functions/summary/requirements.txt deleted file mode 100644 index ced1672..0000000 --- a/functions/summary/requirements.txt +++ /dev/null @@ -1,3 +0,0 @@ -networkx==2.1 -newspaper3k==0.2.8 -nltk==3.4 diff --git a/functions/summary/summary.py b/functions/summary/summary.py deleted file mode 100644 index 5d7561e..0000000 --- a/functions/summary/summary.py +++ /dev/null @@ -1,95 +0,0 @@ -# rewritten from https://github.com/edubey/text-summarizer/blob/master/text-summarizer.py - -from nltk.cluster.util import cosine_distance -from nltk.corpus import stopwords -import numpy as np -import newspaper -import config -import networkx as nx - - -class summarizer: - # define variables - def __init__(self): - self.a = newspaper.build_article(config.url) - self.a.download() - self.a.parse() - self.a.nlp() - - self.hot = newspaper.hot() - - - # get sentences from article main - def text_to_sentences(self): - sentences = list() - - article = self.a.text - article = article.replace('\n\n', '. ') - article_sentences = article.split(r'. ') - - for sentence in article_sentences: - sentences.append(sentence.replace('[^a-zA-Z]', '').split(' ')) - - return sentences - - - # determines sentence similarity - def sentence_similarity(self, sent1, sent2, stopwords=None): - if stopwords is None: - stopwords = [] - - sent1 = [w.lower() for w in sent1] - sent2 = [w.lower() for w in sent2] - - all_words = list(set(sent1 + sent2)) - - vector1 = [0] * len(all_words) - vector2 = [0] * len(all_words) - - for w in sent1: - if w in stopwords: - continue - vector1[all_words.index(w)] += 1 - - for w in sent2: - if w in stopwords: - continue - vector2[all_words.index(w)] += 1 - - return 1 - cosine_distance(vector1, vector2) - - - # takes article content, returns key words - def build_similarity_matrix(self, content, stop_words, sentences): - similarity_matrix = np.zeros((len(sentences), len(sentences))) - - for idx1 in range(len(sentences)): - for idx2 in range(len(sentences)): - if idx1 == idx2: - continue - similarity_matrix[idx1][idx2] = self.sentence_similarity(sentences[idx1], sentences[idx2], stop_words) - - return similarity_matrix - - - # main of function - def main(self): - summarize_text = list() - - sentences = self.text_to_sentences() - - sentence_similarity_matrix = self.build_similarity_matrix(self.a.text, stopwords.words('english'), sentences) - - sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix) - scores = nx.pagerank(sentence_similarity_graph) - - ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True) - - for i in range(config.lines): - summarize_text.append(" ".join(ranked_sentence[i][1])) - - return summarize_text - - -if __name__ == '__main__': - print("Summarized Text: \n", ". ".join(summarizer().main())) diff --git a/functions/url/config.py b/functions/url/config.py deleted file mode 100644 index dca92ac..0000000 --- a/functions/url/config.py +++ /dev/null @@ -1,5 +0,0 @@ -url = '' # url you want to evaluate -language = '' # language prefix the article is in - -api_key = '' # google api key -cse_id = '' # custom search engine ID diff --git a/functions/url/readme.md b/functions/url/readme.md deleted file mode 100644 index 7fe46bd..0000000 --- a/functions/url/readme.md +++ /dev/null @@ -1,23 +0,0 @@ -# url - -Takes title from article, searches nouns and searches if subjects are currently talked about. - -## Setup - -1. `cd functions/url` -2. `virtualenv env` -3. `source env/bin/activate` -4. `pip install -r requirements.txt` - -## Parameters -You can get an API key by visiting [google console](https://code.google.com/apis/console) and clicking "API Access". You will then need to switch on the custom search API on the "Services" tab. - -Inside `config.py` file the following parameters are **necessary** -- `api_key` -- `cse_id` - -The following parameters are **customizable**: -- `url` -- `language` - -***Do not change the value of other parameters*** diff --git a/functions/url/requirements.txt b/functions/url/requirements.txt deleted file mode 100644 index 54e3a66..0000000 --- a/functions/url/requirements.txt +++ /dev/null @@ -1,3 +0,0 @@ -newspaper3k==0.2.8 -nltk==3.4 -google-api-python-client==1.7.11 \ No newline at end of file diff --git a/functions/url/url.py b/functions/url/url.py deleted file mode 100644 index e522ba2..0000000 --- a/functions/url/url.py +++ /dev/null @@ -1,55 +0,0 @@ -from googleapiclient.discovery import build # news api requires js, can do that later -import newspaper -import pprint -import config -import nltk -import time -import sys - - -class url_evaluator: - # define variables and prepare article instance - def __init__(self): - self.a = newspaper.build_article(config.url) - self.a.download() - self.a.parse() - self.a.nlp() - - self.hot = newspaper.hot() - - - # get nouns from title - def get_nouns_nltk(self): - is_noun = lambda pos: pos[:2] == 'NN' - tokenized = nltk.word_tokenize(self.a.title) - nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] - - return nouns - - - # takes a search term, does a google search on it - def google_search(self, search_term, **kwargs): - try: - service = build('customsearch', 'v1', developerKey=config.api_key) - res = service.cse().list(q=search_term, cx=config.cse_id, **kwargs).execute() - - return res['items'] - - except Exception as e: - print('Google API returned error', e) - sys.exit() - - - # main of function - def main(self): - nouns = self.get_nouns_nltk() - - for i in nouns: - results = self.google_search('i', num=10) - - for result in results: - pprint.pprint(result) - - -if __name__ == '__main__': - url_evaluator().main() diff --git a/headline.py b/headline.py new file mode 100644 index 0000000..8cbe76d --- /dev/null +++ b/headline.py @@ -0,0 +1,55 @@ +import urllib.request +from bs4 import BeautifulSoup +from textblob.classifiers import NaiveBayesClassifier +from textblob import TextBlob + +class title: + + #Initialisations + def __init__(self): + self.news_url="https://edition.cnn.com/2019/08/25/politics/trump-g7-boris-johnson-emmanuel-macron/index.html" + + + def extract_headline(self): + self.net_con=True #Expecting Internet Connection to be working initially + try: + news_page=urllib.request.urlopen(self.news_url) + soup = BeautifulSoup(news_page,'html.parser') + headline_in_html=soup.find('h1') + headline=headline_in_html.text.strip() + return headline + + except urllib.error.URLError: + print("\nCONNECTIION ERROR:There may be a connection problem. Please check if the device is connected to the Internet") + self.net_con=False #Value update if the program is unable to connenct + + + #Adding Training Data + def train_data(self, headline): + try: + with open('training_data.csv','r') as td: + cl=NaiveBayesClassifier(td,format='csv') + sentiment=cl.classify(headline) + return sentiment + + except: + if self.net_con==False: + pass + else: + print("\n\nProgram Error") + + + def headline_category(self,headline,sentiment): + + analyse_headline=TextBlob(headline) + print("\n"+"Headline:",headline,"\n") + print("Headline Sentiment:",sentiment,"\n\n") + + def main(self): + hdln=self.extract_headline() + sntmnt=self.train_data(hdln) + self.headline_category(hdln,sntmnt) + +if __name__=='__main__': + do_ya_thing=title() + do_ya_thing.main() diff --git a/main.py b/main.py deleted file mode 100644 index 4af7d58..0000000 --- a/main.py +++ /dev/null @@ -1,5 +0,0 @@ -#!/usr/bin/env python3 - -if __name__ == '__main__': - print("UnTruth!") - raise SystemExit diff --git a/training_data.csv b/training_data.csv new file mode 100644 index 0000000..8c53f71 --- /dev/null +++ b/training_data.csv @@ -0,0 +1,101 @@ +aba decides against community broadcasting licence,Negative +act fire witnesses must be aware of defamation,Negative +ag calls for infrastructure protection summit,Positive +air nz staff in aust strike for pay rise,Negative +air nz strike to affect australian travellers,Negative +ambitious olsson wins triple jump,Positive +antic delighted with record breaking barca,Positive +aussie qualifier stosur wastes four memphis match,Negative +aust addresses un security council over iraq,Positive +australia is locked into war timetable opp,Negative +australia to contribute 10 million in aid to iraq,Positive +barca take record as robson celebrates birthday in,Positive +bathhouse plans move ahead,Positive +big hopes for launceston cycling championship,Positive +big plan to boost paroo water supplies,Positive +blizzard buries united states in bills,Negative +brigadier dismisses reports troops harassed in,Negative +british combat troops arriving daily in kuwait,Negative +bryant leads lakers to double overtime win,Negative +bushfire victims urged to see centrelink,Negative +businesses should prepare for terrorist attacks,Negative +calleri avenges final defeat to eliminate massu,Neutral +call for ethanol blend fuel to go ahead,Neutral +carews freak goal leaves roma in ruins,Negative +cemeteries miss out on funds,Negative +code of conduct toughens organ donation regulations,Neutral +commonwealth bank cuts fixed home loan rates,Positive +community urged to help homeless youth,Positive +council chief executive fails to secure position,Negative +councillor to contest wollongong as independent,Positive +council moves to protect tas heritage garden,Positive +council welcomes ambulance levy decision,Positive +council welcomes insurance breakthrough,Positive +crean tells alp leadership critics to shut up,Negative +dargo fire threat expected to rise,Negative +death toll continues to climb in south korean subway,Negative +dems hold plebiscite over iraqi conflict,Positive +dent downs philippoussis in tie break thriller,Neutral +de villiers to learn fate on march 5,Neutral +digital tv will become commonplace summit,Neutral +direct anger at govt not soldiers crean urges,Negative +dispute over at smithton vegetable processing plant,Negative +dog mauls 18 month old toddler in nsw,Neutral +dying korean subway passengers phoned for help,Negative +england change three for wales match,Neutral +epa still trying to recover chemical clean up costs,Positive +expressions of interest sought to build livestock,Neutral +fed opp to re introduce national insurance,Neutral +firefighters contain acid spill,Negative +four injured in head on highway crash,Negative +freedom records net profit for third successive,Positive +funds allocated for domestic violence victims,Positive +funds allocated for youth at risk,Positive +funds announced for bridge work,Positive +funds to go to cadell upgrade,Positive +funds to help restore cossack,Positive +german court to give verdict on sept 11 accused,Neutral +gilchrist backs rest policy,Negative +girl injured in head on highway crash,Negative +gold coast to hear about bilby project,Neutral +golf club feeling smoking ban impact,Positive +govt is to blame for ethanols unpopularity opp,Negative +greens offer police station alternative,Positive +griffiths under fire over project knock back,Negative +group to meet in north west wa over rock art,Neutral +hacker gains access to eight million credit cards,Negative +hanson is grossly naive over nsw issues costa,Positive +hanson should go back where she came from nsw mp,Negative +harrington raring to go after break,Negative +health minister backs organ and tissue storage,Positive +heavy metal deposits survey nearing end,Positive +injured rios pulls out of buenos aires open,Neutral +inquest finds mans death accidental,Neutral +investigations underway into death toll of korean,Positive +investigation underway into elster creek spill,Negative +iraqs neighbours plead for continued un inspections,Negative +iraq to pay for own rebuilding white house,Positive +irish man arrested over omagh bombing,Positive +irrigators vote over river management,Positive +israeli forces push into gaza strip,Negative +jury to consider verdict in murder case,Positive +juvenile sex offenders unlikely to reoffend as,Positive +kelly disgusted at alleged bp ethanol scare,Neutral +kelly not surprised ethanol confidence low,Negative +korean subway fire 314 still missing,Negative +last minute call hands alinghi big lead,Positive +low demand forces air service cuts,Negative +man arrested after central qld hijack attempt,Positive +man charged over cooma murder,Positive +man fined after aboriginal tent embassy raid,Positive +man jailed over keno fraud,Positive +man with knife hijacks light plane,Negative +martin to lobby against losing nt seat in fed,Neutral +massive drug crop discovered in western nsw,Positive +mayor warns landfill protesters,Positive +meeting to consider tick clearance costs,Positive +meeting to focus on broken hill water woes,Positive +moderate lift in wages growth,Positive +more than 40 pc of young men drink alcohol at,Negative +more water restrictions predicted for northern tas,Negative +Petrol bombs and water cannons mark violent escalation in Hong Kong protests, negative \ No newline at end of file