-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
5649052
commit 75b3079
Showing
19 changed files
with
10,920 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
OPENAI_API_KEY="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" |
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
Tweet | ||
"@tedcruz And, #HandOverTheServer she wiped clean + 30k deleted emails, explains dereliction of duty/lies re #Benghazi,etc #tcot" | ||
Hillary is our best choice if we truly want to continue being a progressive nation. #Ohio | ||
"@TheView I think our country is ready for a female pres, it can't ever be Hillary" | ||
I just gave an unhealthy amount of my hard-earned money away to the big gov't & untrustworthy IRS. #WhyImNotVotingForHillary | ||
@PortiaABoulger Thank you for adding me to your list | ||
Hillary can not win. Here's hoping the Dems offer a real candidate like Warren. #Warren2016 | ||
"Respect FOR the law and respect BY the law Yes, needed desperately. #BaltimoreRiots" | ||
I don't want to be appointed to an Ambassador post. | ||
"#StopHillary2016 @HillaryClinton if there was a woman with integrity and honesty I would vote for such as woman president, NO" | ||
@HillaryClinton End lawless #ClintonFoundation. Jail Butcher of #Benghazi. #Arrest rapist #BillClinton. #HillaryClinton | ||
"Use your brain, keep Hillary out of the White House.Clinton2016" | ||
@HillaryClinton Hillary pandering with her logo. #ClintonFoundationscandal #ClintonCash | ||
"@readyforHRC @HillaryClinton #HillaryClinton, the US presidency is a testament to the success of #women their role in the world" | ||
@CiaraAntaya cuz you know I'm such a feminist | ||
2 million bogus followers on Twitter @HillaryClinton #WhyImNotVotingForHillary | ||
@lindasuhler : My name is Rebecca and my grandmother immigrated to Sunnybrook Farm. @twitchyteam | ||
Where's the campaign store is the real question? I am ready to buy some Hillary gear | ||
"It's a miracle, suddenly #Democrats don't mind having someone who voted for war." | ||
@smileitsalicia @greekgummybear2 now i can live in peace | ||
Hillary doesn't want to put anyone in prison anymore. Obviously worried about her own future. | ||
The only way I support Hillary was if Elizabeth Warren ran or Karl Marx was running #2016 #Clinton2016 | ||
@HomeOfUncleSam @ScotsFyre @RWNutjob1 @SA_Hartdegen She's too old to understand the internet...that she can be fact checked. | ||
Because Communist Breadlines are not my thing! #NoHillary #WhyImNotVotingForHillary | ||
"@HillaryClinton bad wife, bad role model for women, bad lawyer, bad First Lady, bad Senator, horrible Secretary of State." | ||
"Everything Hillary touches ends up being a scam, a lie, a cover-up, or a failure. William L. Just who we want as president." | ||
Yes HRC subject 2 dbl standard Smh Come on @billclinton @HillaryClinton U Knew @ClintonFdn Donations Would b Scrutinized; Spun! | ||
#Hillary to stop for #pizza today to garner the #Italian vote. #MSM is worthless. #libertynothillary #HillNo | ||
I want America to great again #WhyImNotVotingForHillary | ||
"March 8, 2016 Ohio is holding our Primaries! The date is subject to change. #Ohio #OurChampion" | ||
@RIGHTZONE @WethePeoplePets Let's hope the VOTERS remember! #HillNo | ||
Hillary Clinton has not driven a car since 1996. #clintonfakerealityshow | ||
"@NaughtyBeyotch @TheRealMadman23 Don't care for #Fiorina, but it seems she taking the #sexist gut punches by the #media. #MSM" | ||
"@FoxNews @marthamaccallum @BillHemmer whose the opportunist now, #NoHillary2016" | ||
@HillaryClinton @WomenintheWorld we need to re-establish a #global system dominated by love and affection have #moral_humane RT | ||
#Hillary is as transparent as a brick wall #LibertyNotHillary | ||
@WSJ . Clinton Foundation to keep accepting bribes from foreign governments #WhyImNotVotingForHillary | ||
"@josephbenning I agree, these are better than what you had before, like a severe cold is better than pneumonia. Good luck." | ||
"What are you afraid of @HillaryClinton? If you can't answer questions from the press, why do we want you as POTUS" | ||
"Sorry, Hillary's new normal folk image doesn't take away from Behgnazi & her 0 foreign policy successes as Secretary of State." | ||
CEO pay the target for 2016 election. From someone who makes more than most CEO's but you drank the Kool-Aid. |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
import pandas as pd | ||
|
||
|
||
def write_tweets_to_txt(file): | ||
df = pd.read_csv(file, encoding='ISO-8859-1') | ||
column_data = df["Tweet"] | ||
row_count = 0 | ||
|
||
with open('atheism_without_none.txt', 'w', encoding='ISO-8859-1') as f: | ||
for value in column_data: | ||
if(row_count % 43 == 0): | ||
f.write('********************************************' + '\n') | ||
f.write(str(value) + '\n') | ||
row_count += 1 | ||
|
||
write_tweets_to_txt('atheism_without_none.csv') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
import numpy as np | ||
from nltk.tokenize import word_tokenize | ||
from snowballstemmer import TurkishStemmer | ||
import string | ||
from sklearn.feature_extraction.text import TfidfVectorizer | ||
|
||
import pandas as pd | ||
|
||
#stemmer = TurkishStemmer() | ||
|
||
def read_turkish_tweets(file): | ||
df = pd.read_csv(file, encoding='windows-1254') | ||
tweets = df["Tweet"].tolist() | ||
targets = df["Target"].tolist() | ||
return tweets, targets | ||
|
||
def detect_stopwords(): | ||
stopwords_df = pd.read_csv('turkish', header=None) | ||
stop_words = stopwords_df[0].tolist() | ||
#stop_words = stopwords.words('turkish') | ||
stop_words.extend(string.punctuation) | ||
stop_words.extend(["vs.", "vb.", "a", "i", "e", "rt", "#semst", "semst"]) | ||
stop_words = set(stop_words) | ||
return stop_words | ||
|
||
|
||
def tokenize_tweet(tweet): | ||
# Tokenization | ||
tokens = word_tokenize(tweet) | ||
stop_words = detect_stopwords() | ||
normalized_tokens = [token.lower() for token in tokens] | ||
filtered_tokens = [token for token in normalized_tokens if (token not in stop_words and not token.startswith("http"))] | ||
#stemmed_tokens = [stemmer.stemWord(token) for token in filtered_tokens] | ||
|
||
return filtered_tokens | ||
|
||
def extract_features_tfidf(tweets): | ||
# Tokenization and preprocessing | ||
tokenized_tweets = [tokenize_tweet(tweet) for tweet in tweets] | ||
|
||
# Feature extraction: n-grams | ||
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 3), analyzer='word') | ||
tfidf_features = tfidf_vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_tweets]) | ||
|
||
# Feature extraction: character n-grams | ||
char_tfidf_vectorizer = TfidfVectorizer(ngram_range=(2, 5), analyzer='char') | ||
char_tfidf_features = char_tfidf_vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_tweets]) | ||
|
||
# Feature extraction: sentiment lexicon features, target presence/absence, POS tags, encodings | ||
# These features remain the same as before | ||
|
||
# Combine all features | ||
all_features = np.concatenate((tfidf_features.toarray(), char_tfidf_features.toarray()), axis=1) | ||
np.savetxt('feature_matrix.csv', all_features, delimiter=',') | ||
return all_features | ||
|
||
train_tweets, train_targets = read_turkish_tweets('translated_train_without_none.csv') | ||
|
||
print(extract_features_tfidf(train_tweets)) | ||
|
||
#print(tokenize_tweet(train_tweets[0])) |
Oops, something went wrong.