Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get the IDs #23

Open
shenyizy opened this issue Feb 4, 2020 · 26 comments
Open

Can't get the IDs #23

shenyizy opened this issue Feb 4, 2020 · 26 comments

Comments

@shenyizy
Copy link

shenyizy commented Feb 4, 2020

When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?

@jaackland
Copy link

I also have this problem, it used to work and now doesn't. My suspicion is that the problem is with the css selector. Maybe Twitter recently changed the way they store tweet ids in the css file? I also don't really know what I'm talking about because I'm pretty new to python. If you figure it out, please let me know!

@jaackland
Copy link

jaackland commented Feb 11, 2020

@shenyizy I think I've fixed it but I'm not entirely confident this is logic error free. It's a bit messy, but the trick is to use a new and less effective css selector. I've noticed three problems so far but I've been able to work around:

  1. The new selector will also select for hyperlinks on the names of users being replied to, so to work around that I remove all the list items that aren't fully numeric. But if your user was replying to someone with a fully numeric handle then this data point would slip through. There might be a better way to fix that than this.
  2. It also tends to duplicate a lot of tweet ids but this really doesn't matter because duplicates are removed at the end of the script.
  3. The json file doesn't get wiped at any point, so if you run for two users in a row, the second user will inherit all of the first user's tweets. My solution is to manually delete the all_ids.json between runs, which is clunky but functional.

New selector:

twitter_ids_filename = 'all_ids.json'
days = (end - start).days + 1
tweet_selector = 'article > div > div > div > div > div > div > a'
user = user.lower()
ids = []

New loop:

for day in range(days):
    d1 = format_day(increment_day(start, 0))
    d2 = format_day(increment_day(start, 1))
    url = form_url(d1, d2)
    print(url)
    print(d1)
    driver.get(url)
    sleep(delay)
    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        all_tweets = found_tweets[:]
        increment = 0
        
        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            all_tweets += found_tweets[:]
        
            print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
            increment += 10   
        
        for tweet in all_tweets:
            try:
                id = tweet.get_attribute('href').split('/')[-1]
                ids.append(id)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)
        
        print(ids)

    except NoSuchElementException:
        print('no tweets on this day')
    start = increment_day(start, 1)

finalids = [tweetid for tweetid in ids if tweetid.isdigit() == True]

New writetofile

try:
    with open(twitter_ids_filename) as f:
        all_ids = finalids + json.load(f)
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(finalids))
        print('total tweet count: ', len(data_to_write))
except FileNotFoundError:
    with open(twitter_ids_filename, 'w') as f:
        all_ids = finalids[-]
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(finalids))
        print('total tweet count: ', len(data_to_write))

with open(twitter_ids_filename, 'w') as outfile:
    json.dump(data_to_write, outfile)

Hope that works for you too!

@ghost
Copy link

ghost commented Feb 16, 2020

@jaackland Thanks for sharing a solution. However, when I run this code, it doesn't get out of the "while len(found_tweets) >= increment:" loop. The problem is coming from the "all_tweets" variable. There is nothing added to that variable and so it never gets out of that loop. Any alternative solution?

@jaackland
Copy link

@Ahsancode Sorry I should have made clearer that that isn't a full script. Are you substituting that into the original Scrape.py?

@ghost
Copy link

ghost commented Feb 16, 2020

@jaackland No worries, I am aware that this isn't the full script. The original version was also working for me until now. I just made the same adjustments as you did. It is the "tweet_selector" that is giving me problems at the moment.

@ghost
Copy link

ghost commented Feb 16, 2020

@jaackland My mistake, I had an indent problem. The code is running fine now. There are still a couple of issues:
1: I tried to retrieve all tweets since 1st Jan 2020 and noticed that after 20 days, the code doesn't retrieve any new ID's anymore so I had to rerun the code multiple times.
2: Once all ID's are retrieved, there is still a significant number of tweets unaccounted for.

@jaackland
Copy link

@Ahsancode Yes unfortunately I think Twitter have managed to rate-limit Selenium now (the original post implies this wasn't always the case). If you increase the delay variable it will scrape more tweets (but take longer, obviously). I went up to 5 and got all the tweets I needed, but you might be able to get away with less than that.

Glad it was just an indent problem because as far as I can tell that tweet_selector is universal (if a bit sloppy).

@shenyizy
Copy link
Author

shenyizy commented Feb 18, 2020

@jaackland Thanks so much for sharing the codes. However, when I tried to substitute the original codes with yours. There is a syntax error in the part of New writetofile as shown below.

all_ids = finalids[-]
                    ^

SyntaxError: invalid syntax

I am also pretty new to python so sorry for the stupid question.

@rougetimelord
Copy link

I found a selector which seems to only select the time posted link, which links to the full tweet's page. tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a

@abotmaker
Copy link

I found a selector which seems to only select the time posted link, which links to the full tweet's page. tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a

Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks

@lebanj12
Copy link

Twitter changed CSS styling, therefore in current code you need to change id_selector and tweet_selector to:

id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a"
tweet_selector = 'article'

@rougetimelord
Copy link

rougetimelord commented Mar 31, 2020

id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a"
tweet_selector = 'article'

I would combine the article part with the rest of the selector, for code neatness. Your selector seems to grab a few more tweets for whatever reason.

Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks

I used Chrome DevTools to generate a selector and stripped out class names, etc.

@jakobberndt
Copy link

I've tried the changes suggested to id_selector and tweet_selector, however I'm not getting the ID with this.

I've changed the line collecting the id (line 65) to this:
id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-3]

This gives me some ids, but not even close the number of tweets I'm finding. Any suggestions on what the problem might be?

@namuan
Copy link

namuan commented May 10, 2020

I've started writing some scripts for a project I'm working on.
https://github.com/namuan/twitter-utils

Currently tweets_between.py generates a text file but I'll see if I can generate a json so that get_metadata.py script can be used without any change

@rougetimelord
Copy link

rougetimelord commented May 18, 2020

I'm back with a new selector! article > div > div > div > div > div > div > div.r-1d09ksm > a. The class name on the last div may change depending on platform, I've only tested it on Chrome 81. Removing it will collect some extra links to profiles but should be easy to filter those out.

I have also run into a new twitter search page which will not work with this selector. Simple fix: just restart the script, I think they're A/B testing it.

@abiaus
Copy link

abiaus commented Jun 22, 2020

@rougetimelord that selector worked... kinda
It found the tweets but I don't know why it didn't save the tweets
image

@abiaus
Copy link

abiaus commented Jun 22, 2020

I was able to get a list of ids... the thing is that now I can´t transform it to Json file.
I always get the error "unhashable type: 'list'"

I tried transforming into a dict o a tuple... but didn't worked.

My Try code looks like this know

    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        increment = 10

        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            increment += 10

        print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

        for tweet in found_tweets:
            try:
                id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href')
                id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user]
                ids.append(id)
                print(id_user)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

These are my selectors:

id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \
                 '> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a'

tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'

@abiaus
Copy link

abiaus commented Jun 23, 2020

I finally got it to work! But I hit Twitter limit every time.... I started at 2 seconds of sleep and did like 4 months.
Then changed it to 4 and did a couple more months... it's going to be a loooooong journey.

I will try to upload my version.

@camilogiga
Copy link

I was able to get a list of ids... the thing is that now I can´t transform it to Json file.
I always get the error "unhashable type: 'list'"

I tried transforming into a dict o a tuple... but didn't worked.

My Try code looks like this know

    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        increment = 10

        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            increment += 10

        print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

        for tweet in found_tweets:
            try:
                id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href')
                id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user]
                ids.append(id)
                print(id_user)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

These are my selectors:

id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \
                 '> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a'

tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'

Hi! Were you able to fix this error? I was trying what you recommended and it also shows the 'unhashable type: list' error.

I'd appreciate any help you could give me with this. I'm trying to retrieve tweets of specific accounts from November and December 2019.

@abiaus
Copy link

abiaus commented Jun 24, 2020

I submitted a Pull Request with my working version

@aaronamerica
Copy link

Hi, I used the updated .py file and it still doesn't work.
Is there anything else I have to change?

@Joseph-D-Bradshaw
Copy link

Joseph-D-Bradshaw commented Mar 1, 2021

@rougetimelord Just wanted to let you know that your css selector can be condensed into something way smaller
article div.r-1d09ksm > a

@Joseph-D-Bradshaw
Copy link

I have fixed this in my own version, as I see PRs are not being sorted out I am not sure if I should make a PR here to fix it.
If there are others interested in me opening a PR just let me know, there are quite a few improvements such as making sure the csv writer uses utf-8 encoding to simply using selenium's ability execute javascript to easily fetch the id information from a twitter post

@orhanozbek
Copy link

I submitted a Pull Request with my working version

bro ı try your version

0 tweets found, 0 total
https://twitter.com/search?f=tweets&vertical=default&q=from%3Aelonmusk%20since%3A2014-04-07%20until%3A2014-04-08include%3Aretweets&src=typd
2014-04-07

getting results like this

@wendeljuliao
Copy link

Why dont i get all tweets from a user? For example, @lulaoficial i only scratch 8k out of 22k

@nandezgarcia
Copy link

nandezgarcia commented Jul 20, 2022

The twitter webpage was modified years after this code was written. The modifications needed are as follows:
id_selector and tweet_selector not required now. Only need to change the try-except code into of for loop
`

try:
    found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
    for i in found_tweets:
        print("Founded: ", i.get_attribute('href'))


    increment = 10

    while len(found_tweets) >= increment:
        print('scrolling down to load more tweets')
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        sleep(delay)
        found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
        increment += 10

    print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

    for tweet in found_tweets:
        try:
            id = tweet.get_attribute('href').split('/')[-1]
            ids.append(id)
        except StaleElementReferenceException as e:
            print('lost element reference', tweet)

except NoSuchElementException:
    print('no tweets on this day')

start = increment_day(start, 1)`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests