Can't get the IDs #23

shenyizy · 2020-02-04T18:50:42Z

When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?

jaackland · 2020-02-10T17:27:47Z

I also have this problem, it used to work and now doesn't. My suspicion is that the problem is with the css selector. Maybe Twitter recently changed the way they store tweet ids in the css file? I also don't really know what I'm talking about because I'm pretty new to python. If you figure it out, please let me know!

jaackland · 2020-02-11T10:31:00Z

@shenyizy I think I've fixed it but I'm not entirely confident this is logic error free. It's a bit messy, but the trick is to use a new and less effective css selector. I've noticed three problems so far but I've been able to work around:

The new selector will also select for hyperlinks on the names of users being replied to, so to work around that I remove all the list items that aren't fully numeric. But if your user was replying to someone with a fully numeric handle then this data point would slip through. There might be a better way to fix that than this.
It also tends to duplicate a lot of tweet ids but this really doesn't matter because duplicates are removed at the end of the script.
The json file doesn't get wiped at any point, so if you run for two users in a row, the second user will inherit all of the first user's tweets. My solution is to manually delete the all_ids.json between runs, which is clunky but functional.

New selector:

twitter_ids_filename = 'all_ids.json'
days = (end - start).days + 1
tweet_selector = 'article > div > div > div > div > div > div > a'
user = user.lower()
ids = []

New loop:

for day in range(days):
    d1 = format_day(increment_day(start, 0))
    d2 = format_day(increment_day(start, 1))
    url = form_url(d1, d2)
    print(url)
    print(d1)
    driver.get(url)
    sleep(delay)
    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        all_tweets = found_tweets[:]
        increment = 0
        
        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            all_tweets += found_tweets[:]
        
            print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
            increment += 10   
        
        for tweet in all_tweets:
            try:
                id = tweet.get_attribute('href').split('/')[-1]
                ids.append(id)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)
        
        print(ids)

    except NoSuchElementException:
        print('no tweets on this day')
    start = increment_day(start, 1)

finalids = [tweetid for tweetid in ids if tweetid.isdigit() == True]

New writetofile

try:
    with open(twitter_ids_filename) as f:
        all_ids = finalids + json.load(f)
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(finalids))
        print('total tweet count: ', len(data_to_write))
except FileNotFoundError:
    with open(twitter_ids_filename, 'w') as f:
        all_ids = finalids[-]
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(finalids))
        print('total tweet count: ', len(data_to_write))

with open(twitter_ids_filename, 'w') as outfile:
    json.dump(data_to_write, outfile)

Hope that works for you too!

ghost · 2020-02-16T13:33:45Z

@jaackland Thanks for sharing a solution. However, when I run this code, it doesn't get out of the "while len(found_tweets) >= increment:" loop. The problem is coming from the "all_tweets" variable. There is nothing added to that variable and so it never gets out of that loop. Any alternative solution?

jaackland · 2020-02-16T13:56:28Z

@Ahsancode Sorry I should have made clearer that that isn't a full script. Are you substituting that into the original Scrape.py?

ghost · 2020-02-16T14:02:06Z

@jaackland No worries, I am aware that this isn't the full script. The original version was also working for me until now. I just made the same adjustments as you did. It is the "tweet_selector" that is giving me problems at the moment.

ghost · 2020-02-16T17:38:31Z

@jaackland My mistake, I had an indent problem. The code is running fine now. There are still a couple of issues:
1: I tried to retrieve all tweets since 1st Jan 2020 and noticed that after 20 days, the code doesn't retrieve any new ID's anymore so I had to rerun the code multiple times.
2: Once all ID's are retrieved, there is still a significant number of tweets unaccounted for.

jaackland · 2020-02-16T18:51:47Z

@Ahsancode Yes unfortunately I think Twitter have managed to rate-limit Selenium now (the original post implies this wasn't always the case). If you increase the delay variable it will scrape more tweets (but take longer, obviously). I went up to 5 and got all the tweets I needed, but you might be able to get away with less than that.

Glad it was just an indent problem because as far as I can tell that tweet_selector is universal (if a bit sloppy).

shenyizy · 2020-02-18T14:58:49Z

@jaackland Thanks so much for sharing the codes. However, when I tried to substitute the original codes with yours. There is a syntax error in the part of New writetofile as shown below.

all_ids = finalids[-]
                    ^

SyntaxError: invalid syntax

I am also pretty new to python so sorry for the stupid question.

rougetimelord · 2020-03-28T03:00:50Z

I found a selector which seems to only select the time posted link, which links to the full tweet's page. tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a

abotmaker · 2020-03-29T16:33:48Z

I found a selector which seems to only select the time posted link, which links to the full tweet's page. tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a

Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks

lebanj12 · 2020-03-31T18:10:59Z

Twitter changed CSS styling, therefore in current code you need to change id_selector and tweet_selector to:

id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a"
tweet_selector = 'article'

rougetimelord · 2020-03-31T18:56:12Z

id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a"
tweet_selector = 'article'

I would combine the article part with the rest of the selector, for code neatness. Your selector seems to grab a few more tweets for whatever reason.

Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks

I used Chrome DevTools to generate a selector and stripped out class names, etc.

jakobberndt · 2020-05-08T10:29:34Z

I've tried the changes suggested to id_selector and tweet_selector, however I'm not getting the ID with this.

I've changed the line collecting the id (line 65) to this:
id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-3]

This gives me some ids, but not even close the number of tweets I'm finding. Any suggestions on what the problem might be?

namuan · 2020-05-10T05:21:54Z

I've started writing some scripts for a project I'm working on.
https://github.com/namuan/twitter-utils

Currently tweets_between.py generates a text file but I'll see if I can generate a json so that get_metadata.py script can be used without any change

rougetimelord · 2020-05-18T00:12:57Z

I'm back with a new selector! article > div > div > div > div > div > div > div.r-1d09ksm > a. The class name on the last div may change depending on platform, I've only tested it on Chrome 81. Removing it will collect some extra links to profiles but should be easy to filter those out.

I have also run into a new twitter search page which will not work with this selector. Simple fix: just restart the script, I think they're A/B testing it.

abiaus · 2020-06-22T16:31:19Z

@rougetimelord that selector worked... kinda
It found the tweets but I don't know why it didn't save the tweets

abiaus · 2020-06-22T22:22:00Z

I was able to get a list of ids... the thing is that now I can´t transform it to Json file.
I always get the error "unhashable type: 'list'"

I tried transforming into a dict o a tuple... but didn't worked.

My Try code looks like this know

    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        increment = 10

        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            increment += 10

        print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

        for tweet in found_tweets:
            try:
                id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href')
                id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user]
                ids.append(id)
                print(id_user)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

These are my selectors:

id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \
                 '> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a'

tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'

abiaus · 2020-06-23T15:52:26Z

I finally got it to work! But I hit Twitter limit every time.... I started at 2 seconds of sleep and did like 4 months.
Then changed it to 4 and did a couple more months... it's going to be a loooooong journey.

I will try to upload my version.

camilogiga · 2020-06-24T01:06:55Z

I was able to get a list of ids... the thing is that now I can´t transform it to Json file.
I always get the error "unhashable type: 'list'"

I tried transforming into a dict o a tuple... but didn't worked.

My Try code looks like this know

    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        increment = 10

        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            increment += 10

        print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

        for tweet in found_tweets:
            try:
                id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href')
                id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user]
                ids.append(id)
                print(id_user)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

These are my selectors:

id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \
                 '> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a'

tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'

Hi! Were you able to fix this error? I was trying what you recommended and it also shows the 'unhashable type: list' error.

I'd appreciate any help you could give me with this. I'm trying to retrieve tweets of specific accounts from November and December 2019.

abiaus · 2020-06-24T03:15:06Z

I submitted a Pull Request with my working version

aaronamerica · 2020-08-03T21:08:17Z

Hi, I used the updated .py file and it still doesn't work.
Is there anything else I have to change?

Joseph-D-Bradshaw · 2021-03-01T00:12:30Z

@rougetimelord Just wanted to let you know that your css selector can be condensed into something way smaller
article div.r-1d09ksm > a

Joseph-D-Bradshaw · 2021-03-06T23:43:37Z

I have fixed this in my own version, as I see PRs are not being sorted out I am not sure if I should make a PR here to fix it.
If there are others interested in me opening a PR just let me know, there are quite a few improvements such as making sure the csv writer uses utf-8 encoding to simply using selenium's ability execute javascript to easily fetch the id information from a twitter post

orhanozbek · 2021-03-27T12:34:01Z

I submitted a Pull Request with my working version

bro ı try your version

0 tweets found, 0 total
https://twitter.com/search?f=tweets&vertical=default&q=from%3Aelonmusk%20since%3A2014-04-07%20until%3A2014-04-08include%3Aretweets&src=typd
2014-04-07

getting results like this

wendeljuliao · 2021-05-04T18:00:48Z

Why dont i get all tweets from a user? For example, @lulaoficial i only scratch 8k out of 22k

nandezgarcia · 2022-07-20T09:54:54Z

The twitter webpage was modified years after this code was written. The modifications needed are as follows:
id_selector and tweet_selector not required now. Only need to change the try-except code into of for loop
`

try:
    found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
    for i in found_tweets:
        print("Founded: ", i.get_attribute('href'))


    increment = 10

    while len(found_tweets) >= increment:
        print('scrolling down to load more tweets')
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        sleep(delay)
        found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
        increment += 10

    print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

    for tweet in found_tweets:
        try:
            id = tweet.get_attribute('href').split('/')[-1]
            ids.append(id)
        except StaleElementReferenceException as e:
            print('lost element reference', tweet)

except NoSuchElementException:
    print('no tweets on this day')

start = increment_day(start, 1)`

nerdymark mentioned this issue May 1, 2020

Fixes for Twitter's CSS changes #26

Closed

rougetimelord mentioned this issue May 1, 2020

It runs but doesn't collect any tweets. #25

Open

rougetimelord mentioned this issue Mar 29, 2021

URL src parameter is malformed #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get the IDs #23

Can't get the IDs #23

shenyizy commented Feb 4, 2020

jaackland commented Feb 10, 2020

jaackland commented Feb 11, 2020 •

edited

Loading

ghost commented Feb 16, 2020

jaackland commented Feb 16, 2020

ghost commented Feb 16, 2020

ghost commented Feb 16, 2020

jaackland commented Feb 16, 2020

shenyizy commented Feb 18, 2020 •

edited

Loading

rougetimelord commented Mar 28, 2020

abotmaker commented Mar 29, 2020

lebanj12 commented Mar 31, 2020

rougetimelord commented Mar 31, 2020 •

edited

Loading

jakobberndt commented May 8, 2020

namuan commented May 10, 2020

rougetimelord commented May 18, 2020 •

edited

Loading

abiaus commented Jun 22, 2020

abiaus commented Jun 22, 2020

abiaus commented Jun 23, 2020

camilogiga commented Jun 24, 2020

abiaus commented Jun 24, 2020

aaronamerica commented Aug 3, 2020

Joseph-D-Bradshaw commented Mar 1, 2021 •

edited

Loading

Joseph-D-Bradshaw commented Mar 6, 2021

orhanozbek commented Mar 27, 2021

wendeljuliao commented May 4, 2021

nandezgarcia commented Jul 20, 2022 •

edited

Loading

Can't get the IDs #23

Can't get the IDs #23

Comments

shenyizy commented Feb 4, 2020

jaackland commented Feb 10, 2020

jaackland commented Feb 11, 2020 • edited Loading

ghost commented Feb 16, 2020

jaackland commented Feb 16, 2020

ghost commented Feb 16, 2020

ghost commented Feb 16, 2020

jaackland commented Feb 16, 2020

shenyizy commented Feb 18, 2020 • edited Loading

rougetimelord commented Mar 28, 2020

abotmaker commented Mar 29, 2020

lebanj12 commented Mar 31, 2020

rougetimelord commented Mar 31, 2020 • edited Loading

jakobberndt commented May 8, 2020

namuan commented May 10, 2020

rougetimelord commented May 18, 2020 • edited Loading

abiaus commented Jun 22, 2020

abiaus commented Jun 22, 2020

abiaus commented Jun 23, 2020

camilogiga commented Jun 24, 2020

abiaus commented Jun 24, 2020

aaronamerica commented Aug 3, 2020

Joseph-D-Bradshaw commented Mar 1, 2021 • edited Loading

Joseph-D-Bradshaw commented Mar 6, 2021

orhanozbek commented Mar 27, 2021

wendeljuliao commented May 4, 2021

nandezgarcia commented Jul 20, 2022 • edited Loading

jaackland commented Feb 11, 2020 •

edited

Loading

shenyizy commented Feb 18, 2020 •

edited

Loading

rougetimelord commented Mar 31, 2020 •

edited

Loading

rougetimelord commented May 18, 2020 •

edited

Loading

Joseph-D-Bradshaw commented Mar 1, 2021 •

edited

Loading

nandezgarcia commented Jul 20, 2022 •

edited

Loading