diff --git a/datasources/twitterv2/DESCRIPTION.md b/datasources/twitterv2/DESCRIPTION.md index 57f1f7a5..d138e675 100644 --- a/datasources/twitterv2/DESCRIPTION.md +++ b/datasources/twitterv2/DESCRIPTION.md @@ -1,93 +1,88 @@ -Twitter data is gathered through the official [Twitter v2 API](https://developer.twitter.com/en/docs/twitter-api). 4CAT -allows access to both the Standard and the Academic track. The Standard track is free for anyone to use, but only -allows to retrieve tweets up to seven days old. The Academic track allows a full-archive search of up to ten million -tweets per month (as of March 2022). For the Academic track, you need a valid Bearer token. You can request one -[here](https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you). +X/Twitter data is gathered through the official [X v2 API](https://developer.twitter.com/en/docs/twitter-api). 4CAT can interface with X's Research API (sometimes +branded as the 'DSA API', referencing the EU's Digital Services Act). To retrieve posts via this API with 4CAT, you need +a valid Bearer token. Read more about this mode of access [here](https://developer.x.com/en/use-cases/do-research/academic-research). -Tweets are captured in batches at a speed of approximately 100,000 tweets per hour. 4CAT will warn you if your dataset +Posts are captured in batches at a speed of approximately 100,000 posts per hour. 4CAT will warn you if your dataset is expected to take more than 30 minutes to collect. It is often a good idea to start small (with very specific queries or narrow date ranges) and then only create a larger dataset if you are confident that it will be manageable and useful for your analysis. -If you hit your Twitter API quota while creating a dataset, the dataset will be finished with the tweets that have been +If you hit your X API quota while creating a dataset, the dataset will be finished with the posts that have been collected so far and a warning will be logged. ### Query syntax -Check the [API documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) +Check the [API documentation](https://developer.x.com/en/docs/x-api/tweets/search/integrate/build-a-query) for available query syntax and operators. This information is crucial to what data you collect. Important operators for -instance include `-is:nullcast` and `-is:retweet`, with which you can ignore promoted tweets and retweets. Query syntax -is roughly the same as for Twitter's search interface, so you can try out most queries by entering them in the Twitter -app or website's search field and looking at the results. You can also test queries with -Twitter's [Query Builder](https://developer.twitter.com/apitools/query?query=). +instance include `-is:nullcast` and `-is:retweet`, with which you can ignore promoted posts and reposts. Query syntax +is roughly the same as for X's search interface, so you can try out most queries by entering them in the X app or +website's search field and looking at the results. You can also test queries with +X's [Query Builder](https://developer.twitter.com/apitools/query?query=). ### Date ranges -By default, Twitter returns tweets posted within the past 30 days. If you want to go back further, you need to -explicitly set a date range. Note that Twitter does not like date ranges that end in the future, or start before -Twitter existed. If you want to capture tweets "until now", it is often best to use yesterday as an end date. +By default, X returns posts posted within the past 30 days. If you want to go back further, you need to +explicitly set a date range. Note that X does not like date ranges that end in the future, or start before +Twitter existed. If you want to capture tweets "until now", it is often best to use yesterday as an end date. Also note +that API access may come with certain limitations on how far a query may extend into history. ### Geo parameters -Twitter offers a number of ways -to [query by location/geo data](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location) -such as `has:geo`, `place:Amsterdam`, or `place:Amsterdam`. This feature is only available for the Academic level; -you will receive a 400 error if using queries filtering by geographic information. +X offers a number of ways +to [query by location/geo data](https://developer.x.com/en/docs/tutorials/filtering-tweets-by-location) +such as `has:geo`, `place:Amsterdam`, or `place:Amsterdam`. ### Retweets -A retweet from Twitter API v2 contains at maximum 140 characters from the original tweet. 4CAT therefore -gathers both the retweet and the original tweet and reformats the retweet text so it resembles a user's experience. +A repost from X API v2 contains at maximum 140 characters from the original post. 4CAT therefore +gathers both the repost and the original post and reformats the repost text so it resembles a user's experience. This also affects mentions, hashtags, and other data as only those contained in the first 140 characters are provided -by Twitter API v2 with the retweet. Additional hashtags, mentions, etc. are taken from the original tweet and added -to the retweet for 4CAT analysis methods. *4CAT stores the data from Twitter API v2 as similar as possible to the format +by X API v2 with the retweet. Additional hashtags, mentions, etc. are taken from the original tweet and added +to the repost for 4CAT analysis methods. *4CAT stores the data from X API v2 as similar as possible to the format in which it was received which you can obtain by downloading the ndjson file.* *Example 1* -[This retweet](https://twitter.com/tonino1630/status/1554618034299568128) returns the following data: +[This repost](https://x.com/tonino1630/status/1554618034299568128) returns the following data: - *author:* `tonino1630` -- * - text:* `RT @ChuckyFrao: ¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar p…` +- *text:* `RT @ChuckyFrao: ¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar p…` - *mentions:* `ChuckyFrao` - *hashags:*
-While the original tweet will return (as a reference tweet) this data: +While the original post will return (as a reference post) this data: - *author:* `ChuckyFrao` -- * - text:* `¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar presos estadounidenses en otros países. #FreeAlexSaab @POTUS @usembassyve @StateSPEHA @StateDept @SecBlinken #BringAlexHome #IntegridadTerritorial https://t.co/ClSQ3Rfax0` +- *text:* `¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar presos estadounidenses en otros países. #FreeAlexSaab @POTUS @usembassyve @StateSPEHA @StateDept @SecBlinken #BringAlexHome #IntegridadTerritorial https://t.co/ClSQ3Rfax0` - *mentions:* `POTUS, usembassyve, StateSPEHA, StateDept, SecBlinken` - *hashtags:* `FreeAlexSaab, BringAlexHome, IntegridadTerritorial`
-As you can see, only the author of the original tweet is listed as a mention in the retweet. +As you can see, only the author of the original post is listed as a mention in the repost. *Example 2* -[This retweet](https://twitter.com/Macsmart31/status/1554618041459445760) returns the following: +[This repost](https://x.com/Macsmart31/status/1554618041459445760) returns the following: - *author:* `Macsmart31` -- * - text:* `RT @mickyd123us: @tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the de…` +- *text:* `RT @mickyd123us: @tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the de…` - *mentions:* `mickyd123us, tribelaw, HonorDecency`
-Compared with the original tweet referenced below: +Compared with the original post referenced below: - *author:* `mickyd123us` -- * - text:* `@tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the detail he had in the basement. Who knows where they would have taken him. https://t.co/s47Kb5RrCr` +- *text:* `@tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the detail he had in the basement. Who knows where they would have taken him. https://t.co/s47Kb5RrCr` - *mentions:* `tribelaw, HonorDecency`
-Because the mentioned users are in the first 140 characters of the original tweet, they are also listed as mentions in the retweet. - -The key difference here is that example one the retweet contains none of the hashtags or mentions from the original -tweet (they are beyond the first 140 characters) while the second retweet example does return mentions from the original -tweet. *Due to this discrepancy, for retweets all mentions and hashtags of the original tweet are considered as mentions -and hashtags of the retweet.* A user on Twitter will see all mentions and hashtags when viewing a retweet and the -retweet would be a part of any network around those mentions and hashtags. +Because the mentioned users are in the first 140 characters of the original post, they are also listed as mentions in +the repost. + +The key difference here is that in example one the repost contains none of the hashtags or mentions from the original +post (they are beyond the first 140 characters) while the second repost example does return mentions from the original +post. *Due to this discrepancy, for reposts all mentions and hashtags of the original post are considered as mentions +and hashtags of the repost.* A user on X will see all mentions and hashtags when viewing a repost and the +repost would be a part of any network around those mentions and hashtags. diff --git a/datasources/twitterv2/__init__.py b/datasources/twitterv2/__init__.py index 3335bc7c..6aa80c7b 100644 --- a/datasources/twitterv2/__init__.py +++ b/datasources/twitterv2/__init__.py @@ -9,4 +9,4 @@ # Internal identifier for this data source DATASOURCE = "twitterv2" -NAME = "Twitter API (v2) Search" \ No newline at end of file +NAME = "X/Twitter API (v2) Search" \ No newline at end of file diff --git a/datasources/twitterv2/search_twitter.py b/datasources/twitterv2/search_twitter.py index 999680b6..8b91d1eb 100644 --- a/datasources/twitterv2/search_twitter.py +++ b/datasources/twitterv2/search_twitter.py @@ -1,5 +1,5 @@ """ -Twitter keyword search via the Twitter API v2 +X/Twitter keyword search via the X API v2 """ import requests import datetime @@ -17,13 +17,10 @@ class SearchWithTwitterAPIv2(Search): """ - Get Tweets via the Twitter API - - This only allows for historical search - use f.ex. TCAT for more advanced - queries. + Get Tweets via the X API """ type = "twitterv2-search" # job ID - title = "Twitter API (v2)" + title = "X/Twitter API (v2)" extension = "ndjson" is_local = False # Whether this datasource is locally scraped is_static = False # Whether this datasource is still updated @@ -32,15 +29,15 @@ class SearchWithTwitterAPIv2(Search): import_issues = True references = [ - "[Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api)" + "[X/Twitter API documentation](https://developer.x.com/en/docs/x-api)" ] config = { "twitterv2-search.academic_api_key": { "type": UserInput.OPTION_TEXT, "default": "", - "help": "Academic API Key", - "tooltip": "An API key for the Twitter v2 Academic API. If " + "help": "Research API Key", + "tooltip": "An API key for the X/Twitter v2 Research API. If " "provided, the user will not need to enter their own " "key to retrieve tweets. Note that this API key should " "have access to the Full Archive Search endpoint." @@ -50,15 +47,15 @@ class SearchWithTwitterAPIv2(Search): "default": 0, "min": 0, "max": 10_000_000, - "help": "Max tweets per dataset", + "help": "Max posts per dataset", "tooltip": "4CAT will never retrieve more than this amount of " - "tweets per dataset. Enter '0' for unlimited tweets." + "posts per dataset. Enter '0' for unlimited posts." }, "twitterv2-search.id_lookup": { "type": UserInput.OPTION_TOGGLE, "default": False, "help": "Allow lookup by ID", - "tooltip": "If enabled, allow users to enter a list of tweet IDs " + "tooltip": "If enabled, allow users to enter a list of post IDs " "to retrieve. This is disabled by default because it " "can be confusing to novice users." } @@ -110,7 +107,7 @@ def get_items(self, query): } if self.parameters.get("query_type", "query") == "id_lookup" and self.config.get("twitterv2-search.id_lookup"): - endpoint = "https://api.twitter.com/2/tweets" + endpoint = "https://api.x.com/2/tweets" tweet_ids = self.parameters.get("query", []).split(',') @@ -126,7 +123,7 @@ def get_items(self, query): else: # Query to all or search - endpoint = "https://api.twitter.com/2/tweets/search/" + api_type + endpoint = "https://api.x.com/2/tweets/search/" + api_type queries = [self.parameters.get("query", "")] @@ -158,7 +155,7 @@ def get_items(self, query): while True: if self.interrupted: - raise ProcessorInterruptedException("Interrupted while getting tweets from the Twitter API") + raise ProcessorInterruptedException("Interrupted while getting posts from the Twitter API") # there is a limit of one request per second, so stay on the safe side of this while self.previous_request == int(time.time()): @@ -188,18 +185,18 @@ def get_items(self, query): try: structured_response = api_response.json() if structured_response.get("title") == "UsageCapExceeded": - self.dataset.update_status("Hit the monthly tweet cap. You cannot capture more tweets " - "until your API quota resets. Dataset completed with tweets " + self.dataset.update_status("Hit the monthly post cap. You cannot capture more posts " + "until your API quota resets. Dataset completed with posts " "collected so far.", is_final=True) return except (json.JSONDecodeError, ValueError): - self.dataset.update_status("Hit Twitter rate limit, but could not figure out why. Halting " - "tweet collection.", is_final=True) + self.dataset.update_status("Hit X's rate limit, but could not figure out why. Halting " + "post collection.", is_final=True) return resume_at = convert_to_int(api_response.headers["x-rate-limit-reset"]) + 1 resume_at_str = datetime.datetime.fromtimestamp(int(resume_at)).strftime("%c") - self.dataset.update_status("Hit Twitter rate limit - waiting until %s to continue." % resume_at_str) + self.dataset.update_status("Hit X's rate limit - waiting until %s to continue." % resume_at_str) while time.time() <= resume_at: if self.interrupted: raise ProcessorInterruptedException("Interrupted while waiting for rate limit to reset") @@ -211,10 +208,10 @@ def get_items(self, query): elif api_response.status_code == 403: try: structured_response = api_response.json() - self.dataset.update_status("'Forbidden' error from the Twitter API. Could not connect to Twitter API " + self.dataset.update_status("'Forbidden' error from the X API. Could not connect to X API " "with this API key. %s" % structured_response.get("detail", ""), is_final=True) except (json.JSONDecodeError, ValueError): - self.dataset.update_status("'Forbidden' error from the Twitter API. Your key may not have access to " + self.dataset.update_status("'Forbidden' error from the X API. Your key may not have access to " "the full-archive search endpoint.", is_final=True) finally: return @@ -224,7 +221,7 @@ def get_items(self, query): elif api_response.status_code in (502, 503, 504): resume_at = time.time() + 60 resume_at_str = datetime.datetime.fromtimestamp(int(resume_at)).strftime("%c") - self.dataset.update_status("Twitter unavailable (status %i) - waiting until %s to continue." % ( + self.dataset.update_status("X unavailable (status %i) - waiting until %s to continue." % ( api_response.status_code, resume_at_str)) while time.time() <= resume_at: time.sleep(0.5) @@ -233,7 +230,7 @@ def get_items(self, query): # this usually means the query is too long or otherwise contains # a syntax error elif api_response.status_code == 400: - msg = "Response %i from the Twitter API; " % api_response.status_code + msg = "Response %i from the X API; " % api_response.status_code try: api_response = api_response.json() msg += api_response.get("title", "") @@ -247,19 +244,19 @@ def get_items(self, query): # invalid API key elif api_response.status_code == 401: - self.dataset.update_status("Invalid API key - could not connect to Twitter API", is_final=True) + self.dataset.update_status("Invalid API key - could not connect to X API", is_final=True) return # haven't seen one yet, but they probably exist elif api_response.status_code != 200: self.dataset.update_status( "Unexpected HTTP status %i. Halting tweet collection." % api_response.status_code, is_final=True) - self.log.warning("Twitter API v2 responded with status code %i. Response body: %s" % ( + self.log.warning("X API v2 responded with status code %i. Response body: %s" % ( api_response.status_code, api_response.text)) return elif not api_response: - self.dataset.update_status("Could not connect to Twitter. Cancelling.", is_final=True) + self.dataset.update_status("Could not connect to X. Cancelling.", is_final=True) return api_response = api_response.json() @@ -291,13 +288,13 @@ def get_items(self, query): if num_missing_objects > 50: # Large amount of missing objects; possible error with Twitter API self.import_issues = False - error_report.append('%i missing objects received following tweet number %i. Possible issue with Twitter API.' % (num_missing_objects, tweets)) + error_report.append('%i missing objects received following post number %i. Possible issue with X API.' % (num_missing_objects, tweets)) error_report.append('Missing objects collected: ' + ', '.join(['%s: %s' % (k, len(v)) for k, v in missing_objects.items()])) # Warn if new missing object is recorded (for developers to handle) expected_error_types = ['user', 'media', 'poll', 'tweet', 'place'] if any(key not in expected_error_types for key in missing_objects.keys()): - self.log.warning("Twitter API v2 returned unknown error types: %s" % str([key for key in missing_objects.keys() if key not in expected_error_types])) + self.log.warning("X API v2 returned unknown error types: %s" % str([key for key in missing_objects.keys() if key not in expected_error_types])) # Loop through and collect tweets for tweet in api_response.get("data", []): @@ -312,7 +309,7 @@ def get_items(self, query): tweets += 1 if tweets % 500 == 0: - self.dataset.update_status("Received %s of ~%s tweets from the Twitter API" % ("{:,}".format(tweets), expected_tweets)) + self.dataset.update_status("Received %s of ~%s tweets from the X API" % ("{:,}".format(tweets), expected_tweets)) if num_expected_tweets is not None: self.dataset.update_progress(tweets / num_expected_tweets) @@ -474,21 +471,19 @@ def get_options(cls, parent_dataset=None, user=None): max_tweets = config.get("twitterv2-search.max_tweets", user=user) if have_api_key: - intro_text = ("This data source uses the full-archive search endpoint of the Twitter API (v2) to retrieve " + intro_text = ("This data source uses the full-archive search endpoint of the X API (v2) to retrieve " "historic tweets that match a given query.") else: - intro_text = ("This data source uses either the Standard 7-day historical Search endpoint or the " - "full-archive search endpoint of the Twitter API, v2. To use the latter, you must have " - "access to the Academic Research track of the Twitter API. In either case, you will need to " - "provide a valid [bearer " - "token](https://developer.twitter.com/en/docs/authentication/oauth-2-0). The bearer token " - "**will be sent to the 4CAT server**, where it will be deleted after data collection has " - "started. Note that any tweets retrieved with 4CAT will count towards your monthly Tweet " - "retrieval cap.") - - intro_text += ("\n\nPlease refer to the [Twitter API documentation](" - "https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) " + intro_text = ("This data source uses the full-archive search endpoint of the X/Twitter API, v2. To use the " + "it, you must have access to the Research track of the X API. You will need to provide a " + "valid [bearer token](https://developer.x.com/en/docs/authentication/oauth-2-0). The " + "bearer token **will be sent to the 4CAT server**, where it will be deleted after data " + "collection has started. Note that any posts retrieved with 4CAT will count towards your " + "monthly post retrieval cap.") + + intro_text += ("\n\nPlease refer to the [X API documentation](" + "https://developer.x.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) " "documentation for more information about this API endpoint and the syntax you can use in your " "search query. Retweets are included by default; add `-is:retweet` to exclude them.") @@ -500,16 +495,18 @@ def get_options(cls, parent_dataset=None, user=None): } if not have_api_key: + # options.update({ + # "api_type": { + # "type": UserInput.OPTION_CHOICE, + # "help": "API track", + # "options": { + # "all": "Research API: Full-archive search", + # "recent": "Standard: Recent search (Tweets published in last 7 days)", + # }, + # "default": "all" + # } + # }) options.update({ - "api_type": { - "type": UserInput.OPTION_CHOICE, - "help": "API track", - "options": { - "all": "Academic: Full-archive search", - "recent": "Standard: Recent search (Tweets published in last 7 days)", - }, - "default": "all" - }, "api_bearer_token": { "type": UserInput.OPTION_TEXT, "sensitive": True, @@ -523,10 +520,10 @@ def get_options(cls, parent_dataset=None, user=None): "query_type": { "type": UserInput.OPTION_CHOICE, "help": "Query type", - "tooltip": "Note: Num of Tweets and Date fields ignored with 'Tweets by ID' lookup", + "tooltip": "Note: Num of posts and date fields are ignored with 'Posts by ID' lookup", "options": { "query": "Search query", - "id_lookup": "Tweets by ID (list IDs seperated by commas or one per line)", + "id_lookup": "Posts by ID (list IDs seperated by commas or one per line)", }, "default": "query" } @@ -539,7 +536,7 @@ def get_options(cls, parent_dataset=None, user=None): }, "amount": { "type": UserInput.OPTION_TEXT, - "help": "Tweets to retrieve", + "help": "Posts to retrieve", "tooltip": "0 = unlimited (be careful!)" if not max_tweets else ("0 = maximum (%s)" % str(max_tweets)), "min": 0, "max": max_tweets if max_tweets else 10_000_000, @@ -550,7 +547,7 @@ def get_options(cls, parent_dataset=None, user=None): }, "daterange-info": { "type": UserInput.OPTION_INFO, - "help": "By default, Twitter returns tweets up til 30 days ago. If you want to go back further, you " + "help": "By default, X returns posts up til 30 days ago. If you want to go back further, you " "need to explicitly set a date range." }, "daterange": { @@ -591,7 +588,7 @@ def validate_query(query, request, user): raise QueryParametersException("Please provide a valid bearer token.") if len(query.get("query")) > 1024 and query.get("query_type", "query") != "id_lookup": - raise QueryParametersException("Twitter API queries cannot be longer than 1024 characters.") + raise QueryParametersException("X API queries cannot be longer than 1024 characters.") if query.get("query_type", "query") == "id_lookup" and config.get("twitterv2-search.id_lookup", user=user): # reformat queries to be a comma-separated list with no wrapping @@ -630,7 +627,7 @@ def validate_query(query, request, user): # to dissuade users from running huge queries that will take forever # to process if params["query_type"] == "query" and (params.get("api_type") == "all" or have_api_key): - count_url = "https://api.twitter.com/2/tweets/counts/all" + count_url = "https://api.x.com/2/tweets/counts/all" count_params = { "granularity": "day", "query": params["query"], @@ -668,7 +665,7 @@ def validate_query(query, request, user): elif response.status_code == 401: raise QueryParametersException("Your bearer token seems to be invalid. Please make sure it is valid " - "for the Academic Track of the Twitter API.") + "for the Research track of the X API.") elif response.status_code == 400: raise QueryParametersException("Your query is invalid. Please make sure the date range does not " @@ -791,7 +788,7 @@ def map_item(item): "thread_id": item.get("conversation_id", item["id"]), "timestamp": tweet_time.strftime("%Y-%m-%d %H:%M:%S"), "unix_timestamp": int(tweet_time.timestamp()), - 'link': "https://twitter.com/%s/status/%s" % (author_username, item.get('id')), + 'link': "https://x.com/%s/status/%s" % (author_username, item.get('id')), "subject": "", "body": item["text"], "author": author_username,