diff --git a/datasources/twitterv2/DESCRIPTION.md b/datasources/twitterv2/DESCRIPTION.md
index 57f1f7a5..d138e675 100644
--- a/datasources/twitterv2/DESCRIPTION.md
+++ b/datasources/twitterv2/DESCRIPTION.md
@@ -1,93 +1,88 @@
-Twitter data is gathered through the official [Twitter v2 API](https://developer.twitter.com/en/docs/twitter-api). 4CAT
-allows access to both the Standard and the Academic track. The Standard track is free for anyone to use, but only
-allows to retrieve tweets up to seven days old. The Academic track allows a full-archive search of up to ten million
-tweets per month (as of March 2022). For the Academic track, you need a valid Bearer token. You can request one
-[here](https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you).
+X/Twitter data is gathered through the official [X v2 API](https://developer.twitter.com/en/docs/twitter-api). 4CAT can interface with X's Research API (sometimes
+branded as the 'DSA API', referencing the EU's Digital Services Act). To retrieve posts via this API with 4CAT, you need
+a valid Bearer token. Read more about this mode of access [here](https://developer.x.com/en/use-cases/do-research/academic-research).
-Tweets are captured in batches at a speed of approximately 100,000 tweets per hour. 4CAT will warn you if your dataset
+Posts are captured in batches at a speed of approximately 100,000 posts per hour. 4CAT will warn you if your dataset
is expected to take more than 30 minutes to collect. It is often a good idea to start small (with very specific
queries or narrow date ranges) and then only create a larger dataset if you are confident that it will be manageable and
useful for your analysis.
-If you hit your Twitter API quota while creating a dataset, the dataset will be finished with the tweets that have been
+If you hit your X API quota while creating a dataset, the dataset will be finished with the posts that have been
collected so far and a warning will be logged.
### Query syntax
-Check the [API documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query)
+Check the [API documentation](https://developer.x.com/en/docs/x-api/tweets/search/integrate/build-a-query)
for available query syntax and operators. This information is crucial to what data you collect. Important operators for
-instance include `-is:nullcast` and `-is:retweet`, with which you can ignore promoted tweets and retweets. Query syntax
-is roughly the same as for Twitter's search interface, so you can try out most queries by entering them in the Twitter
-app or website's search field and looking at the results. You can also test queries with
-Twitter's [Query Builder](https://developer.twitter.com/apitools/query?query=).
+instance include `-is:nullcast` and `-is:retweet`, with which you can ignore promoted posts and reposts. Query syntax
+is roughly the same as for X's search interface, so you can try out most queries by entering them in the X app or
+website's search field and looking at the results. You can also test queries with
+X's [Query Builder](https://developer.twitter.com/apitools/query?query=).
### Date ranges
-By default, Twitter returns tweets posted within the past 30 days. If you want to go back further, you need to
-explicitly set a date range. Note that Twitter does not like date ranges that end in the future, or start before
-Twitter existed. If you want to capture tweets "until now", it is often best to use yesterday as an end date.
+By default, X returns posts posted within the past 30 days. If you want to go back further, you need to
+explicitly set a date range. Note that X does not like date ranges that end in the future, or start before
+Twitter existed. If you want to capture tweets "until now", it is often best to use yesterday as an end date. Also note
+that API access may come with certain limitations on how far a query may extend into history.
### Geo parameters
-Twitter offers a number of ways
-to [query by location/geo data](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)
-such as `has:geo`, `place:Amsterdam`, or `place:Amsterdam`. This feature is only available for the Academic level;
-you will receive a 400 error if using queries filtering by geographic information.
+X offers a number of ways
+to [query by location/geo data](https://developer.x.com/en/docs/tutorials/filtering-tweets-by-location)
+such as `has:geo`, `place:Amsterdam`, or `place:Amsterdam`.
### Retweets
-A retweet from Twitter API v2 contains at maximum 140 characters from the original tweet. 4CAT therefore
-gathers both the retweet and the original tweet and reformats the retweet text so it resembles a user's experience.
+A repost from X API v2 contains at maximum 140 characters from the original post. 4CAT therefore
+gathers both the repost and the original post and reformats the repost text so it resembles a user's experience.
This also affects mentions, hashtags, and other data as only those contained in the first 140 characters are provided
-by Twitter API v2 with the retweet. Additional hashtags, mentions, etc. are taken from the original tweet and added
-to the retweet for 4CAT analysis methods. *4CAT stores the data from Twitter API v2 as similar as possible to the format
+by X API v2 with the retweet. Additional hashtags, mentions, etc. are taken from the original tweet and added
+to the repost for 4CAT analysis methods. *4CAT stores the data from X API v2 as similar as possible to the format
in which it was received which you can obtain by downloading the ndjson file.*
*Example 1*
-[This retweet](https://twitter.com/tonino1630/status/1554618034299568128) returns the following data:
+[This repost](https://x.com/tonino1630/status/1554618034299568128) returns the following data:
- *author:* `tonino1630`
-- *
- text:* `RT @ChuckyFrao: ¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar p…`
+- *text:* `RT @ChuckyFrao: ¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar p…`
- *mentions:* `ChuckyFrao`
- *hashags:*
-While the original tweet will return (as a reference tweet) this data:
+While the original post will return (as a reference post) this data:
- *author:* `ChuckyFrao`
-- *
- text:* `¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar presos estadounidenses en otros países. #FreeAlexSaab @POTUS @usembassyve @StateSPEHA @StateDept @SecBlinken #BringAlexHome #IntegridadTerritorial https://t.co/ClSQ3Rfax0`
+- *text:* `¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar presos estadounidenses en otros países. #FreeAlexSaab @POTUS @usembassyve @StateSPEHA @StateDept @SecBlinken #BringAlexHome #IntegridadTerritorial https://t.co/ClSQ3Rfax0`
- *mentions:* `POTUS, usembassyve, StateSPEHA, StateDept, SecBlinken`
- *hashtags:* `FreeAlexSaab, BringAlexHome, IntegridadTerritorial`
-As you can see, only the author of the original tweet is listed as a mention in the retweet.
+As you can see, only the author of the original post is listed as a mention in the repost.
*Example 2*
-[This retweet](https://twitter.com/Macsmart31/status/1554618041459445760) returns the following:
+[This repost](https://x.com/Macsmart31/status/1554618041459445760) returns the following:
- *author:* `Macsmart31`
-- *
- text:* `RT @mickyd123us: @tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the de…`
+- *text:* `RT @mickyd123us: @tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the de…`
- *mentions:* `mickyd123us, tribelaw, HonorDecency`
-Compared with the original tweet referenced below:
+Compared with the original post referenced below:
- *author:* `mickyd123us`
-- *
- text:* `@tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the detail he had in the basement. Who knows where they would have taken him. https://t.co/s47Kb5RrCr`
+- *text:* `@tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the detail he had in the basement. Who knows where they would have taken him. https://t.co/s47Kb5RrCr`
- *mentions:* `tribelaw, HonorDecency`
-Because the mentioned users are in the first 140 characters of the original tweet, they are also listed as mentions in the retweet.
-
-The key difference here is that example one the retweet contains none of the hashtags or mentions from the original
-tweet (they are beyond the first 140 characters) while the second retweet example does return mentions from the original
-tweet. *Due to this discrepancy, for retweets all mentions and hashtags of the original tweet are considered as mentions
-and hashtags of the retweet.* A user on Twitter will see all mentions and hashtags when viewing a retweet and the
-retweet would be a part of any network around those mentions and hashtags.
+Because the mentioned users are in the first 140 characters of the original post, they are also listed as mentions in
+the repost.
+
+The key difference here is that in example one the repost contains none of the hashtags or mentions from the original
+post (they are beyond the first 140 characters) while the second repost example does return mentions from the original
+post. *Due to this discrepancy, for reposts all mentions and hashtags of the original post are considered as mentions
+and hashtags of the repost.* A user on X will see all mentions and hashtags when viewing a repost and the
+repost would be a part of any network around those mentions and hashtags.
diff --git a/datasources/twitterv2/__init__.py b/datasources/twitterv2/__init__.py
index 3335bc7c..6aa80c7b 100644
--- a/datasources/twitterv2/__init__.py
+++ b/datasources/twitterv2/__init__.py
@@ -9,4 +9,4 @@
# Internal identifier for this data source
DATASOURCE = "twitterv2"
-NAME = "Twitter API (v2) Search"
\ No newline at end of file
+NAME = "X/Twitter API (v2) Search"
\ No newline at end of file
diff --git a/datasources/twitterv2/search_twitter.py b/datasources/twitterv2/search_twitter.py
index 999680b6..8b91d1eb 100644
--- a/datasources/twitterv2/search_twitter.py
+++ b/datasources/twitterv2/search_twitter.py
@@ -1,5 +1,5 @@
"""
-Twitter keyword search via the Twitter API v2
+X/Twitter keyword search via the X API v2
"""
import requests
import datetime
@@ -17,13 +17,10 @@
class SearchWithTwitterAPIv2(Search):
"""
- Get Tweets via the Twitter API
-
- This only allows for historical search - use f.ex. TCAT for more advanced
- queries.
+ Get Tweets via the X API
"""
type = "twitterv2-search" # job ID
- title = "Twitter API (v2)"
+ title = "X/Twitter API (v2)"
extension = "ndjson"
is_local = False # Whether this datasource is locally scraped
is_static = False # Whether this datasource is still updated
@@ -32,15 +29,15 @@ class SearchWithTwitterAPIv2(Search):
import_issues = True
references = [
- "[Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api)"
+ "[X/Twitter API documentation](https://developer.x.com/en/docs/x-api)"
]
config = {
"twitterv2-search.academic_api_key": {
"type": UserInput.OPTION_TEXT,
"default": "",
- "help": "Academic API Key",
- "tooltip": "An API key for the Twitter v2 Academic API. If "
+ "help": "Research API Key",
+ "tooltip": "An API key for the X/Twitter v2 Research API. If "
"provided, the user will not need to enter their own "
"key to retrieve tweets. Note that this API key should "
"have access to the Full Archive Search endpoint."
@@ -50,15 +47,15 @@ class SearchWithTwitterAPIv2(Search):
"default": 0,
"min": 0,
"max": 10_000_000,
- "help": "Max tweets per dataset",
+ "help": "Max posts per dataset",
"tooltip": "4CAT will never retrieve more than this amount of "
- "tweets per dataset. Enter '0' for unlimited tweets."
+ "posts per dataset. Enter '0' for unlimited posts."
},
"twitterv2-search.id_lookup": {
"type": UserInput.OPTION_TOGGLE,
"default": False,
"help": "Allow lookup by ID",
- "tooltip": "If enabled, allow users to enter a list of tweet IDs "
+ "tooltip": "If enabled, allow users to enter a list of post IDs "
"to retrieve. This is disabled by default because it "
"can be confusing to novice users."
}
@@ -110,7 +107,7 @@ def get_items(self, query):
}
if self.parameters.get("query_type", "query") == "id_lookup" and self.config.get("twitterv2-search.id_lookup"):
- endpoint = "https://api.twitter.com/2/tweets"
+ endpoint = "https://api.x.com/2/tweets"
tweet_ids = self.parameters.get("query", []).split(',')
@@ -126,7 +123,7 @@ def get_items(self, query):
else:
# Query to all or search
- endpoint = "https://api.twitter.com/2/tweets/search/" + api_type
+ endpoint = "https://api.x.com/2/tweets/search/" + api_type
queries = [self.parameters.get("query", "")]
@@ -158,7 +155,7 @@ def get_items(self, query):
while True:
if self.interrupted:
- raise ProcessorInterruptedException("Interrupted while getting tweets from the Twitter API")
+ raise ProcessorInterruptedException("Interrupted while getting posts from the Twitter API")
# there is a limit of one request per second, so stay on the safe side of this
while self.previous_request == int(time.time()):
@@ -188,18 +185,18 @@ def get_items(self, query):
try:
structured_response = api_response.json()
if structured_response.get("title") == "UsageCapExceeded":
- self.dataset.update_status("Hit the monthly tweet cap. You cannot capture more tweets "
- "until your API quota resets. Dataset completed with tweets "
+ self.dataset.update_status("Hit the monthly post cap. You cannot capture more posts "
+ "until your API quota resets. Dataset completed with posts "
"collected so far.", is_final=True)
return
except (json.JSONDecodeError, ValueError):
- self.dataset.update_status("Hit Twitter rate limit, but could not figure out why. Halting "
- "tweet collection.", is_final=True)
+ self.dataset.update_status("Hit X's rate limit, but could not figure out why. Halting "
+ "post collection.", is_final=True)
return
resume_at = convert_to_int(api_response.headers["x-rate-limit-reset"]) + 1
resume_at_str = datetime.datetime.fromtimestamp(int(resume_at)).strftime("%c")
- self.dataset.update_status("Hit Twitter rate limit - waiting until %s to continue." % resume_at_str)
+ self.dataset.update_status("Hit X's rate limit - waiting until %s to continue." % resume_at_str)
while time.time() <= resume_at:
if self.interrupted:
raise ProcessorInterruptedException("Interrupted while waiting for rate limit to reset")
@@ -211,10 +208,10 @@ def get_items(self, query):
elif api_response.status_code == 403:
try:
structured_response = api_response.json()
- self.dataset.update_status("'Forbidden' error from the Twitter API. Could not connect to Twitter API "
+ self.dataset.update_status("'Forbidden' error from the X API. Could not connect to X API "
"with this API key. %s" % structured_response.get("detail", ""), is_final=True)
except (json.JSONDecodeError, ValueError):
- self.dataset.update_status("'Forbidden' error from the Twitter API. Your key may not have access to "
+ self.dataset.update_status("'Forbidden' error from the X API. Your key may not have access to "
"the full-archive search endpoint.", is_final=True)
finally:
return
@@ -224,7 +221,7 @@ def get_items(self, query):
elif api_response.status_code in (502, 503, 504):
resume_at = time.time() + 60
resume_at_str = datetime.datetime.fromtimestamp(int(resume_at)).strftime("%c")
- self.dataset.update_status("Twitter unavailable (status %i) - waiting until %s to continue." % (
+ self.dataset.update_status("X unavailable (status %i) - waiting until %s to continue." % (
api_response.status_code, resume_at_str))
while time.time() <= resume_at:
time.sleep(0.5)
@@ -233,7 +230,7 @@ def get_items(self, query):
# this usually means the query is too long or otherwise contains
# a syntax error
elif api_response.status_code == 400:
- msg = "Response %i from the Twitter API; " % api_response.status_code
+ msg = "Response %i from the X API; " % api_response.status_code
try:
api_response = api_response.json()
msg += api_response.get("title", "")
@@ -247,19 +244,19 @@ def get_items(self, query):
# invalid API key
elif api_response.status_code == 401:
- self.dataset.update_status("Invalid API key - could not connect to Twitter API", is_final=True)
+ self.dataset.update_status("Invalid API key - could not connect to X API", is_final=True)
return
# haven't seen one yet, but they probably exist
elif api_response.status_code != 200:
self.dataset.update_status(
"Unexpected HTTP status %i. Halting tweet collection." % api_response.status_code, is_final=True)
- self.log.warning("Twitter API v2 responded with status code %i. Response body: %s" % (
+ self.log.warning("X API v2 responded with status code %i. Response body: %s" % (
api_response.status_code, api_response.text))
return
elif not api_response:
- self.dataset.update_status("Could not connect to Twitter. Cancelling.", is_final=True)
+ self.dataset.update_status("Could not connect to X. Cancelling.", is_final=True)
return
api_response = api_response.json()
@@ -291,13 +288,13 @@ def get_items(self, query):
if num_missing_objects > 50:
# Large amount of missing objects; possible error with Twitter API
self.import_issues = False
- error_report.append('%i missing objects received following tweet number %i. Possible issue with Twitter API.' % (num_missing_objects, tweets))
+ error_report.append('%i missing objects received following post number %i. Possible issue with X API.' % (num_missing_objects, tweets))
error_report.append('Missing objects collected: ' + ', '.join(['%s: %s' % (k, len(v)) for k, v in missing_objects.items()]))
# Warn if new missing object is recorded (for developers to handle)
expected_error_types = ['user', 'media', 'poll', 'tweet', 'place']
if any(key not in expected_error_types for key in missing_objects.keys()):
- self.log.warning("Twitter API v2 returned unknown error types: %s" % str([key for key in missing_objects.keys() if key not in expected_error_types]))
+ self.log.warning("X API v2 returned unknown error types: %s" % str([key for key in missing_objects.keys() if key not in expected_error_types]))
# Loop through and collect tweets
for tweet in api_response.get("data", []):
@@ -312,7 +309,7 @@ def get_items(self, query):
tweets += 1
if tweets % 500 == 0:
- self.dataset.update_status("Received %s of ~%s tweets from the Twitter API" % ("{:,}".format(tweets), expected_tweets))
+ self.dataset.update_status("Received %s of ~%s tweets from the X API" % ("{:,}".format(tweets), expected_tweets))
if num_expected_tweets is not None:
self.dataset.update_progress(tweets / num_expected_tweets)
@@ -474,21 +471,19 @@ def get_options(cls, parent_dataset=None, user=None):
max_tweets = config.get("twitterv2-search.max_tweets", user=user)
if have_api_key:
- intro_text = ("This data source uses the full-archive search endpoint of the Twitter API (v2) to retrieve "
+ intro_text = ("This data source uses the full-archive search endpoint of the X API (v2) to retrieve "
"historic tweets that match a given query.")
else:
- intro_text = ("This data source uses either the Standard 7-day historical Search endpoint or the "
- "full-archive search endpoint of the Twitter API, v2. To use the latter, you must have "
- "access to the Academic Research track of the Twitter API. In either case, you will need to "
- "provide a valid [bearer "
- "token](https://developer.twitter.com/en/docs/authentication/oauth-2-0). The bearer token "
- "**will be sent to the 4CAT server**, where it will be deleted after data collection has "
- "started. Note that any tweets retrieved with 4CAT will count towards your monthly Tweet "
- "retrieval cap.")
-
- intro_text += ("\n\nPlease refer to the [Twitter API documentation]("
- "https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) "
+ intro_text = ("This data source uses the full-archive search endpoint of the X/Twitter API, v2. To use the "
+ "it, you must have access to the Research track of the X API. You will need to provide a "
+ "valid [bearer token](https://developer.x.com/en/docs/authentication/oauth-2-0). The "
+ "bearer token **will be sent to the 4CAT server**, where it will be deleted after data "
+ "collection has started. Note that any posts retrieved with 4CAT will count towards your "
+ "monthly post retrieval cap.")
+
+ intro_text += ("\n\nPlease refer to the [X API documentation]("
+ "https://developer.x.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) "
"documentation for more information about this API endpoint and the syntax you can use in your "
"search query. Retweets are included by default; add `-is:retweet` to exclude them.")
@@ -500,16 +495,18 @@ def get_options(cls, parent_dataset=None, user=None):
}
if not have_api_key:
+ # options.update({
+ # "api_type": {
+ # "type": UserInput.OPTION_CHOICE,
+ # "help": "API track",
+ # "options": {
+ # "all": "Research API: Full-archive search",
+ # "recent": "Standard: Recent search (Tweets published in last 7 days)",
+ # },
+ # "default": "all"
+ # }
+ # })
options.update({
- "api_type": {
- "type": UserInput.OPTION_CHOICE,
- "help": "API track",
- "options": {
- "all": "Academic: Full-archive search",
- "recent": "Standard: Recent search (Tweets published in last 7 days)",
- },
- "default": "all"
- },
"api_bearer_token": {
"type": UserInput.OPTION_TEXT,
"sensitive": True,
@@ -523,10 +520,10 @@ def get_options(cls, parent_dataset=None, user=None):
"query_type": {
"type": UserInput.OPTION_CHOICE,
"help": "Query type",
- "tooltip": "Note: Num of Tweets and Date fields ignored with 'Tweets by ID' lookup",
+ "tooltip": "Note: Num of posts and date fields are ignored with 'Posts by ID' lookup",
"options": {
"query": "Search query",
- "id_lookup": "Tweets by ID (list IDs seperated by commas or one per line)",
+ "id_lookup": "Posts by ID (list IDs seperated by commas or one per line)",
},
"default": "query"
}
@@ -539,7 +536,7 @@ def get_options(cls, parent_dataset=None, user=None):
},
"amount": {
"type": UserInput.OPTION_TEXT,
- "help": "Tweets to retrieve",
+ "help": "Posts to retrieve",
"tooltip": "0 = unlimited (be careful!)" if not max_tweets else ("0 = maximum (%s)" % str(max_tweets)),
"min": 0,
"max": max_tweets if max_tweets else 10_000_000,
@@ -550,7 +547,7 @@ def get_options(cls, parent_dataset=None, user=None):
},
"daterange-info": {
"type": UserInput.OPTION_INFO,
- "help": "By default, Twitter returns tweets up til 30 days ago. If you want to go back further, you "
+ "help": "By default, X returns posts up til 30 days ago. If you want to go back further, you "
"need to explicitly set a date range."
},
"daterange": {
@@ -591,7 +588,7 @@ def validate_query(query, request, user):
raise QueryParametersException("Please provide a valid bearer token.")
if len(query.get("query")) > 1024 and query.get("query_type", "query") != "id_lookup":
- raise QueryParametersException("Twitter API queries cannot be longer than 1024 characters.")
+ raise QueryParametersException("X API queries cannot be longer than 1024 characters.")
if query.get("query_type", "query") == "id_lookup" and config.get("twitterv2-search.id_lookup", user=user):
# reformat queries to be a comma-separated list with no wrapping
@@ -630,7 +627,7 @@ def validate_query(query, request, user):
# to dissuade users from running huge queries that will take forever
# to process
if params["query_type"] == "query" and (params.get("api_type") == "all" or have_api_key):
- count_url = "https://api.twitter.com/2/tweets/counts/all"
+ count_url = "https://api.x.com/2/tweets/counts/all"
count_params = {
"granularity": "day",
"query": params["query"],
@@ -668,7 +665,7 @@ def validate_query(query, request, user):
elif response.status_code == 401:
raise QueryParametersException("Your bearer token seems to be invalid. Please make sure it is valid "
- "for the Academic Track of the Twitter API.")
+ "for the Research track of the X API.")
elif response.status_code == 400:
raise QueryParametersException("Your query is invalid. Please make sure the date range does not "
@@ -791,7 +788,7 @@ def map_item(item):
"thread_id": item.get("conversation_id", item["id"]),
"timestamp": tweet_time.strftime("%Y-%m-%d %H:%M:%S"),
"unix_timestamp": int(tweet_time.timestamp()),
- 'link': "https://twitter.com/%s/status/%s" % (author_username, item.get('id')),
+ 'link': "https://x.com/%s/status/%s" % (author_username, item.get('id')),
"subject": "",
"body": item["text"],
"author": author_username,