|
| 1 | +The Bluesky data source searches for posts and collects them based on your queries. You can optionally only collect |
| 2 | +posts within a given date range. You can only select a certain amount of posts per query (this is a limitation set |
| 3 | +by your 4CAT administrators per usertype). |
| 4 | + |
| 5 | +### Search queries |
| 6 | +Bluesky has tips and tricks to using their search engine, which you can find [here](https://bsky.social/about/blog/05-31-2024-search). |
| 7 | + |
| 8 | +4CAT uses the Bluesky API via `atproto` to collect posts which requires a user to be logged in. It is therefore possible |
| 9 | +that your search results are tailored to your user profile. You may therefore wish to create a new user profile for your |
| 10 | +research. |
| 11 | + |
| 12 | +You can currently search by the following: |
| 13 | + |
| 14 | +- Keywords (use quotes for exact matches) |
| 15 | +- From or mentioning a user |
| 16 | + - Use `from:username` to search for posts from a user |
| 17 | + - Use `to:username` or `mentions:username` to search for posts mentioning a user |
| 18 | + - Use `"@username"` to search for posts including the text "@username" whether they were tagged or not |
| 19 | +- URL |
| 20 | + - Use `domain:example.com` to search for posts with the domain "example.com". |
| 21 | +- Language |
| 22 | + - Use `lang:en` to search for posts in English for example. |
| 23 | +- Date range is best set in the 4CAT interface, but can also be set per query using `since:YYYY-MM-DD` and `until:YYYY-MM-DD`. |
| 24 | + |
| 25 | +Note: commas (`,`) are used by 4CAT to separate queries, so you should not use them in your queries. If you have such a |
| 26 | +requirement, please contact your 4CAT administrator. |
| 27 | + |
| 28 | +### Technical details and caveats |
| 29 | +Bluesky data is collected via Bluesky's [official API](https://docs.bsky.app/docs/get-started) via the |
| 30 | +[AT Protocol](https://atproto.blue/en/latest/). This is done using the [AT Protocol library](https://pypi.org/project/atproto/) |
| 31 | +for Python. You can always view the latest code used by 4CAT to collect Bluesky data [here](https://github.com/digitalmethodsinitiative/4cat/blob/master/datasources/bsky/search_bsky.py); |
| 32 | +each dataset contains a unique `commit` identifier which you can use to find the exact code used to collect your data. |
| 33 | + |
| 34 | +### Data format |
| 35 | +Posts are saved as JSON objects, combined in one [NDJSON](http://ndjson.org/) file. For each post, the object |
| 36 | +collected with `atproto` is mapped to JSON. A lot of information is included per post, more than can be explained |
| 37 | +here. You can read more about the post data structure here in [this |
| 38 | +documentation](https://docs.bsky.app/docs/advanced-guides/posts). Most |
| 39 | +metadata you may be interested in is included or can be derived from the included data. |
| 40 | + |
| 41 | +NDJSON files can also be downloaded as a CSV file in 4CAT. In this case, only the most important attributes |
| 42 | +of a post (such as text body, timestamp, author name, and whether it was forwarded) are included in the CSV file. |
| 43 | +The CSV structure will appear the same as in the Preview view in 4CAT. |
| 44 | + |
| 45 | +#### Note on User handles and post URLs |
| 46 | + In Bluesky, user handles can change. They are also used in forming the URL of a post. It is thus |
| 47 | +possible for a post's user handle and URL to change over time. Both the post and the user have unique IDs which will not |
| 48 | +change and can be used to identify them and look up the new post URL or user handle. |
| 49 | + |
| 50 | +In addition to the post, 4CAT therefore also collects the user handles of mentions, replies, quoted posts, etc. during |
| 51 | +the collection process and uses these handles to create the URLs seen in the 4CAT preview and CSV export. If a handle |
| 52 | +cannot be looked up (e.g., the user has deleted their account, blocked certain users, was suspended), 4CAT will use the |
| 53 | +author ID in place of the user handle. The raw data is stored in the JSON file as it is received from Bluesky and can |
| 54 | +be thus used to update the handles/URLs in the future if needed. |
0 commit comments