Skip to content

Commit 115a3c1

Browse files
committed
bsky datasource
1 parent 836a235 commit 115a3c1

File tree

4 files changed

+764
-0
lines changed

4 files changed

+764
-0
lines changed

datasources/bsky/DESCRIPTION.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
The Bluesky data source searches for posts and collects them based on your queries. You can optionally only collect
2+
posts within a given date range. You can only select a certain amount of posts per query (this is a limitation set
3+
by your 4CAT administrators per usertype).
4+
5+
### Search queries
6+
Bluesky has tips and tricks to using their search engine, which you can find [here](https://bsky.social/about/blog/05-31-2024-search).
7+
8+
4CAT uses the Bluesky API via `atproto` to collect posts which requires a user to be logged in. It is therefore possible
9+
that your search results are tailored to your user profile. You may therefore wish to create a new user profile for your
10+
research.
11+
12+
You can currently search by the following:
13+
14+
- Keywords (use quotes for exact matches)
15+
- From or mentioning a user
16+
- Use `from:username` to search for posts from a user
17+
- Use `to:username` or `mentions:username` to search for posts mentioning a user
18+
- Use `"@username"` to search for posts including the text "@username" whether they were tagged or not
19+
- URL
20+
- Use `domain:example.com` to search for posts with the domain "example.com".
21+
- Language
22+
- Use `lang:en` to search for posts in English for example.
23+
- Date range is best set in the 4CAT interface, but can also be set per query using `since:YYYY-MM-DD` and `until:YYYY-MM-DD`.
24+
25+
Note: commas (`,`) are used by 4CAT to separate queries, so you should not use them in your queries. If you have such a
26+
requirement, please contact your 4CAT administrator.
27+
28+
### Technical details and caveats
29+
Bluesky data is collected via Bluesky's [official API](https://docs.bsky.app/docs/get-started) via the
30+
[AT Protocol](https://atproto.blue/en/latest/). This is done using the [AT Protocol library](https://pypi.org/project/atproto/)
31+
for Python. You can always view the latest code used by 4CAT to collect Bluesky data [here](https://github.com/digitalmethodsinitiative/4cat/blob/master/datasources/bsky/search_bsky.py);
32+
each dataset contains a unique `commit` identifier which you can use to find the exact code used to collect your data.
33+
34+
### Data format
35+
Posts are saved as JSON objects, combined in one [NDJSON](http://ndjson.org/) file. For each post, the object
36+
collected with `atproto` is mapped to JSON. A lot of information is included per post, more than can be explained
37+
here. You can read more about the post data structure here in [this
38+
documentation](https://docs.bsky.app/docs/advanced-guides/posts). Most
39+
metadata you may be interested in is included or can be derived from the included data.
40+
41+
NDJSON files can also be downloaded as a CSV file in 4CAT. In this case, only the most important attributes
42+
of a post (such as text body, timestamp, author name, and whether it was forwarded) are included in the CSV file.
43+
The CSV structure will appear the same as in the Preview view in 4CAT.
44+
45+
#### Note on User handles and post URLs
46+
In Bluesky, user handles can change. They are also used in forming the URL of a post. It is thus
47+
possible for a post's user handle and URL to change over time. Both the post and the user have unique IDs which will not
48+
change and can be used to identify them and look up the new post URL or user handle.
49+
50+
In addition to the post, 4CAT therefore also collects the user handles of mentions, replies, quoted posts, etc. during
51+
the collection process and uses these handles to create the URLs seen in the 4CAT preview and CSV export. If a handle
52+
cannot be looked up (e.g., the user has deleted their account, blocked certain users, was suspended), 4CAT will use the
53+
author ID in place of the user handle. The raw data is stored in the JSON file as it is received from Bluesky and can
54+
be thus used to update the handles/URLs in the future if needed.

datasources/bsky/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
"""
2+
Initialize Bluesky data source
3+
"""
4+
5+
# An init_datasource function is expected to be available to initialize this
6+
# data source. A default function that does this is available from the
7+
# backend helpers library.
8+
from common.lib.helpers import init_datasource
9+
10+
# Internal identifier for this data source
11+
DATASOURCE = "bsky"
12+
NAME = "Bluesky"

0 commit comments

Comments
 (0)