This update introduces three new data sources; two for data imported via Zeeschuimer, from Pinterest and RedNote/Xiaohongshu. A third new data source allow for direct data capture from Bluesky, if provided with a Bluesky login.
There are also several new processors, focused on image analysis as well as using LLMs. In the latter category, a processor that allows one to prompt the OpenAI API for text generation based on a dataset's content allows for, for example, LLM-based coding or categorisation of a dataset collected with 4CAT.
Additionally, this release includes many bug fixes to processors, data sources and the 4CAT web interface.
We also recommend reading the instructions above if you are running 4CAT via Docker and have in any way deviated from the default set-up, or see error messages in the log file when upgrading via the web interface.
Otherwise, you can upgrade 4CAT via the 'Restart or upgrade' button in the Control Panel. This release of 4CAT incorporates the following fixes and improvements:
New data sources and processors
- Added data source: Bluesky, allowing for the capture of Bluesky posts for a given query - requires a Bluesky login (115a3c1)
- Added data source: Pinterest, for importing data collected from the Pinterest website with Zeeschuimer (#478)
- Added data source: Xiaohongshu/RedNote, for importing data collected from the RedNote website with Zeeschuimer (34b8409)
- Added processor: Deduplicate images, filtering an image dataset for duplicates using a range of comparison methods (aad7d57)
- Added processor: Bipartite image-item network, which can be used with e.g. Gephi's "Image Preview" plugin to create visual networks (f98addc)
- Added processor: Vectorise by category, allowing for vectors of tokens grouped by some column from the parent dataset (aeb01f7)
- Added processor: OpenAI prompting, to interface with the OpenAI API and generate text based on the combination of the prompt and a value from the parent dataset for each item. Requires an OpenAI API key. (c405213)
Updates to interface, data sources, and processors
- Updated the 'import CSV' data source to better handle files of which the CSV dialect cannot be detected automatically (46b2805)
- Updated the 'Media upload' data source to warn a user when trying to upload SVG files (which most processors will not handle) (d119225, 2987fd4)
- Updated TikTok dataset import to include is_sensitive and is_photosensitive columns in the CSV mapping (2f42113)
- Updated TikTok image downloader processor to allow downloading author avatars (6881cba)
- Updated co-tag network processor to allow ignoring certain tags (e53b73f)
- Updated the video downloader processor to better handle download errors and rate limits (d1d9347, d4c43a7)
- Updated the 'Count values' and 'Thread metadata' processors to better report their progress while processing large datasets (59a1546)
- Update the 4CAT back-end to log a message when a dataset cannot be deleted due to file permission issues (638413a)
- Update the Instagram imported data source to no longer consider a lack of geo-tags 'missing data' (3c62f37)
- Update the Instagram imported data source with a new column 'likes_hidden' that indicates if the amount of likes are hidden by the post author; the 'num_likes' column will be empty if this is the case (79cb297)
- Update the 'Image wall' processor to use the 'fit height' sizing option by default, instead of 'square' (a43c9aa)
- Update the dataset status message after importing data from Zeeschuimer to provide clearer information about data fields missing from the imported file (711c8b4)
- Update the front-end to hide some processors from the list for a dataset if they are technically compatible but do not make sense to run in the given context (#472)
- Update datasets to keep track of when they finish being created; existing datasets take their 'finished at' date from the dataset log's last update (#462)
- Update the default 4CAT configuration to enable new data sources (4376b33)
- Update Twitter-related processors and data sources to reflect the platform's name change to X (1871019)
- Update Bluesky widget on 'About' page to show smaller link previews (da8328e)
- Update interface footer to only show the 4CAT version when logged in (0792ef4)
- Update the look of CSV preview of datasets to be more readable and indicate missing data (cb2ef69, 8da18b3, dd2ab72)
- Update the dataset overview page to now show empty/unfinished datasets by default (8261b25)
- Update the list of available processors when creating a follow-up dataset to always show the processor description (68db315)
Docker-related changes
- Updated the Docker version of 4CAT to use Python 3.11 (2600e55)
Removals and deprecations
- Removed the 'FAQ' page from the web interface (d5c873a)
- Removed the 'Convert to Excel-compatible CSV file' processor - use Excel's CSV import wizard instead (6367500)
Bug fixes
- Fix a crash when importing NDJSON files with invalid entries; 4CAT will now skip the item and warn about it instead (6aa7177)
- Fix an issue where data sources that could be imported via Zeeschuimer would show as available even when disabled (e09e875)
- Fix an issue with the Telegram image downloader processor that would stall when hitting a rate limit (5d5a0e3)
- Fix an issue with the Telegram image downloader where it would crash on a 'bad request' response (3df74c9)
- Fix an issue with the Telegram image downloader where it could end up in an infinite loop when encountering a deleted image (ac543cc)
- Fix an issue with image downloader processors where they would attempt to download all available images if that was set to be allowed, even when asking for fewer (99e8fd0, 0638ec2)
- Fix an issue with the TikTok image downloader processor where it would crash when encountering unexpected errors (b60e8cf)
- Fix an issue where 4CAT log messages would be logged twice in some cases (ded8d3d)
- Fix an issue with the video scene detection processor where it would crash when a video in the parent dataset had not been downloaded (9453b76)
- Fix an issue with the video frame extraction processor where it would crash when no frames could be extracted (176905a)
- Fix an issue in the TikTok comments data source where comments without information on whether they had been pinned would be skipped (bfe3075)
- Fix TikTok data import to properly map the post author thumbnail URL (8e660a4)
- Fix an issue with the word trees processor where it would crash when trying to make word trees of numeric data (5021e85)
- Fix an issue with the 'group by sentence' setting of the tokeniser where it would crash when choosing certain languages (3f06845)
- Fix an issue in the video downloading processor where it would crash when the connection broke before the downloading finished (1765e80)
- Fix an issue when trying to export unfinished or incomplete datasets (a296ff0)
- Fix an issue when trying to work with tokenised datasets from older 4CAT versions or with duplicate source data (4906887)
- Fix an issue where importing the same dataset into 4CAT twice would lead to strange side-effects (ffd5c46)
- Fix an issue where the front-end interface would crash when trying to display datasets made with processors that were removed from 4CAT (817b4ee)
- Fix the Generate images with BLIP2 processor to better handle images with no metadata (e.g. when uploaded via the 'Media upload' data source) (d69a0c3)
- Fix an issue with the image classification processors to not crash when encountering SVG files but skip them instead (033b716, 4912ef4)
- Fix an issue with the image categorisation processor where it would not properly skip empty categories (9465cc2)
- Fix an issue with the Classify using LLMs processor where it would not properly read an uploaded few-shot examples file (14c9fae)
- Fix an issue with the 'Media upload' data source where only the first uploaded file would be validated properly when uploading multiple (75ae4b2)
- Fix an issue with the 'Count values' processor when trying to count numeric data (977d887)
- Fix an issue with the Telegram data source where the name of the source of a forwarded message was not mapped properly (66d60e9)
- Fix an issue with the 'Open with Gephi Lite' link in the front-end network preview to conform to Gephi Lite's new URL scheme (855d34e)
- Fix the 'Consolidate URLs' processors to properly skip data that isn't actually a URL (ccaf114)
- Fix an issue in the front-end where tooltips would sometimes be positioned (partially) outside the viewport (a3e4f77)
- Fix an issue with the Gab data source where data would be imported incompletely when collected from a certain set of Gab pages (1716c4b)
- Fix an issue where the 'can manipulate dataset' privilege would not take effect when set on a per-user level (#481)
- Fix an issue where jobs could get stuck in the job queue even if the dataset they belong to had been deleted (#468)