Skip to content

Commit

Permalink
Export 4CAT datasets and analyses as ZIP file... and import them else…
Browse files Browse the repository at this point in the history
…where! (#452)

* export processor

* start of importer

* finish off importing ZIP 4CAT datasets

* ensure cleanup on failure

had some weird lost datasets when debugging this

* auto-expire export zips

* nltk again

* Squashed commit of the following:

commit 3f2a62a
Author: Carsten Schnober <[email protected]>
Date:   Wed Sep 18 18:18:29 2024 +0200

    Update Gensim to >=4.3.3, <4.4.0 (#450)

    * Update Gensim to >=4.3.3, <4.4.0

    * update nltk as well

    ---------

    Co-authored-by: Dale Wahl <[email protected]>
    Co-authored-by: Sal Hagen <[email protected]>

commit fee2c8c
Merge: 3d94b66 f8e93ed
Author: sal-phd-desktop <[email protected]>
Date:   Wed Sep 18 18:11:19 2024 +0200

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit 3d94b66
Author: sal-phd-desktop <[email protected]>
Date:   Wed Sep 18 18:11:04 2024 +0200

    FINALLY remove 'News' from the front page, replace with 4CAT BlueSky updates and potential information about the specific server (to be set on config page)

commit f8e93ed
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 15:11:21 2024 +0200

    Simple extensions page in Control Panel

commit b5be128
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:08:13 2024 +0200

    Remove 'docs' directory

commit 1e2010a
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:07:38 2024 +0200

    Forgot TikTok and Douyin

commit c757dd5
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:01:31 2024 +0200

    Say 'zeeschuimer' instead of 'extension' to avoid confusion with 4CAT extensions

commit ee7f434
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:00:40 2024 +0200

    RIP Parler data source

commit 11300f2
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 11:21:37 2024 +0200

    Tuplestring

commit 5472652
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 11:15:29 2024 +0200

    Pass user obj instead of str to ConfigWrapper in Processor

commit b21866d
Author: Stijn Peeters <[email protected]>
Date:   Tue Sep 17 17:45:01 2024 +0200

    Ensure request-aware config reader in user object when using config wrapper

commit bbe79e4
Author: Sal Hagen <[email protected]>
Date:   Tue Sep 17 15:12:46 2024 +0200

    Fix extension path walk for Windows

commit d6064be
Author: Stijn Peeters <[email protected]>
Date:   Mon Sep 16 14:50:45 2024 +0200

    Allow tags that have no users

    Use case: tag-based frontend differentiation using X-4CAT-Config-Via-Proxy

commit b542ded
Author: Stijn Peeters <[email protected]>
Date:   Mon Sep 16 14:13:14 2024 +0200

    Trailing slash in query results list

commit a4bddae
Author: Dale Wahl <[email protected]>
Date:   Mon Sep 16 13:57:23 2024 +0200

    4CAT Extension - easy(ier) adding of new datasources/processors that can be mainted seperately from 4CAT base code (#451)

    * domain only

    * fix reference

    * try and collect links with selenium

    * update column_filter to find multiple matches

    * fix up the normal url_scraper datasource

    * ensure all selenium links are strings for join

    * change output of url_scraper to ndjson with map_items

    * missed key/index change

    * update web archive to use json and map to 4CAT

    * fix no text found

    * and none on scraped_links

    * check key first

    * fix up web_archive error reporting

    * handle None type for error

    * record web archive "bad request"

    * add wait after redirect movement

    * increase waittime for redirects

    * add processor for trackers

    * dict to list for addition

    * allow both newline and comma seperated links

    * attempt to scrape iframes as seperate pages

    * Fixes for selenium scraper to work with config database

    * installation of packages, geckodriver, and firefox if selenium enabled

    * update install instructions

    * fix merge error

    * fix dropped function

    * have to be kidding me

    * add note; setup requires docker... need to think about IF this will ever
    be installed without Docker

    * seperate selenium class into wrapper and Search class so wrapper can be
    used in processors!

    * add screenshots; add firefox extension support

    * update selenium definitions

    * regex for extracting urls from strings

    * screenshots processor; extract urls from text and takes screenshots

    * Allow producing zip files from data sources

    * import time

    * pick better default

    * test screenshot datasource

    * validate all params

    * fix enable extension

    * haha break out of while loop

    * count my items

    * whoops, len() is important here

    * must be getting tired...

    * remove redundant logging

    * Eager loading for screenshots, viewport options, etc

    * Woops, wrong folder

    * Fix label shortening

    * Just 'queue' instead of 'search queue'

    * Yeah, make it headless

    * README -> DESCRIPTION

    * h1 -> h2

    * Actually just have no header

    * Use proper filename for downloaded files

    * Configure whether to offer pseudonymisation etc

    * Tweak descriptions

    * fix log missing data

    * add columns to post_topic_matrix

    * fix breadcrumb bug

    * Add top topics column

    * Fix selenium config install parameter (Docker uses this/manual would
    need to run install_selenium, well, manually)

    * this processor is slow; i thought it was broken long before it updated!

    * refactor detect_trackers as conversion processor not filter

    * add geckodriver executable to docker install

    * Auto-configure webdrivers if available in PATH

    * update screenshots to act as image-downloader and benefit from processors

    * fix is_compatible_with

    * Delete helper-scripts/migrate/migrate-1.30-1.31.py

    * fix embeddings is_compatible_with

    * fix up UI options for hashing and private

    * abstract was moved to lib

    * various fixes to selenium based datasources

    * processors not compatible with image datasets

    * update firefox extension handling

    * screenshots datasource fix get_options

    * rename screenshots processor to be detected as image dataset

    * add monthly and weekly frequencies to wayback machine datasource

    * wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily

    * add scroll down page to allow lazy loading for entire page screenshots

    * screenshots: adjust pause time so it can be used to force a wait for images to load

    I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine

    * hash URLs to create filenames

    * remove log

    * add setting to toggle display advanced options

    * add progress bars

    * web archive fix query validation

    * count subpages in progress

    * remove overwritten function

    * move http response to own column

    * special filenames

    * add timestamps to all screenshots

    * restart selenium on failure

    * new build have selenium

    * process urls after start (keep original query parameters)

    * undo default firefox

    * quick max

    * rename SeleniumScraper to SeleniumSearch

    todo: build SeleniumProcessor!

    * max number screenshots configurable

    * method to get url with error handling

    * use get_with_error_handling

    * d'oh, screenshot processor needs to quit selenium

    * update log to contain URL

    * Update scrolling to use Page down key if necessary

    * improve logs

    * update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way.

    Also, could I get categories from the metadata? That's... ugh.

    * no category, no processor

    * str errors

    * screenshots: dismiss alerts when checking ready state is complete

    * set screenshot timeout to 30 seconds

    * update gensim package

    * screenshots: move processor interrupt into attempts loop

    * if alert disappears before we can dismiss it...

    * selenium specific logger

    * do not switch window when no alert found on dismiss

    * extract wait for page to load to selenium class

    * improve descriptions of screenshot options

    * remove unused line

    * treat timeouts differently from other errors

    these are more likely due to an issue with the website in question

    * debug if requested

    * increase pause time

    * restart browser w/ PID

    * increase max_workers for selenium

    this is by individual worker class not for all selenium classes... so you can really crank them out if desired

    * quick fix restart by pid

    * avoid bad urls

    * missing bracket & attempt to fix-missing dependencies in Docker install

    * Allow dynamic form options in processors

    * Allow 'requires' on data source options as well

    * Handle list values with requires

    * basic processor for apple store; setup checks for additional requirements

    * fix is_4cat_class

    * show preview when no map_item

    * add google store datasource

    * Docker setup.py use extensions

    * Wider support for file upload in processors

    * Log file uploads in DMI service manager

    * add map_item methods and record more data per item

    need additional item data as map_item is staticmethod

    * update from master; merge conflicts

    * fix docker build context (ignore data files)

    * fix option requirements

    * apple store fix: list still tries to get query

    * apple & google stores fix up item mapping

    * missed merge error

    * minor fix

    * remove unused import

    * fix datasources w/ files frontend error

    * fix error w/ datasources having file option

    * better way to name docker volumes

    * update two other docker compose files

    * fix docker-compose ymls

    * minor bug: fix and add warning; fix no results fail

    * update apple field names to better match interface

    * update google store fieldnames and order

    * sneak in jinja logger if needed

    * fix fourcat.js handling checkboxes for dynamic settings

    * add new endpoint for app details to apple store

    * apple_store map new beta app data

    * add default lang/country

    * not all apps have advisories

    * revert so button works

    * add chart positions to beta map items

    * basic scheduler

    To-do
    - fix up and add options to scheduler view (e.g. delete/change)
    - add scheduler view to navigator
    - tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view)
    - more testing...

    * update scheduler view, add functions to update job interval

    * revert .env

    * working scheduler!

    * basic scheduler view w/ datasets

    * fix postgres tag

    * update job status in scheduled_jobs table

    * fix timestamp; end_date needed for last run check; add dataset label

    * improve scheduler view

    * remove dataset from scheduled_jobs table on delete

    * scheduler view order by last creation

    * scheduler views: separate scheduler list from scheduled dataset list

    * additional update from master fixes

    * apple_store map_items fix missing locales

    * add back depth for pagination

    * correct route

    * modify pagination to accept args

    * pagination fun

    * pagination: i hate testing on live servers...

    * ok ok need the pagination route

    * pagination: add route_args

    * fix up scheduler header

    * improve app store descriptions

    * add azure store

    * fix azure links

    * azure_store: add category search

    * azure fix type of config update timestamp

    OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly

    * basic aws store

    * check if selenium available; get correct app_id

    * aws: implement pagination

    * add logging; wait for elements to load after next page; attempts to rework filter option collection

    * apple_store: handle invalid param error

    * fix filter_options

    * aws: fix filter option collection!

    * more merge

    * move new datasources and processors to extensions and modify setup.py and module loader to use the new locations

    * migrate.py to run extension "fourcat_install.py" files

    * formatting

    * remove extensions; add gitignore

    * excise scheduler merge

    * some additional cleanup from app_studies branch

    * allow nested datasources folders; ignore files in extensions main folder

    * allow extension install scripts to run pip if migrate.py has not

    * Remove unused URL functions we could use ural for

    * Take care of git commit hash tracking for extension processors

    * Get rid of unused path.versionfile config setting

    * Add extensions README

    * Squashed commit of the following:

    commit cd356f7
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 17:36:18 2024 +0200

        UI setting for 4CAT install ad in login

    commit 0945d8c
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 17:32:55 2024 +0200

        UI setting for anonymisation controls

        Todo: make per-datasource

    commit 1a2562c
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 15:53:27 2024 +0200

        Debug panel for HTTP headers in control panel

    commit 203314e
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 15:53:17 2024 +0200

        Preview for HTML datasets

    commit 48c20c2
    Author: Desktop Sal <[email protected]>
    Date:   Wed Sep 11 13:54:23 2024 +0200

        Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies

    commit 657ffd7
    Author: Dale Wahl <[email protected]>
    Date:   Fri Sep 6 16:29:19 2024 +0200

        fix nltk where it matters

    commit 2ef5c80
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Sep 3 12:05:14 2024 +0200

        Actually check progress in text annotator

    commit 693960f
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Sep 2 18:03:18 2024 +0200

        Add processor for stormtrooper DMI service

    commit 6ae964a
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Aug 30 17:31:37 2024 +0200

        Fix reference to old stopwords list in neologisms preset

    * Fix Github links for extensions

    * Fix commit detection in extensions

    * Fix extension detection in module loader

    * Follow symlinks when loading extensions

    Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir

    * Make queue message on create page more generic

    * Markdown in datasource option tooltips

    * Remove Spacy model from requirements

    * Add software_source to database SQL

    ---------

    Co-authored-by: Stijn Peeters <[email protected]>
    Co-authored-by: Stijn Peeters <[email protected]>

commit cd356f7
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 17:36:18 2024 +0200

    UI setting for 4CAT install ad in login

commit 0945d8c
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 17:32:55 2024 +0200

    UI setting for anonymisation controls

    Todo: make per-datasource

commit 1a2562c
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 15:53:27 2024 +0200

    Debug panel for HTTP headers in control panel

commit 203314e
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 15:53:17 2024 +0200

    Preview for HTML datasets

commit 48c20c2
Author: Desktop Sal <[email protected]>
Date:   Wed Sep 11 13:54:23 2024 +0200

    Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies

commit 657ffd7
Author: Dale Wahl <[email protected]>
Date:   Fri Sep 6 16:29:19 2024 +0200

    fix nltk where it matters

* merge docker files

* fix merge issues

* more modules passing fixes

* disappearing import

not sure pycharm's merge is super awesome...

* fix import 4cat datasource with modules changes

---------

Co-authored-by: Stijn Peeters <[email protected]>
  • Loading branch information
dale-wahl and stijn-uva authored Oct 1, 2024
1 parent 8f2193c commit a224dd9
Show file tree
Hide file tree
Showing 5 changed files with 500 additions and 136 deletions.
16 changes: 15 additions & 1 deletion common/lib/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from common.config_manager import config
from common.lib.job import Job, JobNotFoundException
from common.lib.module_loader import ModuleCollector
from common.lib.helpers import get_software_commit, NullAwareTextIOWrapper, convert_to_int
from common.lib.helpers import get_software_commit, NullAwareTextIOWrapper, convert_to_int, get_software_version
from common.lib.item_mapping import MappedItem, MissingMappedField, DatasetItem
from common.lib.fourcat_module import FourcatModule
from common.lib.exceptions import (ProcessorInterruptedException, DataSetException, DataSetNotFoundException,
Expand Down Expand Up @@ -1586,6 +1586,20 @@ def get_media_type(self):
# Default to text
return self.parameters.get("media_type", "text")

def get_metadata(self):
"""
Get dataset metadata
This consists of all the data stored in the database for this dataset, plus the current 4CAT version (appended
as 'current_4CAT_version'). This is useful for exporting datasets, as it can be used by another 4CAT instance to
update its database (and ensure compatibility with the exporting version of 4CAT).
"""
metadata = self.db.fetchone("SELECT * FROM datasets WHERE key = %s", (self.key,))

# get 4CAT version (presumably to ensure export is compatible with import)
metadata["current_4CAT_version"] = get_software_version()
return metadata

def get_result_url(self):
"""
Gets the 4CAT frontend URL of a dataset file.
Expand Down
Loading

0 comments on commit a224dd9

Please sign in to comment.