-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Updated migrate.py and docker-entrypoint.sh to be compatible with docker compose (input() causes error). Also updated docker-config.py. NOTE: Docker will overwrite docker-config with config.py if it already exists. This may be desired behavior, however, can cause failures. * Removing personal info. * Changed to use config.py-example over docker-config.py Also changed apt update && install per "https://docs.docker.com/develop/develop-images/ dockerfile_best-practices/#run". Noticed that the database variables are hardcoded throughout entrypoint.sh, Dockerfile, config.py, and docker-compose.yml. Unsure if I like using sed to update config file. Alternatives could be using a docker .env file; would need to make multiple updates and still would need to update config.py. The configparser python package is possibly a better solution to the config.py file in general, but that would require extensive updates to 4cat. COULD possibly create a seperate config file handled by configparser and import that into config.py. * Updates to docker config modifications. Moved docker variables to docker_config.ini, created docker_setup.py to better utilize configparser and avoid any accidental "sed" changes, and modified config.py (actually config.py-example) to use docker_config.ini if it has been specified in docker_config.ini. Also moved setup to Dockerfile instead of docker-entrypoint.sh so that it does not unnecessarily run every time docker containers are started. * 1. added paths to docker.ini file 2. handled paths in docker_setup.py (create directories if needed) 3. update config.py-example to use docker paths 3. trap SIGTERM in docker-entrypoint.sh for 4cat-daemon backend 4. update docker-compose to separate backend and frontend and use shared volume for data 5. add Dockerfile_frontend to set up frontend (could use pairing down) 6. rearranged Dockerfile so it doesn't rebuild python packages and download/install chrome every time I update the config file * Docker org updates to allow for rebuilding images/updating docker files. Shared admin password someplace noticable. * gitignore changes * update .gitignore (ignore venv & jupyter notebooks). Add 4444 port to docker. * add port for telegram. add modify api port to config for docker. * Update README.md * Update README.md * Update README.md * Update README.md * Adding dynamic localhost * Dynamic API host * temp logging * add test status button * Update README.md * Add sessions path to config files; allow it to be shared by docker containers * Removed logging. Changed [email protected] to admin. * Updates to Docker: user database information is set in .env file which is then used by docker-compose.yml, passed to the Dockerfiles, docker-entrypoint.sh, and docker_setup.py which updates 4cat config files. * TCAT to 4CAT. Cause apparently I don't even know where I am anymore! * dynamic database port to docker * added arg to dockerfile * docker fix: change EXTERNAL port not internal * Pseudo fix (#146) * comprehensive search and replace for ndjson * used the CheckCache object in the items_to_csv function as well. * updated to pseudonymize all values nested in a matching group. Currently only looks for keys with the word "author". * deleting some test validations * clarrified notes/function description * Docker updates: - Allow user to modify public port (default changed to 80) - Moved docker_setup to run on startup and modify docker_config if any changes to variables were made - Updated Gunicorn default settings to improve performance and take advantage of CPU cores * Docker: allow update external API and Telegram ports * Notify user of 4CAT address and port * revert readme changes * Minor cleanup of docker config values * move docker login.txt file to persistant docker volume * easy update server name * move server_name variable to docker .env file * Remove merge markers * Don't need with Docker changes * Fixed link for byline (new datasources failed) * Created date filter as processor and added daterange as possible processor option * Changed datetime template_filter to "09 Aug 2021" format. Date range uses month-day-year which gets confusing when day-month-year is norm elsewhere. Personally I prefer year-month-day for sorting. And because no sane person would use year-day-month! * clean up code and fix frontend display of processor options * Use create_standalone function in other filters * Fixes #127
- Loading branch information
Showing
11 changed files
with
212 additions
and
131 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
""" | ||
Filter posts by a dates | ||
""" | ||
import csv | ||
import dateutil.parser | ||
from datetime import datetime | ||
|
||
from backend.abstract.processor import BasicProcessor | ||
from common.lib.helpers import UserInput | ||
|
||
__author__ = "Dale Wahl" | ||
__credits__ = ["Dale Wahl"] | ||
__maintainer__ = "Dale Wahl" | ||
__email__ = "[email protected]" | ||
|
||
csv.field_size_limit(1024 * 1024 * 1024) | ||
|
||
|
||
class DateFilter(BasicProcessor): | ||
""" | ||
Retain only posts between specific dates | ||
""" | ||
type = "date-filter" # job type ID | ||
category = "Filtering" # category | ||
title = "Filter by date" # title displayed in UI | ||
description = "Copies the dataset, retaining only posts between the given dates. This creates a new, separate \ | ||
dataset you can run analyses on." | ||
extension = "csv" # extension of result file, used internally and in UI | ||
|
||
options = { | ||
"daterange": { | ||
"type": UserInput.OPTION_DATERANGE, | ||
"help": "Date range:", | ||
}, | ||
"parse_error": { | ||
"type": UserInput.OPTION_CHOICE, | ||
"help": "Invalid date formats:", | ||
"options": { | ||
"return": "Keep invalid dates for new dataset", | ||
"reject": "Remove invalid dates for new dataset", | ||
}, | ||
"default": "return" | ||
}, | ||
} | ||
|
||
@classmethod | ||
def is_compatible_with(cls, module=None): | ||
""" | ||
Allow processor on CSV files | ||
:param module: Dataset or processor to determine compatibility with | ||
""" | ||
return module.is_top_dataset() and module.get_extension() == "csv" | ||
|
||
def process(self): | ||
""" | ||
Reads a CSV file, filtering items that match in the required way, and | ||
creates a new dataset containing the matching values | ||
""" | ||
# Column to match | ||
# 'timestamp' should be a required field in all datasources | ||
date_column_name = 'timestamp' | ||
|
||
# Process inputs from user | ||
min_date, max_date = self.parameters.get("daterange") | ||
# Convert to datetime for easy comparison | ||
min_date = datetime.fromtimestamp(min_date).date() | ||
max_date = datetime.fromtimestamp(max_date).date() | ||
# Decide how to handle invalid dates | ||
if self.parameters.get("parse_error") == 'return': | ||
keep_errors = True | ||
elif self.parameters.get("parse_error") == 'reject': | ||
keep_errors = False | ||
else: | ||
raise "Error with parse_error types" | ||
|
||
# Track progress | ||
processed_items = 0 | ||
invalid_dates = 0 | ||
matching_items = 0 | ||
|
||
# Start writer | ||
with self.dataset.get_results_path().open("w", encoding="utf-8") as outfile: | ||
writer = None | ||
|
||
# Loop through items | ||
for item in self.iterate_items(self.source_file): | ||
if not writer: | ||
# First iteration, check if column actually exists | ||
if date_column_name not in item.keys(): | ||
self.dataset.update_status("'%s' column not found in dataset" % date_column_name, is_final=True) | ||
self.dataset.finish(0) | ||
return | ||
|
||
# initialise csv writer - we do this explicitly rather than | ||
# using self.write_items_and_finish() because else we have | ||
# to store a potentially very large amount of items in | ||
# memory which is not a good idea | ||
writer = csv.DictWriter(outfile, fieldnames=item.keys()) | ||
writer.writeheader() | ||
|
||
# Update 4CAT and user on status | ||
processed_items += 1 | ||
if processed_items % 500 == 0: | ||
self.dataset.update_status("Processed %i items (%i matching, %i invalid dates)" % (processed_items, | ||
matching_items, | ||
invalid_dates)) | ||
|
||
# Attempt to parse timestamp | ||
try: | ||
item_date = dateutil.parser.parse(item.get(date_column_name)) | ||
except dateutil.parser.ParserError: | ||
if keep_errors: | ||
# Keep item | ||
invalid_dates += 1 | ||
writer.writerow(item) | ||
continue | ||
else: | ||
# Reject item | ||
invalid_dates += 1 | ||
continue | ||
|
||
# Only use date for comparison (not time) | ||
item_date = item_date.date() | ||
|
||
# Reject dates | ||
if min_date and item_date < min_date: | ||
continue | ||
if max_date and item_date > max_date: | ||
continue | ||
|
||
# Must be a good date! | ||
writer.writerow(item) | ||
matching_items += 1 | ||
|
||
# Any matches? | ||
if matching_items == 0: | ||
self.dataset.update_status("No items matched your criteria", is_final=True) | ||
|
||
self.dataset.finish(matching_items) | ||
|
||
def after_process(self): | ||
super().after_process() | ||
|
||
# Request standalone | ||
self.create_standalone() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,21 +2,19 @@ | |
Filter by unique posts | ||
""" | ||
import hashlib | ||
import os | ||
import csv | ||
|
||
from backend.abstract.processor import BasicProcessor | ||
from common.lib.helpers import UserInput | ||
|
||
import config | ||
|
||
__author__ = "Sal Hagen" | ||
__credits__ = ["Sal Hagen"] | ||
__maintainer__ = "Sal Hagen" | ||
__email__ = "[email protected]" | ||
|
||
csv.field_size_limit(1024 * 1024 * 1024) | ||
|
||
|
||
class UniqueFilter(BasicProcessor): | ||
""" | ||
Retain only posts matching a given lexicon | ||
|
@@ -31,11 +29,11 @@ class UniqueFilter(BasicProcessor): | |
# interface. | ||
options = { | ||
"case_sensitive": { | ||
"type": UserInput.OPTION_TOGGLE, | ||
"help": "Case sentitive", | ||
"default": False, | ||
"tooltip": "Check to consider posts with different capitals as different." | ||
} | ||
"type": UserInput.OPTION_TOGGLE, | ||
"help": "Case sensitive", | ||
"default": False, | ||
"tooltip": "Check to consider posts with different capitals as different." | ||
} | ||
} | ||
|
||
def process(self): | ||
|
@@ -91,28 +89,5 @@ def process(self): | |
def after_process(self): | ||
super().after_process() | ||
|
||
# copy this dataset - the filtered version - and make that copy standalone | ||
# this has the benefit of allowing for all analyses that can be run on | ||
# full datasets on the new, filtered copy as well | ||
top_parent = self.source_dataset | ||
|
||
standalone = self.dataset.copy(shallow=False) | ||
standalone.body_match = "(Filtered) " + top_parent.query | ||
standalone.datasource = top_parent.parameters.get("datasource", "custom") | ||
|
||
try: | ||
standalone.board = top_parent.board | ||
except KeyError: | ||
standalone.board = self.type | ||
|
||
standalone.type = "search" | ||
|
||
standalone.detach() | ||
standalone.delete_parameter("key_parent") | ||
|
||
self.dataset.copied_to = standalone.key | ||
|
||
# we don't need this file anymore - it has been copied to the new | ||
# standalone dataset, and this one is not accessible via the interface | ||
# except as a link to the copied standalone dataset | ||
os.unlink(self.dataset.get_results_path()) | ||
# Request standalone | ||
self.create_standalone() |
Oops, something went wrong.