Add daterange filter (#168)

* Updated migrate.py and docker-entrypoint.sh to be compatible with docker compose (input() causes error). Also updated docker-config.py. NOTE: Docker will overwrite docker-config with config.py if it already exists. This may be desired behavior, however, can cause failures. * Removing personal info. * Changed to use config.py-example over docker-config.py Also changed apt update && install per "https://docs.docker.com/develop/develop-images/ dockerfile_best-practices/#run". Noticed that the database variables are hardcoded throughout entrypoint.sh, Dockerfile, config.py, and docker-compose.yml. Unsure if I like using sed to update config file. Alternatives could be using a docker .env file; would need to make multiple updates and still would need to update config.py. The configparser python package is possibly a better solution to the config.py file in general, but that would require extensive updates to 4cat. COULD possibly create a seperate config file handled by configparser and import that into config.py. * Updates to docker config modifications. Moved docker variables to docker_config.ini, created docker_setup.py to better utilize configparser and avoid any accidental "sed" changes, and modified config.py (actually config.py-example) to use docker_config.ini if it has been specified in docker_config.ini. Also moved setup to Dockerfile instead of docker-entrypoint.sh so that it does not unnecessarily run every time docker containers are started. * 1. added paths to docker.ini file 2. handled paths in docker_setup.py (create directories if needed) 3. update config.py-example to use docker paths 3. trap SIGTERM in docker-entrypoint.sh for 4cat-daemon backend 4. update docker-compose to separate backend and frontend and use shared volume for data 5. add Dockerfile_frontend to set up frontend (could use pairing down) 6. rearranged Dockerfile so it doesn't rebuild python packages and download/install chrome every time I update the config file * Docker org updates to allow for rebuilding images/updating docker files. Shared admin password someplace noticable. * gitignore changes * update .gitignore (ignore venv & jupyter notebooks). Add 4444 port to docker. * add port for telegram. add modify api port to config for docker. * Update README.md * Update README.md * Update README.md * Update README.md * Adding dynamic localhost * Dynamic API host * temp logging * add test status button * Update README.md * Add sessions path to config files; allow it to be shared by docker containers * Removed logging. Changed [email protected] to admin. * Updates to Docker: user database information is set in .env file which is then used by docker-compose.yml, passed to the Dockerfiles, docker-entrypoint.sh, and docker_setup.py which updates 4cat config files. * TCAT to 4CAT. Cause apparently I don't even know where I am anymore! * dynamic database port to docker * added arg to dockerfile * docker fix: change EXTERNAL port not internal * Pseudo fix (#146) * comprehensive search and replace for ndjson * used the CheckCache object in the items_to_csv function as well. * updated to pseudonymize all values nested in a matching group. Currently only looks for keys with the word "author". * deleting some test validations * clarrified notes/function description * Docker updates: - Allow user to modify public port (default changed to 80) - Moved docker_setup to run on startup and modify docker_config if any changes to variables were made - Updated Gunicorn default settings to improve performance and take advantage of CPU cores * Docker: allow update external API and Telegram ports * Notify user of 4CAT address and port * revert readme changes * Minor cleanup of docker config values * move docker login.txt file to persistant docker volume * easy update server name * move server_name variable to docker .env file * Remove merge markers * Don't need with Docker changes * Fixed link for byline (new datasources failed) * Created date filter as processor and added daterange as possible processor option * Changed datetime template_filter to "09 Aug 2021" format. Date range uses month-day-year which gets confusing when day-month-year is norm elsewhere. Personally I prefer year-month-day for sorting. And because no sane person would use year-day-month! * clean up code and fix frontend display of processor options * Use create_standalone function in other filters * Fixes #127
digitalmethodsinitiative · Sep 2, 2021 · 2919893 · 2919893
1 parent 08e54b0
commit 2919893
Show file tree

Hide file tree

Showing 11 changed files with 212 additions and 131 deletions.
diff --git a/.env b/.env
@@ -8,5 +8,5 @@ SERVER_NAME=localhost
 PUBLIC_PORT=80
 PUBLIC_API_PORT=4444
 
-# Telegram aparently needs its own port
+# Telegram apparently needs its own port
 TELEGRAM_PORT=443
diff --git a/README.md b/README.md
@@ -7,10 +7,10 @@
 
 <p align="center"><img alt="A screenshot of 4CAT, displaying its 'Create Dataset' interface" src="common/assets/screenshot1.png"><img alt="A screenshot of 4CAT, displaying a network visualisation of a dataset" src="common/assets/screenshot2.png"></p>
 
-4CAT is a research tool that can be used to analyse and process data from 
-online  social platforms. Its goal is to make the capture and analysis of data 
-from these  platforms accessible to people through a web interface, without 
-requiring any programming or web scraping skills. Our target audience is 
+4CAT is a research tool that can be used to analyse and process data from
+online  social platforms. Its goal is to make the capture and analysis of data
+from these  platforms accessible to people through a web interface, without
+requiring any programming or web scraping skills. Our target audience is
 researchers, students and journalists interested using Digital Methods in their
 work.
 
@@ -53,7 +53,7 @@ You can install 4CAT locally or on a server via Docker or manually. The usual
 docker-compose up
 ```
 
-will work, but detailed and alternative installation 
+will work, but detailed and alternative installation
 instructions are available [in our
 wiki](https://github.com/digitalmethodsinitiative/4cat/wiki/Installing-4CAT).
 Currently 4chan, 8chan, and 8kun require additional steps; please see the wiki.

diff --git a/backend/abstract/processor.py b/backend/abstract/processor.py
@@ -8,6 +8,7 @@
 import json
 import abc
 import csv
+import os
 
 from pathlib import Path, PurePath
 
@@ -512,6 +513,33 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI
 
 		self.dataset.finish(num_items)
 
+	def create_standalone(self):
+		# copy this dataset - the filtered version - and make that copy standalone
+		# this has the benefit of allowing for all analyses that can be run on
+		# full datasets on the new, filtered copy as well
+		top_parent = self.source_dataset
+
+		standalone = self.dataset.copy(shallow=False)
+		standalone.body_match = "(Filtered) " + top_parent.query
+		standalone.datasource = top_parent.parameters.get("datasource", "custom")
+
+		try:
+			standalone.board = top_parent.board
+		except KeyError:
+			standalone.board = self.type
+
+		standalone.type = "search"
+
+		standalone.detach()
+		standalone.delete_parameter("key_parent")
+
+		self.dataset.copied_to = standalone.key
+
+		# we don't need this file anymore - it has been copied to the new
+		# standalone dataset, and this one is not accessible via the interface
+		# except as a link to the copied standalone dataset
+		os.unlink(self.dataset.get_results_path())
+
 	@classmethod
 	def is_filter(cls):
 		"""

diff --git a/processors/filtering/column_filter.py b/processors/filtering/column_filter.py
@@ -1,7 +1,6 @@
 """
 Filter posts by a given column
 """
-import os
 import re
 import csv
 import datetime
@@ -200,28 +199,5 @@ def process(self):
     def after_process(self):
         super().after_process()
 
-        # copy this dataset - the filtered version - and make that copy standalone
-        # this has the benefit of allowing for all analyses that can be run on
-        # full datasets on the new, filtered copy as well
-        top_parent = self.source_dataset
-
-        standalone = self.dataset.copy(shallow=False)
-        standalone.body_match = "(Filtered) " + top_parent.query
-        standalone.datasource = top_parent.parameters.get("datasource", "custom")
-
-        try:
-            standalone.board = top_parent.board
-        except KeyError:
-            standalone.board = self.type
-
-        standalone.type = "search"
-
-        standalone.detach()
-        standalone.delete_parameter("key_parent")
-
-        self.dataset.copied_to = standalone.key
-
-        # we don't need this file anymore - it has been copied to the new
-        # standalone dataset, and this one is not accessible via the interface
-        # except as a link to the copied standalone dataset
-        os.unlink(self.dataset.get_results_path())
+        # Request standalone
+        self.create_standalone()
diff --git a/processors/filtering/date_filter.py b/processors/filtering/date_filter.py
@@ -0,0 +1,146 @@
+"""
+Filter posts by a dates
+"""
+import csv
+import dateutil.parser
+from datetime import datetime
+
+from backend.abstract.processor import BasicProcessor
+from common.lib.helpers import UserInput
+
+__author__ = "Dale Wahl"
+__credits__ = ["Dale Wahl"]
+__maintainer__ = "Dale Wahl"
+__email__ = "[email protected]"
+
+csv.field_size_limit(1024 * 1024 * 1024)
+
+
+class DateFilter(BasicProcessor):
+    """
+    Retain only posts between specific dates
+    """
+    type = "date-filter"  # job type ID
+    category = "Filtering"  # category
+    title = "Filter by date"  # title displayed in UI
+    description = "Copies the dataset, retaining only posts between the given dates. This creates a new, separate \
+                    dataset you can run analyses on."
+    extension = "csv"  # extension of result file, used internally and in UI
+
+    options = {
+        "daterange": {
+            "type": UserInput.OPTION_DATERANGE,
+            "help": "Date range:",
+        },
+        "parse_error": {
+            "type": UserInput.OPTION_CHOICE,
+            "help": "Invalid date formats:",
+            "options": {
+                "return": "Keep invalid dates for new dataset",
+                "reject": "Remove invalid dates for new dataset",
+            },
+            "default": "return"
+        },
+    }
+
+    @classmethod
+    def is_compatible_with(cls, module=None):
+        """
+        Allow processor on CSV files
+
+        :param module: Dataset or processor to determine compatibility with
+        """
+        return module.is_top_dataset() and module.get_extension() == "csv"
+
+    def process(self):
+        """
+        Reads a CSV file, filtering items that match in the required way, and
+        creates a new dataset containing the matching values
+        """
+        # Column to match
+        # 'timestamp' should be a required field in all datasources
+        date_column_name = 'timestamp'
+
+        # Process inputs from user
+        min_date, max_date = self.parameters.get("daterange")
+        # Convert to datetime for easy comparison
+        min_date = datetime.fromtimestamp(min_date).date()
+        max_date = datetime.fromtimestamp(max_date).date()
+        # Decide how to handle invalid dates
+        if self.parameters.get("parse_error") == 'return':
+            keep_errors = True
+        elif self.parameters.get("parse_error") == 'reject':
+            keep_errors = False
+        else:
+            raise "Error with parse_error types"
+
+        # Track progress
+        processed_items = 0
+        invalid_dates = 0
+        matching_items = 0
+
+        # Start writer
+        with self.dataset.get_results_path().open("w", encoding="utf-8") as outfile:
+            writer = None
+
+            # Loop through items
+            for item in self.iterate_items(self.source_file):
+                if not writer:
+                    # First iteration, check if column actually exists
+                    if date_column_name not in item.keys():
+                        self.dataset.update_status("'%s' column not found in dataset" % date_column_name, is_final=True)
+                        self.dataset.finish(0)
+                        return
+
+                    # initialise csv writer - we do this explicitly rather than
+                    # using self.write_items_and_finish() because else we have
+                    # to store a potentially very large amount of items in
+                    # memory which is not a good idea
+                    writer = csv.DictWriter(outfile, fieldnames=item.keys())
+                    writer.writeheader()
+
+                # Update 4CAT and user on status
+                processed_items += 1
+                if processed_items % 500 == 0:
+                    self.dataset.update_status("Processed %i items (%i matching, %i invalid dates)" % (processed_items,
+                                                                                                       matching_items,
+                                                                                                       invalid_dates))
+
+                # Attempt to parse timestamp
+                try:
+                    item_date = dateutil.parser.parse(item.get(date_column_name))
+                except dateutil.parser.ParserError:
+                    if keep_errors:
+                        # Keep item
+                        invalid_dates += 1
+                        writer.writerow(item)
+                        continue
+                    else:
+                        # Reject item
+                        invalid_dates += 1
+                        continue
+
+                # Only use date for comparison (not time)
+                item_date = item_date.date()
+
+                # Reject dates
+                if min_date and item_date < min_date:
+                    continue
+                if max_date and item_date > max_date:
+                    continue
+
+                # Must be a good date!
+                writer.writerow(item)
+                matching_items += 1
+
+        # Any matches?
+        if matching_items == 0:
+            self.dataset.update_status("No items matched your criteria", is_final=True)
+
+        self.dataset.finish(matching_items)
+
+    def after_process(self):
+        super().after_process()
+
+        # Request standalone
+        self.create_standalone()
diff --git a/processors/filtering/lexical_filter.py b/processors/filtering/lexical_filter.py
@@ -1,10 +1,7 @@
 """
 Filter posts by lexicon
 """
-import pickle
 import re
-import os
-
 import csv
 from pathlib import Path
 
@@ -20,6 +17,7 @@
 
 csv.field_size_limit(1024 * 1024 * 1024)
 
+
 class LexicalFilter(BasicProcessor):
 	"""
 	Retain only posts matching a given lexicon
@@ -57,7 +55,7 @@ class LexicalFilter(BasicProcessor):
 
 	def process(self):
 		"""
-		Reads a CSV file, counts occurences of chosen values over all posts,
+		Reads a CSV file, counts occurrences of chosen values over all posts,
 		and aggregates the results per chosen time frame
 		"""
 
@@ -155,28 +153,5 @@ def process(self):
 	def after_process(self):
 		super().after_process()
 
-		# copy this dataset - the filtered version - and make that copy standalone
-		# this has the benefit of allowing for all analyses that can be run on
-		# full datasets on the new, filtered copy as well
-		top_parent = self.source_dataset
-
-		standalone = self.dataset.copy(shallow=False)
-		standalone.body_match = "(Filtered) " + top_parent.query
-		standalone.datasource = top_parent.parameters.get("datasource", "custom")
-
-		try:
-			standalone.board = top_parent.board
-		except KeyError:
-			standalone.board = self.type
-
-		standalone.type = "search"
-
-		standalone.detach()
-		standalone.delete_parameter("key_parent")
-
-		self.dataset.copied_to = standalone.key
-
-		# we don't need this file anymore - it has been copied to the new
-		# standalone dataset, and this one is not accessible via the interface
-		# except as a link to the copied standalone dataset
-		os.unlink(self.dataset.get_results_path())
+		# Request standalone
+		self.create_standalone()
diff --git a/processors/filtering/unique_filter.py b/processors/filtering/unique_filter.py
@@ -2,21 +2,19 @@
 Filter by unique posts
 """
 import hashlib
-import os
 import csv
 
 from backend.abstract.processor import BasicProcessor
 from common.lib.helpers import UserInput
 
-import config
-
 __author__ = "Sal Hagen"
 __credits__ = ["Sal Hagen"]
 __maintainer__ = "Sal Hagen"
 __email__ = "[email protected]"
 
 csv.field_size_limit(1024 * 1024 * 1024)
 
+
 class UniqueFilter(BasicProcessor):
 	"""
 	Retain only posts matching a given lexicon
@@ -31,11 +29,11 @@ class UniqueFilter(BasicProcessor):
 	# interface.
 	options = {
 		"case_sensitive": {
-            "type": UserInput.OPTION_TOGGLE,
-            "help": "Case sentitive",
-            "default": False,
-            "tooltip": "Check to consider posts with different capitals as different."
-        }
+			"type": UserInput.OPTION_TOGGLE,
+			"help": "Case sensitive",
+			"default": False,
+			"tooltip": "Check to consider posts with different capitals as different."
+		}
 	}
 
 	def process(self):
@@ -91,28 +89,5 @@ def process(self):
 	def after_process(self):
 		super().after_process()
 
-		# copy this dataset - the filtered version - and make that copy standalone
-		# this has the benefit of allowing for all analyses that can be run on
-		# full datasets on the new, filtered copy as well
-		top_parent = self.source_dataset
-
-		standalone = self.dataset.copy(shallow=False)
-		standalone.body_match = "(Filtered) " + top_parent.query
-		standalone.datasource = top_parent.parameters.get("datasource", "custom")
-
-		try:
-			standalone.board = top_parent.board
-		except KeyError:
-			standalone.board = self.type
-
-		standalone.type = "search"
-
-		standalone.detach()
-		standalone.delete_parameter("key_parent")
-
-		self.dataset.copied_to = standalone.key
-
-		# we don't need this file anymore - it has been copied to the new
-		# standalone dataset, and this one is not accessible via the interface
-		# except as a link to the copied standalone dataset
-		os.unlink(self.dataset.get_results_path())
+		# Request standalone
+		self.create_standalone()