Skip to content

Commit

Permalink
Add daterange filter (#168)
Browse files Browse the repository at this point in the history
* Updated migrate.py and docker-entrypoint.sh to be compatible with docker
compose (input() causes error). Also updated docker-config.py.

NOTE: Docker will overwrite docker-config with config.py if it already
exists. This may be desired behavior, however, can cause failures.

* Removing personal info.

* Changed to use config.py-example over docker-config.py
Also changed apt update && install per
"https://docs.docker.com/develop/develop-images/
dockerfile_best-practices/#run".

Noticed that the database variables are hardcoded throughout
entrypoint.sh, Dockerfile, config.py, and docker-compose.yml.
Unsure if I like using sed to update config file.

Alternatives could be using a docker .env file; would need to make
multiple updates and still would need to update config.py. The
configparser python package is possibly a better solution to the
config.py file in general, but that would require extensive updates
to 4cat. COULD possibly create a seperate config file handled by
 configparser and import that into config.py.

* Updates to docker config modifications. Moved docker variables to
docker_config.ini, created docker_setup.py to better utilize
configparser and avoid any accidental "sed" changes, and modified
config.py (actually config.py-example) to use docker_config.ini if
it has been specified in docker_config.ini.

Also moved setup to Dockerfile instead of docker-entrypoint.sh so that
it does not unnecessarily run every time docker containers are started.

* 1. added paths to docker.ini file
2. handled paths in docker_setup.py (create directories if needed)
3. update config.py-example to use docker paths
3. trap SIGTERM in docker-entrypoint.sh for 4cat-daemon backend
4. update docker-compose to separate backend and frontend and use
shared volume for data
5. add Dockerfile_frontend to set up frontend (could use pairing down)
6. rearranged Dockerfile so it doesn't rebuild python packages and
download/install chrome every time I update the config file

* Docker org updates to allow for rebuilding images/updating docker files.
Shared admin password someplace noticable.

* gitignore changes

* update .gitignore (ignore venv & jupyter notebooks). 
Add 4444 port to docker.

* add port for telegram.
add modify api port to config for docker.

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Adding dynamic localhost

* Dynamic API host

* temp logging

* add test status button

* Update README.md

* Add sessions path to config files; allow it to be shared by docker 
containers

* Removed logging. Changed [email protected] to admin.

* Updates to Docker: user database information is set in .env file which
is then used by docker-compose.yml, passed to the Dockerfiles, 
docker-entrypoint.sh, and docker_setup.py which updates 4cat config 
files.

* TCAT to 4CAT. Cause apparently I don't even know where I am anymore!

* dynamic database port to docker

* added arg to dockerfile

* docker fix: change EXTERNAL port not internal

* Pseudo fix (#146)

* comprehensive search and replace for ndjson

* used the CheckCache object in the items_to_csv function as well.

* updated to pseudonymize all values nested in a matching group.
Currently only looks for keys with the word "author".

* deleting some test validations

* clarrified notes/function description

* Docker updates:
- Allow user to modify public port (default changed to 80)
- Moved docker_setup to run on startup and modify docker_config if any 
changes to variables were made
- Updated Gunicorn default settings to improve performance and take 
advantage of CPU cores

* Docker: allow update external API and Telegram ports

* Notify user of 4CAT address and port

* revert readme changes

* Minor cleanup of docker config values

* move docker login.txt file to persistant docker volume

* easy update server name

* move server_name variable to docker .env file

* Remove merge markers

* Don't need with Docker changes

* Fixed link for byline (new datasources failed)

* Created date filter as processor and 
added daterange as possible processor option

* Changed datetime template_filter to "09 Aug 2021" format.
Date range uses month-day-year which gets confusing when day-month-year 
is norm elsewhere. Personally I prefer year-month-day for sorting. And 
because no sane person would use year-day-month!

* clean up code and fix frontend display of processor options

* Use create_standalone function in other filters

* Fixes #127
  • Loading branch information
dale-wahl authored Sep 2, 2021
1 parent 08e54b0 commit 2919893
Show file tree
Hide file tree
Showing 11 changed files with 212 additions and 131 deletions.
2 changes: 1 addition & 1 deletion .env
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ SERVER_NAME=localhost
PUBLIC_PORT=80
PUBLIC_API_PORT=4444

# Telegram aparently needs its own port
# Telegram apparently needs its own port
TELEGRAM_PORT=443
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@

<p align="center"><img alt="A screenshot of 4CAT, displaying its 'Create Dataset' interface" src="common/assets/screenshot1.png"><img alt="A screenshot of 4CAT, displaying a network visualisation of a dataset" src="common/assets/screenshot2.png"></p>

4CAT is a research tool that can be used to analyse and process data from
online social platforms. Its goal is to make the capture and analysis of data
from these platforms accessible to people through a web interface, without
requiring any programming or web scraping skills. Our target audience is
4CAT is a research tool that can be used to analyse and process data from
online social platforms. Its goal is to make the capture and analysis of data
from these platforms accessible to people through a web interface, without
requiring any programming or web scraping skills. Our target audience is
researchers, students and journalists interested using Digital Methods in their
work.

Expand Down Expand Up @@ -53,7 +53,7 @@ You can install 4CAT locally or on a server via Docker or manually. The usual
docker-compose up
```

will work, but detailed and alternative installation
will work, but detailed and alternative installation
instructions are available [in our
wiki](https://github.com/digitalmethodsinitiative/4cat/wiki/Installing-4CAT).
Currently 4chan, 8chan, and 8kun require additional steps; please see the wiki.
Expand Down
28 changes: 28 additions & 0 deletions backend/abstract/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import json
import abc
import csv
import os

from pathlib import Path, PurePath

Expand Down Expand Up @@ -512,6 +513,33 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI

self.dataset.finish(num_items)

def create_standalone(self):
# copy this dataset - the filtered version - and make that copy standalone
# this has the benefit of allowing for all analyses that can be run on
# full datasets on the new, filtered copy as well
top_parent = self.source_dataset

standalone = self.dataset.copy(shallow=False)
standalone.body_match = "(Filtered) " + top_parent.query
standalone.datasource = top_parent.parameters.get("datasource", "custom")

try:
standalone.board = top_parent.board
except KeyError:
standalone.board = self.type

standalone.type = "search"

standalone.detach()
standalone.delete_parameter("key_parent")

self.dataset.copied_to = standalone.key

# we don't need this file anymore - it has been copied to the new
# standalone dataset, and this one is not accessible via the interface
# except as a link to the copied standalone dataset
os.unlink(self.dataset.get_results_path())

@classmethod
def is_filter(cls):
"""
Expand Down
28 changes: 2 additions & 26 deletions processors/filtering/column_filter.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""
Filter posts by a given column
"""
import os
import re
import csv
import datetime
Expand Down Expand Up @@ -200,28 +199,5 @@ def process(self):
def after_process(self):
super().after_process()

# copy this dataset - the filtered version - and make that copy standalone
# this has the benefit of allowing for all analyses that can be run on
# full datasets on the new, filtered copy as well
top_parent = self.source_dataset

standalone = self.dataset.copy(shallow=False)
standalone.body_match = "(Filtered) " + top_parent.query
standalone.datasource = top_parent.parameters.get("datasource", "custom")

try:
standalone.board = top_parent.board
except KeyError:
standalone.board = self.type

standalone.type = "search"

standalone.detach()
standalone.delete_parameter("key_parent")

self.dataset.copied_to = standalone.key

# we don't need this file anymore - it has been copied to the new
# standalone dataset, and this one is not accessible via the interface
# except as a link to the copied standalone dataset
os.unlink(self.dataset.get_results_path())
# Request standalone
self.create_standalone()
146 changes: 146 additions & 0 deletions processors/filtering/date_filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
"""
Filter posts by a dates
"""
import csv
import dateutil.parser
from datetime import datetime

from backend.abstract.processor import BasicProcessor
from common.lib.helpers import UserInput

__author__ = "Dale Wahl"
__credits__ = ["Dale Wahl"]
__maintainer__ = "Dale Wahl"
__email__ = "[email protected]"

csv.field_size_limit(1024 * 1024 * 1024)


class DateFilter(BasicProcessor):
"""
Retain only posts between specific dates
"""
type = "date-filter" # job type ID
category = "Filtering" # category
title = "Filter by date" # title displayed in UI
description = "Copies the dataset, retaining only posts between the given dates. This creates a new, separate \
dataset you can run analyses on."
extension = "csv" # extension of result file, used internally and in UI

options = {
"daterange": {
"type": UserInput.OPTION_DATERANGE,
"help": "Date range:",
},
"parse_error": {
"type": UserInput.OPTION_CHOICE,
"help": "Invalid date formats:",
"options": {
"return": "Keep invalid dates for new dataset",
"reject": "Remove invalid dates for new dataset",
},
"default": "return"
},
}

@classmethod
def is_compatible_with(cls, module=None):
"""
Allow processor on CSV files
:param module: Dataset or processor to determine compatibility with
"""
return module.is_top_dataset() and module.get_extension() == "csv"

def process(self):
"""
Reads a CSV file, filtering items that match in the required way, and
creates a new dataset containing the matching values
"""
# Column to match
# 'timestamp' should be a required field in all datasources
date_column_name = 'timestamp'

# Process inputs from user
min_date, max_date = self.parameters.get("daterange")
# Convert to datetime for easy comparison
min_date = datetime.fromtimestamp(min_date).date()
max_date = datetime.fromtimestamp(max_date).date()
# Decide how to handle invalid dates
if self.parameters.get("parse_error") == 'return':
keep_errors = True
elif self.parameters.get("parse_error") == 'reject':
keep_errors = False
else:
raise "Error with parse_error types"

# Track progress
processed_items = 0
invalid_dates = 0
matching_items = 0

# Start writer
with self.dataset.get_results_path().open("w", encoding="utf-8") as outfile:
writer = None

# Loop through items
for item in self.iterate_items(self.source_file):
if not writer:
# First iteration, check if column actually exists
if date_column_name not in item.keys():
self.dataset.update_status("'%s' column not found in dataset" % date_column_name, is_final=True)
self.dataset.finish(0)
return

# initialise csv writer - we do this explicitly rather than
# using self.write_items_and_finish() because else we have
# to store a potentially very large amount of items in
# memory which is not a good idea
writer = csv.DictWriter(outfile, fieldnames=item.keys())
writer.writeheader()

# Update 4CAT and user on status
processed_items += 1
if processed_items % 500 == 0:
self.dataset.update_status("Processed %i items (%i matching, %i invalid dates)" % (processed_items,
matching_items,
invalid_dates))

# Attempt to parse timestamp
try:
item_date = dateutil.parser.parse(item.get(date_column_name))
except dateutil.parser.ParserError:
if keep_errors:
# Keep item
invalid_dates += 1
writer.writerow(item)
continue
else:
# Reject item
invalid_dates += 1
continue

# Only use date for comparison (not time)
item_date = item_date.date()

# Reject dates
if min_date and item_date < min_date:
continue
if max_date and item_date > max_date:
continue

# Must be a good date!
writer.writerow(item)
matching_items += 1

# Any matches?
if matching_items == 0:
self.dataset.update_status("No items matched your criteria", is_final=True)

self.dataset.finish(matching_items)

def after_process(self):
super().after_process()

# Request standalone
self.create_standalone()
33 changes: 4 additions & 29 deletions processors/filtering/lexical_filter.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
"""
Filter posts by lexicon
"""
import pickle
import re
import os

import csv
from pathlib import Path

Expand All @@ -20,6 +17,7 @@

csv.field_size_limit(1024 * 1024 * 1024)


class LexicalFilter(BasicProcessor):
"""
Retain only posts matching a given lexicon
Expand Down Expand Up @@ -57,7 +55,7 @@ class LexicalFilter(BasicProcessor):

def process(self):
"""
Reads a CSV file, counts occurences of chosen values over all posts,
Reads a CSV file, counts occurrences of chosen values over all posts,
and aggregates the results per chosen time frame
"""

Expand Down Expand Up @@ -155,28 +153,5 @@ def process(self):
def after_process(self):
super().after_process()

# copy this dataset - the filtered version - and make that copy standalone
# this has the benefit of allowing for all analyses that can be run on
# full datasets on the new, filtered copy as well
top_parent = self.source_dataset

standalone = self.dataset.copy(shallow=False)
standalone.body_match = "(Filtered) " + top_parent.query
standalone.datasource = top_parent.parameters.get("datasource", "custom")

try:
standalone.board = top_parent.board
except KeyError:
standalone.board = self.type

standalone.type = "search"

standalone.detach()
standalone.delete_parameter("key_parent")

self.dataset.copied_to = standalone.key

# we don't need this file anymore - it has been copied to the new
# standalone dataset, and this one is not accessible via the interface
# except as a link to the copied standalone dataset
os.unlink(self.dataset.get_results_path())
# Request standalone
self.create_standalone()
41 changes: 8 additions & 33 deletions processors/filtering/unique_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,19 @@
Filter by unique posts
"""
import hashlib
import os
import csv

from backend.abstract.processor import BasicProcessor
from common.lib.helpers import UserInput

import config

__author__ = "Sal Hagen"
__credits__ = ["Sal Hagen"]
__maintainer__ = "Sal Hagen"
__email__ = "[email protected]"

csv.field_size_limit(1024 * 1024 * 1024)


class UniqueFilter(BasicProcessor):
"""
Retain only posts matching a given lexicon
Expand All @@ -31,11 +29,11 @@ class UniqueFilter(BasicProcessor):
# interface.
options = {
"case_sensitive": {
"type": UserInput.OPTION_TOGGLE,
"help": "Case sentitive",
"default": False,
"tooltip": "Check to consider posts with different capitals as different."
}
"type": UserInput.OPTION_TOGGLE,
"help": "Case sensitive",
"default": False,
"tooltip": "Check to consider posts with different capitals as different."
}
}

def process(self):
Expand Down Expand Up @@ -91,28 +89,5 @@ def process(self):
def after_process(self):
super().after_process()

# copy this dataset - the filtered version - and make that copy standalone
# this has the benefit of allowing for all analyses that can be run on
# full datasets on the new, filtered copy as well
top_parent = self.source_dataset

standalone = self.dataset.copy(shallow=False)
standalone.body_match = "(Filtered) " + top_parent.query
standalone.datasource = top_parent.parameters.get("datasource", "custom")

try:
standalone.board = top_parent.board
except KeyError:
standalone.board = self.type

standalone.type = "search"

standalone.detach()
standalone.delete_parameter("key_parent")

self.dataset.copied_to = standalone.key

# we don't need this file anymore - it has been copied to the new
# standalone dataset, and this one is not accessible via the interface
# except as a link to the copied standalone dataset
os.unlink(self.dataset.get_results_path())
# Request standalone
self.create_standalone()
Loading

0 comments on commit 2919893

Please sign in to comment.