-
Notifications
You must be signed in to change notification settings - Fork 8
2.7 Timeseries padder: variable vs. constant
The QAQC module needs more data before and after the time window of interest, so the timeseries padder preps the extra data around the time window of interest. For example, say the pipeline wants to just process data on 2020-01-04 00:00:00 through 23:59:59. The padder would draw in data from the days bracketing 2020-01-04 (e.g. 2020-01-03 and 2020-01-05), and that extra data would facilitate QAQC scripts that need 'edge' data outside of 2020-01-04.
The [SHORT-NAME]_timeseries_padder.yaml
file may call a constant or variable timeseries padder python module.
You want a constant timeseries padder when you provide a specific time window (WINDOW_SIZE
) that pads a set amount of data on each side of the processing day.
You want a variable timeseries padder when you want the module to automatically determine how many days to pad on either side. This option is preferred, because it will automatically adjust to changes in threshold parameters and data rate that result in needing a larger or smaller window of data to perform QAQC. The module will use the data rate included in the location file for each named location along with the threshold parameters to determine the window size. Thus, data rate must be populated in the location files. If it is missing, see the Wiki page 1.3 Populating properties of named locations in Pachyderm for how to populate it.
The constant timeseries padder python module timeseries_padder.timeseries_padder.constant_pad_main
uses variables designated under env:
(e.g. OUT_PATH
, WINDOW_SIZE
, YEAR_INDEX
, etc.) to designate arguments for the module. See an example of how the env:
is designated for the constant timeseries padder below:
transform:
image_pull_secrets:
- battelleecology-quay-read-all-pull-secret
image: quay.io/battelleecology/timeseries_padder:26
cmd:
- "/bin/bash"
stdin:
- "#!/bin/bash"
- python3 -m timeseries_padder.timeseries_padder.constant_pad_main
env:
OUT_PATH: /pfs/out
WINDOW_SIZE: '1'
LOG_LEVEL: INFO
RELATIVE_PATH_INDEX: '3'
YEAR_INDEX: '4'
MONTH_INDEX: '5'
DAY_INDEX: '6'
LOCATION_INDEX: '7'
DATA_TYPE_INDEX: '8'
The variable timeseries padder python module does not use the env
specified in a yaml file, but rather arguments passed via the python command using the argparse
python package. This same approach is also used in the [SHORT-NAME]_egress.yaml
. The following example shows the corresponding variable timeseries padder employed in the [SHORT-NAME]_timeseries_padder.yaml
. Note how timeseries_padder.timeseries_padder.variable_pad_main
is now called, followed by the arguments that will be parsed in lieu of being specified in env:
.
transform:
image_pull_secrets:
- battelleecology-quay-read-all-pull-secret
image: quay.io/battelleecology/timeseries_padder:31
cmd:
- "/bin/bash"
stdin:
- "#!/bin/bash"
- python3 -m timeseries_padder.timeseries_padder.variable_pad_main --yearindex 4 --monthindex 5 --dayindex 6 --locindex 7 --subdirindex 8
env:
OUT_PATH: /pfs/out
LOG_LEVEL: INFO