Skip to content

2.7 Timeseries padder: variable vs. constant

Cove Sturtevant edited this page Feb 11, 2021 · 10 revisions

The QAQC module needs more data before and after the time window of interest, so the timeseries padder preps the extra data around the time window of interest. For example, say the pipeline wants to just process data on 2020-01-04 00:00:00 through 23:59:59. The padder would draw in data from the days bracketing 2020-01-04 (e.g. 2020-01-03 and 2020-01-05), and that extra data would facilitate QAQC scripts that need 'edge' data outside of 2020-01-04.

The [SHORT-NAME]_timeseries_padder.yaml file may call a constant or variable timeseries padder python module.

You want a constant timeseries padder when you provide a specific time window (WINDOW_SIZE) that pads a set amount of data on each side of the processing day.

You want a variable timeseries padder when you want the module to automatically determine how many days to pad on either side. This option is preferred, because it will automatically adjust to changes in threshold parameters and data rate that result in needing a larger or smaller window of data to perform QAQC. The module will use the data rate included in the location file for each named location along with the threshold parameters to determine the window size. Thus, data rate must be populated in the location files. If it is missing, see the Wiki page 1.3 Populating properties of named locations in Pachyderm for how to populate it.

Constant pad

The constant timeseries padder python module timeseries_padder.timeseries_padder.constant_pad_main uses variables designated under env: (e.g. OUT_PATH, WINDOW_SIZE, YEAR_INDEX, etc.) to designate arguments for the module. See an example of how the env: is designated for the constant timeseries padder below:

transform:
  image_pull_secrets:
  - battelleecology-quay-read-all-pull-secret
  image: quay.io/battelleecology/timeseries_padder:26
  cmd:
  - "/bin/bash"
  stdin:
  - "#!/bin/bash"
  - python3 -m timeseries_padder.timeseries_padder.constant_pad_main
  env:
    OUT_PATH: /pfs/out
    WINDOW_SIZE: '1'
    LOG_LEVEL: INFO
    RELATIVE_PATH_INDEX: '3'
    YEAR_INDEX: '4'
    MONTH_INDEX: '5'
    DAY_INDEX: '6'
    LOCATION_INDEX: '7'
    DATA_TYPE_INDEX: '8'

Variable pad

The variable timeseries padder python module does not use the env specified in a yaml file, but rather arguments passed via the python command using the argparse python package. This same approach is also used in the [SHORT-NAME]_egress.yaml. The following example shows the corresponding variable timeseries padder employed in the [SHORT-NAME]_timeseries_padder.yaml. Note how timeseries_padder.timeseries_padder.variable_pad_main is now called, followed by the arguments that will be parsed in lieu of being specified in env:.

transform:
  image_pull_secrets:
  - battelleecology-quay-read-all-pull-secret
  image: quay.io/battelleecology/timeseries_padder:31
  cmd:
  - "/bin/bash"
  stdin:
  - "#!/bin/bash"
  - python3 -m timeseries_padder.timeseries_padder.variable_pad_main --yearindex 4 --monthindex 5 --dayindex 6 --locindex 7 --subdirindex 8
  env:
    OUT_PATH: /pfs/out
    LOG_LEVEL: INFO