-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add `--sample` flag to `run` command * Remove no longer needed `if` statement around EventTimeFilter creation for microbatch models Upon the initial implementation of microbatch models, the the `start` for a batch was _optional_. However, in c3d87b8 they became guaranteed. Thus the if statement guarding when `start/end` isn't present for microbatch models was no longer actually doing anything. Hence, the if statement was safe to remove. * Get sample mode working with `--event-time-start/end` This is temporary as a POC. In the end, sample mode can't depend on the arguments `--event-time-start/end` and will need to be split into their own CLI args / project config, something like `--sample-window`. The issue with using `--event-time-start/end` is that if people set those in the project configs, then their microbatch models would _always_ run with those values even outside of sample mode. Despite that, this is a useful checkpoint even though it will go away. * Begin using `--sample-window` for sample mode instead of `--event-time-start/end` Using `--event-time-start/end` for sample mode was conflicting with microbatch models when _not_ running in sample mode. We will have to do _slightly_ more work to plumb this new way of specifying sample time to microbatch models. * Move `SampleWindow` class to `sample_window.py` in `event_time` submodule This is mostly symbolic. We are going to be adding some utilities for "event_time" type things, which will all live in the `event_time` submodule. Additionally we plan to refactor `/incremental/materializations/microbatch.py` into the sub module as well. * Create an `offset_timestamp` separate from MicrobatchBuilder The `MicrobatchBuilder.offset_timestamp` _truncates_ the timestamp before offsetting it. We don't want to do that, we want to offset the "raw" timestamp. We could have split renamed the microbatch builder function name to `truncate_and_offset_timestamp` and separated the offset logic into a separate abstract function. However, the offset logic in the MicrobatchBuilder context depends on the truncation. We might later on be able to refactor the Microbatch provided function by instead truncating _after_ offsetting instead of before. But that is out of scope for this initial work, and we should instead revisit it later. * Add `types-python-dateutil` to dev requirements The previous commit began using a submodule of the dateutil builtin python library. We weren't previously using this library, and thus didn't need the type stubs for it. But now that we do use it, we need to have the type stubs during development. * Begin supporting microbatch models in sample mode * Move parsing logic of `SampleWindowType` to `SampleWindow` * Allow for specificaion of "specific" sample windows In most cases people will want to set "relative" sample windows, i.e. "3 days" to sample the last three days. However, there are some cases where people will want to "specific" sample windows for some chunk of historic time, i.e. `{'start': '2024-01-01', 'end': '2024-01-31'}`. * Fix tests of `BaseResolver.resolve_event_time_filter` for sample mode changes * Add `--no-sample` as it's necessary for retry * Add guards to accessing of `sample` and `sample_window` This was necessary because these aren't _always_ available. I had expected to need to do this after putting the `sample` flag behind an environment variable (which I haven't done yet). However, we needed to add the guards sooner because the `render` logic is called multiple times throughout the dbt process, and earlier on the flags aren't available. * Gate sample mode functionality via env var `DBT_EXPERIMENTAL_SAMPLE_MODE` At this point sample mode is _alpha_ and should not be depended upon. To make this crystal clear we've gated the functionality behind an environment variable. We'll likely remove this gate in the coming month. * Add sample mode tests for incremental models * Add changie doc for sample mode initial implementation * Fixup sample mode functional tests I had updated the `later_input_model.sql` to be easier to test with. However, I didn't correspondingly update the inital `input_model.sql` to match. * Ensure microbatch creates correct number of batches when sample mode env var isn't present Previously microbatch was creating the _right_ number of batches when: 1. sample mode _wasn't_ being used 2. sample mode _was_ being used AND the env var was present Unfortunately sample mode _wasn't_ creating the right number of batches when: 3. sample mode _was_ being used AND the env var _wasn't_ present. In case (3) sample mode shouldn't be run. Unfortunately we weren't gating sample mode by the environment variable during batch creation. This lead to a situtation where in creating batches it was using sample mode but in the rendering of refs it _wasn't_ using sample mode. Putting it in an inbetween state... This commit fixes that issue. Additionally of note, we currently have duplicate sample mode gating logic in the batch creation as well as in the rendering of refs. We should probably consolidate this logic into a singular importable function, that way any future changes of how sample mode is gated is easier to implement. * Correct comment in SampleWindow post serialization method * Hide CLI sample mode options We are doing this _temporarily_ while sample mode as a feature is in alpha/beta and locked behind an environment variable. When we remove the environment variable we should also unhide these.
- Loading branch information
Showing
15 changed files
with
1,040 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
kind: Features | ||
body: Initial implementation of sample mode | ||
time: 2025-02-02T14:00:54.074209-06:00 | ||
custom: | ||
Author: QMalcolm | ||
Issue: 11227 11230 11231 11248 11252 11254 11258 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
from datetime import datetime | ||
|
||
from dateutil.relativedelta import relativedelta | ||
|
||
from dbt.artifacts.resources.types import BatchSize | ||
from dbt_common.exceptions import DbtRuntimeError | ||
|
||
|
||
def offset_timestamp(timestamp=datetime, batch_size=BatchSize, offset=int) -> datetime: | ||
"""Offsets the passed in timestamp based on the batch_size and offset. | ||
Note: THIS IS DIFFERENT FROM MicrobatchBuilder.offset_timestamp. That function first | ||
`truncates` the timestamp, and then does delta addition subtraction from there. This | ||
function _doesn't_ truncate the timestamp and uses `relativedelta` for specific edge | ||
case handling (months, years), which may produce different results than the delta math | ||
done in `MicrobatchBuilder.offset_timestamp` | ||
Examples | ||
2024-09-17 16:06:00 + Batchsize.hour -1 -> 2024-09-17 15:06:00 | ||
2024-09-17 16:06:00 + Batchsize.hour +1 -> 2024-09-17 17:06:00 | ||
2024-09-17 16:06:00 + Batchsize.day -1 -> 2024-09-16 16:06:00 | ||
2024-09-17 16:06:00 + Batchsize.day +1 -> 2024-09-18 16:06:00 | ||
2024-09-17 16:06:00 + Batchsize.month -1 -> 2024-08-17 16:06:00 | ||
2024-09-17 16:06:00 + Batchsize.month +1 -> 2024-10-17 16:06:00 | ||
2024-09-17 16:06:00 + Batchsize.year -1 -> 2023-09-17 16:06:00 | ||
2024-09-17 16:06:00 + Batchsize.year +1 -> 2025-09-17 16:06:00 | ||
2024-01-31 16:06:00 + Batchsize.month +1 -> 2024-02-29 16:06:00 | ||
2024-02-29 16:06:00 + Batchsize.year +1 -> 2025-02-28 16:06:00 | ||
""" | ||
|
||
if batch_size == BatchSize.hour: | ||
return timestamp + relativedelta(hours=offset) | ||
elif batch_size == BatchSize.day: | ||
return timestamp + relativedelta(days=offset) | ||
elif batch_size == BatchSize.month: | ||
return timestamp + relativedelta(months=offset) | ||
elif batch_size == BatchSize.year: | ||
return timestamp + relativedelta(years=offset) | ||
else: | ||
raise DbtRuntimeError(f"Unhandled batch_size '{batch_size}'") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
from __future__ import annotations | ||
|
||
from datetime import datetime | ||
|
||
import pytz | ||
from attr import dataclass | ||
|
||
from dbt.artifacts.resources.types import BatchSize | ||
from dbt.event_time.event_time import offset_timestamp | ||
from dbt_common.dataclass_schema import dbtClassMixin | ||
from dbt_common.exceptions import DbtRuntimeError | ||
|
||
|
||
@dataclass | ||
class SampleWindow(dbtClassMixin): | ||
start: datetime | ||
end: datetime | ||
|
||
def __post_serialize__(self, data, context): | ||
# This is insane, but necessary, I apologize. Mashumaro handles the | ||
# dictification of this class via a compile time generated `to_dict` | ||
# method based off of the _typing_ of th class. By default `datetime` | ||
# types are converted to strings. We don't want that, we want them to | ||
# stay datetimes. | ||
# Note: This is safe because the `SampleWindow` isn't part of the artifact | ||
# and thus doesn't get written out. | ||
new_data = super().__post_serialize__(data, context) | ||
new_data["start"] = self.start | ||
new_data["end"] = self.end | ||
return new_data | ||
|
||
@classmethod | ||
def from_relative_string(cls, relative_string: str) -> SampleWindow: | ||
end = datetime.now(tz=pytz.UTC) | ||
|
||
relative_window = relative_string.split(" ") | ||
if len(relative_window) != 2: | ||
raise DbtRuntimeError( | ||
f"Cannot load SAMPLE_WINDOW from '{relative_string}'. Must be of form 'DAYS_INT GRAIN_SIZE'." | ||
) | ||
|
||
try: | ||
lookback = int(relative_window[0]) | ||
except Exception: | ||
raise DbtRuntimeError(f"Unable to convert '{relative_window[0]}' to an integer.") | ||
|
||
try: | ||
batch_size_string = relative_window[1].lower().rstrip("s") | ||
batch_size = BatchSize[batch_size_string] | ||
except Exception: | ||
grains = [size.value for size in BatchSize] | ||
grain_plurals = [BatchSize.plural(size) for size in BatchSize] | ||
valid_grains = grains + grain_plurals | ||
raise DbtRuntimeError( | ||
f"Invalid grain size '{relative_window[1]}'. Must be one of {valid_grains}." | ||
) | ||
|
||
start = offset_timestamp(timestamp=end, batch_size=batch_size, offset=-1 * lookback) | ||
|
||
return cls(start=start, end=end) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.