Issue by skrawcz
Tuesday Aug 02, 2022 at 17:00 GMT
Originally opened as stitchfix/hamilton#165
Is your feature request related to a problem? Please describe.
Data profiling is a way to help bootstrap creating data quality checks.
Data profiling is also a way to facilitate data exploration, by providing summary statistics over data.
Describe the solution you'd like
A user should be able to profile their DAG, or a set of nodes, and get out some summary statistics.
Those statistics could then be used to bootstrap data quality, i.e. check_output(), decorators, but the output should be standalone.
Describe alternatives you've considered
Haven't considered many options. But there are a few libraries that do data profiling already.
Additional context
Systems like whylogs, great expectations, use profiling to help with the user experience.
Standalone libraries like https://github.com/capitalone/DataProfiler also exist.
stitchfix/hamilton#149 does a little to prototype in this area too.
Tuesday Aug 02, 2022 at 17:00 GMT
Originally opened as stitchfix/hamilton#165
Is your feature request related to a problem? Please describe.
Data profiling is a way to help bootstrap creating data quality checks.
Data profiling is also a way to facilitate data exploration, by providing summary statistics over data.
Describe the solution you'd like
A user should be able to profile their DAG, or a set of nodes, and get out some summary statistics.
Those statistics could then be used to bootstrap data quality, i.e. check_output(), decorators, but the output should be standalone.
Describe alternatives you've considered
Haven't considered many options. But there are a few libraries that do data profiling already.
Additional context
Systems like whylogs, great expectations, use profiling to help with the user experience.
Standalone libraries like https://github.com/capitalone/DataProfiler also exist.
stitchfix/hamilton#149 does a little to prototype in this area too.