Skip to content

Ability to data profile node outputs for creating data quality checks #40

@HamiltonRepoMigrationBot

Description

Issue by skrawcz
Tuesday Aug 02, 2022 at 17:00 GMT
Originally opened as stitchfix/hamilton#165


Is your feature request related to a problem? Please describe.
Data profiling is a way to help bootstrap creating data quality checks.
Data profiling is also a way to facilitate data exploration, by providing summary statistics over data.

Describe the solution you'd like
A user should be able to profile their DAG, or a set of nodes, and get out some summary statistics.
Those statistics could then be used to bootstrap data quality, i.e. check_output(), decorators, but the output should be standalone.

Describe alternatives you've considered
Haven't considered many options. But there are a few libraries that do data profiling already.

Additional context
Systems like whylogs, great expectations, use profiling to help with the user experience.
Standalone libraries like https://github.com/capitalone/DataProfiler also exist.

stitchfix/hamilton#149 does a little to prototype in this area too.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions