Skip to content

Get formatted schema and anomalies to visualize #146

@wakanapo

Description

@wakanapo

I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI.

For statistics, I can easily visualize using get_statistics_html.

However, for schema and anomalies, I was struggled. We have display_schema and display_anomalies function, but it transforms data and calls IPython display inside. So, we have no way to get visualizable formatted data.
Eventually, I almost copied the display functions and change those to return DataFrame.

FYI, the code is like this.
def _transform_anormalies_to_df(anomalies) -> pd.DataFrame:
    anomaly_rows = []
    for feature_name, anomaly_info in anomalies.anomaly_info.items():
        anomaly_rows.append(
            [
                display_util._add_quotes(feature_name),
                anomaly_info.short_description,
                anomaly_info.description,
            ]
        )
    if anomalies.HasField("dataset_anomaly_info"):
        anomaly_rows.append(
            [
                "[dataset anomaly]",
                anomalies.dataset_anomaly_info.short_description,
                anomalies.dataset_anomaly_info.description,
            ]
        )

    if not anomaly_rows:
        logging.info("No anomalies found.")
        return None
    else:
        logging.warning(f"{len(anomaly_rows)} anomalies found.")
        anomalies_df = pd.DataFrame(
            anomaly_rows,
            columns=[
                "Feature name",
                "Anomaly short description",
                "Anomaly long description",
            ],
        )
        return anomalies_df


def main(schema_file: str, stats_file: str, anomalies_file: str):
    schema = tfdv.load_schema_text(schema_file)
    stats = tfdv.load_statistics(stats_file)
    anomalies = tfdv.validate_statistics(statistics=stats, schema=schema)
    tfdv.write_anomalies_text(anomalies, anomalies_file)

    anomalies_df = _transform_anormalies_to_df(anomalies)
    if anomalies_df is not None:
        metadata = {
            "outputs": [
                {
                    "type": "table",
                    "storage": "inline",
                    "format": "csv",
                    "header": anomalies_df.columns.tolist(),
                    "source": anomalies_df.to_csv(header=False, index=False),
                },
            ]
        }
        with open("/mlpipeline-ui-metadata.json", "w") as f:
            json.dump(metadata, f)

Does someone know any other good way?
What do you think about separate the display function for the transforming function and visualizing function like the function for statistics?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions