Get formatted schema and anomalies to visualize

I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI.

For statistics, I can easily visualize using `get_statistics_html`.

However, for schema and anomalies, I was struggled. We have `display_schema` and `display_anomalies` function, but it transforms data and calls IPython display inside. So, we have no way to get visualizable formatted data.
Eventually, I almost copied the display functions and change those to return DataFrame.

<details>
<summary>FYI, the code is like this.</summary>

```Python
def _transform_anormalies_to_df(anomalies) -> pd.DataFrame:
    anomaly_rows = []
    for feature_name, anomaly_info in anomalies.anomaly_info.items():
        anomaly_rows.append(
            [
                display_util._add_quotes(feature_name),
                anomaly_info.short_description,
                anomaly_info.description,
            ]
        )
    if anomalies.HasField("dataset_anomaly_info"):
        anomaly_rows.append(
            [
                "[dataset anomaly]",
                anomalies.dataset_anomaly_info.short_description,
                anomalies.dataset_anomaly_info.description,
            ]
        )

    if not anomaly_rows:
        logging.info("No anomalies found.")
        return None
    else:
        logging.warning(f"{len(anomaly_rows)} anomalies found.")
        anomalies_df = pd.DataFrame(
            anomaly_rows,
            columns=[
                "Feature name",
                "Anomaly short description",
                "Anomaly long description",
            ],
        )
        return anomalies_df


def main(schema_file: str, stats_file: str, anomalies_file: str):
    schema = tfdv.load_schema_text(schema_file)
    stats = tfdv.load_statistics(stats_file)
    anomalies = tfdv.validate_statistics(statistics=stats, schema=schema)
    tfdv.write_anomalies_text(anomalies, anomalies_file)

    anomalies_df = _transform_anormalies_to_df(anomalies)
    if anomalies_df is not None:
        metadata = {
            "outputs": [
                {
                    "type": "table",
                    "storage": "inline",
                    "format": "csv",
                    "header": anomalies_df.columns.tolist(),
                    "source": anomalies_df.to_csv(header=False, index=False),
                },
            ]
        }
        with open("/mlpipeline-ui-metadata.json", "w") as f:
            json.dump(metadata, f)
```
</details>

Does someone know any other good way?
What do you think about separate the display function for the transforming function and visualizing function like the function for statistics?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get formatted schema and anomalies to visualize #146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Get formatted schema and anomalies to visualize #146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions