Skip to content

Using tfdv to validate text based data #215

Open
@Capsar

Description

@Capsar

Hi,

After searching online whether tfdv could be used to validate data that contains text. For instance, for a dataset with sentences that have to be mapped to labels. I could not find any real useful tutorials, as the ones that I could find only go into numerical data regarding the dataset. For instance, height, weights, etc.

After looking around in the data-validation package I have found a couple of files that seem to be related to this.
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_stats_generator.py
And
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_domain_inferring_stats_generator.py

Furthermore on the Tensorflow website about the StatsOptions class I found the following:
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions

Arguments Description
enable_semantic_domain_stats If True statistics for semantic domains are generated (e.g: image, text domains).
semantic_domain_stats_sample_rate An optional sampling rate for semantic domain statistics. If specified, semantic domain statistics is computed over a sample.
vocab_paths An optional dictionary mapping vocab names to paths. Used in the schema when specifying a NaturalLanguageDomain. The paths can either be to GZIP-compressed TF record files that have a tfrecord.gz suffix or to text files.

These arguments and files do indicate that tfdv can be used to analyze and validate data that would be used in NLP / Text classification type problems.

However, it is unclear to me how one would go about and use these features to validate text-based data?
I have enabled the enable_semantic_domain_stats argument and this does give information like sequence length etc.
However, how would one extend on this, and validate vocabularies for known/unknown word ratio's; etc.

Any tips or thoughts are highly appreciated!
Kind Regards,
Caspar

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions