-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Joan Giner edited this page Sep 23, 2022
·
1 revision
(Version 0.0.5)
DescribeML is a VSCode language plugin to describe machine-learning datasets.
Full examples of the language can be found in the public open repository here
-
Title:
STRING: The public title of the dataset -
Unique-identifier:
IDMachine-readable unique identifier of the dataset -
Version:
IDThe version of the dataset -
-
Created:
DATEThe date where the dataset was initially created: -
Modified:
DATEThe date where the dataset was last modified: -
Published:
DATEThe publication date of the dataset:
Example
Dates: Release Date: 10-08-20 Modified Date: 10-08-20 Published Date: 10-08-20
-
Created:
-
Citation: The citation of the dataset, between chose between a raw citation and a structured format
-
Raw Citation:
STRINGRaw citation as text, or as Bibtex or equivalent format, of the dataset -
OR:
-
Title:
STRINGThe title of the dataset -
Authors:
STRINGThe authors of the dataset -
Year:
DATEThe year of the dataset -
Journal/Conference:
STRINGThe publisher of the dataset -
Publisher:
STRINGThe publisher of the dataset: -
URL:
URLThe URL of the dataset -
DOI:
IDThe DOI of the dataset -
ISBN:
IDThe ISBN of the dataset
Example:
Citation: Title: "SIIM-ISIC 2020 Challenge Dataset. International Skin Imaging Collaboration" Year: 2020 Publisher: "International Skin Imaging Collaboration" DOI: "doi.org/10.34970/2020-ds01" Url: "https://www.kaggle.com/c/siim-isic-melanoma-classification"
-
Title:
-
Raw Citation:
-
Description: The description of the dataset
-
Description:
STRINGTextual description of the dataset OR:-
Purposes
STRINGFor what purposes was the dataset created? -
Tasks:
TASKS ENUMERATEList of ML tasks the dataset is intended for:Autocomplete feature will guide you through the options -
Gaps:
STRINGWhich gaps does the dataset aims to fill
-
Purposes
-
Areas:
IDSet a list of areas of the dataset -
Tags:
ID, ...Set a list of Tags of the dataset
Example:
Description: Purposes: Purposes: "The 2020 SIIM-ISIC Melanoma" Tasks: [classification] Gaps: "As the leading healthcare organization for informatics in medical imaging..." Areas: HealthCare Tags: Images Melanoma diagnosis SkinImage
-
Description:
-
Applications Summerize the applications of the dataset
-
Past Uses:
STRINGSummerize the past uses of the dataset -
Recommended uses:
STRINGSummerize the recommended uses of the dataset -
Non-recommended uses:
STRINGSummerize the non-recommended uses of the dataset. -
Benchmarking: Benchmarking of the dataset
-
Task:
TASKS ENUMERATETask to benchmarkAutocomplete feature will guide you through the options -
Metric: Metric to benchmark
-
F1:
NUMBERF1 score -
Accuracy:
NUMBERAccuracy score -
Precision:
NUMBERPrecision score -
Recall:
NUMBERRecall score
-
F1:
-
Reference:
STRINGSource of the benchmark
Example
Applications: Past Uses: "The 2020 SIIM-ISIC Melanoma Classification... " Recommended: "Identify melanoma in lesion images." "Predict incidence of melanoma in a population." Non-recommended: "Due to low population prevalence and challenges with access." Benchmarking: Task: Language-model [ Model: "ModelExample" Metrics:[ F1: 81 Accuracy: 81 Precision: 81 Recall: 81 ] Reference: "https://www.kaggle.com/c/siim-isic-melanoma-classification/leaderboard" ]
-
Task:
-
Past Uses:
-
Distribution Summerize the distribution of the dataset
-
Is public?:
BOOLIndicate if the dataset is publicly available -
Licenses:
LICENCES ENUMERATEList of standard licenses, use others if not fit your case:The Montreal data license , Creative Commons, CC0: Public Domain ... -
Rights(stand-alone)
ENUMERATEMontreal data licence enumerate of stand-alone rights: Access | Tagging |'Distribute | Re-Represent -
Rights(with models):
ENUMERATEMontreal data licence enumerate of model related rights:Benchmark | Research | Publish' | Internal Use | 'Output Commercialization' | Model Commercialization -
Credits/Attribution Notice:
STRINGWho needs to be credited when using the dataset -
Designated Third Parties:
STRINGThird parties in charge of licensing and distribution issues -
Additional Conditions:
STRINGOther issues specified by the authors
Example
Distribution: Licences: CC BY 3.0 (Attribution 3.0 Unported) Rights(stand-alone): Access Rights(with models): Benchmark Additional Conditions "In addition to the CC-BY-NC license, the dataset is governed by the ISIC Terms of Use ... "
-
Is public?:
-
Authoring Authoring of the dataset
-
Authors Authors of the dataset
-
Name:
STRINGName of the author -
Email:
EMAILEmail of the author
-
Name:
-
Founders Founders of the dataset
-
Name:
STRINGName of the founder -
Type:
ENUMERATEType of the founderprivate | public | mixed; -
Grantor
STRINGGrantor of the dataset -
Grant ID:
IDMachine-readable name of the grant id
-
Name:
-
Maintainers Maintainers of the dataset
-
Name:
STRINGName of the maintainer -
Email:
EMAILEmail of the maintainer
-
Name:
-
Erratum?:
STRINGIs there any erratum? -
Data retention:
STRINGPlease indicate any data retention policy -
Version lifecycle:
STRINGDescribe the planned version lifecycle -
Contribution guidelines
STRINGIs there any contribution guideline?
Example:
Authoring: Authors: Name Skin_Imaging_Collaboration_ISIC email emailo@emailo.com [...] Funders: Name The_University_of_Queensland type mixed grantor "National Health and Medical Research Council (NHMRC) – Centre of Research Excellence Scheme" grantId: APP1099021 [...] Erratum?: "There is no erratum known" Contribution guidelines: "No contribution guidelines provided"
-
Authors Authors of the dataset
-
Rationale
STRINGProvide a composition rationale -
Total Size
NUMBERTotal size of tuples of the dataset -
Instances A composition description of each instance of the dataset
-
Instance:
IDMachine-readable name of the instance -
Size:
NUMBERSize of the instance -
Description:
STRINGDescription of the instance -
Type:
ENUMERATEType of the instanceRecord-Data | Time-Series | Ordered | Graph | Other -
Attribute Number:
NUMBERNumber of attributes -
Attributes: Description of each attribute of the instance
-
attribute:
IDMachine-readable name of the attribute -
Description:
STRINGDescription of the attribute -
Associated label:
LabelsReference to a declared label in a labeling process (first you should complete the provenance part) -
unique values:
NUMBERType of the attribute -
ofType:
ENUMERATEType of the attributeCategorical | NominalIfofTypeisCategorical-
Statistics: Statistic of the attribute
-
Unique:
NUMBERUnique tuples (without duplications) -
Unique Percentage:
NUMBERPercentage of unique tuples -
Missing Values:
NUMBERNumber of missing values -
Completeness:
NUMBERCompleteness of the attribute -
Mode:
STRINGMode of the attribute -
First Rows:
[0: ROW1, ...]Percentage of the mode -
Min-leght:
NUMBERMin of the attribute -
Max-lenght:
NUMBERMax of the attribute -
Median-lenght:
NUMERMedian lengths of the attribute -
Lenght-histogram:
STRINGHistogram of the attribute -
Chi-Squared: Chi-Squared of the attribute
- statistic: Statistic of the chi-sqaure analysis
- p-value: p-value of the chi-sqaure analysis
-
Binary attribute:
BOOLIs a binary attribute?-
Symmetry:
ENUMERATESymmetryc | Asymmetryc -
Attribute Sparsity:
NUMBERHow sparse is the binary attribute?
-
Symmetry:
-
Categoric Distribution:
["CATEGORY": "NUMBER"%, ...]Categoric distribution of the attribute
Example
attribute: beningnant_malignant description: 'Type of the melanoma' label: skinLabel count: 33126 ofType: Categorical Statistics: Missing Values: 0 Completeness: 100 Chi-Squared: p-value: 0 Categoric Distribution: [ "beningnant": 80%, "malignant": 20% ]
-
Unique:
ofTypeisNominal-
Statistics: Statistics of the attribute
-
Mean:
NUMBERUnique tuples (without duplications) -
Median:
NUMBERPercentage of unique tuples -
Mode:
NUMBERMode of the attribute -
Minimmum:
NUMBERMin of the attribute -
Maximmum:
NUMBERMax of the attribute -
Quartiles:
[Q1:NUMBER, ...]Median lengths of the attribute -
IQR:
NUMBERHistogram of the attribute
Example
attribute: acidity description: 'wine acidity mesure' count: 33126 ofType: Numerical Statistics: Mean: 4 Median: 4.1 Standard Desviation: 0.2 Minimmum: 5 Maximmum: 87 Quartiles: Q1:17 Q2:27 Q3:30 Q4:30 IQR: 1.2
-
Mean:
-
Statistics: Statistic of the attribute
-
attribute:
-
Statistics: (instance) Statistic of the instance
-
Correlations: Correlation of the instance, choose one calculation type
-
Pearson:
[INDEX:"NUMBER", ...]Pearson correlation of the instance -
Spearman:
[INDEX:"NUMBER", ...]Spearman correlation of the instance -
Kendall:
[INDEX:"NUMBER", ...]Kendall correlation of the instance -
Cramers:
[INDEX:"NUMBER", ...]Cramers correlation of the instance -
Phi-k
[INDEX:"NUMBER", ...]Phi-k correlation of the instance
-
Pearson:
-
Pair Correlation
Between [ATTRIBUTE], and [ATTRIBUTE]Points the relevant pair-correlation between two instances of declared attributes. -
Quality Metrics: General quality metrics of the instance
-
Sparsity:
NUMBERSparsity of the instance -
Completeness:
NUMBERCompleteness of the instance -
Class balance:
STRINGClass balance of the instance -
Noisy labels:
STRINGNoisy labels of the instance
Example:
Statistics: Correlations: Spearman: ['1': 0.2, '2':0.3, '3':0.4, '4':0.5, '5':0.6, '6':0.7, '7':0.8, '8':0.9] Pair Correlation: between ImageId and diagnosis between age and external source From: "National statistical office" Rationale: "The age average is similar to the Nevada state age average due to national statistical office average of 2022 of Nevada" Quality Metrics: Completeness: 100
-
Sparsity:
-
Correlations: Correlation of the instance, choose one calculation type
-
Consistency Rules: Set the consistency rules of your dataset
-
Rule:
OCLExpressionOCL expression of the rule
Example:
Consistency rules: inv: skinImages : (age >= 0)
-
Rule:
-
Instance:
-
Dependencies: Dependencies of the rule
-
Description:
STRINGDescription of the dependencies -
Links:
URLLink to the dependency artifact
-
Description:
-
Instances relation:
Relation: ID attribute: [ATTRIBUTE] is related to [INSTANCE]Relation between instances
-
Curation Rationale
STRINGProvide a provenance rationale -
Gathering Processes:
-
Process:
IDMachine-readable name of the process -
Description:
STRINGDescription of the process -
When data was collected:
STRINGDate where data the process was performed -
How data was collected
STRINGHow data was collected -
Is language data: Set the speech situation
-
Language:
STRINGLanguage of the data -
Time and place:
STRING -
Modality:
ENUMERATEModality of the speechspoken/signed | written -
Type:
ENUMERATEType of the speechscripted/edited | spontaneous -
Syncrony:
ENUMERATESynchrony of the speechsynchronous |asynchronous -
Inteded Audience:
STRINGIntended audience of the speech
-
Language:
-
Social Issues:
[SOCIAL ISSUES]Relation of the gathering process with an already declared social issue instance -
Source: Source of the data
-
Source:
IDmachine-readable name of the source -
Description:
STRINGDescription of the source -
Noise:
STRINGDescription of the source's noise -
Links:
URLLink to the source artifact
-
Source:
-
Process Demographics:
-
Age:
NUMBERMedian age of the participants -
Gender:
STRINGGender relation of the participants -
Country/Region
STRINGCountry/Region of the participants -
Race/Ethnicity
STIRNGRace or ethnicity of the participants -
Native Langugage
STRINGNative language of the participants -
Socioeconomic status
STRINGSocioeconomic status -
Number of speakers represented:
NUMBERNumber of participants -
Precense of disorders in speech:
STRINGNumber of speakers -
Training in linguistics/other relevant disciplines
STRINGExplain the training of the participants
-
Age:
-
Gathering Team Team in charge of gathering the data
-
Who collects the data:
STRINGWho collects the data -
Type
ENUMERATEInternal | External | Contractors | Crowdsourcing -
Demographics: Demographics of the gathering team
-
Age:
NUMBERMedian age of the participants -
Gender:
STRINGGender relation of the participants -
Country/Region
STRINGCountry/Region of the participants -
Race/Ethnicity
STIRNGRace or ethnicity of the participants -
Native Langugage
STRINGNative language of the participants -
Socioeconomic status
STRINGSocioeconomic status -
Training in linguistics/other relevant disciplines
STRINGExplain the training of the participants
-
Age:
-
Who collects the data:
-
Gathering Requirements:
Requirement: STRING, ...
Example:
Data Provenance: Curation Rationale: "The curation process have been conducted by several health institutions... " Gathering Processes: Process: GatheringProcess1 Description: "The sources are: the Melanoma Institute Australia and the ..." Source: GeneralHospital1 Description: 'Source Description' Noise: "Inconsistent lighting in images may alter skin type" "Duplicates:..." Related Instances: skinImages How data is collected: Manual Human Curator When data was collected: Range: 1998 - 2019 Process Demographics: Country/Region: 'Australia' [...] Gathering Team: Who collects the data: "A team of dermatologists and pathologists" Type Internal Gather Requirements: Requirement: "We queried clinical imaging databases across the six centers to generate a ..."
-
Process:
-
LabelingProcesses:
-
Labeling process:
IDMachine-readable name of the labeling process -
Description:
STRINGDescription of the labeling process -
Type:
ENUMERATE'Bounding boxes' | 'Lines and splines' | 'Semantinc Segmentation' | '3D cuboids' | 'Polygonal segmentation' | 'Landmark and key-point' | 'Image and video annotations' | 'Entity annotation' | 'Content and textual categorization -
Labels: Labels of the labeling process
-
Label:
IDMachine-readable name of the label -
Description:
STRINGDescription of the label - Mapping: [ATTRIBUTE,...] Relate a label with instances of attributes already declared in the documentation
-
Label:
-
Labeling Team:
-
Who collects the data:
STRINGWho collects the data -
Type
ENUMERATEInternal, External, Contractors, Crowdsourcing -
Demographics: Demographics of the gathering team
-
Age:
NUMBERMedian age of the participants -
Gender:
STRINGGender relation of the participants -
Country/Region
STRINGCountry/Region of the participants -
Race/Ethnicity
STIRNGRace or ethnicity of the participants -
Native Langugage
STRINGNative language of the participants -
Socioeconomic status
STRINGSocioeconomic status -
Number of speakers represented:
NUMBERNumber of participants -
Precense of disorders in speech:
STRINGNumber of speakers -
Training in linguistics/other relevant disciplines
STRINGExplain the training of the participants
-
Age:
-
Who collects the data:
-
Infrastructure: Infrastructure used to annotate the data
-
Tool:
STRINGTool used to annotate the data -
Platform:
STRINGPlatform where the tool works -
Version:
STRINGVersion of the tool and platform -
Language:
STRINGLanguage of the tool -
Comments:
STRINGProvide comments about the tool
-
Tool:
-
Validation: Validation methods to ensure annotation quality
-
Validation Methods:
STRINGValidation method used -
Validation Dates:
STRINGDates where the validation where done annotations -
Golden Questions: Golden Question pass to the annotators
-
Question:
STRINGTextual question -
Inter-annotation agreement:
NUMBERInter-annotation agreement for each question. Low values mean low confidence in the annotation
-
Question:
-
Validation Requirements:
Requirement: STRING, ...Provide comments about the validation tool
-
Validation Methods:
-
Labeling Requirements:
Requirement: STRING, ...
Example:
LabelingProcesses: Labeling process: skinLabeling Description: "Medical staff looking at the data and images and annotating the diagnosis" Type: Image and video annotations Labels: Label: skinLabel Description: "marked as beningnant or malignant" Mapping: beningnant_malignant Labeling Team: Who collects the data: "Internal Medical staff" Type Internal Country/Region: "Australia" Label Requirements: Requirement: "1) Images containing any potentially identifying features, such as jewelry
-
Labeling process:
-
Preprocesses: Data preprocesses done over the data
-
Preprocess:
IDmachine-readable name of the preprocess -
Type:
ENUMERATEType of preprocess applied'Missing Values' | 'Data Augmentation' | 'Outlier Filtering' | 'Remove Duplicates' | 'Data reduction' | 'Sampling' | 'Data Normalization' | 'Others' -
Description:
STRINGDescription of the preprocess -
Social Issues:
[SOCIAL ISSUES]Relation of the preprocess with an already declared social issue instance
-
Preprocess:
-
Social Concerns
-
Rationale:
STRINGRationale of the social concerns of the dataset -
Social Issues: Social issues identified from the data
-
Social Issue:
IDMachine-readable name of the social issue -
IssueType:
ENUMERATEType of social concern'Privacy' | 'Bias' | 'Sensitive Data' | 'Social Impact' -
Description:
STRINGDescription of the social issue -
Related Attributes
attribute: [ATTRIBUTE]Attributes related to the social issue -
Instace belong to people:
-
Have sensitive attributes?
[Attribute], ...List of sensitive attributes -
Are there protected groups?
ENUMERATE(Yes, No, Unknown) -
Might be offensive?
STRINGIs there offensive content in the dataset
Examples
Social Concerns: Rationale: 'Dataset may not be representative of the real world data, and the cavenience sample is not representative of general incidence of melanoma' Social Issue: raceRepresentative IssueType: Bias Description: "Dataset is not representative with respect to darker skin types" Related Attributes: attribute: ImageId
-
Have sensitive attributes?
-
Social Issue:
-
Rationale:
For any related question, please contact the authors at: [email protected]