Skip to content
This repository has been archived by the owner on Aug 3, 2022. It is now read-only.

Add data quality indicators to the UI #873

Open
pwalsh opened this issue Jan 4, 2017 · 10 comments
Open

Add data quality indicators to the UI #873

pwalsh opened this issue Jan 4, 2017 · 10 comments

Comments

@pwalsh
Copy link
Member

pwalsh commented Jan 4, 2017

As a {Product Owner}, I want to integrate quality assessments in the pages of a given dataset, as applicable, so I can start to highlight quality as an important dimension of analysis for the future.

  • Example: For a tabular dataset, run GoodTables over the dataset and generate a report that shows the quality of the dataset, even if this quality is not yet part of the ranking mechanism.
  • Example: for any dataset, ping the URLs at scheduled intervals (monthly?) to check the data is still available.
@morchickit morchickit modified the milestones: Review Sprint, Index 2016 results website Jan 6, 2017
@morchickit
Copy link
Contributor

We should have a call with @pwalsh and @brew about it

@georgiana-b
Copy link

Depending on how you guys will define quality in this context and how you decide to display it, you might find Data Quality CLI useful so I'll drop here a few words about it.
Data Quality CLI uses GoodTables to assert the quality of a data package and gives back a quality score (a.k.a. percent because it ranges from 0 to 100) based on structure errors, schema errors and timeliness. As you will see in the README, it's very configurable and it already has integration for CKAN instances.
The point of Data Quality CLI was to generate the data for Data Quality Dashboard which displays the results of the quality analysis (the scores). You can see here the quality dashboard for Northern Ireland CKAN. The dashboard has an /embed route so you can include only the relevant bits in your page.
The main issue with the Data Quality Duo is that they haven't been brought up to date with the latest GoodTables API so they would really benefit contributions if you choose to use them.
If you have further questions about this, feel free to ask me. Good luck! ✌️

@georgiana-b georgiana-b removed their assignment Apr 6, 2017
@brew brew added the design label Apr 7, 2017
@brew
Copy link
Collaborator

brew commented Apr 7, 2017

@morchickit We might also want @smth to look at this. How to associate each source location with a data quality 'badge' (or whatever is used to represent data quality).

@morchickit
Copy link
Contributor

@georgiana-b -great idea. My only problem is that not all the links to the datasets leads directly to the files. How should we deal with that?

@georgiana-b
Copy link

georgiana-b commented Apr 18, 2017

@morchickit Can you give me an example? I'm trying to understand what "dataset" mean in this context.

@smth
Copy link
Contributor

smth commented Apr 19, 2017

Given that this was added to the mockups last year (http://okfnlabs.org/index-mockup/entry/), what, if anything, is required of me here?

@dannylammerhirt
Copy link
Contributor

We thought a tooltip is good to explain what GoodTables does/means. We were wondering if a link to the GoodTables website, or a short 1-2 sentence description would be helpful to explain what things like "GoodTables: Valid" or "Last seen: DATE" mean.

At the moment I have the impression this is not self-explaining. What do you think @smth

@smth
Copy link
Contributor

smth commented Apr 19, 2017

I would agree they are not self explanatory (though nothing here really is). I think these badges should be clickable, and link to some sort of (external) GoodTables page.

@morchickit
Copy link
Contributor

@georgiana-b - See this -
http://professionnels.ign.fr/geofla - this page was linked to the index and describes the dataset
the data is actually accessed from this URL http://professionnels.ign.fr/geofla#tab-3

Or this page from Israel that links to other links that links to the dataset - https://foi.gov.il/he/search/site/?f%5B0%5D=im_field_mmdsubjects%3A367

@georgiana-b
Copy link

georgiana-b commented Apr 19, 2017

So, after seeing the dataset examples and the mockup for this I have the following observations:

You have to discuss the conditions necessary to get that "Valid" badge. Since a dataset is made of several data files it's probable that some of its files will be valid, some will not. How valid a dataset is depends on how valid each of its constituent files are.
For example in Data Quality Dashboard for UK spend, because we consider a valid file to be 100% correct there are 0 valid files even though many come close so the average correctness score is 46%.

Whether you use Data Quality CLI or GoodTables directly, you have to transform those links into a standardized version of a dataset i.e. a DataPackage.
If you just send http://professionnels.ign.fr/geofla#tab-3 or http://professionnels.ign.fr/geofla to GoodTables, it will interpret it as an HTML page and thus an invalid file.
To get this automatic quality analysis somebody will have to make a datapackage for each dataset, posibly using DQ-CLI's init & generate commands. GoodTables can assess datapackages if you use the datapackage preset.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants