Skip to content

Commit

Permalink
More edits.
Browse files Browse the repository at this point in the history
  • Loading branch information
arokem committed May 16, 2024
1 parent 34bae6b commit 2568099
Show file tree
Hide file tree
Showing 2 changed files with 69 additions and 45 deletions.
99 changes: 54 additions & 45 deletions sections/01-introduction.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,54 +9,63 @@ machine learning techniques, these datasets can help us understand everything
from the cellular operations of the human body, through business transactions
on the internet, to the structure and history of the universe. However, the
development of new machine learning methods, and data-intensive discovery more
generally, rely heavily on the availability and usability of these large
datasets. Data can be openly available but still not useful if it cannot be
properly understood. In current conditions in which almost all of the relevant
data is stored in digital formats, and many relevant datasets can be found
through the communication networks of the world wide web, Findability,
Accessibility, Interoperability and Reusability (FAIR) principles for data
management and stewardship become critically important
\cite{Wilkinson2016FAIR}.
generally, rely heavily on Findability, Accessibility, Interoperability and
Reusability (FAIR) of data [@Wilkinson2016FAIR].

One of the main mechanisms through which these principles are promoted is the
development of \emph{standards} for data and metadata. Standards can vary in
the level of detail and scope, and encompass such things as \emph{file formats}
for the storing of certain data types, \emph{schemas} for databases that store
a range of data types, \emph{ontologies} to describe and organize metadata in a
One of the main mechanisms through which the FAIR principles are promoted is the
development of *standards* for data and metadata. Standards can vary in
the level of detail and scope, and encompass such things as *file formats*
for the storing of certain data types, *schemas* for databases that store
a range of data types, *ontologies* to describe and organize metadata in a
manner that connects it to field-specific meaning, as well as mechanisms to
describe \emph{provenance} of different data derivatives. The importance of
standards was underscored in a recent report report by the Subcommittee on Open
Science of the National Science and Technology Council on "Desirable
characteristics of data repositories for federally funded research"
\cite{nstc2022desirable}. The report explicitly called out the importance of
"allow[ing] datasets and metadata to be accessed, downloaded, or exported from
the repository in widely used, preferably non-proprietary, formats consistent
with standards used in the disciplines the repository serves." This highlights
the need for data and metadata standards across a variety of different kinds of
data. In addition, a report from the National Institute of Standards and
Technology on "U.S. Leadership in AI: A Plan for Federal Engagement in
Developing Technical Standards and Related Tools" emphasized that --
specifically for the case of AI -- "U.S. government agencies should prioritize
AI standards efforts that are [...] Consensus-based, [...] Inclusive and
accessible, [...] Multi-path, [...] Open and transparent, [...] and [that]
Result in globally relevant and non-discriminatory standards..."
\cite{NIST2019}. The converging characteristics of standards that arise from
these reports suggest that considerable thought needs to be given to the manner
in which standards arise, so that these goals are achieved.
describe *provenance* of analysis products.

Standards for a specific domain can come about in various ways, but very
broadly speaking two kinds of mechanisms can generate a standard for a specific
type of data: (i) top-down: in this case a (usually) small group of people
develop the standard and disseminate it to the communities of interest with
very little input from these communities. An example of this mode of standards
development can occur when an instrument is developed by a manufacturer and
users of this instrument receive the data in a particular format that was
developed in tandem with the instrument; and (ii) bottom-up: in this case,
standards are developed by a larger group of people that convene and reach
consensus about the details of the standard in an attempt to cover a large
range of use-cases. Most standards are developed through an interplay between
these two modes, and understanding how to make the best of these modes is
critical in advancing the development of data and metadata standards.
The importance of standards stems not only from discussions within research
fields about how research can best be conducted to take advantage of existing
and growing datasets, but also arises from an ongoing series of policy
discussions that address the interactions between research communities and the
general public. In the United States, memos issued in 2013 and 2022 by the
directors of the White House Office of Science and Technology Policy (OSTP),
James Holdren (2013) and Alondra Nelson (2022). While these memos focused
primarily on making peer-reviewed publications funded by the US Federal
government available to the general public, they also lay an increasingly
detailed path towards the publication and general availability of the data that
is collected as part of the research that is funded by the US government.

The general guidance and overall spirit of these memos dovetail with more
specific policy discussions that put meat on the bones of the general guidance.
The importance of data and metadata standards, for example, was underscored in
a recent report by the Subcommittee on Open Science of the National Science and
Technology Council on the "Desirable characteristics of data repositories for
federally funded research" [@nstc2022desirable]. The report explicitly called
out the importance of "allow[ing] datasets and metadata to be accessed,
downloaded, or exported from the repository in widely used, preferably
non-proprietary, formats consistent with standards used in the disciplines the
repository serves." This highlights the need for data and metadata standards
across a variety of different kinds of data. In addition, a report from the
National Institute of Standards and Technology on "U.S. Leadership in AI: A
Plan for Federal Engagement in Developing Technical Standards and Related
Tools" emphasized that -- specifically for the case of AI -- "U.S. government
agencies should prioritize AI standards efforts that are [...] Consensus-based,
[...] Inclusive and accessible, [...] Multi-path, [...] Open and transparent,
[...] and [that] Result in globally relevant and non-discriminatory
standards..." [@NIST2019]. The converging characteristics of standards that
arise from these reports suggest that considerable thought needs to be given to
the manner in which standards arise, so that these goals are achieved.

Standards for a specific domain can come about in various ways. Broadly
speaking two kinds of mechanisms can generate a standard for a specific type of
data: (i) top-down: in this case a (usually) small group of people develop the
standard and disseminate it to the communities of interest with very little
input from these communities. An example of this mode of standards development
can occur when an instrument is developed by a manufacturer and users of this
instrument receive the data in a particular format that was developed in tandem
with the instrument; and (ii) bottom-up: in this case, standards are developed
by a larger group of people that convene and reach consensus about the details
of the standard in an attempt to cover a large range of use-cases. Most
standards are developed through an interplay between these two modes, and
understanding how to make the best of these modes is critical in advancing the
development of data and metadata standards.

One source of inspiration for bottom-up development of robust, adaptable and
useful standards comes from open-source software (OSS). OSS has a long history
Expand Down
15 changes: 15 additions & 0 deletions sections/03-recommendations.qmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@


<<<<<<< HEAD
## Funding or Grantmaking entities:

### Fund Data Standards Development
Expand Down Expand Up @@ -57,5 +58,19 @@ Development of standards should be coupled with development of associated softwa
Additionally, standards evolution should maintain software compatibility, and ability to translate and migrate between standards.


=======
1. Training for data stewards and career paths that encourage this role.
2. Development of meta-standards or standards-of-standards. These are descriptions of cross-cutting best practices. These can be used as a basis of the analysis or assessment of an existing standard, or as guidelines to develop new standards.
3. Recommend pathways or lifecycles for successful data standards. Include process, creators, affiliations, grants, and adoption journeys. Make this documentation step integral to the work of standards creators and granting agencies.
4. Retrocactively document #3 for standards such as CF(climate science), NASA genelab (space omics), OpenGIS (geospatial), DICOM (medical imaging), GA4GH (genomics), FITS (astronomy), Zarr (domain agnostic n-dimensional arrays)... ?
5. Create ontology for standards process such as top down vs bottom up, minimum number of datasets, and community size. Examine schema.org (w3c), PEP (Python), CDISC (FDA).
6. Amplify formalization/guidelines on how to create standards (example metadata schema specifications using https://linkml.io).
7. Make data standards machine readable, and software creation an integral part of establishing a standard's schema e.g. identifiers for a person using CFF in citations. cffconvert software makes the CFF standard usable and useful.
8. Survey and document failure of current standards for a specific dataset / domain before establishing a new one. Use resources such as Fairsharing.org or Digital Curation Center https://www.dcc.ac.uk/guidance/standards.
9. Funding agencies and science communities need to establish governance for standards creation and adoption (cite https://www.theopensourceway.org/the_open_source_way-guidebook-2.0.html#_project_and_community_governance).
10. Cross sector alliances such as industry - academia need closer coordination and algnment of pace through strong program management (for instance via OSPO efforts).
11. Multi company partnerships should include strategic initiatives for standard establishment (example https://www.pistoiaalliance.org/news/press-release-pistoia-alliance-launches-idmp-1-0/).
12. Stakeholder organizations should invest in training grants to establish curriculum for data and metadata standards education.
>>>>>>> 8cb3f6b (More edits.)

0 comments on commit 2568099

Please sign in to comment.