Skip to content

Commit 53e6a7c

Browse files
thclarkgitbook-bot
authored andcommitted
GitBook: [#10] No subject
1 parent 4890eaf commit 53e6a7c

11 files changed

+165
-6
lines changed
Loading
Loading
51.8 KB
Loading

.gitbook/assets/example pipeline with jsonschema.svg

+16
Loading

.gitbook/assets/preprocessing-pipeline.svg

+16
Loading

SUMMARY.md

+2
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,15 @@
1212
* [FAIR principles](terms/fair-principles.md)
1313
* [Metadata](terms/live-edit-and-locked-edits.md)
1414
* [Ontology](terms/ontology.md)
15+
* [Pipeline](terms/pipeline.md)
1516
* [Pragmatics](terms/pragmatics.md)
1617
* [Property](terms/property.md)
1718
* [Schema](terms/schema.md)
1819
* [Schema Language](terms/schema-language.md)
1920
* [Semantics](terms/collections.md)
2021
* [Syntax](terms/the-gitbook-editor.md)
2122
* [Taxonomy](terms/taxonomy.md)
23+
* [Transformation](terms/transformation.md)
2224
* [Uniform Resource Identifier](terms/uniform-resource-identifier.md)
2325
* [Vocabulary](terms/change-requests.md)
2426

terms/live-edit-and-locked-edits.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,5 @@ The descriptive information about data.
3333
Information about how the data is structured.
3434

3535
{% hint style="info" %}
36-
**Example:** Format, size, file type, properties, reference to a schema
36+
**Example:** Format, size, file type, properties, reference to a [schema](schema.md)
3737
{% endhint %}
38-

terms/pipeline.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Pipeline
2+
3+
\<placeholder>

terms/property.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Property
22

3-
An aspect, feature, characteristic or parameter associated with a class or an object of a given class.
3+
An aspect, feature, characteristic or parameter associated with a [class](class.md) or an object of a given class.
44

55
{% hint style="warning" %}
6-
In some programming languages, there is a significant difference between an attribute and a property (an attribute in python being a formal class member, for example), but for the the majority of the Task 43 work, they are broadly synonymous, and which is used depends on the context (for example, when writing a JSON schema you'd use the "property" terminology).&#x20;
6+
In some programming languages, there is a significant difference between an attribute and a property (an attribute in python being a formal class member, for example), but for the the majority of the Task 43 work, they are broadly synonymous, and which is used depends on the context (for example, when writing a JSON schema you'd use the "property" terminology).
77
{% endhint %}

terms/schema.md

+80-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,85 @@
11
# Schema
22

3-
## TODO - copy from google docs
3+
A schema is a "blueprint" of what data looks like. More formally, it's an expression of descriptive and structural [metadata](live-edit-and-locked-edits.md) with defined [semantics](collections.md). A schema is a powerful communication tool, as it provides a clear and well-encapsulated expression of what data you have (or need).
4+
5+
By "defined semantics" we mean that it is expressed in a particular [schema language](schema-language.md), the choice of which can be highly nuanced depending on your application.
6+
7+
{% hint style="success" %}
8+
**Purpose**
9+
10+
Schema allow you to:
11+
12+
* de-risk projects involving data,
13+
* build robust data [transformations](transformation.md) and [pipelines](pipeline.md), and
14+
* communicate with other stakeholders about the data you require (or provide).
15+
{% endhint %}
16+
17+
{% hint style="info" %}
18+
**Tip**
19+
20+
The **pure act of writing down what's in the data** is far more important than the selection of [schema language](schema-language.md).
21+
22+
Schemas can (in general) be translated between languages relatively easily; **the real value lies** in getting the data structure and description written down and communicated in the first place.
23+
{% endhint %}
424

525
{% hint style="warning" %}
6-
In philosophy, a schema is a representation of a plan or theory in the form of an outline or model: "a schema of scientific reasoning". This is of course a much broader definition than we use for our purposes here.
26+
**Disambiguation**
27+
28+
In philosophy, a schema is a representation of a plan or theory in the form of an outline or model: "a schema of scientific reasoning". This is a much broader definition than we use for our purposes here.
729
{% endhint %}
30+
31+
## Example uses for schema
32+
33+
{% tabs %}
34+
{% tab title="Project management (!!!)" %}
35+
_Yes, you read this tab label correctly. Schema are an incredibly powerful de-risking tool for project management!_
36+
37+
In any data-driven project with several stakeholders, many months can be spent in communication on how teams are going to work together. Example chunks of data are sent back and forth in CSV or Excel files, and it's frequently unclear what expectations are and where boundaries lie between teams. The meanings of particular columns are queried by email, just when people go out on holiday. Things need fixing when it turns out the real data is a bit different. And so on, and so on, as **the critical path grows ever longer...**
38+
39+
****
40+
41+
<figure><img src="../.gitbook/assets/Screenshot 2022-12-15 at 11.38.42.png" alt=""><figcaption><p>Gantt charts have a habit of shifting ever-right in digitalisation projects. A perfect way of getting dependencies under control, and prevent teams blocking one another, is to workshop a set of schema at each of the boundaries at the beginning of the project.</p></figcaption></figure>
42+
43+
At the beginning of such projects, defining a set of schema at the boundaries where data is exchanged between teams:
44+
45+
* Clarifies initial expectations&#x20;
46+
* Encourages disciplined, effective communication between the teams as the project and its data evolve
47+
48+
This works well even if you don't yet have a clear understanding of your data (when schema are little more than a wild-guess!), because it introduces a framework for communication at the start.
49+
{% endtab %}
50+
51+
{% tab title="Data validation" %}
52+
For many kinds of [data transformation](transformation.md), it's imperative that the input data has some kind of characteristic. For example, if you run a wind resource analysis, your input data must contain at least some information about the wind at a site!
53+
54+
> But what if it doesn't? Or it's not in the right form?
55+
56+
This is where schema come in. If you have a schema then you can use it to:
57+
58+
* Check that the incoming data is valid
59+
* Issue coherent and useful error messages if it isn't.
60+
61+
This is in contrast to accepting any kind of data, then behaving in undefined ways depending on what's in the data. Hence, it allows you to build robust data transformations more rapidly.
62+
63+
<figure><img src="../.gitbook/assets/preprocessing-pipeline.svg" alt=""><figcaption><p>An example architecture diagram, in which a schema is used to validate data and return errors to user before accepting data for processing, launching a <a href="pipeline.md">pipeline</a> for <a href="transformation.md">transformation</a> with dataflow then persisting the output to a database</p></figcaption></figure>
64+
{% endtab %}
65+
66+
{% tab title="Database" %}
67+
A schema for a relational database (such as [PostgreSQL](https://www.postgresql.org/)) describes how data is stored, by specifying tables, columns, column types and relations.
68+
69+
Database schema are generally required to be in a language specific to the type of database.
70+
71+
<figure><img src="../.gitbook/assets/database schema example.png" alt=""><figcaption><p>Graphical representation of a simple SQL database schema, from <a href="https://www.codecademy.com/learn/how-do-i-make-and-populate-my-own-database/modules/designing-a-database-schema/cheatsheet">codeacademy</a>.</p></figcaption></figure>
72+
{% endtab %}
73+
{% endtabs %}
74+
75+
76+
77+
## Standardisation and evolution
78+
79+
Industrial standards frequently emerge specifying data contents, often for stable industrial systems whose parameters and data outputs are well known.
80+
81+
82+
83+
84+
85+
\

terms/transformation.md

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Transformation
2+
3+
An operation in which data is transformed from one form to another. The resulting form is entirely general but may be (for example):
4+
5+
* A re-expression of the same fundamental data (eg rewriting with a different data structure or file format)
6+
* A combination of data sources (eg add wind direction to wind speed, to form a combined time series)
7+
* A result of the application of an algorithm (eg getting mean wind speed by averaging a time series)
8+
* An expression or visualisation (eg an HTML file to display a chart of wind speed time series and a table of monthly averages)
9+
10+
Data transformation as a term is often used with reference to [pipelines](pipeline.md), which can be used to execute an arbitrary number and sequence of transformations.
11+
12+
{% hint style="warning" %}
13+
**Disambiguation**
14+
15+
The term **"Transformation"** could be perceived as meaning **"Translation"** of data (eg a re-expression of the same fundamental data in a different way). Translation, however, represents only one of the many kinds of operation that could be a transformation [(see below)](transformation.md#kinds-of-transformation).
16+
17+
18+
{% endhint %}
19+
20+
## Kinds of transformation
21+
22+
In the discussion that led to this entry, a range of other terms were discussed surrounding the "transformation" terminology, with some viewing other terms as more helpful. "Data Analysis" and "Data Translation" in particular were thought of.&#x20;
23+
24+
However, such terms **have a connotation regarding the purpose of the transformation**, which can result in confusion when attempting to think clearly about the data engineering required to implement a transformation, particularly since an output of a given translation might have multiple uses.
25+
26+
In general though, it may be helpful to have a shorthand to discuss different specific kinds of translation. Corresponding with the above examples:
27+
28+
* A **translation** might constitute a re-expression of the same fundamental data in a different form
29+
* A **view** might constitute the creation of a bridge between two data tables, combining two data sources to appear as one
30+
* An **analysis** might constitute application of some scientific algorithm to extract results or insight (which themselves constitute the output data of the transformation)
31+
* A **render** might constitute the expression or visualisation of data in a way which may not be well suited for ongoing use in an automated pipeline, but is good for human consumption (like an automated report).
32+
33+
{% hint style="danger" %}
34+
**Why use the word "might"?**
35+
36+
The boundaries blur between these items, particularly because it can't be known _a priori_ how the output of any given transformation will be used.
37+
38+
* For example, using a[ jupyter notebook](https://jupyter.org/) to average some data and plot the result would combine algorithm implementation with visualisation. This would be colloquially called a "data analysis", but when moving to a production scenario, averaging of the data and the visualisation of the result would likely be in entirely different domains.
39+
{% endhint %}
40+
41+
{% hint style="info" %}
42+
**Tip**
43+
44+
Thinking simply in terms of transformations (as opposed to the different kinds and their purposes) will help you think more objectively about the separation of concerns between, and the role of, each step in a pipeline.
45+
{% endhint %}

0 commit comments

Comments
 (0)