- Overview
- Delta Sharing Specification
- Profile File Format
Delta Sharing is an open protocol for secure real-time exchange of large datasets, which enables secure data sharing across products for the first time. It is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS, or GCS, to reliably transfer large datasets.
With Delta Sharing, the user accessing shared data can directly connect to it through Pandas, Tableau, or dozens of other systems that implement the open protocol, without having to deploy a specific platform first. This reduces their access time from months to minutes, and makes life dramatically simpler for data providers who want to reach as many users as possible.
This document is a specification for the Delta Sharing Protocol, which defines the REST APIs and the formats of messages used by any clients and servers to exchange data.
- Share: A share is a logical grouping to share with recipients. A share can be shared with one or multiple recipients. A recipient can access all resources in a share. A share may contain multiple schemas.
- Schema: A schema is a logical grouping of tables. A schema may contain multiple tables.
- Table: A table is a Delta Lake table or a view on top of a Delta Lake table.
- Recipient: A principal that has a bearer token to access shared tables.
- Sharing Server: A server that implements this protocol.
Here are the list of APIs used by Delta Sharing Protocol. All of the REST APIs use bearer tokens for authorization. The {prefix}
of each API is configurable, and servers hosted by different providers may pick up different prefixes.
This is the API to list shares accessible to a recipient.
HTTP Parameters | Value |
---|---|
HTTP Method | GET |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares |
Query Parameters | maxResults (type: Int, optional): The maximum number of results per page that should be returned. If the number of available results is larger than maxResults , the response will provide a nextPageToken that can be used to get the next page of results in subsequent list requests. The server may return fewer than maxResults items even if there are more available. The client should check nextPageToken in the response to determine if there are more available.pageToken (type: String, optional): Specifies a page token to use. Set pageToken to the nextPageToken returned by a previous list request to get the next page of results. nextPageToken will not be returned in a response if there are no more results available. |
Response Header | Content-Type: application/json; charset=utf-8 |
Response Body | {Note: 1. The items field may be an empty array or missing when no results are found. The client must handle both cases. 2. The id field is optional. If id is populated for a Share, its value should be unique across the sharing server and immutable through the Share's lifecycle. |
Example:
GET {prefix}/shares?maxResults=10&pageToken=...
{
"items" : [
{
"name" : "vaccine_share",
"id" : "edacc4a7-6600-4fbb-85f3-a62a5ce6761f"
},
{
"name" : "sales_share",
"id" : "3e979c79-6399-4dac-bcf8-54e268f48515"
}
],
"nextPageToken" : "..."
}
This is the API to get the metadata of a Share.
HTTP Parameters | Value |
---|---|
HTTP Method | GET |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares/{share} |
URL Parameters | {share}: The share name to query. It's case-insensitive. |
Response Header | Content-Type: application/json; charset=utf-8 |
Response Body | {Note: The id field is optional. If id is populated for a Share, its value should be unique across the sharing server and immutable through the Share's lifecycle. |
Example:
GET {prefix}/shares/{share}
{
"share": {
"name" : "vaccine_share",
"id" : "edacc4a7-6600-4fbb-85f3-a62a5ce6761f"
}
}
This is the API to list schemas in a share.
HTTP Parameters | Value |
---|---|
HTTP Method | GET |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares/{share}/schemas |
URL Parameters | {share}: The share name to query. It's case-insensitive. |
Query Parameters | maxResults (type: Int, optional): The maximum number of results per page that should be returned. If the number of available results is larger than maxResults , the response will provide a nextPageToken that can be used to get the next page of results in subsequent list requests. The server may return fewer than maxResults items even if there are more available. The client should check nextPageToken in the response to determine if there are more available.pageToken (type: String, optional): Specifies a page token to use. Set pageToken to the nextPageToken returned by a previous list request to get the next page of results. nextPageToken will not be returned in a response if there are no more results available. |
Response Header | Content-Type: application/json; charset=utf-8 |
Response Body | {Note: the items field may be an empty array or missing when no results are found. The client must handle both cases. |
Example:
GET {prefix}/shares/vaccine_share/schemas?maxResults=10&pageToken=...
{
"items" : [
{
"name" : "acme_vaccine_data",
"share" : "vaccine_share"
}
],
"nextPageToken" : "..."
}
This is the API to list tables in a schema.
HTTP Parameters | Value |
---|---|
HTTP Method | GET |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares/{share}/schemas/{schema}/tables |
URL Parameters | {share}: The share name to query. It's case-insensitive. {schema}: The schema name to query. It's case-insensitive. |
Query Parameters | maxResults (type: Int, optional): The maximum number of results per page that should be returned. If the number of available results is larger than maxResults , the response will provide a nextPageToken that can be used to get the next page of results in subsequent list requests. The server may return fewer than maxResults items even if there are more available. The client should check nextPageToken in the response to determine if there are more available.pageToken (type: String, optional): Specifies a page token to use. Set pageToken to the nextPageToken returned by a previous list request to get the next page of results. nextPageToken will not be returned in a response if there are no more results available. |
Response Header | Content-Type: application/json; charset=utf-8 |
Response Body | {Note: 1. the items field may be an empty array or missing when no results are found. The client must handle both cases. 2. The id field is optional. If id is populated for a table, its value should be unique within the share it belongs to and immutable through the table's lifecycle. 3. The shareId field is optional. If shareId is populated for a table, its value should be unique across the sharing server and immutable through the table's lifecycle. |
Example:
GET {prefix}/shares/vaccine_share/schemas/acme_vaccine_data/tables?maxResults=10&pageToken=...
{
"items" : [
{
"share" : "vaccine_share",
"schema" : "acme_vaccine_data",
"name" : "vaccine_ingredients",
"shareId": "edacc4a7-6600-4fbb-85f3-a62a5ce6761f",
"id": "dcb1e680-7da4-4041-9be8-88aff508d001"
},
{
"share" : "vaccine_share",
"schema" : "acme_vaccine_data",
"name" : "vaccine_patients",
"shareId": "edacc4a7-6600-4fbb-85f3-a62a5ce6761f",
"id": "c48f3e19-2c29-4ea3-b6f7-3899e53338fa"
}
],
"nextPageToken" : "..."
}
This is the API to list all the tables under all schemas in a share.
HTTP Parameters | Value |
---|---|
HTTP Method | GET |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares/{share}/all-tables |
URL Parameters | {share}: The share name to query. It's case-insensitive. |
Query Parameters | maxResults (type: Int, optional): The maximum number of results per page that should be returned. If the number of available results is larger than maxResults , the response will provide a nextPageToken that can be used to get the next page of results in subsequent list requests. The server may return fewer than maxResults items even if there are more available. The client should check nextPageToken in the response to determine if there are more available.pageToken (type: String, optional): Specifies a page token to use. Set pageToken to the nextPageToken returned by a previous list request to get the next page of results. nextPageToken will not be returned in a response if there are no more results available. |
Response Header | Content-Type: application/json; charset=utf-8 |
Response Body | {Note: 1. the items field may be an empty array or missing when no results are found. The client must handle both cases.2. The id field is optional. If id is populated for a table, its value should be unique within the share it belongs to and immutable through the table's lifecycle. 3. The shareId field is optional. If shareId is populated for a table, its value should be unique across the sharing server and immutable through the table's lifecycle. |
Example:
GET {prefix}/shares/vaccine_share/all-tables
{
"items" : [
{
"share" : "vaccine_share",
"schema" : "acme_vaccine_ingredient_data",
"name" : "vaccine_ingredients",
"shareId": "edacc4a7-6600-4fbb-85f3-a62a5ce6761f",
"id": "2f9729e9-6fcf-4d34-96df-bf72b26dfbe9"
},
{
"share" : "vaccine_share",
"schema" : "acme_vaccine_patient_data",
"name" : "vaccine_patients",
"shareId": "edacc4a7-6600-4fbb-85f3-a62a5ce6761f",
"id": "74be6365-0fc8-4a2f-8720-0de125bb5832"
}
],
"nextPageToken" : "..."
}
This is the API for clients to get a table version without any other extra information. The server usually can implement this API effectively. If a client caches information about a shared table locally, it can store the table version and use this cheap API to quickly check whether their cache is stale and they should re-fetch the data.
HTTP Parameters | Value |
---|---|
HTTP Method | HEAD |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares/{share}/schemas/{schema}/tables/{table} |
URL Parameters | {share}: The share name to query. It's case-insensitive. {schema}: The schema name to query. It's case-insensitive. {table}: The table name to query. It's case-insensitive. |
Response Header | Delta-Table-Version: {version} {version} is a long value which represents the current table version. |
Response Body | Empty |
Example:
HEAD {prefix}/shares/vaccine_share/schemas/acme_vaccine_data/tables/vaccine_patients
HTTP/2 200
delta-table-version: 123
This is the API for clients to query the table schema and other metadata.
HTTP Parameters | Value |
---|---|
HTTP Method | GET |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares/{share}/schemas/{schema}/tables/{table}/metadata |
URL Parameters | {share}: The share name to query. It's case-insensitive. {schema}: The schema name to query. It's case-insensitive. {table}: The table name to query. It's case-insensitive. |
Response Header | Content-Type: application/x-ndjson Delta-Table-Version: {version} {version} is a long value which represents the current table version. |
Response Body | A sequence of JSON strings delimited by newline. Each line is a JSON object defined in Table Metadata Format. The response contains two lines: - The first line is a JSON wrapper object containing the table Protocol object. - The second line is a JSON wrapper object containing the table Metadata object. |
Example (See Table Metadata Format for more details about the format):
GET {prefix}/shares/vaccine_share/schemas/acme_vaccine_data/tables/vaccine_patients/metadata
HTTP/2 200
content-type: application/x-ndjson; charset=utf-8
delta-table-version: 123
{"protocol":{"minReaderVersion":1}}
{"metaData":{"id":"f8d5c169-3d01-4ca3-ad9e-7dc3355aedb2","format":{"provider":"parquet"},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"eventTime\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"date\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":["date"]}}
This is the API for clients to read data from a table.
HTTP Parameters | Value |
---|---|
HTTP Method | POST |
Header | Authorization: Bearer {token} |
URL | {prefix}/shares/{share}/schemas/{schema}/tables/{table}/query |
URL Parameters | {share}: The share name to query. It's case-insensitive. {schema}: The schema name to query. It's case-insensitive. {table}: The table name to query. It's case-insensitive. |
Request Header | Content-Type: application/json; charset=utf-8 |
Request Body | { See below for more details. |
Response Header | Content-Type: application/x-ndjson Delta-Table-Version: {version} {version} is a long value which represents the current table version. |
Response Body | A sequence of JSON strings delimited by newline. Each line is a JSON object defined in Table Metadata Format. The response contains multiple lines: - The first line is a JSON wrapper object containing the table Protocol object. - The second line is a JSON wrapper object containing the table Metadata object. - The rest of the lines are JSON wrapper objects for files in the table. |
The response body should be a JSON string containing the following two optional fields:
-
predicateHints (type: String, optional): a list of SQL boolean expressions using a restricted subset of SQL, in a JSON array.
- When it’s present, the server will use the provided predicates as a hint to apply the SQL predicates on the returned files.
- Filtering files based on the SQL predicates is BEST EFFORT. The server may return files that don’t satisfy the predicates.
- If the server fails to parse one of the SQL predicates, or fails to evaluate it, the server may skip it.
- Predicate expressions are conjunctive (AND-ed together).
- When it’s absent, the server will return all of files in the table.
- When it’s present, the server will use the provided predicates as a hint to apply the SQL predicates on the returned files.
-
limitHint (type: Int, optional): an optional limit number. It’s a hint from the client to tell the server how many rows in the table the client plans to read. The server can use this hint to return only some files by using the file stats. For example, when running
SELECT * FROM table LIMIT 1000
, the client can setlimitHint
to1000
.- Applying
limitHint
is BEST EFFORT. The server may return files containing more rows than the client requests.
- Applying
When predicateHints
and limitHint
are both present, the server should apply predicateHints
first then limitHint
. As these two parameters are hints rather than enforcement, the client must always apply predicateHints
and limitHint
on the response returned by the server if it wishes to filter and limit the returned data. An empty JSON object ({}
) should be provided when these two parameters are missing.
Example (See Table Metadata Format for more details about the format):
POST {prefix}/shares/vaccine_share/schemas/acme_vaccine_data/tables/vaccine_patients/query
{
"predicateHints" : [
"date >= '2021-01-01'",
"date <= '2021-01-31'"
],
"limitHint": 1000
}
HTTP/2 200
content-type: application/x-ndjson; charset=utf-8
delta-table-version: 123
{"protocol":{"minReaderVersion":1}}
{"metaData":{"id":"f8d5c169-3d01-4ca3-ad9e-7dc3355aedb2","format":{"provider":"parquet"},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"eventTime\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"date\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":["date"]}}
{"file":{"url":"https://<s3-bucket-name>.s3.us-west-2.amazonaws.com/delta-exchange-test/table2/date%3D2021-04-28/part-00000-8b0086f2-7b27-4935-ac5a-8ed6215a6640.c000.snappy.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210501T010516Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=AKIAISZRDL4Q4Q7AIONA%2F20210501%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=97b6762cfd8e4d7e94b9d707eff3faf266974f6e7030095c1d4a66350cfd892e","id":"8b0086f2-7b27-4935-ac5a-8ed6215a6640","partitionValues":{"date":"2021-04-28"},"size":573,"stats":"{\"numRecords\":1,\"minValues\":{\"eventTime\":\"2021-04-28T23:33:57.955Z\"},\"maxValues\":{\"eventTime\":\"2021-04-28T23:33:57.955Z\"},\"nullCount\":{\"eventTime\":0}}"}}
{"file":{"url":"https://<s3-bucket-name>.s3.us-west-2.amazonaws.com/delta-exchange-test/table2/date%3D2021-04-28/part-00000-591723a8-6a27-4240-a90e-57426f4736d2.c000.snappy.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210501T010516Z&X-Amz-SignedHeaders=host&X-Amz-Expires=899&X-Amz-Credential=AKIAISZRDL4Q4Q7AIONA%2F20210501%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=0f7acecba5df7652457164533a58004936586186c56425d9d53c52db574f6b62","id":"591723a8-6a27-4240-a90e-57426f4736d2","partitionValues":{"date":"2021-04-28"},"size":573,"stats":"{\"numRecords\":1,\"minValues\":{\"eventTime\":\"2021-04-28T23:33:48.719Z\"},\"maxValues\":{\"eventTime\":\"2021-04-28T23:33:48.719Z\"},\"nullCount\":{\"eventTime\":0}}"}}
This section discusses the table metadata format returned by the server.
The JSON object in each line is a wrapper object that may contain the following fields:
Field Name | Data Type | Description | Optional/Required |
---|---|---|---|
protocol | The Protocol JSON object. | Defines the versioning information about the table metadata format. | Optional |
metaData | The Metadata JSON object. | The table metadata including schema, partitionColumns, etc. | Optional |
file | The File JSON object. | An individual data file in the table. | Optional |
It must contain only ONE of the above fields.
Protocol versioning will allow servers to exclude older clients that are missing features required to correctly interpret their response if the Delta Sharing Protocol evolves in the future. The protocol version will be increased whenever non-forward-compatible changes are made to the protocol. When a client is running an unsupported protocol version, it should show an error message instructing the user to upgrade to a newer version of their client.
Since only breaking changes must be accompanied by an increase in the protocol version recorded in a table, clients can assume that any unrecognized fields in a response that supports their reader version are never required in order to correctly interpret the response.
Field Name | Data Type | Description | Optional/Required |
---|---|---|---|
minReaderVersion | Int | The minimum version of the protocol that a client must implement in order to correctly read a Delta Lake table. Currently it’s always 1 . It will be changed in future when we introduce non-forward-compatible changes that require clients to implement. |
Required |
Example (for illustration purposes; each JSON object must be a single line in the response):
{
"protocol" : {
"minReaderVersion" : 1
}
}
Field Name | Data Type | Description | Optional/Required |
---|---|---|---|
id | String | Unique identifier for this table | Required |
name | String | User-provided identifier for this table | Optional |
description | String | User-provided description for this table | Optional |
format | Format Object | Specification of the encoding for the files stored in the table. | Required |
schemaString | String | Schema of the table. This is a serialized JSON string which can be deserialized to a Schema Object. | Required |
partitionColumns | Array | An array containing the names of columns by which the data should be partitioned. When a table doesn’t have partition columns, this will be an empty array. | Required |
Example (for illustration purposes; each JSON object must be a single line in the response):
{
"metaData" : {
"partitionColumns" : [
"date"
],
"format" : {
"provider" : "parquet"
},
"schemaString" : "{\"type\":\"struct\",\"fields\":[{\"name\":\"eventTime\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"date\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}",
"id" : "f8d5c169-3d01-4ca3-ad9e-7dc3355aedb2"
}
}
Field Name | Data Type | Description | Optional/Required |
---|---|---|---|
url | String | A https url that a client can use to read the file directly. The same file in different responses may have different urls. | Required |
id | String | A unique string for the file in a table. The same file is guaranteed to have the same id across multiple requests. A client may cache the file content and use this id as a key to decide whether to use the cached file content. | Required |
partitionValues | Map<String, String> | A map from partition column to value for this file. See Partition Value Serialization for how to parse the partition values. When the table doesn’t have partition columns, this will be an empty map. | Required |
size | Long | The size of this file in bytes. | Required |
stats | String | Contains statistics (e.g., count, min/max values for columns) about the data in this file. This field may be missing. A file may or may not have stats. This is a serialized JSON string which can be deserialized to a Statistics Struct. A client can decide whether to use stats or drop it. | Optional |
Example (for illustration purposes; each JSON object must be a single line in the response):
{
"file" : {
"url" : "https://<s3-bucket-name>.s3.us-west-2.amazonaws.com/delta-exchange-test/table2/date%3D2021-04-28/part-00000-591723a8-6a27-4240-a90e-57426f4736d2.c000.snappy.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210501T010655Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=AKIAISZRDL4Q4Q7AIONA%2F20210501%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=dd5d3ba1a179dc7e239d257feed046dccc95000d1aa0479ea6ff36d10d90ec94",
"id" : "591723a8-6a27-4240-a90e-57426f4736d2",
"size" : 573,
"partitionValues" : {
"date" : "2021-04-28"
},
"stats" : "{\"numRecords\":1,\"minValues\":{\"eventTime\":\"2021-04-28T23:33:48.719Z\"},\"maxValues\":{\"eventTime\":\"2021-04-28T23:33:48.719Z\"},\"nullCount\":{\"eventTime\":0}}"
}
}
Field Name | Data Type | Description | Optional/Required |
---|---|---|---|
provider | String | Name of the encoding for data files in this table. Currently the only valid value is parquet . This is to enable possible support for other formats in the future. |
Required |
We use a subset of Spark SQL's JSON Schema representation to represent the schema of a table. A reference implementation can be found in the Catalyst package of the Apache Spark repository.
A struct is used to represent both the top-level schema of the table as well as struct columns that contain nested columns. A struct is encoded as a JSON object with the following fields:
Field Name | Description |
---|---|
type | Always the string struct |
fields | An array of fields |
A struct field represents a top-level or nested column.
Field Name | Description |
---|---|
name | Name of this (possibly nested) column |
type | String containing the name of a primitive type, a struct definition, an array definition or a map definition |
nullable | Boolean denoting whether this field can be null |
metadata | A JSON map containing information about this column. For example, the comment key means its value is the column comment. |
Field Name | Description |
---|---|
string | UTF-8 encoded string of characters |
long | 8-byte signed integer. Range: -9223372036854775808 to 9223372036854775807 |
integer | 4-byte signed integer. Range: -2147483648 to 2147483647 |
short | 2-byte signed integer numbers. Range: -32768 to 32767 |
byte | 1-byte signed integer number. Range: -128 to 127 |
float | 4-byte single-precision floating-point numbers |
double | 8-byte double-precision floating-point numbers |
boolean | true or false |
binary | A sequence of binary data |
date | A calendar date, represented as a year-month-day triple without a timezone |
timestamp | Microsecond precision timestamp without a timezone |
An array stores a variable length collection of items of some type.
Field Name | Description |
---|---|
type | Always the string array |
elementType | The type of element stored in this array represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition |
containsNull | Boolean denoting whether this array can contain one or more null values |
A map stores an arbitrary length collection of key-value pairs with a single keyType and a single valueType.
Field Name | Description |
---|---|
type | Always the string map |
keyType | The type of element used for the key of this map, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition |
valueType | The type of element used for the key of this map, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition |
valueContainsNull | Indicates if map values have null values. |
Example Table Schema:
|-- a: integer (nullable = false)
|-- b: struct (nullable = true)
| |-- d: integer (nullable = false)
|-- c: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- e: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- d: integer (nullable = false)
|-- f: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
JSON Encoded Table Schema:
{
"type" : "struct",
"fields" : [ {
"name" : "a",
"type" : "integer",
"nullable" : false,
"metadata" : {
"comment": "this is a comment"
}
}, {
"name" : "b",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "d",
"type" : "integer",
"nullable" : false,
"metadata" : { }
} ]
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "c",
"type" : {
"type" : "array",
"elementType" : "integer",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "e",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "d",
"type" : "integer",
"nullable" : false,
"metadata" : { }
} ]
},
"containsNull" : true
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "f",
"type" : {
"type" : "map",
"keyType" : "string",
"valueType" : "string",
"valueContainsNull" : true
},
"nullable" : true,
"metadata" : { }
} ]
}
Partition values are stored as strings, using the following formats. An empty string for any type translates to a null
partition value.
Type | Serialization Format |
---|---|
string | No translation required |
numeric types | The string representation of the number |
date | Encoded as {year}-{month}-{day} . For example, 1970-01-01 |
timestamp | Encoded as {year}-{month}-{day} {hour}:{minute}:{second} . For example: 1970-01-01 00:00:00 |
boolean | Encoded as the string true or false |
A File JSON object can optionally contain statistics about the data in the file being added to the table. These statistics can be used for eliminating files based on query predicates or as inputs to query optimization.
Global statistics record information about the entire file. The following global statistic is currently supported:
Name | Description |
---|---|
numRecords | The number of records in this file. |
Per-column statistics record information for each column in the file and they are encoded, mirroring the schema of the actual data. For example, given the following data schema:
|-- a: struct
| |-- b: struct
| | |-- c: long
Statistics could be stored with the following schema:
|-- stats: struct
| |-- numRecords: long
| |-- minValues: struct
| | |-- a: struct
| | | |-- b: struct
| | | | |-- c: long
| |-- maxValues: struct
| | |-- a: struct
| | | |-- b: struct
| | | | |-- c: long
Example 1:
Column Type:
|-- a: long
JSON stats:
{
"numRecords" : 1,
"maxValues" : {
"a" : 123
},
"minValues" : {
"a" : 123
},
"nullCount" : {
"a" : 0
}
}
Example 2:
Column Type:
|-- a: struct
| |-- b: struct
| | |-- c: long
JSON stats:
{
"minValues" : {
"a" : {
"b" : {
"c" : 123
}
}
},
"maxValues" : {
"a" : {
"b" : {
"c" : 123
}
}
},
"nullCount" : {
"a" : {
"b" : {
"c" : 0
}
}
},
"numRecords" : 1
}
The following per-column statistics are currently supported:
Name | Description |
---|---|
nullCount | The number of null values for this column |
minValues | A value smaller than all values present in the file for this column |
maxValues | A value larger than all values present in the file for this column |
The client may send a sequence of predicates to the server as a hint to request fewer files when it only wishes to query a subset of the data (e.g., data where the country
field is US
). The server may try its best to filter files based on the predicates. This is BEST EFFORT, so the server may return files that don’t satisfy the predicates. For example, if the server fails to parse a SQL expression, the server can skip it. Hence, the client should always apply predicates to filter the data returned by the server.
At the beginning, the server will support the following simple comparison expressions between a column in the table and a constant value.
Expression | Example |
---|---|
= |
col = 123 |
> |
col > 'foo' |
< |
'foo' < col |
>= |
col >= 123 |
<= |
123 <= col |
<> |
col <> 'foo' |
IS NULL |
col IS NULL |
IS NOT NULL |
col IS NOT NULL |
More expression support will be added incrementally in future.
A profile file is a JSON file that contains the information to access a Delta Sharing server. There are three fields in this file.
Field Name | Descrption |
---|---|
shareCredentialsVersion | The file format version of the profile file. This version will be increased whenever non-forward-compatible changes are made to the profile format. When a client is running an unsupported profile file format version, it should show an error message instructing the user to upgrade to a newer version of their client. |
endpoint | The url of the sharing server. |
bearerToken | The bearer token to access the server. |
expirationTime | The expiration time of the bearer token in ISO 8601 format. This field is optional. |
Example: |
{
"shareCredentialsVersion": 1,
"endpoint": "https://sharing.delta.io/delta-sharing/",
"bearerToken": "<token>",
"expirationTime": "2021-11-12T00:12:29.0Z"
}