Databricks Virtual Schema for Lua (short "VSDAB") is an implementation of a Virtual Schema.
With VSDAB you can make a read-only connection from a schema in a Databricks database to a so-called "Virtual Schema". Exasol Virtual Schemas make external data sources accessible in our data analytics platform through regular SQL commands. The contents of the external data sources are mapped to virtual tables which look like and can be queried as any regular Exasol table.
Each Virtual Schema needs a data source. In the case of Databricks Virtual Schema for Lua, this source is a database schema in a Databricks database. We call that the "origin schema".
Conceptually Virtual Schemas are very similar to database views. They have an owner (typically the one who creates them) and share that owner's access permissions. This means that for a Virtual Schema to be useful, the owner must have the permissions to view the source.
Users of the Virtual Schema must have permissions to view the Virtual Schema itself, but they don't need permissions to view the source.
Each Virtual Schema requires an Adapter. Think of this as a plug-in for Exasol that defines how to access data from a specific source.
Check the section "Installation" for details on how to install the VSDAB adapter.
The main function of a Virtual Schema is to take a query and turn it into a different one that reads from the data source. The input query — that means the query users of a Virtual Schema run — is always a SELECT
statement.
For VSDAB the output query is an IMPORT
statement using a JDBC connection to Databricks, thus allowing to get data via a network connection.
The output query is also called "push-down query", since it is pushed down to the data source. See section "Examining the Push-down query"
What you will need before you begin:
-
Exasol Version 8.31.0 or later
-
A database schema where you can install the adapter script
-
The database privilege to install the script
-
A copy of the adapter script from the release page (check for latest release)
databricks-virtual-schema-dist-<version>.lua
Make sure you pick the file with -dist-
in the name, because that is the installation package that contains everything you need.
- Connection details for your Databricks database consisting of the JDBC URL and credentials.
ExaLoader is responsible for executing the pushdown query on the Databricks and fetching the results. This section describes how to configure the JDBC driver for ExaLoader. You can find more details in the Exasol Administration documentation.
- Download the latest Databricks JDBC Driver.
- Unpack the downloaded ZIP file.
- Upload file
DatabricksJDBC42.jar
to the default BucketFS bucket under pathdrivers/jdbc/databricks/
(i.e.bfsdefault/default/drivers/jdbc/databricks/
), see the BucketFS documentation for details.
In order to enable the ExaLoader to fetch data from Databricks you must register the driver for ExaLoader as described in the Installation procedure for JDBC drivers.
To do that you need to create file settings.cfg
and upload it to the default BucketFS bucket under drivers/jdbc/databricks/
(i.e. bfsdefault/default/drivers/jdbc/databricks/
):
DRIVERNAME=DATABRICKS
PREFIX=jdbc:databricks:
NOSECURITY=YES
FETCHSIZE=100000
INSERTSIZE=-1
Important: Make sure that file settings.cfg
- contains a trailing empty line.
- uses Unix line breaks (LF) instead of Windows (CRLF). JDBC driver registration won't work if the file format is wrong.
For the purpose of the User Guide we will assume that you install the adapter in a schema called VSDAB_SCHEMA
.
If you are not the admin the database, please ask an administrator to create that schema for you and grant you write permissions.
CREATE SCHEMA VSDAB_SCHEMA;
Now you need to install the adapter script (i.e. the plug-in driving the Virtual Schema):
--/
CREATE OR REPLACE LUA ADAPTER SCRIPT VSDAB_SCHEMA.VSDAB_ADAPTER AS
table.insert(package.searchers,
function (module_name)
local loader = package.preload[module_name]
if(loader == nil) then
error("Module " .. module_name .. " not found in package.preload.")
else
return loader
end
end
)
<copy the whole content of databricks-virtual-schema-dist-<version>.lua here>
/
;
The first fixed part is a module loading preamble that is required with Exasol version 8.
Please note:
- We recommend using DbVisualizer for executing this statement.
- Note the leading
--/
and trailing/\n;
- When DbVisualizer asks to enter data for parameter markers, uncheck option "SQL Commander" > "SQL Commander Options" > "Parameterized SQL". See this Knowledge Base article for details.
Create a named connection:
CREATE OR REPLACE CONNECTION DATABRICKS_JDBC_CONNECTION
TO 'jdbc:databricks://$WORKSPACE_HOST_NAME:$PORT/default;transportMode=http;ssl=1;AuthMech=$AUTH_MECH;Auth_Flow=$AUTH_FLOW;OAuth2ClientId=$OAUTH_CLIENT_ID;OAuth2Secret=$OAUTH_SECRET;httpPath=$HTTP_PATH;'
USER '$USERNAME'
IDENTIFIED BY '$PASSWORD';
Fill in the following placeholders in the JDBC URL:
$WORKSPACE_HOST_NAME
: Hostname of your Databricks workspace, e.g.abc-1234abcd-5678.cloud.databricks.com
$PORT
: Port of your Databricks workspace, usually443
$HTTP_PATH
: Partial URL corresponding to the Spark server, e.g./sql/1.0/warehouses/abc123def456ghi7
$AUTH_MECH
: Depends on authentication method, see below for details$AUTH_FLOW
: Depends on authentication method, see below for details$OAUTH_CLIENT_ID
: Depends on authentication method, see below for details$OAUTH_SECRET
: Depends on authentication method, see below for details
You can find the JDBC URL by logging in to your Databricks workspace:
- Click on entry "SQL Warehouses" in the left menu bar
- Click on tab "SQL Warehouses"
- Click on the entry for your SQL Warehouse
- Click on tab "Connection details"
- Select JDBC URL for version "2.6.5 or later"
Enter credentials in USER
and IDENTIFIED BY
fields of the connection. You have two options for authentication, described in the following sections:
- Personal Access Token (PAT)
- Service Principal (OAuth M2M)
Create a "personal access token" in your workspace as described in the Databricks documentation.
Once you have generated your token, enter it in the connection:
CREATE OR REPLACE CONNECTION DATABRICKS_JDBC_CONNECTION
TO '...'
USER 'token'
IDENTIFIED BY '$TOKEN';
- Use the string
token
as username inUSER
and enter your generated token as password inIDENTIFIED BY
. - Modify the JDBC URL:
- Set
$AUTH_MECH
to3
. - The value for
$AUTH_FLOW
is not used, omit optionAuth_Flow
, - Omit option
OAuth2ClientId
- Omit option
OAuth2Secret
- Set
Example JDBC URL for token authentication:
jdbc:databricks://abc-1234abcd-5678.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/abc123def456ghi7;
Important: Databricks documentation recommends adding the credentials as fields UID
and PWD
to the JDBC URL. This may leak credentials in log and error messages. We recommend entering credentials in USER
and IDENTIFIED BY
fields of the connection as described above.
Create client ID and client secret as described in the Databricks documentation.
CREATE OR REPLACE CONNECTION DATABRICKS_JDBC_CONNECTION
TO '...'
USER ''
IDENTIFIED BY '';
- Use empty strings in
USER
andIDENTIFIED BY
. - Modify the JDBC URL:
- Set
$AUTH_MECH
to11
- Set
$AUTH_FLOW
to1
- Set
$OAUTH_CLIENT_ID
to the OAuth M2M client ID - Set
$OAUTH_SECRET
to the OAuth M2M secret
- Set
Example JDBC URL for M2M authentication:
jdbc:databricks://abc-1234abcd-5678.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=11;Auth_Flow=1;OAuth2ClientId=$OAUTH_CLIENT_ID;OAuth2Secret=$OAUTH_SECRET;httpPath=/sql/1.0/warehouses/abc123def456ghi7;
Important: Oauth client ID and secret are stored in the JDBC URL. This URL may occur in log and error message, potentially leaking the credentials. This is due to restrictions of the Databricks JDBC driver which does not allow specifying client ID and secret as username and password.
You can find additional information about the JDBC connection URL in the Databricks JDBC driver guide.
CREATE VIRTUAL SCHEMA VSDAB_VIRTUAL_SCHEMA
USING VSDAB_SCHEMA.VSDAB_ADAPTER
WITH
CONNECTION_NAME = 'DATABRICKS_JDBC_CONNECTION'
CATALOG_NAME = '<Databricks catalog name>'
SCHEMA_NAME = '<Databricks schema name>'
The Exasol optimizer decides on which SQL constructs are pushed-down to Databricks based on the original query and on the capabilities reported by the VSDAB adapter.
VSDAB supports the capabilities listed in the file adapter_capabilities.lua
.
Sometimes you want to prevent specific constructs from being pushed down. In this case, you can tell the VSDAB adapter to exclude one or more capabilities from being reported to the core database.
The core database will then refrain from pushing down the related SQL constructs.
Just add the property EXCLUDED_CAPABILITIES
to the Virtual Schema creation statement and provide a comma-separated list of capabilities you want to exclude as shown in the following example:
CREATE VIRTUAL SCHEMA VSDAB_VIRTUAL_SCHEMA
USING VSDAB_SCHEMA.VSDAB_ADAPTER
WITH
SCHEMA_NAME = '<schema name>'
EXCLUDED_CAPABILITIES = 'SELECTLIST_PROJECTION, ORDER_BY_COLUMN'
-- ...
While in theory you could drop and re-create an Virtual Schema, there is a more convenient way to apply changes in the adapter properties.
Use ALTER VIRTUAL SCHEMA ... SET ...
to update the properties of an existing Virtual Schema.
The following example allows pointing the Virtual Schema to a different source schema by changing properties CATALOG_NAME
and SCHEMA_NAME
:
ALTER VIRTUAL SCHEMA VSDAB_VIRTUAL_SCHEMA
SET
CATALOG_NAME = '<New Databricks catalog name>'
SCHEMA_NAME = '<New Databricks schema name>'
## Updating a Virtual Schema
All Virtual Schemas cache their metadata. That metadata for example contains all information about structure and data types of the underlying data source.
To let VSDAB know that something changed in the metadata, please use the [`ALTER VIRTUAL SCHEMA ... REFRESH`](https://docs.exasol.com/sql/alter_schema.htm) statement:
ALTER VIRTUAL SCHEMA REFRESH
## Using the Virtual Schema
You use Virtual Schemas exactly like you would use a regular schema. The main difference is that they are read-only.
So if you want to query a table in a Virtual Schema, just use the `SELECT` statement.
Example:
```sql
SELECT * FROM VSDAB_VIRTUAL_SCHEMA.<table>
To understand what a Virtual Schema really does and as a starting point for optimizing your queries, it often helps to take a look at the push-down query Exasol generates. This is as easy as prepending EXPLAIN VIRTUAL
to your query.
Example:
EXPLAIN VIRTUAL SELECT * FROM VSDAB_VIRTUAL_SCHEMA.<table>
See the readme for known limitations.