Skip to content

Pipeline deployed using Databricks CLI on Windows fails to start compute if environment dependency has whitespace in path #4456

@rtsh254

Description

@rtsh254

Describe the issue

When deploying a pipeline with serverless: true, and configuring dependencies via an environment using a requirements.txt file, the pipeline fails to start the cluster if there is whitespace in the workspace path to the requirements.txt file.

Example Repo and Project Folder Structure

The repository structure holding the various DAB projects is as follows;

C:\Repo Working Folders\
├───Databrick Asset Bundle Solutions
    ├───Dashboards
    ├───DLTs
    │   ├───Project_1
    │   └───Project_2 <--- Databricks Asset Bundle Here
    │       ├───.databricks
    │       ├───.shared_files
    │       │   └───python
    │       │       └───requirements.txt <--- This is the pipeline's requirements.txt
    │       ├───.tmp
    │       ├───config
    │       │   ├───json
    │       │   |   └───samples_config.json
    │       │   └───sql
    │       ├───devops_pipelines
    │       ├───notebooks
    │       │   └───Transformation.MaterializedView.ipynb
    │       ├───resources
    │       │   └───pipelines
    |       |       ├───pipeline_1.yml
    |       |       ├───pipeline_2.yml
    |       |       └───pipeline_3.yml
    │       ├───sql_deployment
    │       │   └───.tmp
    │       ├───typings
    │       └───databricks.yml
    ├───Transforms
    ├───ODM
    ├───ODW
    └───shared_variables.yml

Project_2 will serve as the example asset bundle having the issue.

Configuration

Databricks Asset Bundle (databricks.yml)

# Databricks asset bundle definition - databricks.yml
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
  name: DLT.Project_2

sync:
  include:
    - .shared_files
    - notebooks

include:
  - resources/jobs/*.yml
  - resources/pipelines/*.yml
  - ../../../shared_variables.yml

Pipeline definition (resources/pipelines/pipeline_2.yml)

resources:
  pipelines:
    project_2_pipeline_2:
      name: Example Project 2 Pipeline 2
      tags:
        Environment: ${bundle.target}
        Layer: DLT
        Site: Examples
      configuration:
        config_file_path: ${workspace.file_path}/config/json/samples_config.json
        catalog_name: ${var.catalog_name}
        schema_name: examples_dlt_${bundle.target}
      libraries:
        - notebook:
            path: ${workspace.file_path}/notebooks/Transformation.MaterializedView
      schema: examples_dlt_${bundle.target}
      development: true
      photon: true
      catalog: ${var.catalog_name}
      serverless: true
      environment: 
        dependencies: 
          - -r ${workspace.file_path}/.shared_files/python/requirements.txt

The asset bundle deploys successfully, but on inspecting the deployed pipeline the environment section is malformed, or at least the path is text wrapped. The deployed yaml appears below

Deployed environment element in pipeline yaml

environment:
  dependencies:
    - -r /Workspace/Users/<user>/Databrick
    Asset
    Bundle
    Solutions/DLT.Project_2/files/.shared_files/python/requirements.txt

Expected Behaviour

The expected behaviour is the pipeline executing successfully, after the compute starts and loads the required libraries from the requirements.txt file.

Actual Behaviour

When the pipeline is executed the compute fails to start with the following error in the stdout log.

ERROR: Invalid requirement: 'Solutions/DLT.Project_2/files/.shared_files/python/requirements.txt' Hint: It looks like a path. File 'Solutions/DLT.Project_2/files/.shared_files/python/requirements.txt' does not exist.

Manually changing the pipeline yaml to the below, ie placing quotes around the entire path, resolves the startup issue and the pipeline runs successfully.

environment:
  dependencies:
    - -r "/Workspace/Users/<user>/Databrick
    Asset
    Bundle
    Solutions/DLT.Project_2/files/.shared_files/python/requirements.txt"

The asset bundle yaml was modified to use quotes as per below;

environment: 
  dependencies: 
    ─ ─r "${workspace.file_path}/.shared_files/python/requirements.txt"

However, this configuration does not deploy, failing with the error message;

Error: unable to determine if C:\Repo Working Folders\Databrick Asset Bundle Solutions\DLTs\Project_2\resources\pipelines"\Workspace\Users\<user>\Databrick Asset Bundle Solutions\DLT.Project_2\files.shared_file\python\requirements.txt"
is not a notebook: open resources/pipelines/"/Workspace/Users/<user>/Databrick Asset Bundle Solutions/DLT.Project_2/files/.shared_files/python/requirements.txt": The filename, directory name, or volume label syntax is incorrect.

It appears like if the first character is not the root / character the path is treated as a relative path and Databricks CLI attempts to resolve the full path.

OS and CLI version

This issue is occuring on Databricks CLI version 0.283.0, but appears to be affecting the Windows version only as a Databricks Mac user was able to deploy as per the original configuration (no quotes) and the pipeline ran successfully.

Is this a regression?

Not to our knowledge.

Please note the above configurations are representative only, but reflect the actual configuration experiencing the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CLICLI related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions