Skip to content

Implements Token Federation for Python Driver #552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Implements Token Federation for Python Driver #552

wants to merge 40 commits into from

Conversation

madhav-db
Copy link
Contributor

@madhav-db madhav-db commented May 7, 2025

What type of PR is this?

  • Refactor
  • Feature
  • Bug Fix
  • Other

Description

This PR adds token federation support to the Databricks SQL Python connector, which allows using external identity provider tokens (like GitHub Actions OIDC tokens) with Databricks SQL.

Key Changes

Core Implementation

  • Added token federation as a new auth type with supporting classes and methods
  • Implemented token exchange mechanism to convert external tokens to Databricks tokens

Code Architecture

  • Added DatabricksTokenFederationProvider class to handle token federation
  • Added Token class to manage token lifecycle and expiry
  • Implemented timezone-aware datetime handling to prevent comparison issues
  • Added IdP detection to support various identity providers (Azure AD, GitHub, Google, AWS)

API & Configuration

  • Added identity_federation_client_id parameter for token federation
  • Added proper OIDC discovery for finding token endpoints
  • Added fallback mechanisms for error handling

Testing

  • Added unit tests with mocking for token federation components
  • Added end-to-end test for GitHub OIDC tokens

Future Improvements

  • Token federation should be refactored as a feature that works with different auth types instead of being an auth type itself
  • OAuthProvider should be integrated with token federation to allow token exchange for OAuth-acquired tokens
  • Use a standardized approach for feature flags across the codebase

This PR enables Databricks SQL connector users to leverage external identity providers for authentication, particularly useful in CI/CD environments like GitHub Actions.

How is this tested?

  • Unit tests
  • E2E Tests
  • Manually (via CI/CD)
  • N/A

Related Tickets & Documents

Notes for reviewers:

Token Federation Flow

1. Client Initialization

  • User creates a SQL connection with auth_type="token-federation" and provides an external token
  • Can be initialized either with access_token or a custom credentials_provider
  • LIMITATION: Currently implemented as a standalone auth type, not a feature that can be combined with other auth types
  • TODO: Refactor to make token federation a feature that works with any auth type via a use_token_federation flag

2. Auth Provider Selection

  • get_auth_provider() in auth.py detects token federation auth type
  • Creates a DatabricksTokenFederationProvider wrapper around the credential source
  • TODO: Remove TOKEN_FEDERATION as an auth_type while maintaining backward compatibility
  • TODO: Allow wrapping of existing providers (DatabricksOAuthProvider, AccessTokenAuthProvider, etc.)

3. Token Evaluation

  • When headers are requested, the federation provider:
    1. Gets external token from underlying provider
    2. Parses JWT claims to check token issuer
    3. Determines if token needs exchange based on issuer comparison
  • The token evaluation works with any valid JWT, regardless of how it was obtained
  • TODO: Design interfaces to wrap any auth provider with token federation capability

4. Token Exchange

  • If token is from a different issuer than the target Databricks host:
    1. Uses OIDC discovery to find token endpoint
    2. Exchanges external token for Databricks token via token exchange protocol
    3. Stores exchanged token and original external token for future reference
  • If token is from same issuer, uses original token without exchange
  • This process works correctly for any token regardless of source

5. Token Refresh

  • Before token expiry (controlled by TOKEN_REFRESH_BUFFER_SECONDS = 10):
    1. Requests fresh external token from underlying provider
    2. Exchanges this fresh token for a new Databricks token
    3. Updates stored tokens and headers
  • LIMITATION: Relies on underlying provider for fresh tokens

6. Fallback Handling

  • If token exchange or refresh fails, falls back to original external token
  • Logs appropriate warnings/errors

Future Provider Integration Plan

To properly integrate token federation with all auth providers in authenticators.py:

  1. Decorator Pattern Implementation:

    • Create a wrapper class that can decorate any AuthProvider with token federation capabilities
    • Allow wrapping of DatabricksOAuthProvider, AccessTokenAuthProvider, etc.
  2. Configuration Changes:

    • Add a use_token_federation boolean flag to connection parameters
    • Modify get_auth_provider() to apply token federation wrapper when flag is set
  3. Provider Interface Enhancement:

    • Update CredentialsProvider interface to expose necessary token information
    • Ensure DatabricksOAuthProvider properly implements this interface for token access
  4. Backward Compatibility:

    • Maintain support for existing auth_type="token-federation" during transition
    • Add deprecation warnings and migration guidance

The core token exchange functionality works well for any token, but the current architecture limits token federation to being a separate auth type. The primary improvement needed is architectural - enabling token federation to work with other auth types (including OAuth-based ones) while maintaining backward compatibility.

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

github-actions bot commented May 7, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…nd enhance unit tests for accurate expiry verification
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@madhav-db madhav-db requested a review from jprakash-db May 12, 2025 06:11
"""
try:
# Add protocol if missing to ensure proper parsing
if not url1.startswith(("http://", "https://")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function of making url proper format be separated to a util. This is being used at multiple points

"""
try:
token = self.get_current_token()
return {"Authorization": f"{token.token_type} {token.access_token}"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are not passing other external headers apart from Authorization. In case of Managed Identities we have more headers coming from external provideer

This is called by the ExternalAuthProvider to get headers for authentication.
"""
# First call the underlying credentials provider to get its headers
header_factory = self.credentials_provider(*args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this header_factory doesn't seem to be used anywhere

else:
# Token is from a different host, need to exchange
logger.debug("Token from different host, exchanging token")
new_token = self._exchange_token(access_token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the exchange token fails, you are not storing the external credentials token. So if token exchange fails, you are repeatedly calling the external provider

header_factory = self.credentials_provider(*args, **kwargs)

# Get the standard token endpoint if not already set
if self.token_endpoint is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this here? Shouldn't this code be near or just before initiating token exchange. Feels very out of place in code logical flow

return self.external_headers

# Return empty dict as a last resort
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by last resort ? Before this statement there is no case of error, I don't think we will ever reach here

self.token_endpoint, data=token_exchange_data, headers=self.EXCHANGE_HEADERS
)

if response.status_code != 200:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not HttpError?

@jprakash-db
Copy link
Contributor

@madhav-db
Suggestion / Clarification

  • I don't understand why are we overriding the credentialsProvider, why can't we decorate the AuthProvider because that is the last thing being supplied for authentication
  • I would suggest using DatabricksAuthProvider instead of the existing one
  • So current AuthProvider will be wrapped by DatabicksAuthProvider. Current credential_providers will follow CredentialsProvider -> ExternalAuthProvider -> DatabricksAuthProvider

description: 'Identity federation client ID'
required: true

# Run on PRs that might affect token federation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this not run as part of release process? What if in a release there is no change in here listed code folders? We should still run this as part of release if not on every PR.

@@ -29,6 +34,7 @@ def __init__(
tls_client_cert_file: Optional[str] = None,
oauth_persistence=None,
credentials_provider=None,
identity_federation_client_id: Optional[str] = None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for workload identity federation flow?

Token Federation Support:
-----------------------
Currently, token federation is implemented as a separate auth type, but the goal is to
refactor it as a feature that can work with any auth type. The current implementation
Copy link

@gopalldb gopalldb May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that in future both will be supported? separate 'use_token_federation' as well as separate auth type? Deprecating anything which is released in a client is not easy.


TODO: Future refactoring needed:
1. Add a use_token_federation flag that can be combined with any auth type
2. Remove TOKEN_FEDERATION as an auth_type while maintaining backward compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be hard once introduced

Returns:
str: The formatted hostname
"""
if not hostname.startswith("https://"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if someone has given as http://

# Ensure expiry is timezone-aware
if expiry.tzinfo is None:
# Convert naive datetime to aware datetime
self.expiry = expiry.replace(tzinfo=timezone.utc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that without timezone will be treated as UTC time?


def __str__(self) -> str:
"""Return the token as a string in the format used for Authorization headers."""
return f"{self.token_type} {self.access_token}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

headers are typically of form "scheme token", where scheme can be like "bearer", "api key" etc. The token type more looks like our internal concepts.

@madhav-db
Copy link
Contributor Author

@madhav-db Suggestion / Clarification

  • I don't understand why are we overriding the credentialsProvider, why can't we decorate the AuthProvider because that is the last thing being supplied for authentication
  • I would suggest using DatabricksAuthProvider instead of the existing one
  • So current AuthProvider will be wrapped by DatabicksAuthProvider. Current credential_providers will follow CredentialsProvider -> ExternalAuthProvider -> DatabricksAuthProvider

I will be doing this work in a follow up PR. This change isn't related to token federation as a feature, and makes more sense to be done separately.

@jprakash-db
Copy link
Contributor

Can you clarify why it is not related to token federation?
Token federation is just exchanging any external token for inhouse token. The token federation should work with access token, external credentials provider, etc directly and don't think we should support it partially as this PR. Or you can raise a stacked PR on this one, but we should merge after all flows are supported

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants