Skip to content

Add Renderable Protocol for codec-driven Spark-native rendering #1458

@dimitri-yatsenko

Description

@dimitri-yatsenko

Context

The Databricks integration replicates DataJoint tables into Delta Lake in two layers:

  • Bronze — Lakehouse Sync CDC mirror of operational Lakebase tables. Binary blob columns land as opaque BINARY. Acceptable for backup / point-in-time queries.
  • SilverLinked Delta Tables. Curated, consumer-facing. Published via DML hooks, per-branch UC namespaces. Used by Spark SQL, Genie, BI tools, and Delta Sharing recipients.

The silver layer requires that every column carrying complex data render to Spark-native typesARRAY<T>, STRUCT<...>, MAP<K,V>, primitives, and nested combinations. No opaque BINARY. No runtime fallback.

This issue requests the minimum contract in datajoint-python that lets codec authors (in-tree or plugins) opt in to Spark rendering. The actual renderers — and the silver-layer publish pipeline that consumes them — live downstream of this issue.

Supersedes #1457

#1457 proposed adding render_spark() as an abstract method on dj.Codec. Design discussion (in datajoint-databricks#42) converged on a cleaner factoring:

  • <blob> / <blob@> stay general-purpose and non-renderable. They hold arbitrary Python values — Spark can't make assumptions about their shape, and adding a render_spark() to BlobCodec would force every renderer to enumerate the type taxonomy of mYm-tagged binary content.
  • Renderable codec types are purpose-built. A codec dedicated to "1D float array" or "2D image" or "labeled timeseries" knows its value shape and can render it cleanly. Pipeline authors opt in to silver eligibility by choosing a renderable codec type for the column.
  • The contract is a Protocol, not an abstract method. Codecs opt in by implementing the method; non-renderable codecs need do nothing. No NotImplementedError stubs across every built-in codec.

Proposed addition

Roughly ten lines in src/datajoint/codecs.py (or a new src/datajoint/rendering.py):

from typing import Any, Protocol, runtime_checkable

@runtime_checkable
class Renderable(Protocol):
    """
    A codec that can render its decoded values to Spark-native types.

    Opt-in. Codecs implementing this method declare that their decoded
    values can be expressed as primitives, lists, or dicts of the same —
    i.e., shapes that map cleanly to Spark's StructType / ArrayType / MapType.

    Consumers (e.g., a Databricks publish pipeline) check
    ``isinstance(codec, Renderable)`` per column to determine eligibility.
    """

    def render_spark(self, decoded: Any, *, key: dict | None = None) -> Any: ...

That's it. No method added to Codec. No default implementation on any built-in codec. No new exception type.

Why a Protocol, not an abstract method

  • Smaller OSS surface. ~10 lines of Protocol declaration vs. an abstract method requiring NotImplementedError stubs on every built-in codec.
  • Cleaner opt-in semantics. Codec authors implement the method when they want silver eligibility; they don't have to acknowledge it otherwise.
  • No retroactive churn for plugin codecs. Existing third-party codecs (dj-zarr-codecs, dj-photon-codecs, etc.) work unchanged. They opt in by adding the method when they choose to.
  • Composable with structural typing. Consumers use isinstance(codec, Renderable) (enabled by @runtime_checkable) — no inheritance chain required.

Scope

In scope (this issue):

  • The Renderable Protocol declaration in datajoint-python.

Out of scope (lives elsewhere):

  • Renderable codec implementations. Specific typed codecs (<float_array@>, <int_array@>, <image_2d@>, <struct@>, <labeled_array@>, <timeseries@>, <sparse_matrix@>, domain-specific shapes) ship as plugins. They register via the existing codec auto-registration mechanism.
  • Silver-layer publish pipeline. Eligibility check, Spark schema construction, Delta write, branch-namespace management, Lakehouse Sync hooks — all downstream of datajoint-python.
  • BlobCodec does not implement Renderable. <blob> and <blob@> columns remain non-renderable by design. Authors who want silver eligibility migrate columns to typed renderable codecs.
  • No decode_spark (reverse direction). Delta consumers query rendered columns directly via Spark SQL; round-tripping back through DataJoint is not a target of this work.
  • No best-effort BINARY fallback. Codecs either implement Renderable (and render) or they don't (and remain bronze-only).

Example: how a renderable codec opts in

A plugin codec for 1D float arrays (sketch, not shipped in datajoint-python):

import datajoint as dj
import numpy as np

class FloatArrayCodec(dj.Codec):
    name = "float_array"

    def get_dtype(self, is_store): ...
    def encode(self, value, *, key=None, store_name=None): ...
    def decode(self, stored, *, key=None) -> np.ndarray: ...

    def render_spark(self, decoded: np.ndarray, *, key=None):
        return decoded.tolist()  # → ARRAY<DOUBLE>

isinstance(FloatArrayCodec(), Renderable) returns True by virtue of the method being present. No subclassing, no registration step beyond what codec auto-registration already does.

Prerequisites and related work

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions