Context
The Databricks integration replicates DataJoint tables into Delta Lake in two layers:
- Bronze — Lakehouse Sync CDC mirror of operational Lakebase tables. Binary blob columns land as opaque
BINARY. Acceptable for backup / point-in-time queries.
- Silver — Linked Delta Tables. Curated, consumer-facing. Published via DML hooks, per-branch UC namespaces. Used by Spark SQL, Genie, BI tools, and Delta Sharing recipients.
The silver layer requires that every column carrying complex data render to Spark-native types — ARRAY<T>, STRUCT<...>, MAP<K,V>, primitives, and nested combinations. No opaque BINARY. No runtime fallback.
This issue requests the minimum contract in datajoint-python that lets codec authors (in-tree or plugins) opt in to Spark rendering. The actual renderers — and the silver-layer publish pipeline that consumes them — live downstream of this issue.
Supersedes #1457
#1457 proposed adding render_spark() as an abstract method on dj.Codec. Design discussion (in datajoint-databricks#42) converged on a cleaner factoring:
<blob> / <blob@> stay general-purpose and non-renderable. They hold arbitrary Python values — Spark can't make assumptions about their shape, and adding a render_spark() to BlobCodec would force every renderer to enumerate the type taxonomy of mYm-tagged binary content.
- Renderable codec types are purpose-built. A codec dedicated to "1D float array" or "2D image" or "labeled timeseries" knows its value shape and can render it cleanly. Pipeline authors opt in to silver eligibility by choosing a renderable codec type for the column.
- The contract is a Protocol, not an abstract method. Codecs opt in by implementing the method; non-renderable codecs need do nothing. No
NotImplementedError stubs across every built-in codec.
Proposed addition
Roughly ten lines in src/datajoint/codecs.py (or a new src/datajoint/rendering.py):
from typing import Any, Protocol, runtime_checkable
@runtime_checkable
class Renderable(Protocol):
"""
A codec that can render its decoded values to Spark-native types.
Opt-in. Codecs implementing this method declare that their decoded
values can be expressed as primitives, lists, or dicts of the same —
i.e., shapes that map cleanly to Spark's StructType / ArrayType / MapType.
Consumers (e.g., a Databricks publish pipeline) check
``isinstance(codec, Renderable)`` per column to determine eligibility.
"""
def render_spark(self, decoded: Any, *, key: dict | None = None) -> Any: ...
That's it. No method added to Codec. No default implementation on any built-in codec. No new exception type.
Why a Protocol, not an abstract method
- Smaller OSS surface. ~10 lines of Protocol declaration vs. an abstract method requiring
NotImplementedError stubs on every built-in codec.
- Cleaner opt-in semantics. Codec authors implement the method when they want silver eligibility; they don't have to acknowledge it otherwise.
- No retroactive churn for plugin codecs. Existing third-party codecs (
dj-zarr-codecs, dj-photon-codecs, etc.) work unchanged. They opt in by adding the method when they choose to.
- Composable with structural typing. Consumers use
isinstance(codec, Renderable) (enabled by @runtime_checkable) — no inheritance chain required.
Scope
In scope (this issue):
- The
Renderable Protocol declaration in datajoint-python.
Out of scope (lives elsewhere):
- Renderable codec implementations. Specific typed codecs (
<float_array@>, <int_array@>, <image_2d@>, <struct@>, <labeled_array@>, <timeseries@>, <sparse_matrix@>, domain-specific shapes) ship as plugins. They register via the existing codec auto-registration mechanism.
- Silver-layer publish pipeline. Eligibility check, Spark schema construction, Delta write, branch-namespace management, Lakehouse Sync hooks — all downstream of
datajoint-python.
BlobCodec does not implement Renderable. <blob> and <blob@> columns remain non-renderable by design. Authors who want silver eligibility migrate columns to typed renderable codecs.
- No
decode_spark (reverse direction). Delta consumers query rendered columns directly via Spark SQL; round-tripping back through DataJoint is not a target of this work.
- No best-effort
BINARY fallback. Codecs either implement Renderable (and render) or they don't (and remain bronze-only).
Example: how a renderable codec opts in
A plugin codec for 1D float arrays (sketch, not shipped in datajoint-python):
import datajoint as dj
import numpy as np
class FloatArrayCodec(dj.Codec):
name = "float_array"
def get_dtype(self, is_store): ...
def encode(self, value, *, key=None, store_name=None): ...
def decode(self, stored, *, key=None) -> np.ndarray: ...
def render_spark(self, decoded: np.ndarray, *, key=None):
return decoded.tolist() # → ARRAY<DOUBLE>
isinstance(FloatArrayCodec(), Renderable) returns True by virtue of the method being present. No subclassing, no registration step beyond what codec auto-registration already does.
Prerequisites and related work
References
Context
The Databricks integration replicates DataJoint tables into Delta Lake in two layers:
BINARY. Acceptable for backup / point-in-time queries.The silver layer requires that every column carrying complex data render to Spark-native types —
ARRAY<T>,STRUCT<...>,MAP<K,V>, primitives, and nested combinations. No opaqueBINARY. No runtime fallback.This issue requests the minimum contract in
datajoint-pythonthat lets codec authors (in-tree or plugins) opt in to Spark rendering. The actual renderers — and the silver-layer publish pipeline that consumes them — live downstream of this issue.Supersedes #1457
#1457 proposed adding
render_spark()as an abstract method ondj.Codec. Design discussion (indatajoint-databricks#42) converged on a cleaner factoring:<blob>/<blob@>stay general-purpose and non-renderable. They hold arbitrary Python values — Spark can't make assumptions about their shape, and adding arender_spark()toBlobCodecwould force every renderer to enumerate the type taxonomy of mYm-tagged binary content.NotImplementedErrorstubs across every built-in codec.Proposed addition
Roughly ten lines in
src/datajoint/codecs.py(or a newsrc/datajoint/rendering.py):That's it. No method added to
Codec. No default implementation on any built-in codec. No new exception type.Why a Protocol, not an abstract method
NotImplementedErrorstubs on every built-in codec.dj-zarr-codecs,dj-photon-codecs, etc.) work unchanged. They opt in by adding the method when they choose to.isinstance(codec, Renderable)(enabled by@runtime_checkable) — no inheritance chain required.Scope
In scope (this issue):
RenderableProtocol declaration indatajoint-python.Out of scope (lives elsewhere):
<float_array@>,<int_array@>,<image_2d@>,<struct@>,<labeled_array@>,<timeseries@>,<sparse_matrix@>, domain-specific shapes) ship as plugins. They register via the existing codec auto-registration mechanism.datajoint-python.BlobCodecdoes not implementRenderable.<blob>and<blob@>columns remain non-renderable by design. Authors who want silver eligibility migrate columns to typed renderable codecs.decode_spark(reverse direction). Delta consumers query rendered columns directly via Spark SQL; round-tripping back through DataJoint is not a target of this work.BINARYfallback. Codecs either implementRenderable(and render) or they don't (and remain bronze-only).Example: how a renderable codec opts in
A plugin codec for 1D float arrays (sketch, not shipped in
datajoint-python):isinstance(FloatArrayCodec(), Renderable)returnsTrueby virtue of the method being present. No subclassing, no registration step beyond what codec auto-registration already does.Prerequisites and related work
REPLICA IDENTITY FULLconfig option. Independent; bronze needs it, silver doesn't. Both required for the integration overall.datajoint-databricks#5— data-product curation layer. Closes once silver-layer publish works end-to-end with renderable codecs.Codec).References
src/datajoint/codecs.py:59-180datajoint-databricks/DECISIONS.md§Linked Delta Tables enforce a strict supported-types contractdatajoint-databricks#42