Skip to content

[INLONG-12090][SDK] Build standard Python Dataproxy SDK wheels based on PEP-517#12091

Open
hzqmwne wants to merge 1 commit intoapache:masterfrom
hzqmwne:master
Open

[INLONG-12090][SDK] Build standard Python Dataproxy SDK wheels based on PEP-517#12091
hzqmwne wants to merge 1 commit intoapache:masterfrom
hzqmwne:master

Conversation

@hzqmwne
Copy link

@hzqmwne hzqmwne commented Feb 27, 2026

[INLONG-12090][SDK] Build standard Python Dataproxy SDK wheels based on PEP-517

Fixes #12090 (Partially)

Motivation

This PR improves the packaging and distribution story of the InLong DataProxy Python SDK by moving toward a PEP 517 compliant build. The goal is to make it easier for the community to publish and maintain prebuilt manylinux wheels for multiple CPython versions, instead of relying on legacy/local installation patterns.

In addition, this PR is written with the expectation that the project maintainers will eventually publish the SDK on PyPI and keep releases up-to-date with both InLong and CPython evolution (including timely rebuilds when Python or manylinux baselines move forward).

Modifications

  • Added PEP 517 build metadata for the Python SDK:

    • inlong-sdk/dataproxy-sdk-twins/dataproxy-sdk-python/pyproject.toml
      Note: the metadata in this file is illustrative. Maintainers should adjust it as needed (package name, description, URLs, classifiers, etc.). The version field is expected to be continuously updated according to the chosen release strategy (e.g., aligned with the InLong release train, independently versioned, etc.).
  • Updated CMake integration to support both legacy and PEP 517 builds:

    • inlong-sdk/dataproxy-sdk-twins/dataproxy-sdk-python/CMakeLists.txt
      Changes include: prefer vendored pybind11/ when present (only for compatible with legacy build.sh), otherwise locate pybind11 from the Python build environment, and install artifacts into SKBUILD_PLATLIB_DIR for wheel builds (including the .pyi).
  • Added Python type stub file to improve typing support:

    • inlong-sdk/dataproxy-sdk-twins/dataproxy-sdk-python/inlong_dataproxy.pyi
      This stub will require ongoing maintenance as the binding surface evolves.
  • Added a manylinux build container example for producing wheels across multiple CPython versions:

    • inlong-sdk/dataproxy-sdk-twins/dataproxy-sdk-docker/Dockerfile_python
      This Dockerfile demonstrates building an sdist and multiple wheels, followed by auditwheel repair.
      This is just for local test. The project offcial maintainers should publish the Python SDK to pypi.org.
  • Documented the wheel build entrypoint:

    • inlong-sdk/dataproxy-sdk-twins/dataproxy-sdk-docker/README.md
      Includes a short snippet on how to run the Docker build to produce wheels.

Verifying this change

(Please pick either of the following options)

  • This change is a trivial rework/code cleanup without any test coverage.
  • This change is already covered by existing tests, such as:
    • (please describe tests)
  • This change added tests and can be verified as follows:

This PR does not add new automated tests. It can be verified via packaging/build smoke checks:

  1. Build wheels in manylinux via Docker:

    • cd inlong-sdk/dataproxy-sdk-twins
    • docker build -f dataproxy-sdk-docker/Dockerfile_python .
  2. Validate produced artifacts:

    • Ensure repaired wheels are generated by auditwheel (typically under wheelhouse/ in the image).
    • Install a wheel in a clean environment and verify import:
      • python -c "import inlong_dataproxy; from inlong_dataproxy import InLongApi"

Documentation

  • Does this pull request introduce a new feature? no (build/packaging infrastructure improvement)
  • If yes, how is the feature documented? not applicable
  • If a feature is not applicable for documentation, explain why? The PR focuses on packaging workflow rather than adding new runtime APIs.
  • If a feature is not documented yet in this PR, please create a follow-up issue for adding the documentation not applicable

Additional Notes / Follow-ups for Maintainers

  • Static linking & licensing: Because several third_party libraries are linked statically, the resulting Python extension .so has no external third-party runtime dependencies, which is beneficial for distribution and manylinux wheel packaging. However, this also means we must pay close attention to the licenses of those bundled third-party libraries. I did not audit them in this PR, so maintainers should review licensing/compliance carefully before merging.

  • PyPI publishing expectation: It needs to create the project on PyPI and publish manylinux wheels for multiple CPython versions, with timely rebuilds as InLong/CPython releases progress.

  • Platform & Python baseline & sdist: The current Python SDK effectively targets Linux-only. This PR sets the baseline to Python >= 3.8; since Python 3.8 is approaching EOL and the manylinux ecosystem evolves, the baseline may need to be raised again soon. It is recommended to reflect these constraints clearly in the PyPI project description/classifiers. Given this, whether to ship sdist in addition to wheels is still an open decision. Because the current sdist bundles prebuilt artifacts (e.g., .a archives of third-party libraries and the DataProxy C++ SDK), it is not a “source” distribution in the usual sense and is effectively Linux-bound. This also conflicts with common expectations for sdists, and shipping binaries inside an sdist is itself a debated practice. The upside is that when a prebuilt wheel is not available, users on a compatible Linux system may still have a good chance to build the package locally. However, on unsupported platforms (e.g., Windows), the build will inevitably fail and the resulting error messages are likely to be confusing and unfriendly. Alternatively, we could reshape the sdist so that it becomes a true source-only distribution and enables cross-platform builds. However, handling the third_party dependency chain is non-trivial and would require maintainers to evaluate the best approach.

  • Type stubs maintenance: The added .pyi improves usability but should be kept in sync with future API changes.

  • Long-term direction (optional): If maintainers want to reduce per-version wheel builds (e.g., via abi3) and/or improve automated .pyi type stub generation, migrating bindings from pybind11 to nanobind could be considered. This is optional and maintainer-driven.

  • Cross-platform ambition: Most third_party dependencies of the DataProxy C++ SDK are cross-platform. If the DataProxy C++ SDK itself gains Windows/macOS support, the Python SDK can follow naturally. A pragmatic first step could be supporting MSYS2 ucrt64 on Windows. If maintainers decide to pursue cross-platform support in the C++ layer, follow-up work for Python packaging can be proposed accordingly.

  • Release workflow: Publishing to PyPI typically benefits from an automated pipeline (PyPI “Trusted Publishing” is commonly recommended; build matrices are often handled by cibuildwheel). However, the final approach should match the project’s preferred release infrastructure and governance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improve][SDK] Publist DataProxy Python SDK to PyPI

1 participant