Skip to content

feat: separate Dockerfile for Hadoop #1186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ All notable changes to this project will be documented in this file.
- opa: Enable custom versions ([#1170]).
- use custom product versions for Hadoop, HBase, Phoenix, hbase-operator-tools, Druid, Hive and Spark ([#1173]).
- hbase: Bump dependencies to the latest patch level for HBase `2.6.1` and `2.6.2` ([#1185]).
- hadoop: Separate Dockerfiles for Hadoop build and HDFS image ([#1186]).

### Fixed

Expand Down Expand Up @@ -206,6 +207,7 @@ All notable changes to this project will be documented in this file.
[#1180]: https://github.com/stackabletech/docker-images/pull/1180
[#1184]: https://github.com/stackabletech/docker-images/pull/1184
[#1185]: https://github.com/stackabletech/docker-images/pull/1185
[#1186]: https://github.com/stackabletech/docker-images/pull/1186

## [25.3.0] - 2025-03-21

Expand Down
2 changes: 2 additions & 0 deletions conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
airflow = importlib.import_module("airflow.versions")
druid = importlib.import_module("druid.versions")
hadoop = importlib.import_module("hadoop.versions")
hadoop_jars = importlib.import_module("hadoop.hadoop.versions")
hbase = importlib.import_module("hbase.versions")
hbase_jars = importlib.import_module("hbase.hbase.versions")
hbase_phoenix = importlib.import_module("hbase.phoenix.versions")
Expand Down Expand Up @@ -48,6 +49,7 @@
{"name": "airflow", "versions": airflow.versions},
{"name": "druid", "versions": druid.versions},
{"name": "hadoop", "versions": hadoop.versions},
{"name": "hadoop/hadoop", "versions": hadoop_jars.versions},
{"name": "hbase", "versions": hbase.versions},
{"name": "hbase/hbase", "versions": hbase_jars.versions},
{"name": "hbase/phoenix", "versions": hbase_phoenix.versions},
Expand Down
6 changes: 3 additions & 3 deletions druid/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# syntax=docker/dockerfile:1.16.0@sha256:e2dd261f92e4b763d789984f6eab84be66ab4f5f08052316d8eb8f173593acf7
# check=error=true

FROM stackable/image/hadoop AS hadoop-builder
FROM stackable/image/hadoop/hadoop AS hadoop-builder

FROM stackable/image/java-devel AS druid-builder

Expand All @@ -12,7 +12,7 @@ ARG STAX2_API
ARG WOODSTOX_CORE
ARG AUTHORIZER
ARG STACKABLE_USER_UID
ARG HADOOP
ARG HADOOP_HADOOP

# Setting this to anything other than "true" will keep the cache folders around (e.g. for Maven, NPM etc.)
# This can be used to speed up builds when disk space is of no concern.
Expand Down Expand Up @@ -75,7 +75,7 @@ mvn \
--no-transfer-progress \
clean install \
-Pdist,stackable-bundle-contrib-exts \
-Dhadoop.compile.version=${HADOOP}-stackable${RELEASE} \
-Dhadoop.compile.version=${HADOOP_HADOOP}-stackable${RELEASE} \
-DskipTests `# Skip test execution` \
-Dcheckstyle.skip `# Skip checkstyle checks. We dont care if the code is properly formatted, it just wastes time` \
-Dmaven.javadoc.skip=true `# Dont generate javadoc` \
Expand Down
6 changes: 3 additions & 3 deletions druid/versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,23 @@
# https://druid.apache.org/docs/30.0.1/operations/java/
"java-base": "17",
"java-devel": "17",
"hadoop": "3.3.6",
"hadoop/hadoop": "3.3.6",
"authorizer": "0.7.0",
},
{
"product": "31.0.1",
# https://druid.apache.org/docs/31.0.1/operations/java/
"java-base": "17",
"java-devel": "17",
"hadoop": "3.3.6",
"hadoop/hadoop": "3.3.6",
"authorizer": "0.7.0",
},
{
"product": "33.0.0",
# https://druid.apache.org/docs/33.0.0/operations/java/
"java-base": "17",
"java-devel": "17",
"hadoop": "3.3.6",
"hadoop/hadoop": "3.3.6",
"authorizer": "0.7.0",
},
]
174 changes: 37 additions & 137 deletions hadoop/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,144 +1,14 @@
# syntax=docker/dockerfile:1.16.0@sha256:e2dd261f92e4b763d789984f6eab84be66ab4f5f08052316d8eb8f173593acf7
# check=error=true

FROM stackable/image/java-devel AS hadoop-builder

ARG PRODUCT
ARG RELEASE
ARG ASYNC_PROFILER
ARG JMX_EXPORTER
ARG PROTOBUF
ARG TARGETARCH
ARG TARGETOS
ARG STACKABLE_USER_UID

WORKDIR /stackable

COPY --chown=${STACKABLE_USER_UID}:0 shared/protobuf/stackable/patches/patchable.toml /stackable/src/shared/protobuf/stackable/patches/patchable.toml
COPY --chown=${STACKABLE_USER_UID}:0 shared/protobuf/stackable/patches/${PROTOBUF} /stackable/src/shared/protobuf/stackable/patches/${PROTOBUF}

RUN <<EOF
rpm --install --replacepkgs https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
microdnf update
# boost is a build dependency starting in Hadoop 3.4.0 if compiling native code
# automake and libtool are required to build protobuf
microdnf install boost1.78-devel automake libtool
microdnf clean all
rm -rf /var/cache/yum
mkdir /opt/protobuf
chown ${STACKABLE_USER_UID}:0 /opt/protobuf
EOF

USER ${STACKABLE_USER_UID}
# This Protobuf version is the exact version as used in the Hadoop Dockerfile
# See https://github.com/apache/hadoop/blob/trunk/dev-support/docker/pkg-resolver/install-protobuf.sh
# (this was hardcoded in the Dockerfile in earlier versions of Hadoop, make sure to look at the exact version in Github)
RUN <<EOF
cd "$(/stackable/patchable --images-repo-root=src checkout shared/protobuf ${PROTOBUF})"

# Create snapshot of the source code including custom patches
tar -czf /stackable/protobuf-${PROTOBUF}-src.tar.gz .

./autogen.sh
./configure --prefix=/opt/protobuf
make "-j$(nproc)"
make install
(cd .. && rm -r ${PROTOBUF})
EOF

ENV PROTOBUF_HOME=/opt/protobuf
ENV PATH="${PATH}:/opt/protobuf/bin"

RUN <<EOF
# async-profiler
ARCH="${TARGETARCH/amd64/x64}"
curl "https://repo.stackable.tech/repository/packages/async-profiler/async-profiler-${ASYNC_PROFILER}-${TARGETOS}-${ARCH}.tar.gz" | tar -xzC .
ln -s "/stackable/async-profiler-${ASYNC_PROFILER}-${TARGETOS}-${ARCH}" /stackable/async-profiler

# JMX Exporter
mkdir /stackable/jmx
curl "https://repo.stackable.tech/repository/packages/jmx-exporter/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar" -o "/stackable/jmx/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar"
chmod -x "/stackable/jmx/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar"
ln -s "/stackable/jmx/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar" /stackable/jmx/jmx_prometheus_javaagent.jar
EOF

WORKDIR /build
COPY --chown=${STACKABLE_USER_UID}:0 hadoop/stackable/patches/patchable.toml /build/src/hadoop/stackable/patches/patchable.toml
COPY --chown=${STACKABLE_USER_UID}:0 hadoop/stackable/patches/${PRODUCT} /build/src/hadoop/stackable/patches/${PRODUCT}
COPY --chown=${STACKABLE_USER_UID}:0 hadoop/stackable/fuse_dfs_wrapper /build
COPY --chown=${STACKABLE_USER_UID}:0 hadoop/stackable/jmx /stackable/jmx
USER ${STACKABLE_USER_UID}
# Hadoop Pipes requires libtirpc to build, whose headers are not packaged in RedHat UBI, so skip building this module
# Build from source to enable FUSE module, and to apply custom patches.
# Also skip building the yarn, mapreduce and minicluster modules: this will result in the modules being excluded but not all
# jar files will be stripped if they are needed elsewhere e.g. share/hadoop/yarn will not be part of the build, but yarn jars
# will still exist in share/hadoop/tools as they would be needed by the resource estimator tool. Such jars are removed in a later step.
RUN <<EOF
cd "$(/stackable/patchable --images-repo-root=src checkout hadoop ${PRODUCT})"

ORIGINAL_VERSION=$(mvn help:evaluate -Dexpression=project.version -q -DforceStdout)
NEW_VERSION=${PRODUCT}-stackable${RELEASE}

mvn versions:set -DnewVersion=${NEW_VERSION}

# Since we skip building the hadoop-pipes module, we need to set the version to the original version so it can be pulled from Maven Central
sed -e '/<artifactId>hadoop-pipes<\/artifactId>/,/<\/dependency>/ { s/<version>.*<\/version>/<version>'"$ORIGINAL_VERSION"'<\/version>/ }' -i hadoop-tools/hadoop-tools-dist/pom.xml

# Create snapshot of the source code including custom patches
tar -czf /stackable/hadoop-${NEW_VERSION}-src.tar.gz .

mvn \
--batch-mode \
--no-transfer-progress \
clean package install \
-Pdist,native \
-pl '!hadoop-tools/hadoop-pipes' \
-Dhadoop.version=${NEW_VERSION} \
-Drequire.fuse=true \
-DskipTests \
-Dmaven.javadoc.skip=true

mkdir -p /stackable/patched-libs/maven/org/apache
cp -r /stackable/.m2/repository/org/apache/hadoop /stackable/patched-libs/maven/org/apache

cp -r hadoop-dist/target/hadoop-${NEW_VERSION} /stackable/hadoop-${NEW_VERSION}
sed -i "s/${NEW_VERSION}/${ORIGINAL_VERSION}/g" hadoop-dist/target/bom.json
mv hadoop-dist/target/bom.json /stackable/hadoop-${NEW_VERSION}/hadoop-${NEW_VERSION}.cdx.json

# HDFS fuse-dfs is not part of the regular dist output, so we need to copy it in ourselves
cp hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs /stackable/hadoop-${NEW_VERSION}/bin

# Remove source code
(cd .. && rm -r ${PRODUCT})

ln -s /stackable/hadoop-${NEW_VERSION} /stackable/hadoop

mv /build/fuse_dfs_wrapper /stackable/hadoop/bin

# Remove unneeded binaries:
# - code sources
# - mapreduce/yarn binaries that were built as cross-project dependencies
# - minicluster (only used for testing) and test .jars
# - json-io: this is a transitive dependency pulled in by cedarsoft/java-utils/json-io and is excluded in 3.4.0. See CVE-2023-34610.
rm -rf /stackable/hadoop/share/hadoop/common/sources/
rm -rf /stackable/hadoop/share/hadoop/hdfs/sources/
rm -rf /stackable/hadoop/share/hadoop/tools/sources/
rm -rf /stackable/hadoop/share/hadoop/tools/lib/json-io-*.jar
rm -rf /stackable/hadoop/share/hadoop/tools/lib/hadoop-mapreduce-client-*.jar
rm -rf /stackable/hadoop/share/hadoop/tools/lib/hadoop-yarn-server*.jar
find /stackable/hadoop -name 'hadoop-minicluster-*.jar' -type f -delete
find /stackable/hadoop -name 'hadoop-client-minicluster-*.jar' -type f -delete
find /stackable/hadoop -name 'hadoop-*tests.jar' -type f -delete
rm -rf /stackable/.m2

# Set correct groups; make sure only required artifacts for the final image are located in /stackable
chmod -R g=u /stackable
EOF
FROM stackable/image/hadoop/hadoop AS hadoop-builder

FROM stackable/image/java-devel AS hdfs-utils-builder

ARG HDFS_UTILS
ARG PRODUCT
ARG RELEASE
ARG HADOOP_HADOOP
ARG STACKABLE_USER_UID

# Starting with hdfs-utils 0.4.0 we need to use Java 17 for compilation.
Expand All @@ -161,25 +31,31 @@ WORKDIR /stackable
COPY --chown=${STACKABLE_USER_UID}:0 hadoop/hdfs-utils/stackable/patches/patchable.toml /stackable/src/hadoop/hdfs-utils/stackable/patches/patchable.toml
COPY --chown=${STACKABLE_USER_UID}:0 hadoop/hdfs-utils/stackable/patches/${HDFS_UTILS} /stackable/src/hadoop/hdfs-utils/stackable/patches/${HDFS_UTILS}

COPY --from=hadoop-builder --chown=${STACKABLE_USER_UID}:0 /stackable/patched-libs /stackable/patched-libs

# The Stackable HDFS utils contain an OPA authorizer, group mapper & topology provider.
# The topology provider provides rack awareness functionality for HDFS by allowing users to specify Kubernetes
# labels to build a rackID from.
# Starting with hdfs-utils version 0.3.0 the topology provider is not a standalone jar anymore and included in hdfs-utils.
RUN <<EOF
cd "$(/stackable/patchable --images-repo-root=src checkout hadoop/hdfs-utils ${HDFS_UTILS})"

# Make Maven aware of custom Stackable libraries
mkdir -p /stackable/.m2/repository
cp -r /stackable/patched-libs/maven/* /stackable/.m2/repository

# Create snapshot of the source code including custom patches
tar -czf /stackable/hdfs-utils-${HDFS_UTILS}-src.tar.gz .

mvn \
--batch-mode \
--no-transfer-progress\
clean package \
-P hadoop-${PRODUCT} \
-P hadoop-${HADOOP_HADOOP} \
-Dhadoop.version=${HADOOP_HADOOP}-stackable${RELEASE} \
-DskipTests \
-Dmaven.javadoc.skip=true

mkdir -p /stackable
cp target/hdfs-utils-$HDFS_UTILS.jar /stackable/hdfs-utils-${HDFS_UTILS}.jar
rm -rf hdfs-utils-main

Expand All @@ -191,8 +67,13 @@ FROM stackable/image/java-base AS final

ARG PRODUCT
ARG RELEASE
ARG HADOOP_HADOOP
ARG HDFS_UTILS
ARG STACKABLE_USER_UID
ARG ASYNC_PROFILER
ARG JMX_EXPORTER
ARG TARGETARCH
ARG TARGETOS

LABEL \
name="Apache Hadoop" \
Expand All @@ -203,10 +84,13 @@ LABEL \
summary="The Stackable image for Apache Hadoop." \
description="This image is deployed by the Stackable Operator for Apache Hadoop / HDFS."

COPY --chown=${STACKABLE_USER_UID}:0 --from=hadoop-builder /stackable /stackable
COPY --chown=${STACKABLE_USER_UID}:0 --from=hdfs-utils-builder /stackable/hdfs-utils-${HDFS_UTILS}.jar /stackable/hadoop-${PRODUCT}-stackable${RELEASE}/share/hadoop/common/lib/hdfs-utils-${HDFS_UTILS}.jar
COPY --chown=${STACKABLE_USER_UID}:0 --from=hadoop-builder /stackable/hadoop-${HADOOP_HADOOP}-stackable${RELEASE} /stackable/hadoop-${HADOOP_HADOOP}-stackable${RELEASE}
COPY --chown=${STACKABLE_USER_UID}:0 --from=hadoop-builder /stackable/*-src.tar.gz /stackable

COPY --chown=${STACKABLE_USER_UID}:0 --from=hdfs-utils-builder /stackable/hdfs-utils-${HDFS_UTILS}.jar /stackable/hadoop-${HADOOP_HADOOP}-stackable${RELEASE}/share/hadoop/common/lib/hdfs-utils-${HDFS_UTILS}.jar
COPY --chown=${STACKABLE_USER_UID}:0 --from=hdfs-utils-builder /stackable/hdfs-utils-${HDFS_UTILS}-src.tar.gz /stackable

COPY --chown=${STACKABLE_USER_UID}:0 hadoop/stackable/jmx /stackable/jmx
COPY --chown=${STACKABLE_USER_UID}:0 hadoop/licenses /licenses

# fuse is required for fusermount (called by fuse_dfs)
Expand All @@ -230,6 +114,22 @@ rm -rf /var/cache/yum
# Without this fuse_dfs does not work
# It is so non-root users (as we are) can mount a FUSE device and let other users access it
echo "user_allow_other" > /etc/fuse.conf

ln -s "/stackable/hadoop-${HADOOP_HADOOP}-stackable${RELEASE}" /stackable/hadoop

# async-profiler
ARCH="${TARGETARCH/amd64/x64}"
curl "https://repo.stackable.tech/repository/packages/async-profiler/async-profiler-${ASYNC_PROFILER}-${TARGETOS}-${ARCH}.tar.gz" | tar -xzC /stackable
ln -s "/stackable/async-profiler-${ASYNC_PROFILER}-${TARGETOS}-${ARCH}" /stackable/async-profiler

# JMX Exporter
curl "https://repo.stackable.tech/repository/packages/jmx-exporter/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar" -o "/stackable/jmx/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar"
chmod -x "/stackable/jmx/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar"
ln -s "/stackable/jmx/jmx_prometheus_javaagent-${JMX_EXPORTER}.jar" /stackable/jmx/jmx_prometheus_javaagent.jar

# Set correct permissions and ownerships
chown --recursive ${STACKABLE_USER_UID}:0 /stackable/hadoop /stackable/jmx /stackable/async-profiler "/stackable/async-profiler-${ASYNC_PROFILER}-${TARGETOS}-${ARCH}"
chmod --recursive g=u /stackable/jmx /stackable/async-profiler "/stackable/hadoop-${HADOOP_HADOOP}-stackable${RELEASE}"
EOF

# ----------------------------------------
Expand Down
Loading