remove support for hadoop ingestion by clintropolis · Pull Request #19109 · apache/druid

clintropolis · 2026-03-07T10:01:52Z

Description

This PR drops support for the Apache Hadoop based ingestion tasks, which run Druid ingestion processes using Hadoop YARN. It was officially deprecated in Druid 34 in #18286 where it was scheduled for deletion in Druid 37, based on discussion in this dev list thread https://lists.apache.org/thread/5jyl10tm4glvzlfps74o3q4r3rb0h1x9 and elsewhere over the last several years. Note that Druid will still work with a Hadoop ecosystem, the druid-hdfs-storage extension is not part of this removal, so we still support readiing from and writing to HDFS, we still support reading Hadoopy formats like druid-orc-extensions and others, and the druid-kerberos and druid-ranger-security extensions still function.

Hadoop ingestion served us well, we wouldn't be where we are without it, but I think it is time to say goodbye 🫡.

There are a number of reasons for doing this - the biggest pain point over the years being Java version support and dependencies. But it is also because Hadoop ingestion is not even close to as well maintained or tested as the rest of Druid, and no one puts much effort into making new features work with it. The primary reason is that codepaths between Hadoop ingestion and the 'native' Druid ingestion types diverged several years ago with the switch from InputRowParser to InputSource/InputFormat starting in #8823.

As for the intended replacements, the Kubernetes based task runner of kubernetes-overlord-extensions is a superior alternative for having task auto scaling capabilities and has matured since becoming a core extension. Additionally we believe that SQL based ingestion with the multi-stage query engine is the future of batch ingestion, and so is the focus of most new development.

I am aware that Hadoop 3.5 is currently nearing release, which would solve our Java 17+ problems that are the next most immediate problems we are facing, but still think this is the right move to make at this time. That said, if anyone in the community wishes to collect the pieces we are deleting here and migrate them into a 'contrib' extension, then I think we would be happy to accept adding it back in that form, which as a contrib extension it would much better reflect the state it has actually been in for the last several years.

Legacy integration tests

Additionally, this also removes the legacy integration-tests, since the hadoop ingestion tests are now the sole remaining legacy ITs!

Release note

Support for Apache Hadoop-based ingestion was removed from Apache Druid 37.0.0. Please use
SQL-based ingestion or native batch instead.

The associated materialized-view-selection and materialized-view-maintenance contrib extensions were also removed
as part of this since they only supported Hadoop based ingestion.

Note that Druid still supports using druid-hdfs-storage as deep storage and other Hadoop ecosystem extensions and
functionality that was not specific to Hadoop-based ingestion.

gianm · 2026-03-07T21:24:44Z

Let's make sure to document the reasons why we are doing this. In the PR description please link to the dev thread where Hadoop support was last discussed: https://lists.apache.org/thread/5jyl10tm4glvzlfps74o3q4r3rb0h1x9 and also the PR where removal in 37 was added to the docs: #18530. (I don't think the release notes needs to link to these though.)

Please also post in the dev list thread with a summary of next steps. I don't see a post in there wrapping up the discussion.

…dexing

capistrant · 2026-03-13T16:10:12Z

distribution/asf-release-process-guide.md


 Release manager must also ensure that CI is passing successfully on the release branch. Since CI on branch can contain additional tests such as ITs for different JVM flavours. (Note that CI is sometimes flaky for older branches).
-To check the CI status on a release branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commmit should show
+To check the CI status on a r`elease branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commmit should show


Suggested change

To check the CI status on a r`elease branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commmit should show

To check the CI status on a release branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commit should show

…dexing

gianm

Generally looks good to me 🫡

A few comments on the documentation and the stub task.

gianm · 2026-03-17T07:30:08Z

website/redirects.js

+    "to": "/docs/latest/ingestion/hadoop"
+  },
+  {
+    "from": "/docs/development/extensions-contrib/materialized-view",


from should include /latest/ (same for the next two)

gianm · 2026-03-17T07:31:16Z

docs/ingestion/hadoop.md

-
-Please note that the command line Hadoop indexer doesn't have the locking capabilities of the indexing service, so if you choose to use it,
-you have to take caution to not override segments created by real-time processing (if you that a real-time pipeline set up).
+Note that Druid still supports using `druid-hdfs-storage` as deep storage and other Hadoop ecosystem extensions and


druid-hdfs-storage would be good as a link.

gianm · 2026-03-17T07:32:12Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/HadoopIndexTaskStub.java

+ * reasons in the event we come across any of these tasks.
+ */
+@Deprecated
+public class HadoopIndexTaskStub extends AbstractBatchIndexTask


Why not extend AbstractTask? It might be simpler.

gianm · 2026-03-17T07:33:06Z

docs/design/storage.md


 On the query side, the Druid Broker is responsible for ensuring that a consistent set of segments is involved in a given query. It selects the appropriate set of segment versions to use when the query starts based on what is currently available. This is supported by atomic replacement, a feature that ensures that from a user's perspective, queries flip instantaneously from an older version of data to a newer set of data, with no consistency or performance impact.
-This is used for Hadoop-based batch ingestion, native batch ingestion when `appendToExisting` is false, and compaction.
+This is used for native batch ingestion when `appendToExisting` is false and compaction.


Should have a comma and mention SQL:

This is used for SQL REPLACE, native batch ingestion when appendToExisting is false, and compaction.

gianm · 2026-03-17T07:33:49Z

docs/design/storage.md


 - Supervised "seekable-stream" ingestion methods like [Kafka](../ingestion/kafka-ingestion.md) and [Kinesis](../ingestion/kinesis-ingestion.md) are idempotent due to the fact that stream offsets and segment metadata are stored together and updated in lock-step.
- [Hadoop-based batch ingestion](../ingestion/hadoop.md) is idempotent unless one of your input sources is the same Druid datasource that you are ingesting into. In this case, running the same task twice is non-idempotent, because you are adding to existing data instead of overwriting it.
 - [Native batch ingestion](../ingestion/native-batch.md) is idempotent unless


would be nice to mention SQL REPLACE here.

gianm · 2026-03-17T07:34:03Z

docs/design/storage.md

 - Supervised "seekable-stream" ingestion methods like [Kafka](../ingestion/kafka-ingestion.md) and [Kinesis](../ingestion/kinesis-ingestion.md). With these methods, Druid commits stream offsets to its [metadata store](metadata-storage.md) alongside segment metadata, in the same transaction. Note that ingestion of data that has not yet been published can be rolled back if ingestion tasks fail. In this case, partially-ingested data is
 discarded, and Druid will resume ingestion from the last committed set of stream offsets. This ensures exactly-once publishing behavior.
- [Hadoop-based batch ingestion](../ingestion/hadoop.md). Each task publishes all segment metadata in a single transaction.
 - [Native batch ingestion](../ingestion/native-batch.md). In parallel mode, the supervisor task publishes all segment metadata in a single transaction after the subtasks are finished. In simple (single-task) mode, the single task publishes all segment metadata in a single transaction after it is complete.


would be nice to mention SQL REPLACE here.

…dexing

remove support for hadoop ingestion

4840fef

clintropolis added the Release Notes label Mar 7, 2026

github-actions bot added Area - Documentation Area - Batch Ingestion Area - Segment Format and Ser/De Area - Web Console Area - Dependencies Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 GHA labels Mar 7, 2026

Merge remote-tracking branch 'upstream/master' into goodbye-hadoop-in…

5ccd2e9

…dexing

clintropolis mentioned this pull request Mar 9, 2026

add thrift inputformat implementation #19111

Merged

clintropolis added 3 commits March 9, 2026 12:12

fixes

aff850b

add back some stuff that was accidentally deleted, cleanup

a1af825

fix docs

81e404b

kfaraz mentioned this pull request Mar 10, 2026

Upgrade to AWS SDK V2 #18891

Merged

10 tasks

abhishekrb19 mentioned this pull request Mar 12, 2026

Suppress CVE-2025-49128 for jackson-core shaded in hadoop-client-runtime #19105

Open

9 tasks

capistrant added this to the 37.0.0 milestone Mar 12, 2026

clintropolis added 3 commits March 13, 2026 00:39

Merge remote-tracking branch 'upstream/master' into goodbye-hadoop-in…

52b10df

…dexing

add hadoop task stub for serde and logging

5a9da32

fix style

51769b6

capistrant reviewed Mar 13, 2026

View reviewed changes

clintropolis added 2 commits March 13, 2026 12:22

more cleanup

368619a

Merge remote-tracking branch 'upstream/master' into goodbye-hadoop-in…

d9f3b01

…dexing

This was referenced Mar 16, 2026

remove deprecated InputRowParser from SeekableStreamIndexTaskRunner (kafka, kinesis, etc) #19166

Merged

Add support for MSQ CLUSTERED BY expressions to be preserved in the segment shard spec as virtual columns #19061

Open

clintropolis added 2 commits March 16, 2026 16:21

fixup dependency after iceberg embedded tests

9c82f93

Merge remote-tracking branch 'upstream/master' into goodbye-hadoop-in…

734c68d

…dexing

gianm reviewed Mar 17, 2026

View reviewed changes

clintropolis added 3 commits March 17, 2026 00:40

remove legacy integration tests too

049191b

Merge remote-tracking branch 'upstream/master' into goodbye-hadoop-in…

f95c014

…dexing

more docs

f99d084

gianm approved these changes Mar 17, 2026

View reviewed changes

clintropolis merged commit 01abc30 into apache:master Mar 17, 2026
37 checks passed

clintropolis deleted the goodbye-hadoop-indexing branch March 17, 2026 20:11

clintropolis mentioned this pull request Mar 17, 2026

remove parser/parserMap from DataSchema #19173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove support for hadoop ingestion#19109

remove support for hadoop ingestion#19109
clintropolis merged 15 commits intoapache:masterfrom
clintropolis:goodbye-hadoop-indexing

clintropolis commented Mar 7, 2026 •

edited

Loading

Uh oh!

gianm commented Mar 7, 2026 •

edited

Loading

Uh oh!

capistrant Mar 13, 2026

Uh oh!

gianm left a comment

Uh oh!

gianm Mar 17, 2026

Uh oh!

gianm Mar 17, 2026

Uh oh!

gianm Mar 17, 2026

Uh oh!

gianm Mar 17, 2026

Uh oh!

gianm Mar 17, 2026

Uh oh!

gianm Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	To check the CI status on a r`elease branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commmit should show
	To check the CI status on a release branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commit should show

Conversation

clintropolis commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Legacy integration tests

Release note

Uh oh!

gianm commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

capistrant Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

gianm Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gianm Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gianm Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gianm Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gianm Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gianm Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clintropolis commented Mar 7, 2026 •

edited

Loading

gianm commented Mar 7, 2026 •

edited

Loading