Skip to content

remove support for hadoop ingestion#19109

Merged
clintropolis merged 15 commits intoapache:masterfrom
clintropolis:goodbye-hadoop-indexing
Mar 17, 2026
Merged

remove support for hadoop ingestion#19109
clintropolis merged 15 commits intoapache:masterfrom
clintropolis:goodbye-hadoop-indexing

Conversation

@clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 7, 2026

Description

This PR drops support for the Apache Hadoop based ingestion tasks, which run Druid ingestion processes using Hadoop YARN. It was officially deprecated in Druid 34 in #18286 where it was scheduled for deletion in Druid 37, based on discussion in this dev list thread https://lists.apache.org/thread/5jyl10tm4glvzlfps74o3q4r3rb0h1x9 and elsewhere over the last several years. Note that Druid will still work with a Hadoop ecosystem, the druid-hdfs-storage extension is not part of this removal, so we still support readiing from and writing to HDFS, we still support reading Hadoopy formats like druid-orc-extensions and others, and the druid-kerberos and druid-ranger-security extensions still function.

Hadoop ingestion served us well, we wouldn't be where we are without it, but I think it is time to say goodbye 🫡.

There are a number of reasons for doing this - the biggest pain point over the years being Java version support and dependencies. But it is also because Hadoop ingestion is not even close to as well maintained or tested as the rest of Druid, and no one puts much effort into making new features work with it. The primary reason is that codepaths between Hadoop ingestion and the 'native' Druid ingestion types diverged several years ago with the switch from InputRowParser to InputSource/InputFormat starting in #8823.

As for the intended replacements, the Kubernetes based task runner of kubernetes-overlord-extensions is a superior alternative for having task auto scaling capabilities and has matured since becoming a core extension. Additionally we believe that SQL based ingestion with the multi-stage query engine is the future of batch ingestion, and so is the focus of most new development.

I am aware that Hadoop 3.5 is currently nearing release, which would solve our Java 17+ problems that are the next most immediate problems we are facing, but still think this is the right move to make at this time. That said, if anyone in the community wishes to collect the pieces we are deleting here and migrate them into a 'contrib' extension, then I think we would be happy to accept adding it back in that form, which as a contrib extension it would much better reflect the state it has actually been in for the last several years.

Legacy integration tests

Additionally, this also removes the legacy integration-tests, since the hadoop ingestion tests are now the sole remaining legacy ITs!

Release note

Support for Apache Hadoop-based ingestion was removed from Apache Druid 37.0.0. Please use
SQL-based ingestion or native batch instead.

The associated materialized-view-selection and materialized-view-maintenance contrib extensions were also removed
as part of this since they only supported Hadoop based ingestion.

Note that Druid still supports using druid-hdfs-storage as deep storage and other Hadoop ecosystem extensions and
functionality that was not specific to Hadoop-based ingestion.

@gianm
Copy link
Contributor

gianm commented Mar 7, 2026

Let's make sure to document the reasons why we are doing this. In the PR description please link to the dev thread where Hadoop support was last discussed: https://lists.apache.org/thread/5jyl10tm4glvzlfps74o3q4r3rb0h1x9 and also the PR where removal in 37 was added to the docs: #18530. (I don't think the release notes needs to link to these though.)

Please also post in the dev list thread with a summary of next steps. I don't see a post in there wrapping up the discussion.


Release manager must also ensure that CI is passing successfully on the release branch. Since CI on branch can contain additional tests such as ITs for different JVM flavours. (Note that CI is sometimes flaky for older branches).
To check the CI status on a release branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commmit should show
To check the CI status on a r`elease branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commmit should show
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To check the CI status on a r`elease branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commmit should show
To check the CI status on a release branch, you can go to the commits page e.g. https://github.com/apache/druid/commits/24.0.0. On this page, latest commit should show

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me 🫡

A few comments on the documentation and the stub task.

"to": "/docs/latest/ingestion/hadoop"
},
{
"from": "/docs/development/extensions-contrib/materialized-view",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from should include /latest/ (same for the next two)


Please note that the command line Hadoop indexer doesn't have the locking capabilities of the indexing service, so if you choose to use it,
you have to take caution to not override segments created by real-time processing (if you that a real-time pipeline set up).
Note that Druid still supports using `druid-hdfs-storage` as deep storage and other Hadoop ecosystem extensions and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

druid-hdfs-storage would be good as a link.

* reasons in the event we come across any of these tasks.
*/
@Deprecated
public class HadoopIndexTaskStub extends AbstractBatchIndexTask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not extend AbstractTask? It might be simpler.


On the query side, the Druid Broker is responsible for ensuring that a consistent set of segments is involved in a given query. It selects the appropriate set of segment versions to use when the query starts based on what is currently available. This is supported by atomic replacement, a feature that ensures that from a user's perspective, queries flip instantaneously from an older version of data to a newer set of data, with no consistency or performance impact.
This is used for Hadoop-based batch ingestion, native batch ingestion when `appendToExisting` is false, and compaction.
This is used for native batch ingestion when `appendToExisting` is false and compaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a comma and mention SQL:

This is used for SQL REPLACE, native batch ingestion when appendToExisting is false, and compaction.


- Supervised "seekable-stream" ingestion methods like [Kafka](../ingestion/kafka-ingestion.md) and [Kinesis](../ingestion/kinesis-ingestion.md) are idempotent due to the fact that stream offsets and segment metadata are stored together and updated in lock-step.
- [Hadoop-based batch ingestion](../ingestion/hadoop.md) is idempotent unless one of your input sources is the same Druid datasource that you are ingesting into. In this case, running the same task twice is non-idempotent, because you are adding to existing data instead of overwriting it.
- [Native batch ingestion](../ingestion/native-batch.md) is idempotent unless
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to mention SQL REPLACE here.

- Supervised "seekable-stream" ingestion methods like [Kafka](../ingestion/kafka-ingestion.md) and [Kinesis](../ingestion/kinesis-ingestion.md). With these methods, Druid commits stream offsets to its [metadata store](metadata-storage.md) alongside segment metadata, in the same transaction. Note that ingestion of data that has not yet been published can be rolled back if ingestion tasks fail. In this case, partially-ingested data is
discarded, and Druid will resume ingestion from the last committed set of stream offsets. This ensures exactly-once publishing behavior.
- [Hadoop-based batch ingestion](../ingestion/hadoop.md). Each task publishes all segment metadata in a single transaction.
- [Native batch ingestion](../ingestion/native-batch.md). In parallel mode, the supervisor task publishes all segment metadata in a single transaction after the subtasks are finished. In simple (single-task) mode, the single task publishes all segment metadata in a single transaction after it is complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to mention SQL REPLACE here.

@clintropolis clintropolis merged commit 01abc30 into apache:master Mar 17, 2026
37 checks passed
@clintropolis clintropolis deleted the goodbye-hadoop-indexing branch March 17, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants