-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Open
Labels
:Data Management/Ingest NodeExecution or management of Ingest Pipelines including GeoIPExecution or management of Ingest Pipelines including GeoIP>bugTeam:Data ManagementMeta label for data/management teamMeta label for data/management team
Description
Elasticsearch Version
8.16.1
Installed Plugins
No response
Java Version
bundled
OS Version
Problem Description
Summary
The issue observed here is exactly the same as in #91964 and remains unresolved. There is a reproducible Java-level deadlock in Elasticsearch ingest attachment nodes during log rollover. This deadlock causes ingest nodes to hang indefinitely, and is especially likely to occur under heavy ingest/logging load. After reviewing #93878 discussions, it is clear that the community has focused on logging loss, logging bridges, and permission issues related to log4j, but has not identified or addressed the potential for a JVM-level deadlock caused by log4j2's RollingFileManager and BufferedWriter lock cycle.
Details
- Elasticsearch version: 8.16.1
- log4j2 version: 2.19.0 (bundled)
- Node type: Dedicated ingest node
jstack evidence
Below is a minimal excerpt from a real production jstack, showing the deadlock:
Found one Java-level deadlock:
=============================
"elasticsearch[1751529376016684232][cluster_coordination][T#1]":
waiting to lock monitor 0x00007f3f214190e0 (object 0x00000010012ae8f0, a org.apache.logging.log4j.core.appender.rolling.RollingFileManager),
which is held by "elasticsearch[1751529376016684232][write][T#16]"
"elasticsearch[1751529376016684232][write][T#16]":
waiting to lock monitor 0x00007f3f02e4a000 (object 0x0000001011305730, a java.io.BufferedWriter),
which is held by "elasticsearch[1751529376016684232][write][T#15]"
"elasticsearch[1751529376016684232][write][T#15]":
waiting to lock monitor 0x00007f3f214190e0 (object 0x00000010012ae8f0, a org.apache.logging.log4j.core.appender.rolling.RollingFileManager),
which is held by "elasticsearch[1751529376016684232][write][T#16]"
Impact
- Ingest nodes become permanently stuck, requiring a process restart.
- All ingest pipelines and possibly cluster coordination are affected.
Attachments
- Full jstack available upon request.
Steps to Reproduce
- Use attachment processor
- Almost always during log rollover under ingest load
As #93878 (comment) said
Logs (if relevant)
Metadata
Metadata
Assignees
Labels
:Data Management/Ingest NodeExecution or management of Ingest Pipelines including GeoIPExecution or management of Ingest Pipelines including GeoIP>bugTeam:Data ManagementMeta label for data/management teamMeta label for data/management team