Skip to content

Out of memory in case of Splunk indexer slowness/failure #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ludovic-boutros opened this issue Feb 28, 2024 · 9 comments
Closed

Out of memory in case of Splunk indexer slowness/failure #423

ludovic-boutros opened this issue Feb 28, 2024 · 9 comments

Comments

@ludovic-boutros
Copy link
Contributor

Hello,
We are using the Splunk Sink Connector with these main parameters:

{
	"name": "SplunkHECSinkConnector",
	"config":{
		"connector.class": "com.splunk.kafka.connect.SplunkSinkConnector",
		"tasks.max": "6",
		"splunk.hec.ack.enabled": "true",
		"splunk.hec.max.outstanding.events": "50000",
		"splunk.hec.max.retries": "-1",
		"splunk.hec.backoff.threshhold.seconds": "60",
		"splunk.hec.threads": "1"
	}
}

In my understanding, we should never have more than 50000 events per task kept in memory.
But that is not the case if Splunk indexers encounter slowness or failures.

We can observe in the Kafka Connect logs such errors and messages:

[2024-02-27 06:39:24,527] INFO [SplunkHECSinkConnector|task-5] handled 394 failed batches with 193452 events (com.splunk.kafka.connect.SplunkSinkTask:154)

I have attached the Kafka Connect metrics during a Splunk indexer stress test.
You can observe the out of memory and the number of active records.

out-of-memory-splunk-sink
@VihasMakwana
Copy link
Contributor

@ludovic-boutros thanks for this.
I will take a look and give my thoughts on this.

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Apr 1, 2024

@ludovic-boutros can you attach the entire kafka connect logs?

@ludovic-boutros
Copy link
Contributor Author

@VihasMakwana we will open a case on Splunk side in order to send you the complete logs in a more secure way.
We decreased the outstanding event property to 10000 and still we have OOM issues.

image
image

@ludovic-boutros
Copy link
Contributor Author

I still don't understand how the sink-record-active-count can be so high? Shouldn't it stay under the splunk.hec.max.outstanding.events value?

@ludovic-boutros
Copy link
Contributor Author

I have put some classes in debug log level.
Here is on interesting one:

[2024-04-09 10:54:52,049] DEBUG [SplunkHECSinkIntConnector|task-0] tid=152 received 261 records with total -4656978 outstanding events tracked (com.splunk.kafka.connect.SplunkSinkTask:83)

I don't think this negative number is normal ;)

@ludovic-boutros
Copy link
Contributor Author

This incorrect outstanding event count makes the outstanding event limit ineffective and this leads to an out of memory in case of Splunk slowness/failure. This is my understanding.

@ludovic-boutros
Copy link
Contributor Author

@VihasMakwana I did not manage to understand how the outstanding event count could be lower than zero.
Do you think we could just add a small check and keep it greater than or equal to zero in the event tracker?
I don't know if it would add side effects, but it would at least assert that the max outstanding event count would be effective.

@ludovic-boutros
Copy link
Contributor Author

ludovic-boutros commented Apr 24, 2024

@VihasMakwana I have patched the connector to prevent negative event count and we can see the effect on the number of events kept in memory. It does not fix the real issue but at least the "symptoms".
image

@ludovic-boutros
Copy link
Contributor Author

Resolved by #431

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants