Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auparse_next_event returns 0 after auparse_feed #416

Open
neel opened this issue Nov 20, 2024 · 7 comments
Open

auparse_next_event returns 0 after auparse_feed #416

neel opened this issue Nov 20, 2024 · 7 comments

Comments

@neel
Copy link

neel commented Nov 20, 2024

Given au initialized with auparse_init(AUSOURCE_FEED, 0) the function auparse_next_event returns 0 even after passing data to auparse_feed as shown in the example below. This happens even when there is no callback set.

if (auparse_feed(_au, l.data(), l.size()) != 0) {
    std::cerr << "Failed to feed data to Auparse. Error: " << std::strerror(errno) << std::endl;
}
auparse_flush_feed(_au);
int res = auparse_next_event(_au); // res is 0
while (res > 0) {
    // never comes here
}

This happens because consume_feed loops over all the events unconditionally.

static void consume_feed(auparse_state_t *au, int flush)
{
	//if (debug) printf("consume feed, flush %d\n", flush);
	while (auparse_next_event(au) > 0) {
		if (au->callback) {
			(*au->callback)(au, AUPARSE_CB_EVENT_READY, au->callback_user_data);
		}
	}
     ...
}

I'd prefer a non-callback way of extracting events which is achievable by simple change

static void consume_feed(auparse_state_t *au, int flush)
{
	//if (debug) printf("consume feed, flush %d\n", flush);
	if (au->callback) {
                while (auparse_next_event(au) > 0) {
		     (*au->callback)(au, AUPARSE_CB_EVENT_READY, au->callback_user_data);
	        }
	}

     ...
}

This way the usercode has the freedom to choose a non-callback based way of using auparse_feed method.

@stevegrubb
Copy link
Contributor

stevegrubb commented Nov 22, 2024

Auparse was designed to handle static and dynamic event collections. The auparse_next _event is for static collections of events such as buffers or files. For dynamic event collections, meaning that it changes in real time, the auparse_feed should be used with a callback. It was designed to use a callback because it does a search to see if this completes an event and if so, immediately places it with the callback function. If this were deferred to auparse_next_event, it will duplicate this search for a completed event. Also, the feed function will do the callback as many times as it has complete events. Meaning passing in one record could unlock 5 events for the callback (note the kernel does not serialize events, auparse does). It's really more efficient to do it this way.

I'd really prefer not to mix the two idioms as it will just confuse people as well as lose efficiency. Static collections can just use the loop method and dynamic uses feed + callback.

@stevegrubb
Copy link
Contributor

Switching the placement of "if (au->callback)" and auparse_next_event is simple. But I'd be curious why callbacks don't work for your situation? If it's because some variables are out of scope, you can collect them up in a structure and pass it as the user_data variable to auparse_add_callback. It then gets handed to the callback function so that it's in scope. Many people don't know this and use global variables instead.

@neel
Copy link
Author

neel commented Nov 28, 2024

Currently I am using auparse_init(AUSOURCE_BUFFER, data).

My project is in C++. I have a parser class. The parser::parse method will be called by async callbacks in a multithreaded environment. So, Initially I was setting up auparse_add_callback with a static function as callback and this as user data. From the static function I was casting the user data back to parser and calling a non-static object method of the same object.

But becomes very problematic and messy if the static function (callback) gets called from multiple threads, unless I want to underutilize all the processor threads I have by putting a single mutex. The user data passed to the auparse_add_callback is a shared state. A thread safe approach would require, additional user data to be provided to auparse_feed. Unlike the existing implementation of auparse_feed that copies and appends the input to a common buffer, the alternate implementation will read from the provided input without copying. However, even with this implementation the new parser state/scope that we pass to the auparse_feed for one transaction will require initialization and destruction.

Initially I was using AUSOURCE_FEED to avoid creation and destruction of the auparse parser for every parsing operation. But even with the alternate hypothetical design that cannot be avoided. So, I switched to using auparse_init(AUSOURCE_BUFFER, data) instead and processing it without using any callback.

@stevegrubb
Copy link
Contributor

auparse was intended to allow multiple instances of parsing by having multiple copies of the auparse_state_t variable returned by auparse_init. However, using AUSOURCE_BUFFER can be problematic if you do not contain whole events in them. I don't know the source of the data that is being fed, but the kernel does not serialize the records it emits. Events can be interleaved which is one of problems auparse solves - but it needs to enough data that it can determine an event is finished. So, there is that issue.

Also, I reviewed the code to make sure it can handle multi-threaded use. I found a problem in the auparse/interpret.c file with a static global variable, il. I am working up a patch to move that into the auparse_state_t struct which would make the library ready for multi-threading. It will be in the 4.0.3 release. Until then, there could be random problems with interpretations/crashes.

@neel
Copy link
Author

neel commented Dec 2, 2024

However, using AUSOURCE_BUFFER can be problematic if you do not contain whole events in them.

Yes that is the problem I am encountering too. I haven't started fixing that yet as I was working on something else. In my case I have a collector that collects events and transmits over ZMQ. I don't want the collector to be heavy. So, I intend to collect enough to describe an events comprehensively and then transmit. Is it enough to check the items key and count the number of following records and transmit once all items records have been collected ? I mean how can I know that given one record sufficient information about the record (e.g. the arguments and the returns of the syscall) is captutured ?

I am mainly interested in syscalls.

@stevegrubb
Copy link
Contributor

The problem of event completeness is really what auparse was designed to do. You shouldn't need to know all the details of how the kernel creates events to start a project. Auparse insulates you from these details.

If the various threads are all from different data sources, you can instantiate multiple auparse_state_t variables by calling auparse_init for each data source. The problem would be that at some point the events need to be serialized with each other.

Also, I would be open to ideas on how serialize multiple threads at the auparse_feed API. Perhaps the callback could choose from a pool of threads?

@neel
Copy link
Author

neel commented Dec 6, 2024

Also, I would be open to ideas on how serialize multiple threads at the auparse_feed API. Perhaps the callback could choose from a pool of threads?

I think a resources pool would be better than a thread pool. Parsing the feed requires state that cannot be shared by two threads simultaneously. So, a pool of auparse states would be better I guess. The entry to these states will be guarded by a counting semaphore. Usercode may have a threadpool from where they invoke auparse_feed function. That thread spawned by the usercode will either wait for semaphore or will process it on the available auparse state. I've recently designed a similar resource pool for Lua. But it is a bit more complicated due to type bindings and its in C++.

My design is actually like this. I am collecting audit logs through an agent. I want to minimize the complexity of collection, because there can be many other collectors, e.g. log files, pcap etc.. Rather I want to collect and transmit with minimal processing. If I use auparse, while collecting then I have to serialize it and then transmit to the server. But the server will deserialize it anyway and the audit log is already serialized and parsable. So, I am avoiding parsing it while collecting. Rather I'd parse it on the server.

I have one question.

I am only interested in syscall. Is it okay if the collector first checks whether the entry is of type SYSCALL or not. If it is then it creates a bucket in the dictionary. After that, if it gets an entry having the same msg id then it puts it under the same bucket. It closes the bucket once it gets a PROCTITLE type entry. Once the bucket is closed take all the entries and send it over zmq. Once I receive it on the server I process it using auparse. There I use AUSOURCE_BUFFER. So, I expect that all entries that I'll process in the server after one receive operation are associated with a single syscall event.

Is it possible to get any information about the event after the PROCTITLE has been written ? Then should I wait for few seconds more before closing the bucket ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants