-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: 0.11.1 : More CPU usage for the database server #1066
Comments
The new version has improvements managing distributed queues. In other deployments version 0.11 was actually faster. |
Here is the feedback from my database provider: A few days ago, the database was handling less than 2,000 queries per second, including around 100 insertions per second. Currently, they are observing about 20,000 queries per second, including more than 5,000 insertions per second. |
Try disabling the metrics store to see if it helps |
For information : The change in behavior seems to have started around 10:30 AM (upgrade to 0.11.1), with a gradual increase in the number of queries. The queries mainly involve your DATABASE_NAME database. These queries follow a similar pattern: |
For your information, at 12 PM, I switched from a MySQL instance with 2 CPUs to an instance with 4 CPUs. The overall CPU usage remains stuck at 90% regardless of the number of CPUs in the instance. I suspect an issue related to the queue, as mentioned in my previous messages. In any case, there seems to be a problem since version 0.11.1, and the looping logs are related to the queue. Sorry to insist, but this is causing a massive slowdown for all my users, whereas in version 0.10.7 everything was perfect. I am currently stuck and therefore relying on you to resolve this issue. |
See #1069 for solution. |
Try to obtain the top queries that are causing most of the load. That will tell us what is going on. |
There are multiple issues open, I'm travelling at the moment and it's hard to follow them from my phone, let's stick to this one thread please. Here is a reply to each on of the problems you reported:
|
No problem, I’ll stay here. How do you explain that when I disable all the options in the spam filter, I no longer get any warning messages about the concurrent connection limit? I also see a significant drop in the number of MySQL connections, and the entire infrastructure becomes smooth again. I also no longer get any queue locked messages. Disabling the spam filter features resolves all these messages, whether they are informational or errors. Their excessive presence was not normal. I’m bringing all this up and insisting a bit to help Stalwart and assist you in identifying an issue, not to bother you. I dealt with the problem all day and I’m glad I resolved it by disabling the spam filter features. However, if I can help you figure out what is causing all the issues encountered (hence the multiple bugs reported, as initially I didn’t think they were related), it would be my pleasure. |
The spam filter does not trigger any outbound messages and does not interact with the queue in any way, so it is strange that the concurrency errors disappear when disabling it. The filter is executed at the SMTP DATA stage before a message is accepted in the system. Perhaps in 0.10 you had the spam filter disabled and you were not used to seeing all those extra queries? |
I’m completely lost. The issue has reappeared after an hour without problems. I have no outgoing emails, nothing is being sent. How can the limit be reached if nothing is going out? |
Check how many records do you have in tables 'e' and 'q', that is the message queue. |
Okay, so all the emails in the queue are looping. Please take a look at the following video. Enregistrement.de.l.ecran.2025-01-08.a.19.45.40.mov |
And do you see these messages from the webadmin? |
The 52 messages are visible in the web admin. I performed the following test: So, it seems that all emails in the queue are continuously looping rather than being sent once every 30 minutes, for example. I experience a heavy load as soon as there is an email in the queue because it loops and overloads everything |
Please export the queue by running stalwart with --console and I'll look into it soon. |
What is the command for export the queue, please ?
In the meantime, I used the CLI and sent it to you via Discord |
I'll look into this tomorrow evening. |
Alright. I sent you a video on Discord to illustrate the retry action that’s not working. CLI : Queue message détail : |
Do you have another method to force a retry? |
Try downgrading to 0.10 or use a single server for outgoing mail. This can be achieved by pausing the queue on all servers except one. |
If I revert to version 0.10.7, I encounter corrupted data errors.
I redirected port 25 to a single machine instead of the 4 machines, but how can I pause the queue on the other machines? Currently, the inbound message queues are looping at a rate of 3 to 4 times per second for each queue ID. You can see this in the videos, etc. |
Morning update. The night was calm, and everything seems OK this morning. No emails in the queue, reception is working well, and the CPU load of the database is low. Let’s see how it performs during the day, but it looks good for the lock issue. Regarding the startup issue, it only occurs when I need to restart all the servers. In such cases, they go into out-of-memory mode for about 5 minutes. This could possibly be due to pending delivery emails or the influx of pending authentication connections all arriving at once? I don’t have 100 messages in 20 seconds, but about 2 incoming emails per second. If I restart one server (1/4) now, there’s no issue. No out-of-memory problems. It seems to occur only during a full cluster restart. However, the cache behavior seems odd because I have 6,900 mailboxes and the cache configuration mentioned earlier. Yet, this morning, I have 508.7 used, 687.0 buff/cache, with 7 GB of available RAM. In theory, shouldn’t the cache at least use 3 GB per cluster machine? Stalwart use only 5% of total memory (Total memory 8Go) |
Good news so far. If there are no issues today I'll release version
Stalwart has a default inbound limit of 8192 concurrent connections (across all services). More concurrent connections than the memory can handle could in theory cause an OOM but it is strange that it happens on all servers. If you would like to debug why this is happening you can isolate one server and, after a full restart, run Stalwart under $ sudo gdb -p <STALWART_PID>
set pagination off
set print thread-events off
thread apply all bt
c And then post here the stack trace. If that does not provide enough information, we'll have to compile Stalwart with a heap tracer.
Yes, I was proposing to reduce the cache to help troubleshoot the cause. But running |
Thank you for your feedback. If I isolate a single server, a full restart does not cause any issues. There is no “out of memory” error. I can note down the command and execute it the next time I encounter this reboot issue. Also, why doesn’t Stalwart use the 3 GB available to it during normal operation? |
The cache only grows in size when entries are added to it (i.e. when a new user loggs in). And probably the 3GB won't be reached anyway as you have less than 10k users and each cache entry has a size ranging from a few bytes to 1MB (for huge mailboxes). |
Alright, thank you. Indeed, I have 6,900 active mailboxes. That makes it very clear. So, during the next issue (computers fail and always will! 😊), I will open a “bug” on Github for the OOM with the trace data. Does that work for you? In any case, I’ll update this topic tonight with a summary of the day regarding queues and locks. |
Sure, thank you.
Great. And once this issue is gone, consider using Redis as it is the recommended in-memory store for distributed environments (see the documentation). Although this bug should've been caught by the test suite, it is a consequence of what Stalwart needs to do to simulate an in-memory store on a database. I understand that Redis is an additional dependency to maintain but, on the other hand, you will be taking some load from your SQL cluster. |
I don’t mind switching to REDIS, but I can’t afford another migration for my clients. I’ve already been working on this for 4 months. How can I migrate from MySQL to REDIS? Do you have a step-by-step guide? My cloud provider (Scaleway) allows me to have a high-availability REDIS setup across 3 nodes managed by them, so I don’t mind switching. Currently, MySQL is also running in a master-slave setup managed by Scaleway. |
After rereading the documentation, I understand that this concerns only temporary information. So, in theory, I can simply replace the MySQL store with REDIS in the ‘In-Memory Store’ section without any issues. If that’s correct, how much RAM should I plan for? |
There is no need to migrate, practically all the information stored in the in-memory store is temporary, see the full list of what is stored here. Basically you will be losing the Bayes filter training data, all other keys are temporary. To migrate just add your Redis database as a new store and configure it as your default in-memory store. Do this on all servers and restart them. To avoid service disruptions in case something is misconfigured in Redis I suggest that you deploy a test server, install Stalwart and configure it to use Redis. Once you are able to login (and see some actual data in Redis) you can configure your cluster to use Redis. |
Thank you for the clarification. |
Yes, it is small data, mostly counters. If you enable the Bayes classifier you can end up having hundreds of thousands of keys but each key/value pair will have a maximum size of 24 bytes. |
Bad news. I once again have hundreds of emails in the queue that are not being delivered. Again, I run the CURL command and restart. The emails are then delivered. |
Try setting the log level to trace and search the Id of some of the messages that are not being delivered. |
Sorry, debug is enough. Trace is too verbose. |
I think I was able to reproduce the issue in the test suite by using mySQL as the queue backend instead of RocksDB. I'll send you an update shortly. |
Something you can test from your side, rather than deleting the locks try pausing and resuming the queue on one or multiple servers. |
Unfortunately, I can’t. Since I ran the CURL command and rebooted, all 1,000 pending emails were distributed, and now that the message queue is empty, distribution is immediate and no longer blocked. I’ll switch to Redis as soon as possible because my clients are starting to get tired of this issue (and honestly, I’m not sleeping well either ^^). |
But it remains to be seen if the queue blocking issue is the same as the ‘locks.’ They seem to be two different things, but I don’t have enough knowledge of the Stalwart program to say for sure. |
I was able to reproduce the issue but when the locks expire (5 minutes) the system automatically delivered all pending messages. |
On my side, there was no delivery for 2 hours. It’s as if the lock doesn’t expire at all. For REDIS, in the configuration, you request a URL like redis:///. How can I disable SSL verification?
|
Do you think you have a lead on the issue with messages stuck in the queue via MySQL? Thank you so much. |
I ran tests for over an hour using mySQL as backend with queues containing thousands of messages and up to 20 concurrent delivery tasks. Unfortunately I couldn't reproduce the issue. Even if I manually lock a message, the queue expires in 5 minutes and the message is delivered. Do you have any errors in the mySQL logs or store related errors in Stalwart? The only possibility is that some of the queries are failing. Before you switch to Redis, can you upgrade to the latest version in the repository? It's the same as the one you have but produces a new event called
Use |
Alright, I will compile the latest version and deploy it to production. No, no errors in the logs. The emails are placed in the queue and nothing more. They just aren’t delivered. Alright, I’ll try pause/resume next time. |
Also you can check whether a message is stuck in the queue with the SQL query generated by this shell command (replace 123456 with the queueId): $ echo "SELECT * FROM m WHERE k = CONCAT(UNHEX('15'), UNHEX('$(printf '%016x' 123456)'));"
Make sure you are in |
Alright, thank you. |
I’ve deployed the latest version on one of the machines. |
Yes, all machines with debug level logging please. Otherwise the issue will be harder to track down. |
Ok, it's done.I’ll keep you updated |
I just sent you the log from one of the machines on Discord. |
The message queue is now empty, but on machine 3, I still see the following log looping: It’s as if it hasn’t realized that there are no more messages in the queue, so it still thinks it’s at the limit of 100. On node 1 = sporadic queue processing, very little. On node 2, no queue processing. On node 3, the message is looping. On node 4 = normal queue processing.
|
What happened?
Since the update to version 0.11.1 of the cluster machines (4 machines), I have noticed a 20-30% increase in CPU usage on the database server.
Is this an expected increase following this update?
How can we reproduce the problem?
Not simple to reproduce. The red bar represents the update.
Version
v0.11.x
What database are you using?
mySQL
What blob storage are you using?
S3-compatible
Where is your directory located?
Internal
What operating system are you using?
Linux
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: