Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: 0.11.1 : More CPU usage for the database server #1066

Closed
1 task done
NathanHbt opened this issue Jan 8, 2025 · 118 comments
Closed
1 task done

[bug]: 0.11.1 : More CPU usage for the database server #1066

NathanHbt opened this issue Jan 8, 2025 · 118 comments
Labels
bug Something isn't working

Comments

@NathanHbt
Copy link

What happened?

Since the update to version 0.11.1 of the cluster machines (4 machines), I have noticed a 20-30% increase in CPU usage on the database server.
Is this an expected increase following this update?

How can we reproduce the problem?

Not simple to reproduce. The red bar represents the update.

Capture d’écran 2025-01-08 à 11 29 09
Capture d’écran 2025-01-08 à 11 29 14

Version

v0.11.x

What database are you using?

mySQL

What blob storage are you using?

S3-compatible

Where is your directory located?

Internal

What operating system are you using?

Linux

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NathanHbt NathanHbt added the bug Something isn't working label Jan 8, 2025
@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

The new version has improvements managing distributed queues. In other deployments version 0.11 was actually faster.
Can you check which queries are generating the load?
Also it could be that a purge task is running at the moment.

@NathanHbt
Copy link
Author

For your information, since the update, the CPU usage is at 95%, and the number of connections has increased from 50 to a constant 250.
I have a lot of “waiting for handler commit” and activity in the “m” table.
Capture d’écran 2025-01-08 à 13 41 01

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

Capture d’écran 2025-01-08 à 13 42 26

@NathanHbt
Copy link
Author

Capture d’écran 2025-01-08 à 13 54 58

@NathanHbt
Copy link
Author

Here is the feedback from my database provider:

A few days ago, the database was handling less than 2,000 queries per second, including around 100 insertions per second.

Currently, they are observing about 20,000 queries per second, including more than 5,000 insertions per second.

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Try disabling the metrics store to see if it helps

@NathanHbt
Copy link
Author

Capture d’écran 2025-01-08 à 15 24 19

You speak about that ?

@NathanHbt
Copy link
Author

For information :

The change in behavior seems to have started around 10:30 AM (upgrade to 0.11.1), with a gradual increase in the number of queries.

The queries mainly involve your DATABASE_NAME database.

These queries follow a similar pattern:
INSERT INTO m (k, v) VALUES (…) ON DUPLICATE KEY UPDATE v = VALUES(v).

@NathanHbt
Copy link
Author

Capture d’écran 2025-01-08 à 17 07 49

For your information, at 12 PM, I switched from a MySQL instance with 2 CPUs to an instance with 4 CPUs. The overall CPU usage remains stuck at 90% regardless of the number of CPUs in the instance.

I suspect an issue related to the queue, as mentioned in my previous messages. In any case, there seems to be a problem since version 0.11.1, and the looping logs are related to the queue.

Sorry to insist, but this is causing a massive slowdown for all my users, whereas in version 0.10.7 everything was perfect. I am currently stuck and therefore relying on you to resolve this issue.

@NathanHbt
Copy link
Author

See #1069 for solution.

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Try to obtain the top queries that are causing most of the load. That will tell us what is going on.

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

There are multiple issues open, I'm travelling at the moment and it's hard to follow them from my phone, let's stick to this one thread please. Here is a reply to each on of the problems you reported:

  • Concurrent limit exceeded: This is not an error message, the server is just telling you that there is a message due for immediate delivery but it's not possible to process it at the moment because the node has exceeded the maximum number of concurrent outbound connections. This is a new feature to protect the server from crashing trying to deliver more messages than it can handle. However the default log level should be changed to debug as it can generate a massive amount of events as you have seen.
  • Queue event locked: Also not an error, it means that another server is processing a certain message.
  • Spam filter causing database load: it's probably the Bayes filter storing the weights and reputation data. I recommend that you use Redis or similar as the in-memory store. Even if you don't use the spam filter you should be using an in-memory store rather than SQL. Read the updated documentation that explains what is the in-memory store used for.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

No problem, I’ll stay here.

How do you explain that when I disable all the options in the spam filter, I no longer get any warning messages about the concurrent connection limit? I also see a significant drop in the number of MySQL connections, and the entire infrastructure becomes smooth again.

I also no longer get any queue locked messages.

Disabling the spam filter features resolves all these messages, whether they are informational or errors. Their excessive presence was not normal.

I’m bringing all this up and insisting a bit to help Stalwart and assist you in identifying an issue, not to bother you. I dealt with the problem all day and I’m glad I resolved it by disabling the spam filter features. However, if I can help you figure out what is causing all the issues encountered (hence the multiple bugs reported, as initially I didn’t think they were related), it would be my pleasure.

Capture d’écran 2025-01-08 à 18 08 18

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

The spam filter does not trigger any outbound messages and does not interact with the queue in any way, so it is strange that the concurrency errors disappear when disabling it. The filter is executed at the SMTP DATA stage before a message is accepted in the system.
What the spam filter will generate is a lot of queries required to run the Bayes classifier and the reputation modules, that is why I was suggesting to use Redis (which you should do in any case).

Perhaps in 0.10 you had the spam filter disabled and you were not used to seeing all those extra queries?

@NathanHbt
Copy link
Author

I’m completely lost. The issue has reappeared after an hour without problems.
Where could this message come from:
“Concurrency limit exceeded (queue.concurrency-limit-exceeded) details = Outbound concurrency limit exceeded, limit = 25”?

I have no outgoing emails, nothing is being sent. How can the limit be reached if nothing is going out?

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Check how many records do you have in tables 'e' and 'q', that is the message queue.
But you should also see them from the queue management page in the webadmin.

@NathanHbt
Copy link
Author

Only 52 messages in these tables.
Capture d’écran 2025-01-08 à 19 42 18
Capture d’écran 2025-01-08 à 19 42 23

@NathanHbt
Copy link
Author

Okay, so all the emails in the queue are looping. Please take a look at the following video.

Enregistrement.de.l.ecran.2025-01-08.a.19.45.40.mov

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

And do you see these messages from the webadmin?
Also, are you experiencing high load also at this moment?

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

The 52 messages are visible in the web admin.
Watch the video. You can see that a message in the queue on the web admin is continually being re-pushed.

I performed the following test:
Deleting the email from the queue via the interface.
Restarting the server.
The log no longer loops on this queue ID.

So, it seems that all emails in the queue are continuously looping rather than being sent once every 30 minutes, for example.

I experience a heavy load as soon as there is an email in the queue because it loops and overloads everything

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Please export the queue by running stalwart with --console and I'll look into it soon.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

What is the command for export the queue, please ?
i have only :
Enter commands (type 'help' for available commands).

help
Available commands:
scan <from_key> <to_key>
delete <from_key> [<to_key>]
get
put []
help
exit/quit

In the meantime, I used the CLI and sent it to you via Discord

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

I just sent you a private video on Discord that clearly shows the issue.

By clearing the queue of 50 messages and restarting the cluster nodes, the database CPU load returns to normal (80% => 30%). This suggests there is an infinite loop in the process handling queued messages, which could explain the thousands of log entries and the limits being reached.

I am not sharing the video here for confidentiality reasons.

Capture d’écran 2025-01-08 à 20 26 43

@mdecimus
Copy link
Member

mdecimus commented Jan 9, 2025

I'll look into this tomorrow evening.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 9, 2025

Alright.
For your information, in the meantime, I am forced to manually clear the message queue.
I tried creating a script to force a “retry” on the messages in the queue, but neither the interface nor the CLI works for the retry. The email remains in the queue, and no server error message is displayed. The only solution is to “cancel” a message, which results in losing it.

I sent you a video on Discord to illustrate the retry action that’s not working.

CLI :
queue retry 307D7024CA03859
Successfully rescheduled 1 message(s).

Queue message détail :
Status | Scheduled |
+--------------+----------------------------------------------+
| Details | |
+--------------+----------------------------------------------+
| Retry # | 0 |

@NathanHbt
Copy link
Author

Do you have another method to force a retry?
This heavily impacts my infrastructure, and managing 3 days of issues will be challenging.

@mdecimus
Copy link
Member

mdecimus commented Jan 9, 2025

Try downgrading to 0.10 or use a single server for outgoing mail. This can be achieved by pausing the queue on all servers except one.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 9, 2025

If I revert to version 0.10.7, I encounter corrupted data errors.

ERROR Data corruption detected (store.data-corruption) causedBy = crates/store/src/write/mod.rs:365, causedBy = crates/store/src/dispatch/store.rs:125, causedBy = crates/jmap/src/services/index.rs:122, details = Failed to iterate over index emails

I redirected port 25 to a single machine instead of the 4 machines, but how can I pause the queue on the other machines?

Currently, the inbound message queues are looping at a rate of 3 to 4 times per second for each queue ID. You can see this in the videos, etc.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 14, 2025

Morning update. The night was calm, and everything seems OK this morning. No emails in the queue, reception is working well, and the CPU load of the database is low. Let’s see how it performs during the day, but it looks good for the lock issue.

Regarding the startup issue, it only occurs when I need to restart all the servers. In such cases, they go into out-of-memory mode for about 5 minutes. This could possibly be due to pending delivery emails or the influx of pending authentication connections all arriving at once? I don’t have 100 messages in 20 seconds, but about 2 incoming emails per second.

If I restart one server (1/4) now, there’s no issue. No out-of-memory problems. It seems to occur only during a full cluster restart.

However, the cache behavior seems odd because I have 6,900 mailboxes and the cache configuration mentioned earlier. Yet, this morning, I have 508.7 used, 687.0 buff/cache, with 7 GB of available RAM. In theory, shouldn’t the cache at least use 3 GB per cluster machine? Stalwart use only 5% of total memory (Total memory 8Go)

@mdecimus
Copy link
Member

Morning update. The night was calm, and everything seems OK this morning. No emails in the queue, reception is working well, and the CPU load of the database is low. Let’s see how it performs during the day, but it looks good for the lock issue.

Good news so far. If there are no issues today I'll release version 0.11.2.

Regarding the startup issue, it only occurs when I need to restart all the servers. In such cases, they go into out-of-memory mode for about 5 minutes. This could possibly be due to pending delivery emails or the influx of pending authentication connections all arriving at once? I don’t have 100 messages in 20 seconds, but about 2 incoming emails per second.

Stalwart has a default inbound limit of 8192 concurrent connections (across all services). More concurrent connections than the memory can handle could in theory cause an OOM but it is strange that it happens on all servers. If you would like to debug why this is happening you can isolate one server and, after a full restart, run Stalwart under gdb as follows:

$ sudo gdb -p <STALWART_PID>
set pagination off
set print thread-events off
thread apply all bt
c

And then post here the stack trace. If that does not provide enough information, we'll have to compile Stalwart with a heap tracer.

In theory, shouldn’t the cache at least use 3 GB per cluster machine?

Yes, I was proposing to reduce the cache to help troubleshoot the cause. But running gdb will be more useful.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 14, 2025

Thank you for your feedback.

If I isolate a single server, a full restart does not cause any issues. There is no “out of memory” error.
The OOM occurs only when I restart all four servers simultaneously. This leads to a Stalwart reboot loop for about 5 minutes, after which it stabilizes and works properly.

I can note down the command and execute it the next time I encounter this reboot issue.

Also, why doesn’t Stalwart use the 3 GB available to it during normal operation?

@mdecimus
Copy link
Member

Also, why doesn’t Stalwart use the 3 GB available to it during normal operation?

The cache only grows in size when entries are added to it (i.e. when a new user loggs in). And probably the 3GB won't be reached anyway as you have less than 10k users and each cache entry has a size ranging from a few bytes to 1MB (for huge mailboxes).

@NathanHbt
Copy link
Author

Alright, thank you. Indeed, I have 6,900 active mailboxes. That makes it very clear.

So, during the next issue (computers fail and always will! 😊), I will open a “bug” on Github for the OOM with the trace data. Does that work for you?

In any case, I’ll update this topic tonight with a summary of the day regarding queues and locks.

@mdecimus
Copy link
Member

I will open a “bug” on Github for the OOM with the trace data. Does that work for you?

Sure, thank you.

In any case, I’ll update this topic tonight with a summary of the day regarding queues and locks.

Great.

And once this issue is gone, consider using Redis as it is the recommended in-memory store for distributed environments (see the documentation). Although this bug should've been caught by the test suite, it is a consequence of what Stalwart needs to do to simulate an in-memory store on a database. I understand that Redis is an additional dependency to maintain but, on the other hand, you will be taking some load from your SQL cluster.

@NathanHbt
Copy link
Author

I don’t mind switching to REDIS, but I can’t afford another migration for my clients. I’ve already been working on this for 4 months.

How can I migrate from MySQL to REDIS? Do you have a step-by-step guide?

My cloud provider (Scaleway) allows me to have a high-availability REDIS setup across 3 nodes managed by them, so I don’t mind switching. Currently, MySQL is also running in a master-slave setup managed by Scaleway.

@NathanHbt
Copy link
Author

After rereading the documentation, I understand that this concerns only temporary information. So, in theory, I can simply replace the MySQL store with REDIS in the ‘In-Memory Store’ section without any issues.

If that’s correct, how much RAM should I plan for?

@mdecimus
Copy link
Member

There is no need to migrate, practically all the information stored in the in-memory store is temporary, see the full list of what is stored here. Basically you will be losing the Bayes filter training data, all other keys are temporary.

To migrate just add your Redis database as a new store and configure it as your default in-memory store. Do this on all servers and restart them.

To avoid service disruptions in case something is misconfigured in Redis I suggest that you deploy a test server, install Stalwart and configure it to use Redis. Once you are able to login (and see some actual data in Redis) you can configure your cluster to use Redis.

@NathanHbt
Copy link
Author

Thank you for the clarification.
I will test this on my development infrastructure once version 0.11.2 is released. This will allow me to stay updated across the board and test REDIS. I imagine a large-scale configuration for REDIS isn’t necessary since it handles only a small amount of data?

@mdecimus
Copy link
Member

Yes, it is small data, mostly counters. If you enable the Bayes classifier you can end up having hundreds of thousands of keys but each key/value pair will have a maximum size of 24 bytes.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 14, 2025

Bad news. I once again have hundreds of emails in the queue that are not being delivered.
If I run a test right now, the email remains in the queue and is not delivered.

Again, I run the CURL command and restart. The emails are then delivered.

@mdecimus
Copy link
Member

Try setting the log level to trace and search the Id of some of the messages that are not being delivered.

@mdecimus
Copy link
Member

Sorry, debug is enough. Trace is too verbose.

@mdecimus
Copy link
Member

mdecimus commented Jan 14, 2025

I think I was able to reproduce the issue in the test suite by using mySQL as the queue backend instead of RocksDB. I'll send you an update shortly.

@mdecimus
Copy link
Member

Something you can test from your side, rather than deleting the locks try pausing and resuming the queue on one or multiple servers.

@NathanHbt
Copy link
Author

Unfortunately, I can’t. Since I ran the CURL command and rebooted, all 1,000 pending emails were distributed, and now that the message queue is empty, distribution is immediate and no longer blocked.

I’ll switch to Redis as soon as possible because my clients are starting to get tired of this issue (and honestly, I’m not sleeping well either ^^).

@NathanHbt
Copy link
Author

But it remains to be seen if the queue blocking issue is the same as the ‘locks.’ They seem to be two different things, but I don’t have enough knowledge of the Stalwart program to say for sure.

@mdecimus
Copy link
Member

mdecimus commented Jan 14, 2025

I was able to reproduce the issue but when the locks expire (5 minutes) the system automatically delivered all pending messages.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 14, 2025

On my side, there was no delivery for 2 hours. It’s as if the lock doesn’t expire at all.
After removing the locks and restarting the server, everything was transmitted gradually.

For REDIS, in the configuration, you request a URL like redis:///. How can I disable SSL verification?
2156:M 14 Jan 2025 14:13:29.312 # Error accepting a client connection: error:1408F10B:SSL routines:ssl3_get_record:wrong version number (conn: fd=16)
Redis configuration problem :

Error occurred while creating a new object: Redis error (store.redis-error): reason = Failed to create initial connections- IoError

@NathanHbt
Copy link
Author

Do you think you have a lead on the issue with messages stuck in the queue via MySQL?
I managed to set up REDIS, but not with my usual provider because they require the connection to be secured with SSL/TLS, and this doesn’t seem to be supported. Before switching to REDIS with another provider, I’d like to know if you have any insights.
Also, for REDIS, is it possible to provide a certificate for the connection?

Thank you so much.

@mdecimus
Copy link
Member

mdecimus commented Jan 14, 2025

On my side, there was no delivery for 2 hours. It’s as if the lock doesn’t expire at all.

I ran tests for over an hour using mySQL as backend with queues containing thousands of messages and up to 20 concurrent delivery tasks. Unfortunately I couldn't reproduce the issue. Even if I manually lock a message, the queue expires in 5 minutes and the message is delivered. Do you have any errors in the mySQL logs or store related errors in Stalwart? The only possibility is that some of the queries are failing.

Before you switch to Redis, can you upgrade to the latest version in the repository? It's the same as the one you have but produces a new event called queue-back-pressure that will tell us whether the node is out of connections. Also, rather than clearing the locks try pausing and resuming the queue. Then check in the logs the message history for the queue ids that are stuck (in debug mode).

For REDIS, in the configuration, you request a URL like redis:///. How can I disable SSL verification?

Use rediss with #insecure, see here.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 14, 2025

Alright, I will compile the latest version and deploy it to production.

No, no errors in the logs. The emails are placed in the queue and nothing more. They just aren’t delivered.

Alright, I’ll try pause/resume next time.

@mdecimus
Copy link
Member

Also you can check whether a message is stuck in the queue with the SQL query generated by this shell command (replace 123456 with the queueId):

$ echo "SELECT * FROM m WHERE k = CONCAT(UNHEX('15'), UNHEX('$(printf '%016x' 123456)'));"

No, no errors in the logs.

Make sure you are in debug level so we can see any queue locked errors, these are not included in the info level.

@NathanHbt
Copy link
Author

Alright, thank you.
The issue is that the problem occurs very randomly. From yesterday at 10 PM until today at noon, there were no issues. Since 2 PM, when I cleared the locks, there hasn’t been a problem either. So I just hope I’ll be at the computer when the issue arises; otherwise, I’ll end up with a lot of emails in the queue and customer tickets :/

@NathanHbt
Copy link
Author

I’ve deployed the latest version on one of the machines.
Would you like me to run it on all four machines with the logs set to ‘debug’ on each?

@mdecimus
Copy link
Member

Yes, all machines with debug level logging please. Otherwise the issue will be harder to track down.

@NathanHbt
Copy link
Author

Ok, it's done.I’ll keep you updated

@NathanHbt
Copy link
Author

I just sent you the log from one of the machines on Discord.
I notice that out of 4 machines, only one is working on the queue. The others seem stuck.

@NathanHbt
Copy link
Author

The message queue is now empty, but on machine 3, I still see the following log looping:

It’s as if it hasn’t realized that there are no more messages in the queue, so it still thinks it’s at the limit of 100. On node 1 = sporadic queue processing, very little. On node 2, no queue processing. On node 3, the message is looping. On node 4 = normal queue processing.

WARN Queue backpressure detected (queue.back-pressure) reason = Queue outbound processing capacity for this node exceeded., total = 1, details = [219353324338561184, 219353342449565860, 219353061458457102, 219353210580644385, 219353056351883925, 219353313987015207, 219353140342829591, 219353358647964214, 219353179624583706, 219353274023686694, 219352672073953763, 219353092001378833, 219353353449128104, 219353077992403472, 219353313970241692, 219353317784961182, 219353200451400223, 219352533789848025, 219352795925463141, 219353355175080501, 219353126784741907, 219353327090024609, 219352921337246200, 219353190609465886, 219353317277446697, 219353011837743620, 219353346857779366, 219352499486732759, 219352969645142526, 219353210310115477, 219352559603691995, 219353183183454351, 219353213950767650, 219352839634301424, 219353350487945779, 219352728055329256, 219352964616172029, 219353139652870280, 219353205983690899, 219353236373516835, 219353178299175575, 219352986615296513, 219353324648935980, 219353158814544408, 219353132398817812, 219352645777764834, 219352994125197826, 219353065881350671, 219353183477051931, 219353058673439244, 219353037066482183, 219352932053692921, 219353094337606162, 219353207671894560, 219353183242166726, 219353320460927135, 219352605841699294, 219352932785590930, 219353350265651367, 219352590412952029, 219353059300487693, 219352835024761326, 219353136291131925, 219353002465571331, 219353162744607257, 219353253849084453, 219352884393816565, 219352525944402392, 219353179528118413, 219353137366970902, 219352907934347767, 219352796976132587, 219353322335777323, 219352570345304540, 219352639830241761, 219353347287687624, 219352880390353396, 219353051318727177, 219353040902173192, 219353328266523181, 219353185259631132, 219353361651085880, 219352678711439844, 219352613706019295, 219352756647899625, 219352821724623341, 219352687272014310, 219352905000432118, 219352982941086208, 219353187438572061, 219352493086224854, 219353346641769009, 219352876781154803, 219352718056108519, 219352942449275387, 219353015436456453, 219352806706917868, 219352631779275232, 219353054690947594, 219353321169760810, 219353245468865060, 219353335700930723, 219353343695270448, 219353058065265163, 219353336200049199, 219353359184835127, 219352977285067263, 219353348801835570, 219352787811578346, 219352933611872706, 219352933037260914, 219352933305692666, 219353031299314182, 219353317214524057, 219353334125965870, 219353354118115892, 219352946997998076, 219353058251915392, 219352847418929649, 219352682683445733, 219352552909582810, 219352837541343727, 219352850552074738, 219353135406137479, 219353316423905832], limit = 100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants