Skip to content

Leak of connections to Redis #279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonbru opened this issue Oct 6, 2021 · 13 comments · Fixed by #296
Closed

Leak of connections to Redis #279

simonbru opened this issue Oct 6, 2021 · 13 comments · Fixed by #296

Comments

@simonbru
Copy link

simonbru commented Oct 6, 2021

We have an issue with daphne opening connections to redis and (sometimes) never closing them. After a while, daphne cannot accept new connection and raises either OSError(24, 'Too many open files') or MaxClientsError('ERR max number of clients reached').

This leak sometimes occurs when a WebsocketConsumer closes the connection before accepting it (i.e. it rejects the connection).

I created a micro project to help reproduce this issue: https://github.com/simonbru/channels-redis-issue

I can reproduce the leak on channels-redis>=3.3.0 but not channels-redis==3.2.0. To be more specific, I can reproduce the leak starting from this commit: 3656d87

Environment

  • Debian 10
  • Python 3.7
  • Redis 6.2.1

Python dependencies

  • aioredis==1.3.1
  • channels==3.0.4
  • channels-redis==3.3.0 (issue also occurs on latest commit: 781228c)
  • daphne==3.0.2
  • django==2.2.24

Configuration

  • Channel backend: channels_redis.core.RedisChannelLayer
@carltongibson
Copy link
Member

Python 3.7

I suspect it's this. Try updating to PY38+ — Please report back.

@simonbru
Copy link
Author

simonbru commented Oct 8, 2021

I tried with Python 3.8 and 3.9.
The issue still occurs but now there are more details in logs which may be helpful:

10.156.11.1:58086 - - [08/Oct/2021:03:05:48] "WSCONNECTING /ws/denied/" - -
10.156.11.1:58086 - - [08/Oct/2021:03:05:48] "WSREJECT /ws/denied/" - -
10.156.11.1:58086 - - [08/Oct/2021:03:05:48] "WSDISCONNECT /ws/denied/" - -
10.156.11.1:58088 - - [08/Oct/2021:03:05:48] "WSCONNECTING /ws/denied/" - -
2021-10-08 03:05:48,647 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4171' coro=<RedisConnection._read_data() done, defined at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca6921520>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,648 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4179' coro=<RedisConnection._read_data() done, defined at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca6921b50>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
10.156.11.1:58088 - - [08/Oct/2021:03:05:48] "WSREJECT /ws/denied/" - -
10.156.11.1:58088 - - [08/Oct/2021:03:05:48] "WSDISCONNECT /ws/denied/" - -
10.156.11.1:58090 - - [08/Oct/2021:03:05:48] "WSCONNECTING /ws/denied/" - -
10.156.11.1:58090 - - [08/Oct/2021:03:05:48] "WSREJECT /ws/denied/" - -
10.156.11.1:58090 - - [08/Oct/2021:03:05:48] "WSDISCONNECT /ws/denied/" - -
10.156.11.1:58092 - - [08/Oct/2021:03:05:48] "WSCONNECTING /ws/denied/" - -
10.156.11.1:58092 - - [08/Oct/2021:03:05:48] "WSREJECT /ws/denied/" - -
10.156.11.1:58092 - - [08/Oct/2021:03:05:48] "WSDISCONNECT /ws/denied/" - -
10.156.11.1:58094 - - [08/Oct/2021:03:05:48] "WSCONNECTING /ws/denied/" - -
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-3947' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca69d3100>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-3971' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca699f130>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-3995' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca69c0d00>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4019' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca694f9d0>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4043' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca69bb790>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4067' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca6968190>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4091' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca6974c40>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,671 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4115' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca697fd90>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,672 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4139' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca690d070>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,672 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4163' coro=<RedisConnection._read_data() running at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca6919b20>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,672 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4187' coro=<RedisConnection._read_data() done, defined at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca6926e80>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,672 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4195' coro=<RedisConnection._read_data() done, defined at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca692f9d0>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
2021-10-08 03:05:48,672 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-4203' coro=<RedisConnection._read_data() done, defined at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:180> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcca6934220>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /usr/local/lib/python3.9/site-packages/aioredis/connection.py:168]>
10.156.11.1:58094 - - [08/Oct/2021:03:05:48] "WSREJECT /ws/denied/" - -
10.156.11.1:58094 - - [08/Oct/2021:03:05:48] "WSDISCONNECT /ws/denied/" - -

@carltongibson
Copy link
Member

OK, if you could work to isolate what's going on there that would help too.

@jureslak
Copy link

jureslak commented Nov 18, 2021

Investigated this together with @jberci during the past few afternoons, because we were facing the same issue.

tl;dr: This is an issue with a bad implementation of ConnectionsPool in aioredis 1.3.x. The issue is fixed in 2.0.0, but channels_redis has not yet migrated (maybe this is a good reason to?)

The OP is correct that the issue was introduced in 3656d87. However, the leak does not occur only when closing the connection without accepting, but in general. It's a timing issue, and closing the connection quickly makes it more likely to appear (the call to accept() almost always takes enough time for no leaks to occur).

The cause of the issue from channels_redis perspective is replacement of aioredis.create_redis calls with aioredis.create_redis_pool, but as mentioned, this is due to a leaky implementation of ConnectionsPool in aioredis.

After creating the connection, a _read_data task is scheduled from RedisConnection initializer (https://github.com/aio-libs/aioredis-py/blob/8a207609b7f8a33e74c7c8130d97186e78cc0052/aioredis/connection.py#L162). The connection is leaked if the websocket request is cancelled before this task completes. Putting await asyncio.sleep(5) inside the _read_data method allows us to reproduce the issue reliably.

There are two problems in the ConnectionsPool. The first is with the cleanup logic of create_pool function (https://github.com/aio-libs/aioredis-py/blob/8a207609b7f8a33e74c7c8130d97186e78cc0052/aioredis/pool.py#L59), specifically

try:
    await pool._fill_free(override_min=False)
except Exception:
    pool.close()
    await pool.wait_closed()
    raise

When the pending receive tasks are cancelled in await_many_dispatch from channels (https://github.com/django/channels/blob/7e7b71a405db1dbc7509f290b1b158bd07b74c1b/channels/utils.py#L56), the asyncio.CancelledError is raised, which is not a subclass of Exception, but only BaseException, so the cleanup logic for pool never gets called.

However, even after fixing that and ensuring the close method of the ConnectionsPool class is called, the issue is not fixed.
Investigating further, we see that the pool has no registered connections... Looking at the connection creation logic, in _fill_free (https://github.com/aio-libs/aioredis-py/blob/8a207609b7f8a33e74c7c8130d97186e78cc0052/aioredis/pool.py#L386), we see the following code (comments omitted):

conn = await self._create_new_connection(self._address)
await conn.execute('ping')
self._pool.append(conn)

After obtaining the connection with _create_new_connection, a new task conn.execute('ping') is immediately scheduled. This introduces a race condition - if the task is cancelled before ping finishes, i.e. "between" the two async calls, conn will never be added to the pool and thus also never cleaned up (for reproducibility, delaying _read_data for a long time (which reads 'ping' results) helps us to achieve this consistently). The fix is to do the bookkeeping synchronously after the connection is obtained, to ensure that the connection is tracked by the pool, before scheduling any new async tasks. In the case above, that simply mean swapping the last two rows in the snippet.

After these two fixes, the issue can no longer be reproduced.

Notably, the 2.0.0 version of aioredis does not suffer from these issues: they use BaseException (https://github.com/aio-libs/aioredis-py/blob/master/aioredis/connection.py#L1423) and they perform bookkeeping synchronously (https://github.com/aio-libs/aioredis-py/blob/master/aioredis/connection.py#L1406).

Sadly, the issue cannot be easily fixed in channels_redis. Two options are reverting back to create_redis or to upgrade to aioredis to 2.0.0. For any production issues, one can monkey-patch pool.py in aioredis 1.3.1 by:

  • swapping lines 386 and 387, and
  • changing Exception on line 59 to BaseException.

@carltongibson
Copy link
Member

Hi @jureslak thanks for the write up there.

I think updating aioredis is the way to go no?

Would you be up for making a PR there?

@testrevolution
Copy link

Hi!

We have the same problem with our app. To prevent this, we have to restart daphne periodically. Any progress here?

@gymlt
Copy link

gymlt commented Mar 16, 2022

Sadly, the issue cannot be easily fixed in channels_redis. Two options are reverting back to create_redis or to upgrade to aioredis to 2.0.0. For any production issues, one can monkey-patch pool.py in aioredis 1.3.1 by:

  • swapping lines 386 and 387, and
  • changing Exception on line 59 to BaseException.

could you provide some working monkey patching example for this? thx in advance

@carltongibson carltongibson linked a pull request Apr 6, 2022 that will close this issue
@carltongibson
Copy link
Member

There's WIP to address the updating going on now:

@mikaraunio
Copy link

mikaraunio commented Apr 23, 2022

Even with channels-redis==3.2.0, I'm seeing leakage under heavier loads with a bare-bones project, but using AsyncConsumer and adding each consumer to a group on connect (accepting the connections, not rejecting them).

This also occurs with 3.4.0 with @jureslak 's patch to aioredis, and @ipmb 's redis-py branch.

"Heavier load" here means hundreds to thousands of new connections per second. With 10k total connections and an arrival rate of 2000/s on an 8-core Hetzner CX51 instance, I'm leaking up to 8k Redis connections.

As the cause is most likely unrelated, should I file a new issue?

Update: checked that this only occurs with AsyncConsumer, using a sync consumer there's absolutely no leakage with this scenario, even with unpatched 3.4.0.

@carltongibson
Copy link
Member

Thanks @mikaraunio.

At first pass seems related no?

If you can help pin down the leak that would be handy.

Planning to cut back this way after Django 4.1a1 is out of the door.

@mikaraunio
Copy link

Yes @carltongibson, agree - could be related, and in any case don't see the need for multiple Redis connection leak issues.

I have created a demo project with a single Daphne worker at:

https://github.com/mikaraunio/channels-redis-leak

When creating 500 client connections in one second, this leaks around 300 Redis connections on my test system.

If I increase the arrival rate and client count, I can get this to leak with a SyncConsumer as well, along with Task was destroyed but it is pending! errors. This could be the lone Daphne worker not being able to keep up and triggering @simonbru's original issue, though.

I have been unable to duplicate this with RedisPubSubChannelLayer, and considering moving over to using that. Nevertheless would love to try and find out what's causing the leak here when I have the bandwidth.

@alirezamastery
Copy link

alirezamastery commented May 15, 2022

I encountered similar memory leak with RedisChannelLayer in our production server:

Python 3.9.6
Redis server v=5.0.3

Dependencies:

  • channels==3.0.4
  • channels-redis==3.4.0
  • daphne==3.0.2
  • Django==3.2.6

WebsocketConsumer is used

memory usage slowly goes up and we have to restart the processes periodically to release memory.

using tracemalloc:

/home/django/django_venv/lib/python3.9/site-packages/channels_redis/core.py:857: size=88.7 MiB, count=827789, average=112 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/backends/postgresql/operations.py:216: size=10.9 MiB, count=9000, average=1270 B
/home/django/django_venv/lib/python3.9/site-packages/rest_framework/serializers.py:515: size=2830 KiB, count=48099, average=60 B
<frozen importlib._bootstrap_external>:647: size=1690 KiB, count=18730, average=92 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/utils.py:97: size=1447 KiB, count=24883, average=60 B
/home/django/django_venv/lib/python3.9/site-packages/rest_framework/serializers.py:497: size=1314 KiB, count=9671, average=139 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py:511: size=1050 KiB, count=6398, average=168 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/models/base.py:418: size=1038 KiB, count=13454, average=79 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/models/sql/query.py:304: size=721 KiB, count=650, average=1136 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/backends/utils.py:119: size=564 KiB, count=9002, average=64 B

there is also a lot of this warning:

WARNING  Application instance <Task pending name='Task-62612' coro=<ProtocolTypeRouter.__call__() running at /home/django/django_venv/lib/python3.9/site-packages/channels/routing.py:71> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.9/asyncio/futures.py:384, <TaskWakeupMethWrapper object at 0x7f58ed71dd90>()]>> for connection <WebSocketProtocol client=['127.0.0.1', 43526] path=b'/ws/'> took too long to shut down and was killed.

everything added with channel_layer.group_add when accepting the connection, is removed in disconnect method with channel_layer.group_discard

it is also similar to #212 with the tracemalloc result but since it was closed I didn't mention there.

@alirezamastery
Copy link

changed the backend to RedisPubSubChannelLayer and let it run for a week without restarting:

/home/django/django_venv/lib/python3.9/site-packages/aioredis/stream.py:86: size=1173 MiB, count=59858, average=20.1 KiB
/home/django/django_venv/lib/python3.9/site-packages/channels_redis/pubsub.py:67: size=15.3 MiB, count=142424, average=112 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/backends/postgresql/operations.py:216: size=9266 KiB, count=9000, average=1054 B
<frozen importlib._bootstrap_external>:647: size=3433 KiB, count=37123, average=95 B
/usr/local/lib/python3.9/threading.py:250: size=1729 KiB, count=3361, average=527 B
/usr/local/lib/python3.9/queue.py:152: size=1315 KiB, count=2552, average=528 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py:511: size=1158 KiB, count=7061, average=168 B
/usr/local/lib/python3.9/linecache.py:137: size=880 KiB, count=8718, average=103 B
/home/django/django_venv/lib/python3.9/site-packages/django/db/utils.py:97: size=852 KiB, count=13209, average=66 B
/usr/local/lib/python3.9/threading.py:364: size=762 KiB, count=1479, average=527 B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants