Timeouts when fetching RQ jobs due to jobs not being deleted from user registries
Sentry Issue: ARKINDEX-BACKEND-75
SystemExit: 1
(37 additional frame(s) were not displayed)
...
File "redis/connection.py", line 324, in read_response
raw = self._buffer.readline()
File "redis/connection.py", line 256, in readline
self._read_from_socket()
File "redis/connection.py", line 198, in _read_from_socket
data = recv(self._sock, socket_read_size)
File "redis/_compat.py", line 72, in recv
return sock.recv(*args, **kwargs)
File "gunicorn/workers/base.py", line 203, in handle_abort
sys.exit(1)
This issue has been occurring for quite a while now, and I was waiting for some extra logs to come in in the next release. But after looking at some aggregated stats in Sentry, I noticed Mélodie was causing over 99% of the errors. I looked at https://arkindex.teklia.com/rq/ but there were no pending jobs, just a few failed jobs that I quickly deleted. I went to check in RQ itself if something was amiss and noticed that Mélodie's RQ registry was very full even though there is nothing in the queue:
>>> from django_rq.queues import get_queue
>>> queue = get_queue('default')
>>> queue.count
0
>>> registry = queue.user_registry(8)
>>> registry.count
270842
I did a dumb thing to just remove all those non-existent job IDs:
>>> queue.connection.delete(registry.key)
This seems to have stopped the events coming into Sentry.
I did not check much further but it seems that RQ job IDs are not removed from the user registry, maybe due to the automatic expiration set in Redis; that expiration does not cause any Python code to be called, so the jobs are only cleaned up when /api/v1/jobs/
is called. If you start enough jobs without ever calling /api/v1/jobs/
(for example with a shell script), it would then be possible to fill the user registry to the brim. I have no idea how to fix this since I can't just change the way Redis' expiration works. There is some automated cleaning code in RQ, some of which uses Lua, but that would still be slow to handle hundreds of thousands of jobs!
I have to point out that none of those issues would occur if we were just using Postgres as the queue; Redis is completely unnecessary for our uses.