To give a small introduction, I'll introduce myself.
Hi! I'm Vincent, the Lead Developer, and Creator of Derailed. I've been around since the start and know all of Derailed's history and many other such. I've been working on Derailed for an entire year now, and this article itself for the past couple of weeks.
Because Derailed is still an ever-changing project, there have been many changes since I started. But I have made sure to keep this blog up-to-date and also make sure that everything here is true to life.
To give a small timeline, I started writing this up in a notes app before the Auth Service was rewritten to use gRPC. Yeah, that long... So let's get started!
Derailed, as it's a chat platform, has always required some sort of Gateway for users to connect to and receive real-time events like message creation and such in order.
This was a big part of the reason for using the Erlang OTP and our other tooling. We didn't want to have to maintain nor have to upkeep a message queue.
This was because of the cost of upkeep, and scaling, but also the higher cost of bandwidth, with there being three tunnels the two taking much more bandwidth by taking in every event.
It's only been recent since we've split away from exclusively using Python, so the first attempts were in Python and really feeble.
The original Gateways used Message Queues like Redis' PubSub and Kafka. While both are robust tools, they really weren't for the job.
They are completely separate and different tools both holding a KV (key-value) aspect we didn't need.
But this was how it was for Derailed all the way up to December.
With our options out, and reading way too many articles and blogs, I started looking at Elixir.
I first started theorizing the Elixir-based Gateway in late April, but it never got anywhere because of my lack of programming knowledge... until December.
In December, what we will call the "Elixir Fest" in Derailed happened. I took out the time to sketch out a real Elixir v0 Gateway: the Pre-Pre-Alpha.
This Gateway was made non-separatable to make learning the curves of everything easier.
But it still consisted of multiple modules: Guilds, Session Registry, gRPC, and the WebSocket Gateway.
It was extremely simple and not made at all to scale, it was more of an overview of what Elixir could do for us and if it'd be a better choice.
In early 2023, I took up the task of doing what we've never done before: making a production-ready Elixir Gateway.
Firstly, I started with the design. This was meant to be scalable, so we decided upon a design centered around a consistent hash ring, modularity, and the ability to deploy everything both separately and together.
The consistent hash ring design is easy: take a list of nodes from a source then puzzle through it.
Then the other two can be done using an Elixir Umbrella project which allows for it to be done.
With that together, the project started, with the main goals being the top three and two other goals: migrating to our v2 auth service, and migrating any db actions to PostgreSQL.
When a user connects to the Gateway, its handler makes a Session (A GenServer) on a remote Node, which then connects to remote Guild Processes (also GenServers), which then dispatch any new message to the Session and then the user if necessary. After that, the Session connects to remote Presence Processes according to their Guild and publishes a presence.
This is the basic design of the Gateway, although with much more validation and other stuff happening. A user can also request members which then executes a concurrent function that gets all of the members in a Guild via the DB.
And if your asking: Well, how did you solve the message queue problem?
Well, we just used Elixir's solution. As it allows for ultra-fast batch and singleton sending of messages to processes. Although, we use an alternative to the traditional send/2
made by Discord: Manifold.send/2
.
With this, we can send messages between processes not in milliseconds like with a Message Broker, but the microseconds.
All in all, sending a message from a Guild is easy:
%[https://gist.github.com/VincentRPS/4d0564b601b6158bb8ab1bf60c54185e]
This is heavily oversimplified for the case of keeping this article short but this is a basic consensus of how it's done.
To transfer event data super-fast from our Pythonic Monolith to the Elixir Splitbase, we use a system called the tunnel, powered by gRPC.
We send a raw JSON string to the Gateway so it can decode the message and send it to clients more easily, and also bypass protobuf's typing.
On both sides, these transactions are highly typed and also highly optimized to increase speed and productivity.
On the Python API, we also use concurrent separated tasks to publish events so we aren't forced to block with long wait times just to send an event. But on the side of the Gateway, it's the opposite.
To keep every event in order, and since publishing takes only microseconds we block to keep a strict in-order policy.
This gives us the benefit of not having something like a CHANNEL_DELETE
event happening before a CHANNEL_EDIT
for the same channel, which could be catastrophic for caching systems which require balance and reliability about an object existing.
To keep things fast and centralized, in mid-December we developed the first iteration of the Derailed auth service. This first version used actix (a rust HTTP framework,) but we quickly found it a bottleneck.
Seeing as we had just implemented the Gateway's way to receive events from the API using gRPC, we thought gRPC may have better potential in this field, as it could serve in microseconds which is precisely what we needed. So on January 12th, we committed Auth v2, a gRPC rewrite of the old Auth service. We kept our previous Rust-based code mostly and reorganized it for gRPC, and more specifically Tonic (the Rust module we were using for the gRPC server.)
The move was easily swift as everything was committed at once to all repositories... except for Gateway v0. As it was a pre-pre-alpha and with v1 on the coming horizon, I decided against putting effort to move Gateway v0 to Auth v2, and just do it when v1 needs it.
The two goals during this rewrite were to:
-
Simplify & Enhance our code from v0
-
Improve Scalability from the v0 base
And from the top, you can always see that these two are always sprouting up. Our old code was quite messy and unorganized compared to v1 and was also excellently unscalable. As v0 was a pre-pre-alpha and more of a concept, we didn't care about this much. But since v1 is meant to be production-ready, these two became a problem so they became the main goals of v1.
And that's it for now, not all endings end with an... ending? Well y'know what I mean, it's time to keep everything moving.
That's it, thanks for reading through this blog (which feels more like a rant lol) and have a good day. Goodbye!