Performance and scalability
This page aims to give some background to help prioritize work on the Zulip’s server’s performance and scalability. By scalability, we mean the ability of the Zulip server on given hardware to handle a certain workload of usage without performance materially degrading.
First, a few notes on philosophy.
We consider it an important technical goal for Zulip to be fast, because that’s an important part of user experience for a real-time collaboration tool like Zulip. Many UI features in the Zulip web app are designed to load instantly, because all the data required for them is present in the initial HTTP response, and both the Zulip API and web app are architected around that strategy.
The Zulip database model and server implementation are carefully designed to ensure that every common operation is efficient, with automated tests designed to prevent the accidental introductions of inefficient or excessive database queries. We much prefer doing design/implementation work to make requests fast over the operational work of running 2-5x as much hardware to handle the same load.
See also scalability for production users.
When thinking about scalability and performance, it’s important to understand the load profiles for production uses.
Zulip servers typically involve a mixture of two very different types of load profiles:
Open communities like open source projects, online classes, etc. have large numbers of users, many of whom are idle. (Many of the others likely stopped by to ask a question, got it answered, and then didn’t need the community again for the next year). Our own chat.zulip.org is a good example for this, with more than 15K total user accounts, of which only several hundred have logged in during the last few weeks. Zulip has many important optimizations, including soft deactivation to ensure idle users have minimal impact on both server-side scalability and request latency.
Fulltime teams, like your typical corporate Zulip installation, have users who are mostly active for multiple hours a day and sending a high volume of messages each. This load profile is most important for self-hosted servers, since many of those are used exclusively by the employees of the organization running the server.
The zulip.com load profile is effectively the sum of thousands of organizations from each of those two load profiles.
Major Zulip endpoints
It’s important to understand that Zulip has a handful of endpoints that result in the vast majority of all server load, and essentially every other endpoint is not important for scalability. We still put effort into making sure those other endpoints are fast for latency reasons, but were they to be 10x faster (a huge optimization!), it wouldn’t materially improve Zulip’s scalability.
For that reason, we organize this discussion of Zulip’s scalability around the several specific endpoints that have a combination of request volume and cost that makes them important.
That said, it is important to distinguish the load associated with an
API endpoint from the load associated with a feature. Almost any
significant new feature is likely to result in its data being sent to
the client in
GET /messages, i.e. one of the
endpoints important to scalability here. As a result, it is important
to thoughtfully implement the data fetch code path for every feature.
Furthermore, a snappy user interface is one of Zulip’s design goals, and so we care about the performance of any user-facing code path, even though many of them are not material to scalability of the server. But only with regard to the requests detailed below, is it worth considering optimizations which save a few milliseconds that would be invisible to the end user, if they carry any cost in code readability.
In Zulip’s documentation, our general rule is to primarily write facts that are likely to remain true for a long time. While the numbers presented here vary with hardware, usage patterns, and time (there’s substantial oscillation within a 24 hour period), we expect the rough sense of them (as well as the list of important endpoints) is not likely to vary dramatically over time.
The “Average impact” above is computed by multiplying request volume by average time; this tells you roughly that endpoint’s relative contribution to the steady-state total CPU load of the system. It’s not precise – waiting for a network request is counted the same as active CPU time, but it’s extremely useful for providing intuition for what code paths are most important to optimize, especially since network wait is in practice largely waiting for PostgreSQL or memcached to do work.
As one can see, there are two categories of endpoints that are
important for scalability: those with extremely high request volumes,
and those with moderately high request volumes that are also
expensive. It doesn’t matter how expensive, for example,
POST /users/me/subscriptions is for scalability, because the volume is
Zulip’s Tornado-based real-time push
system, and in particular
GET /events, accounts for something like 50% of all HTTP requests to a
production Zulip server. Despite
GET /events being extremely
high-volume, the typical request takes 1-3ms to process, and doesn’t
use the database at all (though it will access
redis), so they aren’t a huge contributor to the overall CPU usage
of the server.
Because these requests are so efficient from a total CPU usage perspective, Tornado is significantly less important than other services like Presence and fetching message history for overall CPU usage of a Zulip installation.
It’s worth noting that most (~80%) Tornado requests end the
longpolling via a
heartbeat event, which are issued to idle
connections after about a minute. These
heartbeat events are
useless aside from avoiding problems with networks/proxies/NATs that
are configured poorly and might kill HTTP connections that have been
idle for a minute. It’s likely that with some strategy for detecting
such situations, we could reduce their volume (and thus overall
Tornado load) dramatically.
Currently, Tornado is sharded by realm, which is sufficient for
arbitrary scaling of the number of organizations on a multi-tenant
system like zulip.com. With a somewhat straightforward set of work,
one could change this to sharding by
user_id instead, which will
eventually be important for individual large organizations with many
thousands of concurrent users.
POST /users/me/presence requests, which submit the current user’s
presence information and return the information for all other active
users in the organization, account for about 36% of all HTTP requests
on production Zulip servers. See
presence for details on this system and
how it’s optimized. For this article, it’s important to know that
presence is one of the most important scalability concerns for any
chat system, because it cannot be cached long, and is structurally a
Because typical presence requests consume 10-50ms of server-side processing time (to fetch and send back live data on all other active users in the organization), and are such a high volume, presence is the single most important source of steady-state load for a Zulip server. This is true for most other chat server implementations as well.
There is an ongoing effort to rewrite the data model for presence that we expect to result in a substantial improvement in the per-request and thus total load resulting from presence requests.
The request to generate the
page_params portion of
(equivalent to the response from GET
/api/v1/register used by
mobile/terminal apps) is one of Zulip’s most complex and expensive.
Zulip is somewhat unusual among web apps in sending essentially all of the data required for the entire Zulip web app in this single request, which is part of why the Zulip web app loads very quickly – one only needs a single round trip aside from cacheable assets (avatars, images, JS, CSS). Data on other users in the organization, streams, supported emoji, custom profile fields, etc., is all included. The nice thing about this model is that essentially every UI element in the Zulip client can be rendered immediately without paying latency to the server; this is critical to Zulip feeling performant even for users who have a lot of latency to the server.
There are only a few exceptions where we fetch data in a separate AJAX request after page load:
Message history is managed separately; this is why the Zulip web app will first render the entire site except for the middle panel, and then a moment later render the middle panel (showing the message history).
A few very rarely accessed data sets like message edit history are only fetched on demand.
A few data sets that are only required for administrative settings pages are fetched only when loading those parts of the UI.
GET / and
/api/v1/register that fetch
are pretty rare – something like 0.3% of total requests, but are
important for scalability because (1) they are the most expensive read
requests the Zulip API supports and (2) they can come in a thundering
herd around server restarts (as discussed in fetching message
The cost for fetching
page_params varies dramatically based
primarily on the organization’s size, varying from 90ms-300ms for a
typical organization but potentially multiple seconds for large open
organizations with 10,000s of users. There is also smaller
variability based on a individual user’s personal data state,
primarily in that having 10,000s of unread messages results in a
somewhat expensive query to find which streams/topics those are in.
We consider any organization having normal
page_params fetch times
greater than a second to be a bug, and there is ongoing work to fix that.
It can help when thinking about this to imagine
page_params as what
in another web app would have been 25 or so HTTP GET requests, each
fetching data of a given type (users, streams, custom emoji, etc.); in
Zulip, we just do all of those in a single API request. In the
future, we will likely move to a design that does much of the database
fetching work for different features in parallel to improve latency.
For organizations with 10K+ users and many default streams, the
majority of time spent constructing
page_params is spent marshalling
data on which users are subscribed to which streams, which is an area
of active optimization work.
Fetching message history
Bulk requests for message content and metadata (
GET /messages) account for ~3% of
total HTTP requests. The zulip web app has a few major reasons it does
a large number of these requests:
Most of these requests are from users clicking into different views – to avoid certain subtle bugs, Zulip’s web app currently fetches content from the server even when it has the history for the relevant stream/topic cached locally.
When a browser opens the Zulip web app, it will eventually fetch and cache in the browser all messages newer than the oldest unread message in a non-muted context. This can be in total extremely expensive for users with 10,000s of unread messages, resulting in a single browser doing 100 of these requests.
When a new version of the Zulip server is deployed, every browser will reload within 30 minutes to ensure they are running the latest code. For installations that deploy often like chat.zulip.org and zulip.com, this can result in a thundering herd effect for both
GET /messages. A great deal of care has been taking in designing this auto-reload system to spread most of that herd over several minutes.
Typical requests consume 20-100ms to process, much of which is waiting to fetch message IDs from the database and then their content from memcached. While not large in an absolute sense, these requests are expensive relative to most other Zulip endpoints.
Some requests, like full-text search for commonly used words, can be more expensive, but they are sufficiently rare in an absolute sense so as to be immaterial to the overall scalability of the system.
This server-side code path is already heavily optimized on a per-request basis. However, we have technical designs for optimizing the overall frequency with which clients need to make these requests in two major ways:
Improving client-side caching to allow caching of narrows that the user has viewed in the current session, avoiding repeat fetches of message content during a given session.
Adjusting the behavior for clients with 10,000s of unread messages to not fetch as much old message history into the cache. See this issue for relevant design work.
Together, it is likely that these changes will reduce the total scalability cost of fetching message history dramatically.
Requests to fetch uploaded files (including user avatars) account for
about 5% of total HTTP requests. Zulip spends consistently ~10-15ms
processing one of these requests (mostly authorization logic), before
handing off delivery of the file to
nginx or S3 (depending on the
configured upload backend).
We can significantly reduce the volume of these requests in large production Zulip servers by fixing Zulip’s upload authorization strategy preventing browser caching of uploads, since many of them result from a single client downloading the same file repeatedly as it navigates around the app.
Sending and editing messages
Sending new messages (including
incoming webhooks) represents less than 0.5% of total request volume.
That this number is small should not be surprising even though sending
messages is intuitively the main feature of a chat service: a message
sent to 50 users triggers ~50
GET /events requests.
A typical message-send request takes 20-70ms, with more expensive requests typically resulting from Markdown rendering of more complex syntax. As a result, these requests are not material to Zulip’s scalability. Editing messages and adding emoji reactions are very similar to sending them for the purposes of performance and scalability, since the same clients need to be notified, and these requests are lower in volume.
That said, we consider the performance of these endpoints to be some of the most important for Zulip’s user experience, since even with local echo, these are some of the places where any request processing latency is highly user-visible.
Typing notifications are slightly higher volume than sending messages, but are also extremely cheap (~3ms).
Other API actions, like subscribing to a stream, editing settings, registering an account, etc., are vanishingly rare compared to the requests detailed above, fundamentally because almost nobody changes these things more than a few dozen times over the lifetime of their account, whereas everything above are things that a given user might do thousands of times.
As a result, performance work on those requests is generally only important for latency reasons, not for optimizing the overall scalability of a Zulip server.
Queue processors and cron jobs
The above doesn’t cover all of the work that a production Zulip server
does; various tasks like sending outgoing emails or recording the data
that powers /stats are run by
queue processors and cron jobs, not in
response to incoming HTTP requests. In practice, all of these have
been written such that they are immaterial to total load and thus
architectural scalability, though we do from time to time need to do
operational work to add additional queue processors for particularly
high-traffic queues. For all of our queue processors, any
serialization requirements are at most per-user, and thus it would be
straightforward to shard by
realm_id if required.
In addition to the above, which just focuses on the total amount of CPU work, it’s also relevant to think about load on infrastructure services (memcached, redis, rabbitmq, and most importantly postgres), as well as queue processors (which might get backlogged).
In practice, efforts to make an individual endpoint faster will very likely reduce the load on these services as well. But it is worth considering that database time is a more precious resource than Python/CPU time (being harder to scale horizontally).
Most optimizations to make an endpoint cheaper will start with optimizing the database queries and/or employing caching, and then continue as needed with profiling of the Python code and any memcached queries.
For a handful of the critical code paths listed above, we further optimize by skipping the Django ORM (which has substantial overhead) for narrow sections; typically this is sufficient to result in the database query time dominating that spent by the Python application server process.
Zulip’s server logs are designed to provide insight when a request consumes significant database or memcached resources, which is useful both in development and in production.