Sachith
Sachith.Engineer
22 May 2024

How WhatsApp Actually Works: A Gritty Deep Dive into Real-Time State at Scale

author image
W.P.S Lakshitha

Author

cover image

How WhatsApp Actually Works: A Gritty Deep Dive into Real-Time State at Scale

Let’s talk about the beast that is WhatsApp. Most people think it’s just a simple texting app, but under the hood? It’s a masterclass in distributed systems that would make any sane engineer weep. We’re talking 100 billion messages a day. No massive dashboards. No bloat. Just pure Erlang and FreeBSD holding the world together.

To understand why you’re being "ignored" by a contact, you have to look at the intersection of persistent state, binary-optimized protocols, and edge-triggered crypto. It's not just a UI trick. It's a high-stakes game of sub-millisecond latency.

The ELI5 (And Why Your "Pipe" Matters)

The whole system relies on a "perpetual pipe."

In the boring world of web browsing, your device asks for data and then hangs up. WhatsApp doesn't do that. It keeps a persistent TCP or WebSocket connection open as long as the app is even slightly awake. This "pipe" is a two-way street for tiny, invisible signals we call acknowledgments (ACKs).

When you hit send, that message flies through the pipe to a server.

  • The server shouts "got it!" and sends back the first gray tick.
  • Then, the server finds the recipient's pipe and shoves the message down.
  • If their phone catches it, it sends a signal back. Second gray tick.
  • But the blue ticks? That’s different. That happens only when the UI layer detects the message has actually hit the viewport.
  • At that exact microsecond, the device fires a read_receipt packet.
Message State Visual Indicator The "Trigger" What’s Actually Happening
Sent Single Gray Tick Client-to-Server Handoff Server sends an ACK to the sender's Erlang process.
Delivered Double Gray Tick Device-Level Receipt Recipient’s phone sends a delivery ACK after decryption.
Read Double Blue Tick Viewport Exposure The app triggers a read_receipt packet back to the mothership.
Pending Single Gray Tick Recipient Offline Message sits in the "offline queue" inside Mnesia.

The "ignored" feeling? It’s just the delta between the "Delivered" and "Read" states. Because the connection is persistent, the server knows exactly when a phone is reachable. If "Background App Refresh" is on, the phone ACKs the message silently. You get the double gray ticks, but they haven't touched their phone yet.

The Deep Tech: Erlang and the Actor Model

Why Erlang? Because it was built by Ericsson for phone switches that cannot fail. It runs on the BEAM Virtual Machine, and it’s basically magic for concurrency.

The Actor Model is the secret sauce. In this world, every single user connection is a "process." But don't confuse these with heavy OS threads—those would crash your server in minutes. These are lightweight Erlang processes. They use maybe 2KB of RAM. This efficiency is how a single physical box can handle over 2 million concurrent connections without breaking a sweat.

Processes don't share memory. They talk via asynchronous message passing. When User A messages User B:

  1. User A’s process does a quick lookup in a distributed hash table.
  2. It finds User B’s process.
  3. If they’re on the same cluster, the message hops directly between processes in-memory.
  4. No middleman. No slow external queues. Just raw speed.

FunXMPP: Because XML is Fat

Standard XMPP is great, but it’s verbose. Using it on a shaky 2G connection in a rural area is a recipe for failure. So, the engineers "slimmed" it down into a proprietary version called FunXMPP.

They used binary tokenization. Instead of sending the literal string , they send a single byte—0x59. It’s clever. It’s fast. And it saves a massive amount of overhead.

XMPP String FunXMPP Token Savings
message 0x59 ~87%
s.whatsapp.net 0x91 ~94%
type 0xa7 ~75%
body 0x12 ~50%

The protocol treats XML as a set of lists (starting with a byte like \xf8). This lets the device parser pre-allocate memory. It doesn't have to guess. This saves battery life—which, let’s be honest, is the only thing users actually care about.

Cryptographic Triggers (The E2EE Headache)

End-to-End Encryption (E2EE) makes status tracking a nightmare. The server is a "blind router." It can't see what's inside the encrypted blobs. It just moves them.

They use the Signal Protocol with a Double Ratchet Algorithm. Every message has a unique key. If one key gets leaked, the rest of your history stays safe. But for a "Read" status to work, a specific dance has to happen:

  1. The recipient's phone derives the key and decrypts the payload.
  2. The UI confirms it's visible.
  3. The device kicks off a new cryptographic handshake just to send the read_receipt ACK.
  4. The server routes this back, and the sender's phone decrypts the status update.

And now they’re rolling out PQXDH (Post-Quantum Extended Triple Diffie-Hellman). Why? Because they want to make sure your "ignored" status is safe from future quantum computers. Talk about over-engineering (in a good way).

The Tech Stack (What Actually Matters)

They don't follow trends. They follow performance.

Layer Technology Why they use it
OS FreeBSD The networking stack is just better at handling millions of tiny packets.
Runtime Erlang/OTP Massive concurrency and "hot code swapping" (updating without restarts).
App Server Ejabberd (Modded) They took a standard XMPP server and gutted it for scale.
Web Server Yaws Handles the heavy lifting for media and WebSockets.
Security Rust They’re slowly swapping C++ for Rust to stop memory bugs in media parsing.

Mnesia vs. Cassandra vs. Redis

It’s all about the right tool. Mnesia (Erlang-native) handles the real-time routing. Cassandra handles the "offline queue" because it’s a beast at writes. Redis? That’s for ephemeral stuff, like the "Typing..." indicator. If you stop typing, the TTL (Time-To-Live) expires, and the status vanishes. Simple.

Real-World Engineering Hurdles

The Thundering Herd Problem

Imagine a country’s internet goes down and then pops back up. 10 million phones all try to reconnect at the exact same second. That’s a "Thundering Herd." It can crush a load balancer. They fix this with Jitter. Basically, the app waits for a random amount of time before retrying. The math looks something like this:

t_retry = min(2^attempt * base_delay, max_delay) + random(0, jitter) 

It turns a spike into a manageable wave.

Vector Clocks and Logical Truth

Physical clocks are liars. You can't trust them in a distributed system. If Message 2 arrives before Message 1, how does the app know? They use Vector Clocks. It’s a logical counter.

V_local[i] = max(V_local[i], V_received[i]) 

This ensures the "Happened-Before" relationship stays intact, even if your phone's clock is set to 1999 for some reason.

The "Aha!" Moments

Vertical Scaling > Kubernetes

Everyone is obsessed with microservices. But WhatsApp scaled to 500 million users by pushing vertical scaling to the limit. They optimized the Erlang VM and the FreeBSD kernel to support 2 million connections on one server. It's elegant. It's less moving parts.

"Let It Crash"

This is the Erlang mantra. Don't write 50 layers of defensive try-catch blocks. If a user’s process hits an error? Let it die. A "supervisor" will notice and restart it in a clean state. This isolation means one buggy message won't take down the whole service.

Rust and Media

Media files (JPEGs, MP4s) are dangerous. They're common vectors for exploits. By moving media parsing to Rust, they’ve basically built a memory-safe shield around your phone.

The Bottom Line

The "ignored" status is just the final link in a chain of perfectly timed, invisible handshakes. It’s a synchronized dance between lightweight processes and binary streams. WhatsApp proves that massive scale isn't about adding more "stuff"—it’s about stripping away the noise until only the performance remains. For a senior dev, the lesson is clear: the Actor model wins. Period.

Join our Newsletter