Scalability Archives - Roopinder Singh

James Governor mentions a very interesting pattern of web 2.0 – asymmetrical follow. In a nutshell, it’s basically an unbalanced communication network – you have some nodes on a network (hubs) that tend to have a lot of inbound links compared to others. In other words, it is a situation where popular people whose words/thoughts/opinions/tweets get read by the masses, and they (the popular people) do not reciprocate.

The thing is, asymmetrical follow exists everywhere. Celebrities have a lot of inbound links (tabloids, fans, press, etc), but they do not necessarily have a reverse link back. Blogs by their nature are asymmetrical as well – the blogger publishes a post thats read by visitors, and comments on the blogs, or even pingbacks do not necessarily get read or responded to by the poster. Its not being rude or anything anti-social about it, but its a pattern, and James is right – it is core to Web 2.0. Back in Dec ‘07, JP mentions that Twitter is neither a push or a pull network, but it is actually publish-subscribe.

The point of this post surrounds what James mentions in his article:

But Twitter wasn’t designed for whales. It was designed for small shoals of fish. Which brings us to one of the big issues with Asymmetrical Follow – it introduces unexpected scaling problems. Twitter’s architecture didn’t cope all that well at first, but has performed a lot better since the message broker was re-architected using Scala LIFT, a new web application programming framework). The technical approach that is most appropriate to support Asymmetrical Follow is well known in the world of high scale enterprise messaging- its called Publish And Subscribe.

Publish-Subscribe is a very common pattern in technology. Having worked in 2 investment banks, I have seen plenty of implementations that do the exact same thing: Publishers fire data once to a middleware, and that middleware layer sends that data off to many subscribers. Sounds simple enough to implement, right ? Well, it’s not.

Designing a good, reliable, highly performant Publish-Subscribe framework is not easy. Getting the initial bits is simple and trivial, but the problem that a lot of people face is scalability. If you are looking to build a Pub/Sub layer on your own, then the first thing you have to do is stop and take a reality check. Its not worth the trouble. Buy it from someone, or reuse another framework (like Twitter have done with Scala LIFT). I am not kidding. I have seen millions of dollars go down the drain in missed opportunities, direct trading losses, etc all due to poorly designed and implemented Pub/Sub layers.

Pub/Sub frameworks are a lot like caches (for e.g. memcached) but with a twist. Not only do you have to cache data, but you have to tell subscribers when this data has changed. In fact, they are closer to finite state machines than caches.

Here are a few that I have used in the past and highly recommend any of them:

Oracle Coherence (this seems to be the hot stuff that everyone is talking about)
Reuters RMDS
Tibco Rendezvous

Now, if you’re still stubborn and think you’re up for the challenge, then here are a few pointers:

Design it really, really well

Sit down with a few people and walk them through your design. Your design has to look into how memory is managed, the threading model, the communication mechanism, etc. Find as many defects as possible and don’t take it personally. Do this before you even write a single line of code.

Have a solid, clean API

I have used some really arcane APIs in the past, and oddly, some of them are provided by electronic markets (no names mentioned here). Remember, the API will be used by publishers and subscribers. The cleaner the API, the less bugs it will introduce in publishers/subscribers code.

Non-blocking IO

If you’re using TCP sockets as a communication layer, do use select() (non-blocking IO). You need to break away from the one-thread-per-client model. That model, while easy to code to, just does not scale at all. I have been in way too many situations where I have inherited a system that uses the one-thread-per-client model, and all of a sudden it does not work in production because they’ve just scaled from 30 connected clients to 3000. BTW, if you’re developing in Java, I highly recommend using Apache MINA to reduce the stress of writing non-blocking IO code.

Watch out for data state inconsistencies

A common approach that a lot of frameworks use is to send a snapshot followed by updates of changes.

Publishers should send the following messages upon startup:

An initial message saying “This is the beginning of my initial data”
The initial data itself
A final message saying “This is the end of my initial data”

From then on, publishers should just send updates.

Subscribers will get the reverse. A call to subscribe() should result in at least the following callbacks:

A callback saying “This is the beginning of the data”
The initial snapshot data itself
A callback saying “This is the end of the data”

From then on, subscribers should just send updates.

Handling the subscribe() call in your framework is going to be tricky. You’ll need to be careful of locking your cache to ensure that no one updates it while you’re taking the initial data for the subscriber. Alternatively, you could create a snapshot copy of the cache, but keep an eye on your memory usage.

Correctness over performance

Don’t worry so much about reducing latency from 200ms to 20ms. Getting your implementation correct is far more important than performance. I’m not saying performance is not important, BUT you need to get it to work correctly before looking into performance.

Build a load test framework

You will definitely need one of these. There have been way too many times in the past where I need to reproduce a production problem related to scalability, only to find that the original authors of the system did not bother building a load test framework.

I hope by now you would realise that designing and implementing a Publish/Subscribe framework is not trivial. Buy it off the shelf as someone out there has gone through all of this pain for you.

I am a big believer that your system architecture should reflect the underlying business. It will not be a good fit if you are trying to retrofit an incorrect architecture as you will end up having loads of problems (scalability, maintenence, etc) in the long run. Asymmetric follow is here to stay, and in your next project think about how it is going to affect your architecture and what you need to do at the initial outset to get it right.