, , ,

(Warning: the following paragraph is almost certainly false…)

Below is the actual software architecture showing how services within Netflix running on AWS collaborate to provide customers with all their video goodness.

(We now return you to our normal viewing)

The truth is this picture is just some random grab from a Google image search.  And actually, statistically speaking this diagram might be exactly what I described, but in all honestly, the chances are slim… but that’s besides the point, because the main cloud architect from Netflix recently said this about the very architecture this diagram is purporting to describe…

“We don’t keep track of dependencies. We let every individual developer keep track of what they have to do. …We can’t provide an architecture diagram, it has too many boxes and arrows. There are literally hundreds of services running…I don’t worry about it because we’ve built a decoupled system where every service is capable of withstanding the failure of every service it depends on.”

My first reaction to reading this (and the rest of the quite informative article) was “boy – this guy’s a brave soul to make those statements publicly”, which I almost immediately realised was a fairly cowardly response as the architecture is obviously working for them.  From external perception, Netflix have a good reputation when it comes to service reliability and delivery and they were the only big AWS client to survive the US East EBS meltdown on April 2011 (http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html).  Based on all of this, I have to conclude that the good folk at Netflix really understand how to architect for large-scale distribution on AWS – which is an art that many of their virtualised neighbours on AWS are still grappling with.

So after thinking about what worried me so much about how the Netflix architecture was described in that article  I realised that if I came across an architectural diagram anything similar to the one I displayed earlier, I would immediately be very suspicious of the people behind this design and how it came to be this way.  Perhaps my initial bias is somewhat unfounded…

Indeed, I automatically assume that a highly coupled architecture (like the diagram shows) implies a highly dependent architecture: systems of components/services that rely on the availability of their collaborators.  That is obviously not the case with Netflix.  Furthermore, given AWS is based on commodity hardware which is literally failing all over the place, building solutions that are highly tolerant of failure are a cornerstone of any reliable cloud architecture.

So perhaps I should focus less on coupling and start looking more at how dependencies and failover are managed?

Perhaps this sort of diagram isn’t necessarily an automatic red flag?

Indeed, AWS much published thoughts on 2-pizza teams will naturally produced more naturally distributed architectures based on much larger numbers of small connected services.