This interview
has already been picked up and commented
upon (and /.’ed),
but if you have not yet taken a look, I recommend reading this ACM piece on
Hotmail, and what it means to manage one of the largest services of the web.
Hotmail runs on 10,000 servers and involves several petabytes of storage (i.e millions of gigabytes) and serves, according to this Wikipedia article, 221M users who are operating billions of e-mail transactions daily. It is
operated by 100 sysadmins, which is not that large a team.
Phil Smoot, the PM in charge of Hotmail product development out of the Microsoft Silicon Valley campus, shares a number of insights – from which I noted the following points regarding automation, versionning, capacity planning, impact analysis and QA:
- QA is a challenge in the sense that mimicking Internet loads on our QA lab machines is a hard engineering problem. The production site consists of hundreds of services deployed over multiple years, and the QA lab is relatively small, so re-creating a part of the environment or a particular issue in the QA lab in a timely fashion is a hard problem. Manageability is a challenge in that you want to keep your administrative headcount flat as you scale out the number of machines.
- [...] if you can manage five servers you should be able to manage tens of thousands of servers and hundreds of thousands of servers just by having everything fully automated—and that all the automation hooks need to be built in the service from the get-go. Deployment of bits is an example of code that needs to be automated. You don’t want your administrators touching individual boxes making manual changes. But on the other side, we have roll-out plans for deployment that smaller services probably would not have to consider. For example, when we roll out a new version of a service to the site, we don’t flip the whole site at once.
- We do some staging, where we’ll validate the new version on a server and then roll it out to 10 servers and then to 100 servers and then to 1,000 servers—until we get it across the site. This leads to another interesting problem, which is versioning: the notion that you have to have multiple versions of software running across the sites at the same time. That is, version N and N+1 clients need to be able to talk to version N and N+1 servers and N and N+1 data formats. That problem arises as you roll out new versions or as you try different configurations or tunings across the site.
- The big thing you think about is cost. How much is this new feature going to cost? A penny per user over hundreds of millions of users gets expensive fast. Migration is something you spend more time thinking about over lots of servers versus a few servers. For example, migrating terabytes worth of data takes a long time and involves complex capacity planning and data-center floor and power consumption issues. You also do more up-front planning around how to go backwards if the new version fails.
- We strive to build tools that can replay live-site transactions and real-type live-site loads against single nodes. The notion is that the application itself is logging this data on the live site so that it can be easily consumed in our QA labs. Then as applications bring in new functionalities, we want to add these new transactions to the existing test beds.
- The notion of tape backups is probably no longer feasible. Building systems where we’re just backing up changes—and backing them up to cheap disks—is probably much more where we’re headed. How you can do this in a disconnected fashion is an interesting problem. That is, how are you going to protect the system from viruses and software and administrative scripting bugs? What you’ll start to see is the emergence of the use of data replicas and applying changes to those replicas, and ultimately the requirement that these replicas be disconnected and reattached over time.
- As you go to, let’s say, a commodity model, you have to assume that everything is going to fail underneath you, that you have to deal with these failures, that all the data has to be replicated, and that the system essentially self-heals. For example, if you are writing out files, you put a checksum in place that you can verify when the file is read. If it wasn’t correct, then go get the file somewhere else and repair the old file.
- Last word: If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.
Jeff thanks for sharing this interview, Wow 10,000 servers, 100-system administrator, it’s hard not to have a profound appreciation for the complexity and scale of hotmail. I wonder how much of their energies go into fighting spam?
cheers
Wayne Lambright
http://sfsurvey.com
Posted by: lambright | January 16, 2006 at 08:04 AM
Fascinating.
I have been interested in this issue of scaling your infrastructure for some time now (http://rodrigo.typepad.com/english/2005/10/planning_your_w.html)
Posted by: Rodrigo A. SEPÚLVEDA SCHULZ | January 16, 2006 at 03:55 PM
Thanks for the pointer! Very interesting read.
Posted by: Dorrian | January 16, 2006 at 04:47 PM
Very interesting. The funny thing about Hotmail is that new users are treated better than old users. My account created in 1997 still offers only 2 MB of storage, the one created in 2005 offers 250 MB.
Posted by: Jeff | January 17, 2006 at 12:53 PM
Excellent article, there is some food for everyone here!
Emmanuel
http://galide.jazar.co.uk
Posted by: manu | January 22, 2006 at 08:13 AM