Velocity 2010: Tom Cook, “A Day in the Life of Facebook Operations”
Here are my notes from the the Velocity 2010 lecture "A Day in the Life of Facebook Operations"
A Day in the Life of Facebook Operations
Tom Cook, Facebook
Velocity 2010
June 22-24, 2010
(40 minutes, 48 seconds)
- Description of the size of Facebook in terms of minutes on site, pieces of new content, etc.
- User growth curve
- Server footprint growth curve
- Bay area and Virginia (and soon Oregon)
The stack:
Load Balancer -> (assigns a web server)
Web Server -> (assembles data)
Services (fast, complicated), Memcashed(fast, simple), Databases (slow, persistent)
Web server (HipHop for PHP)
- source code transformer
- converts PHP to C++, compiled with gcc
Memcached
- 300+ TB live data in RAM
MySQL
- persistent store
- lots of sharding
- facebook.com/MySQLatFacebook
Services
- news feed, search, chat, ads, media, etc.
Operations is supplying a platform for the Facebook developers to deploy
So, below the stack, we have:
Deployment, Monitoring
Systems Management
Core Operating System
Operating System
- Linux
- CentOS 5 variant with custom kernel
Systems Management
- Configuration management
- Facebook uses CFengine
- Update every 15 minutes, about 30 sec run on each machine
- On demand tools
- No open source solution that meets Facebook needs (used to use DHS)
- Wrote their own internal tool
Deployments
- Push for frontend code (web push)
- At least once a day, frequently multiple times a day (bug fixes, etc.)
- New features at least once a week
- Built on top of on-demand control tools
- Code distributed by BitTorrent (1 minute to push code to all servers)
- Backend deployments
- Formal QA process removed, QA is responsibility of engineers
- Engineers deploy their own code
- No ‘commit and quit’ mentality
- Ops ’embedded’ into engineering teams
- Change logging (every change, who, start time and end time)
Monitoring
- Ganglia (systems focus, graphing), (http://ganglia.sourceforge.net/)
- ODS (application focus), written by Facebook
- Nagios (ping, ssh, server up, etc.), alerting feeds into internal tools
- Aggregate alarms, drilldown capabilities
What Facebook operations deals with…
- Constant Growth
- Constant Failures
Look at network as logical units and dependencies
- Servers
- Racks
- Clusters (some thousand # of hosts)
- Data centers
Constant Communications
- IRC
- Internal news updates
- Banners on top of lots of tools with alerts as to current status
- Change logs / feeds
- Small teams
Recap
- Version control everything
- Optimize early
- Automate
- Use configuration mangement
- Plan to fail
- Instrument everything