Here are my notes from the the Velocity 2010 lecture "A Day in the Life of Facebook Operations"

A Day in the Life of Facebook Operations
Tom Cook, Facebook
Velocity 2010
June 22-24, 2010
(40 minutes, 48 seconds)

  • Description of the size of Facebook in terms of minutes on site, pieces of new content, etc.
  • User growth curve
  • Server footprint growth curve
  • Bay area and Virginia (and soon Oregon)

The stack:
Load Balancer -> (assigns a web server)
Web Server -> (assembles data)
Services (fast, complicated), Memcashed(fast, simple), Databases (slow, persistent)

Web server (HipHop for PHP)

  • source code transformer
  • converts PHP to C++, compiled with gcc

Memcached

  • 300+ TB live data in RAM

MySQL

  • persistent store
  • lots of sharding
  • facebook.com/MySQLatFacebook

Services

  • news feed, search, chat, ads, media, etc.

Operations is supplying a platform for the Facebook developers to deploy
So, below the stack, we have:

Deployment, Monitoring
Systems Management
Core Operating System

Operating System

  • Linux
  • CentOS 5 variant with custom kernel

Systems Management

  • Configuration management
    • Facebook uses CFengine
    • Update every 15 minutes, about 30 sec run on each machine
  • On demand tools
    • No open source solution that meets Facebook needs (used to use DHS)
    • Wrote their own internal tool

Deployments

  • Push for frontend code (web push)
    • At least once a day, frequently multiple times a day (bug fixes, etc.)
    • New features at least once a week
    • Built on top of on-demand control tools
    • Code distributed by BitTorrent (1 minute to push code to all servers)
  • Backend deployments
    • Formal QA process removed, QA is responsibility of engineers
    • Engineers deploy their own code
    • No ‘commit and quit’ mentality
    • Ops ’embedded’ into engineering teams
    • Change logging (every change, who, start time and end time)

Monitoring

  • Ganglia (systems focus, graphing), (http://ganglia.sourceforge.net/)
  • ODS (application focus), written by Facebook
  • Nagios (ping, ssh, server up, etc.), alerting feeds into internal tools
  • Aggregate alarms, drilldown capabilities

What Facebook operations deals with…

  • Constant Growth
  • Constant Failures

Look at network as logical units and dependencies

  • Servers
  • Racks
  • Clusters (some thousand # of hosts)
  • Data centers

Constant Communications

  • IRC
  • Internal news updates
  • Banners on top of lots of tools with alerts as to current status
  • Change logs / feeds
  • Small teams

Recap

  • Version control everything
  • Optimize early
  • Automate
  • Use configuration mangement
  • Plan to fail
  • Instrument everything