The largest Internet sites manage a precarious balance between innovation and reliability. That's a particular challenge for Facebook, which is experiencing epic user growth while rolling out new features on a regular basis. How does it manage it?
"We fail all the time," said Jonathan Heiliger, the VP of Technical Operations for Facebook. "It's not without danger. Our goal is to make (failure) transparent to our users, and create an environment where it's safe for our employees to fail."
Heiliger was the keynote speaker at the O'Reilly Velocity 2009 conference Tuesday in San Jose, where he discussed the challenges presented by Facebook's growth. "We believe the most effective technical organizations are those that can change fast, and that takes teamwork," said Heiliger.
At Facebook, that has meant cultivating a culture of collaboration between engineering and operations, two groups that often are in conflict. While engineers are eager to innovate, operations covets stability and uptime. "You get conflict," said Heiliger. "I think in every company there's conflicts between operations and engineering."
To bridge the gap, Facebook changed the process for introducing new features. In most organizations, the engineering team writes code, which is then tested by a quality assurance (QA) team before being deployed and becoming the responsibility of the operations team. Heiliger pursued a different approach.
"We don't actually have QA," he said. "At Facebook, every engineer is responsible for the cradle-to-grave lifecycle of their code and their application. You want to put engineering as close to the customer as possible."
The operations team provides "guard rails" during code implementation to contain any unanticipated problems so they don't crash the site. "We built a whole toolset to manage continuous builds, and that toolset is what has allowed us to scale and grow," said Heiliger.
A key strategy in implementing a new feature is the "dark launch," in which new code and functions for the Facebook platform are introduced for limited numbers of users, often running invisibly in the background. By building a feature into the site's back-end before unveiling a front-facing interface, Facebook gets valuable insight into how the new code will perfrm when deployed to a larger user base.
The most recent performance challenge was the introduction of Facebook usernames earlier this month, which was appropriately code-named "Hammer" by the technical team. "We were about to hammer the crap out of our site," Heiliger said.
The change was rolled out at midnight Eastern time on a Saturday morning. "We intentionally picked a pretty odd time to do it, because there's not a lot of traffic at that time and it provided us some additional capacity," said Heiliger. More than 300,000 usernames were claimed in the first three minutes, he said, with more than 1 million names secured in the first hour.
Heiliger also noted the crticial importance of investing in data center infrastructure. "Our approach is to be frugal, but invest where necessary," he said. "We've invested pretty significantly in our infrastructure, and as a result we can scale. Data centers and infrastructure wind up being the second largest expense at Facebook after people."
Data Center Knowledge has previously reported that Facebook appears to be spending $20 million to $25 million a year for the data center space that houses its servers, Facebook has managed its infrastructure costs through its relationships with the two largest "wholesale" data center landlords, the real estate investment trusts Digital Realty Trust and DuPont Fabros Technologies.
Heiliger said those relationships have been important, but are not without their tensions. "I don't think we're cheap, but our vendors complain and gripe about what a hard bargain we drive," he said.