Edited transcript of Facebook (FB) at Credit Suisse 2014 Annual Technology Conference
Facebook’s Vice President of Infrastructure, Jason Taylor presents at Credit Suisse 2014 Annual Technology Conference on December 02, 2014, 12:30 PM ET. Following are the webcast audio and the associated transcript of the event…
Facebook (FB) at Credit Suisse 2014 Annual Technology Conference – Webcast Audio
Stephen Ju – Analyst, Credit Suisse
All right, I think we’re going to go ahead and get started. Stephen Ju from the Credit Suisse Internet Equity Research team joined by Jason Taylor, who heads the infrastructure development effort at Facebook.
So without further ado, take it away.
Jason Taylor – VP, Infrastructure, Facebook
Great. So my name is Jason Taylor and I run a group called Infrastructure Foundation at Facebook.
We’re responsible for server design, server supply chain, overall capacity management. So capacity engineering, performance reviews, things like that. And then also the long-term infrastructure plan.
So today, I’m going to kind of walk through a little bit of our infrastructure and talk about a few efficiency programs that we’re excited about and look towards the future of efficiency at large-scale computing.
So Facebook is large. 82% of our monthly active users are outside the United States. We are a global deployment. We have international data centers, one in Lulea and several in the United States.
So 1.35 billion connect with us monthly, 1.2 billion on mobile and kind of a stunning 930 million photos are uploaded to the site every day. So it’s a lot of media, a lot of content that’s distributed on Facebook.
6 billion likes, 12 billion messages per day. It’s a very active site, very dynamic and we built an infrastructure to accommodate that.
Now for the last five years really efficiency has been a top priority at the company and initially I would say that it was really about necessity. We were facing a huge uptick in adoption of Facebook and usage and efficiency has always been core just to be able to scale. And as we reached a large scale it became necessary just for long-term financial viability and also just our ability to build platforms that scale well.
Now from a cost perspective, really efficiency breaks down into three areas. For data centers, heat management is really one of the most important things that we do in terms of core efficiency. In a poorly designed facility, a facility that doesn’t concentrate on heat very much, you could easily pay 50% or 90% additional electricity bill for every watt that you deliver to a server.
Now at Facebook, because we’ve designed our own servers — and both servers and data centers that heat tax is only 7%, which means that we are using cold air from the outside, we’re not chilling air at all. We’re passing it across the servers, mixing it in a hot aisle and then evacuating it out the other side of the building. So in terms of raw thermal efficiency our data centers are second to none.
Now with servers, we pride ourselves in having a vanity free design and we really focus on supply chain optimization. In 2011 we released our first data center and our first set of servers. We also started the Open Compute Project, which I’m sure many of you are familiar with where we give away the designs to our servers and are very open about how we design, what our approach is and how we think about efficiency on the server level.
The other main efficiency win really comes from software and horizontal wins like HHVM or HPHP wins in cash, database and Web are all absolutely critical in continuing to deliver really efficient infrastructure.
So, during a peak time where one of our front-end clusters is really pretty piping hot, we can run for 10 hours of the day at about 90% to 93% server utilization. So we really work a tremendous amount on making sure that not only are the individual servers and the software optimized, but also the whole data center is optimized to provide content.
We like blue, so all of our servers have blue LEDs. What you see here is, so these are the fronts of the servers here and that enclosed space is hot aisle containment. And so cold air comes in from the ceiling, it’s sucked through the servers and then inside that hot aisle containment, that temperature can reach up to 100 degrees. That hot air is then evacuated out of the building or potentially mixed during winter times.
So it’s a — thermally a very efficient system and the other thing you’ll notice is all of our servers look the same. And that’s because we really work hard on having a very homogenous footprint so that you get good wins in terms of serviceability, maintenance, drivers, everything else.
Now we’ve also been very open about all of our efficiency wins. So not only have we talked publicly released data center designs, not only have we released server designs, but we’ve also released most of the core software that powers Facebook. HHVM is our core PHP web server. It is ballpark five to six times more efficient than a traditional Apache stack.
Flashcache is a service that we use on databases that trades off flash caching and access to slower hard drives. Presto is one of our data processing, data warehouse pieces of software. [Rock Stevie] is a new release, Proxy, Thrift and Folly is a general library we use. So in all cases we really try to, as we’re able to support open sourcing a project, we try to keep it out there. And the reason for this is that we really believe that the entire industry can benefit from efficiency work that we do and that we can benefit from the industry feeding back and contributing new ideas and designs.
And fundamentally our company is going to win or lose based on our product, because of our infrastructure and cost efficiency wins on infrastructure is something we like the entire industry to benefit from.
So, in terms of our architecture, we keep it pretty simple. We have front end clusters. Front end clusters are synonymous with network clusters. These are large, they are about 12,000 servers per cluster and this is the stamp of capacity that we push out in order to serve hot requests from users.
Service clusters contain many of our dedicated services, search, photos, messages and others and then our back end clusters are all built and optimized for database storage. So we think a lot about the power redundancy in those clusters.
So to take one of our services, it’s useful to think through a little bit about how one of these large services work. So if you are on Facebook and you’re viewing that main feed, so on a desktop it’s the center, on mobile it’s the main experience, all of that is the recent activity of all of your friends on Facebook. That data is all kept in an index called the news feed rack. All of the recent activity for the last few days of all people using Facebook is kept on each one of these racks and the design is the leaf aggregator. So the leaves contain all the data, all of the storage of recent activity, it’s all kept in RAM and the aggregator is the thing that does the ranking algorithm and that consolidates all of the information to respond to a request.
So web hit comes in, it says I need some stories. Goes to a news feed rack. Picks any one of the aggregators and says give me some stories. The aggregator then blasts a query out in parallel to the other 40 servers in the rack, gathers that subset of data, ranks it based on your interests and then sends it off to be displayed.
Now if you’re an engineer at Facebook, you can have any server you like as long as it’s one of these five servers. You don’t allow any variations. I have one of our main teams at Facebook infra is capacity engineering. They wear black T-shirts that they know on the front. They are very good at saying no to all kinds of engineering requests, because fundamentally, software is far more flexible than hardware and you pay for hardware, your hardware becomes inefficient when you have a lot of variations. And so the more homogeneity you can keeping your infrastructure, the easier it is to optimize, better supply chain you have, better efficiencies up on the stack in terms of operating and understanding the servers.
And so at Facebook each year we only have five types of servers. So there is Web, which is our main compute workhorse. Database, over the years, it’s evolved from purely a disk thing to entirely Flash. Hadoop is main data warehouse for data processing and that’s a lot of compute and a lot of disk.
Photos, it’s all about the lowest dollars per gig. We certainly store a lot of media at Facebook and so optimizing for dollars per gig is really important. And then that last feed rack that’s the news feed service we talked about. So most engineers do like a lot of memory and a lot of compute and that’s what that rack gives you.
So the advantages of five server types and really constraining it to that are pretty classic. You get volume pricing. If you are putting in an order, a large order you can really work deeply with your suppliers to work in a way that’s beneficial to their supply chains, that’s very predictable and they can pass on savings to you which is important.
It’s also important for repurposing. So if we have five services, that all have very large projections in terms of how well something could launch or when something could launch, and if they’re all using the same type of server, then we can be very flexible in reallocating servers from one service to another. And what that means is that we don’t have servers or infrastructure that just lays follow waiting for products to launch. It’s — you guys do mutual funds, but it’s like a mutual fund. It’s all dollars and some are up and some are down and you can really manage it well.
The other key advantage is easier operations. So in a typical data center facility you might have a data center tech to server ratio of one to about 400 or 450. In our facilities it’s somewhere between 1 to 15,000 and 1 to 20,000 – so all of our servers are the same. They are all optimized for serviceability and we do work hard to make that easy. That also translates into operations, software benefits, just efficiency all up and down all of the consumers of the devices.
So some of the drawbacks and this is drawbacks and intrinsic in any hardware. As soon as you allocate hardware, the hardware lands and lives for three or four years. However, the software needs change over time. And so at any point in time your hardware and your software doesn’t fit very well.