Edited transcript of Facebook (FB) at Credit Suisse 2014 Annual Technology Conference
Facebook’s Vice President of Infrastructure, Jason Taylor presents at Credit Suisse 2014 Annual Technology Conference on December 02, 2014, 12:30 PM ET. Following are the webcast audio and the associated transcript of the event…
Listen to the Webcast Audio MP3: Facebook (FB) at Credit Suisse 2014 Annual Technology Conference – Webcast Audio
Stephen Ju – Analyst, Credit Suisse
All right, I think we’re going to go ahead and get started. Stephen Ju from the Credit Suisse Internet Equity Research team joined by Jason Taylor, who heads the infrastructure development effort at Facebook.
So without further ado, take it away.
Jason Taylor – VP, Infrastructure, Facebook
Great. So my name is Jason Taylor and I run a group called Infrastructure Foundation at Facebook.
We’re responsible for server design, server supply chain, overall capacity management. So capacity engineering, performance reviews, things like that. And then also the long-term infrastructure plan.
So today, I’m going to kind of walk through a little bit of our infrastructure and talk about a few efficiency programs that we’re excited about and look towards the future of efficiency at large-scale computing.
So Facebook is large. 82% of our monthly active users are outside the United States. We are a global deployment. We have international data centers, one in Lulea and several in the United States.
So 1.35 billion connect with us monthly, 1.2 billion on mobile and kind of a stunning 930 million photos are uploaded to the site every day. So it’s a lot of media, a lot of content that’s distributed on Facebook.
6 billion likes, 12 billion messages per day. It’s a very active site, very dynamic and we built an infrastructure to accommodate that.
Now for the last five years really efficiency has been a top priority at the company and initially I would say that it was really about necessity. We were facing a huge uptick in adoption of Facebook and usage and efficiency has always been core just to be able to scale. And as we reached a large scale it became necessary just for long-term financial viability and also just our ability to build platforms that scale well.
Now from a cost perspective, really efficiency breaks down into three areas. For data centers, heat management is really one of the most important things that we do in terms of core efficiency. In a poorly designed facility, a facility that doesn’t concentrate on heat very much, you could easily pay 50% or 90% additional electricity bill for every watt that you deliver to a server.
Now at Facebook, because we’ve designed our own servers — and both servers and data centers that heat tax is only 7%, which means that we are using cold air from the outside, we’re not chilling air at all. We’re passing it across the servers, mixing it in a hot aisle and then evacuating it out the other side of the building. So in terms of raw thermal efficiency our data centers are second to none.
Now with servers, we pride ourselves in having a vanity free design and we really focus on supply chain optimization. In 2011 we released our first data center and our first set of servers. We also started the Open Compute Project, which I’m sure many of you are familiar with where we give away the designs to our servers and are very open about how we design, what our approach is and how we think about efficiency on the server level.
The other main efficiency win really comes from software and horizontal wins like HHVM or HPHP wins in cash, database and Web are all absolutely critical in continuing to deliver really efficient infrastructure.
So, during a peak time where one of our front-end clusters is really pretty piping hot, we can run for 10 hours of the day at about 90% to 93% server utilization. So we really work a tremendous amount on making sure that not only are the individual servers and the software optimized, but also the whole data center is optimized to provide content.
We like blue, so all of our servers have blue LEDs. What you see here is, so these are the fronts of the servers here and that enclosed space is hot aisle containment. And so cold air comes in from the ceiling, it’s sucked through the servers and then inside that hot aisle containment, that temperature can reach up to 100 degrees. That hot air is then evacuated out of the building or potentially mixed during winter times.
So it’s a — thermally a very efficient system and the other thing you’ll notice is all of our servers look the same. And that’s because we really work hard on having a very homogenous footprint so that you get good wins in terms of serviceability, maintenance, drivers, everything else.
Now we’ve also been very open about all of our efficiency wins. So not only have we talked publicly released data center designs, not only have we released server designs, but we’ve also released most of the core software that powers Facebook. HHVM is our core PHP web server. It is ballpark five to six times more efficient than a traditional Apache stack.
Flashcache is a service that we use on databases that trades off flash caching and access to slower hard drives. Presto is one of our data processing, data warehouse pieces of software. [Rock Stevie] is a new release, Proxy, Thrift and Folly is a general library we use. So in all cases we really try to, as we’re able to support open sourcing a project, we try to keep it out there. And the reason for this is that we really believe that the entire industry can benefit from efficiency work that we do and that we can benefit from the industry feeding back and contributing new ideas and designs.
And fundamentally our company is going to win or lose based on our product, because of our infrastructure and cost efficiency wins on infrastructure is something we like the entire industry to benefit from.
So, in terms of our architecture, we keep it pretty simple.
We have front end clusters. Front end clusters are synonymous with network clusters. These are large, they are about 12,000 servers per cluster and this is the stamp of capacity that we push out in order to serve hot requests from users.
Service clusters contain many of our dedicated services, search, photos, messages and others and then our back end clusters are all built and optimized for database storage. So we think a lot about the power redundancy in those clusters.
So to take one of our services, it’s useful to think through a little bit about how one of these large services work. So if you are on Facebook and you’re viewing that main feed, so on a desktop it’s the center, on mobile it’s the main experience, all of that is the recent activity of all of your friends on Facebook. That data is all kept in an index called the news feed rack. All of the recent activity for the last few days of all people using Facebook is kept on each one of these racks and the design is the leaf aggregator. So the leaves contain all the data, all of the storage of recent activity, it’s all kept in RAM and the aggregator is the thing that does the ranking algorithm and that consolidates all of the information to respond to a request.
So web hit comes in, it says I need some stories. Goes to a news feed rack. Picks any one of the aggregators and says give me some stories. The aggregator then blasts a query out in parallel to the other 40 servers in the rack, gathers that subset of data, ranks it based on your interests and then sends it off to be displayed.
Now if you’re an engineer at Facebook, you can have any server you like as long as it’s one of these five servers. You don’t allow any variations. I have one of our main teams at Facebook infra is capacity engineering. They wear black T-shirts that they know on the front. They are very good at saying no to all kinds of engineering requests, because fundamentally, software is far more flexible than hardware and you pay for hardware, your hardware becomes inefficient when you have a lot of variations. And so the more homogeneity you can keeping your infrastructure, the easier it is to optimize, better supply chain you have, better efficiencies up on the stack in terms of operating and understanding the servers.
And so at Facebook each year we only have five types of servers. So there is Web, which is our main compute workhorse. Database, over the years, it’s evolved from purely a disk thing to entirely Flash. Hadoop is main data warehouse for data processing and that’s a lot of compute and a lot of disk.
Photos, it’s all about the lowest dollars per gig. We certainly store a lot of media at Facebook and so optimizing for dollars per gig is really important. And then that last feed rack that’s the news feed service we talked about. So most engineers do like a lot of memory and a lot of compute and that’s what that rack gives you.
So the advantages of five server types and really constraining it to that are pretty classic. You get volume pricing. If you are putting in an order, a large order you can really work deeply with your suppliers to work in a way that’s beneficial to their supply chains, that’s very predictable and they can pass on savings to you which is important.
It’s also important for repurposing. So if we have five services, that all have very large projections in terms of how well something could launch or when something could launch, and if they’re all using the same type of server, then we can be very flexible in reallocating servers from one service to another. And what that means is that we don’t have servers or infrastructure that just lays follow waiting for products to launch. It’s — you guys do mutual funds, but it’s like a mutual fund. It’s all dollars and some are up and some are down and you can really manage it well.
The other key advantage is easier operations. So in a typical data center facility you might have a data center tech to server ratio of one to about 400 or 450. In our facilities it’s somewhere between 1 to 15,000 and 1 to 20,000 – so all of our servers are the same. They are all optimized for serviceability and we do work hard to make that easy. That also translates into operations, software benefits, just efficiency all up and down all of the consumers of the devices.
So some of the drawbacks and this is drawbacks and intrinsic in any hardware. As soon as you allocate hardware, the hardware lands and lives for three or four years. However, the software needs change over time. And so at any point in time your hardware and your software doesn’t fit very well.
Now because at Facebook the rack is the computer, we’re really thinking about not just how does a piece of software fit in individual server, but because we allocate in rack the question is how does the software live on the whole rack.
And what I want to talk to you at this point is really an idea that we’ve mentioned before and I want to talk about a few fun results.
A disaggregated rack, so here we’re shooting for better component service fit over time and we’re also looking to extend the useful life of servers. Now if you think of that rack of newsfeed servers and you ignore the fact that it’s much of servers and you think about, well what is it really, you’ve got a bunch of computes, you’ve got a bunch of RAM and you’ve got some Flash.
Now exactly where the compute is and where the RAM is shouldn’t matter if it’s all within the rack and if you’ve got a nice healthy network. And so if you break down this disaggregated rack idea into a few major components, you’ve got processors or compute servers, you’ve got RAM or just kind of RAM servers. You might have the storage server and you might have Flash. And so rather than put all of that in one server, where at any time you are going to hit a weakest link, you’re going to have not enough compute or not enough RAM, which breaks that all up putting on a high-bandwidth backplane and then switch in and out resources as the service needs.
And so at any time, you’re not wasting services or you’re not wasting resources and so your server and service fit can be better, both across services and over time and you can also accommodate a longer hardware refresh.
So, if we have a Type 6 server and this is a newsfeed server, so up and down is CPU on the left, right and left is just RAM, the newsfeed server fits CPU and RAM very well and it should because we design the servers for newsfeed.
However, if you go to another service, maybe search or one of the other index services, they might need more RAM than CPU and what that means is that they’re terminally not using that CPU resource.
The other thing that can happen is, at the beginning of a services life maybe during year one, they are a perfect fit, but along year two, they need more RAM or they need more Flash and so being able to allocate that just in time or allocate that hardware along with the needs of the service provides a huge benefit because otherwise you’re buying more servers when all you really needed was more RAM. They can’t open the cases on 10,000 servers and upgrade the RAM, that doesn’t work. So being able to add in sleds of RAM is a huge benefit.
The other and third benefit is that you can really keep the hardware for as long as it will physically last. So again, many times when you’re doing computers at scale, you’re deprecating them based on the critical resource that’s not longer good enough and that means you’re throwing away other resources that are perfectly fine.
Compute really doesn’t get old. RAM, it’s a solid state device that’s just a pure RAM device can operate forever. Disk can wear out over time and then Flash, depending on your write volume might wear out over time, but it can actually live for a while.
And so, if you think of a disaggregated rack for say, graph search, rather than have computers that have all three resources, let’s have compute servers things that have just compute, some RAM that are mainly focused on compute that would be a Type 1 server for us, a flash sled, which would be anywhere between 30 or 256 terabytes of flash on a single sled, a RAM sled might have 256 or 512 gigs of RAM and then storage.
Now in year one, the ratios that we pick might be perfect. But then in year two, they might discover, they might have had efficiency win, the index might grow over time and the best thing to do would be to just give that service more flash and so rather than allocate an entire separate rack thereby doubling the cost, you just allocate that one resource and you just slam in another flash sled and that kind of flexibility really leads to some pretty nice efficiency wins.
So we still maintain all of our core strengths. Our volume pricing, custom configuration, all of those sorts of things, but what it really allows us to do is much smarter technology refreshes and the hardware, the resources because we’re thinking on that rack level, we can evolve the hardware with the service.
Now last year I talked to you all about the approximate TCO wins and so over a three-year, six-year period, we were looking at between 12% or 20% OpEx savings on the conservative side and more aggressive between 14% and 30% using this different approach. Keep in mind nothing has fundamentally changed about the computers that we’re allocating. We’re just allocating them in a different way and we’re helping our software team to be a little bit more flexible in how you bring on resources and this works for pretty much anybody at scale.
Now we, at the time we talked last year, we were working on this project and we’ve now landed it for one of our services. Now we’ve got 20 to 30 services, maybe 40 major ones, but for the one that we started with, we’re actually able to realize a 40% savings in the total cost of operating this equipment.
So by doing nothing more than just bringing the resources necessary online at the right time and by customizing the rack in a scalable, flexible way that maintain all of our supply chain wins, we’re able to realize a 40% reduction in cost on this one service and this is something, this is an approach and a technique that we think pretty much anyone can use.
Now if we think about computers and resources and we think about the last 20 years, when one of our ideas about disaggregated rack is to be able to adopt different types of resources and if you look at it over the last 20 years, the types of things that are in servers at scale are pretty much the same.
It’s the server below is — Amir Michaels is holding one of our first Facebook built servers, that server from an architectural perspective is almost identical to that 386 and a tower case from 20 ago.
There have been a few new technologies, math coprocessors, two processors for server, multicore all of those are good. The only game changers in the last 20 years have really been GPUs which are great at vector math and flash memory, which is phenomenal win over the last four or five years.
What we’re looking forward to is really major advancements in the network. So 100 gig mix, 400 gigs between switches, those are all perfectly reasonable now given the state of technology. Flash has been going and all of the flash providers have been really pushing for higher and higher IOPS and heavier duty flash. We actually feel that at the at-scale environment, we want to look in the other direction going for lower IOPS because you really don’t need that many and then also we can be much more careful about how we use flash and what that means is that we can get much denser flash sleds and realize nice benefits there.
So we think that in the flash, there is always going to be nice market for high performance flash, but we think for the data center world, a lot of the flash interest is going to shift towards lower and lower flash.
Last year we did a talk and I asked the industry for the — please make the worse flash possible and really we can work with — we can work with very not necessarily low quality flash, but lower endurance flash – TLC or even beyond in terms of the density.
Now there is also a number of RAM alternatives we think are coming up and are very interesting. Face change memory I think is going to be, if it works out, it’s going to be pretty good and resisted memory is kind of on par with that in terms of technology. And then also cold flash or WORM solid-state storage. So not all storage needs to be on spinning media. It’s perfectly reasonable for very immutable data to be put on solid state devices.
So if you look at our eye chart for technologies over the next few years, it is an eye chart. There is a lot of detail there, but if we simplify a bit, give me one more second to take a photo, sorry, we’re down to some pretty basic evolution.
So over the last from say 2009 to say 2020, we think that compute in the hat datacenters is going to be much more focused on one possessor system on a chip processors that whole efficiency ecosystem that’s developing there is really strong. Yes, we will still have RAM, but we think face change memory and resisted memory will also be strong players.
WORM, so solid state storage. This is a technology that might evolve over the next say three to five years, but permanent immutable storage in solid-state is completely reasonable, and we think that the bit densities can get to a point where they can be superior from a TCO sense for thin hard drives compared to hard drives.
Optical data storage has a great future just in terms of the ability to densify media over time. We’ve seen a number of nice announcements in terms archival discs and taking what looks like Blu-ray discs and taking them from 100 gigs to 500 gigs or even up to a terabyte and then 100 gig fiber to the servers. We think only we’re only a couple of years away from that.
So these are the technologies that we’re interested. These are the technologies that we are excited about.
And with that, I would like to open it up for questions.
Question-and-Answer Session
Stephen Ju – Analyst, Credit Suisse
Yes, we probably have time for a couple of question here. So poll for questions from the audience.
Unidentified Audience: Great. Thank you. I appreciate it. There’s a lot of talk about Mesos and its implications for computing and the software infrastructure stack. What’s your perspective on how Mesos gets adopted and the implications of the change there?
Jason Taylor: Mesos is the virtualization thing. We don’t do that. So all of our infrastructure is based on bare metal scaling. So virtualization in the cloud is excellent when you’re managing a heavy idle workload. So if you have 20 idle servers and you need to compact them into two idle servers, that’s obviously a win. You didn’t buy 18 idle servers. When you’re building for scale and when you’re building for throughput, the best efficiency wins come from one just really balancing the utilization of the gear, which is a lot of what the disag work is about.
But then also really looking deeply at how do the — how does the software and hardware work together and then look for big performance wins there. The first I would say 3X of HPHP wins really just came from, going from an interpreted language to a compile language and that was a very solid win. Most of the wins since then have really come from both optimizations and how the code works, but also how does it work on the specific hardware we run and as soon as you start virtualizing, as soon as you start putting some layers of abstraction in there, you decouple the engineers, who are fantastic from — you essentially don’t allow them to make those kinds of optimizations any more. And so for Facebook, because we have such a large workload, we really focus on using each piece of hardware as much as much we can.
Unidentified Audience: To what extent are the efficiencies that you discussed this morning going to allow Facebook to have more capital efficient growth as the company continues to grow its revenues over the three to five years?
Jason Taylor: I think fundamentally efficiency has been our top priority for us for a long time and I think the most important aspect of efficiency to us is the coherence that it brings to all of our software engineers in terms of unifying them and thinking about how does the software and hardware work together. So if you want to talk actual capital spend, you would have to talk with Deborah. But in terms of efficiency it’s core to the way that we work.
Unidentified Audience: [Inaudible] And the expansion of a 100-gig fiber within that footprint.
Jason Taylor: Sure, so I can say that when I talk — when I’ve talked about 100-gig fiber and 400-gig fiber, all of that is within the data center. So what has happened is, is that the telecommunications industry has delivered fantastic technology over the last bunch of years and what you talked about dark fiber, major facility to major facility running hundreds of miles. Well when you look at that core technology, it is ready to be adopted into the data center space and that is kind of — that kind of bandwidth within a datacenter is very possible. There is a number of great technologies that are developing right now in terms of Silicon Photonics. There are several companies working in essentially fully integrated chips and optics packages, which deliver 100-gig performance at really very low cost.
So the largest thing that’s happened — the biggest thing that happened two or three years ago was flash at the datacenter. The thing that’s happening right now is the amount of networks that you can buy for a reasonable price is climbing dramatically.
When I joined Facebook, we had 1-gig NICs everywhere, and 1-gig was sort of the standard. In 2011, we shifted to 10-gigs. Pretty soon, in the next couple of years we’ll have 25 gigs to the severs; I would say within three years we’ll have 100-gigs to the servers. So over around a six year period, going 100x up in the amount of bandwidth that’s available, that’s transformative in both how the datacenter operates but also how you write your services. So network — I think network and improvements in networking is going to be the largest driver towards changes and how large scale internet companies work.
Stephen Ju – Analyst, Credit Suisse
And I think, with that, we’re actually out of time. So thank you very much.
Jason Taylor: Thank you.
Related Posts