Last month, ExtremeTech revealed to youthe true scale of internet porn. At any one time, streaming adult videos probably utilize around 30% of the internet’s total bandwidth, which equates to around 6 terabytes of porn being consumed every second. But what about the other 70%? Netflix, YouTube, and other non-adult video sites are huge bandwidth hogs, possibly accounting for as much as 40% of internet traffic. Digital file lockers, such as Rapidshare and Megaupload, account for around 10% of traffic worldwide. Web surfing and email (and spam!) are another 15%. And then there’s cloud computing.
Today, the vast majority of web services and sites are hosted in the cloud. By this I mean that, instead of companies (such as Ziff Davis/ExtremeTech) managing their own hardware, third-party cloud storage and computing services are used. Amazon Web Services (AWS), Microsoft Azure, and Google are three prominent examples of huge cloud clusters, but there are hundreds of smaller operations that range in size from a whole data center down to a few racks.
The power of the cloud is vested in the fact that it can be coerced and shoehorned into tasks as disparate as a cloud-based supercomputer, to webmail, to simple document storage. On a single cloud cluster, Google can host and serve petabytes of YouTube videos and store all of your email and documents. Of all the facets of the cloud, though, today we’re going to focus on cloud storage.
A Microsoft data center
While storage might not be as sexy as terabytes of RAM and thousands of CPU cores, it is the most reliable way of measuring the size of the cloud, especially when we factor in bandwidth usage. From the total amount of storage we can also work out the cost of cloud storage — and from there, we can finally work out why the likes of Google, Microsoft, and Dropbox are falling over themselves to provide cloud storage services.
Like the porn story, we’ll first start with some theoretical numbers, and then move onto some real-world figures (and hardware) from Backblaze, a cloud backup provider.
For the most part, real numbers from the big companies, such as Google, Facebook, Amazon, and Microsoft, are few and far between. If you scour the web, though, some rough ballpark figures emerge:
- Facebook, in its IPO filing, said it stores over 100 petabytes (PB) of media (photos and videos). It’s not unrealistic to say that Facebook probably has a total storage of capacity well beyond that, once you factor in backups and other data (status updates, likes, and so on), possibly in the 300PB range.
- Microsoft recently admitted that Hotmail stores over 100 petabytes, and that SkyDrive, with “17 million customers,” stores 10PB of data. Like Facebook, Microsoft’s total capacity, once we factor in the rest of Azure and its web properties, is probably well over 300 petabytes.
- Megaupload is relatively tiny in comparison, apparently storing just 25 petabytes.
- Amazon, rather than giving us a nice, easy number of petabytes, instead announces the total number of objects stored by its S3 cloud storage service. As of April 2012, Amazon S3 stored 905 billion objects. If we assume an average size of 100KB, that’s around 90 petabytes; if the average size is 1MB, that’s 900 petabytes — almost an exabyte!
- Dropbox, a year ago, stored “10+ petabytes” of data. It had 25 million users then, and 100 million users today, so all things being equal the company now stores around 40PB of data.
To put these figures into perspective, an average computer probably has a 500GB or 1TB hard drive, and a petabyte is 1024TB. At the very least, then, Microsoft and Facebook data centers play host to more than 100,000 hard drives. Without building custom hardware, you can squeeze 48 drives into a 4U enclosure. After accounting for networking gear, that means you’re probably looking at around 400 hard drives per 40U rack — or 250 racks, each of which occupies around one square meter of floor space. This might sound like a lot, but when you consider that Google, Amazon, Facebook, and Microsoft regularly roll out data centers with floor plans of over 30,000 square meters (300,000+ square feet), it’s really not that much. In the grand scale of things, a lot more space is dedicated to servers (i.e. CPUs) and networking gear.
Bandwidth-wise, we have even less data from the big boys. We know that, as of last year, one million files were being saved every five minutes — so today, with four times as many users, that’s 800,000 files per minute. Amazon S3, which is significantly larger than Dropbox, handles “650,000 requests per second.”
If we assume that the average file stored on Dropbox is 500KB (a mix of photos, videos, and documents) then Dropbox stores a total of 400,000 megabytes (0.4TB) per minute — or 6.7GB per second (54Gbps). We don’t have any data on how much data Dropbox sends per minute (i.e. people downloading files from their Dropbox), but it’s probably in the region of 10 to 20Gbps.
Amazon S3, which is mainly used to store static files for websites (images, style sheets, videos), probably has a lower average file size than Dropbox. If we assume an average size of 100KB per file, then 650,000 requests per second comes to a grand total of 61 gigabytes of data transferred per second, or 488Gbps. This is very close to the 800Gbps figure that we estimated for a large porn site, which equates to around 2% of total internet traffic — Amazon is pretty darn big!
Facebook and Microsoft, with between 100 and 300PB of storage each, probably fall somewhere between Dropbox and Amazon in terms of bandwidth usage — maybe 200Gbps a piece.
But enough theory! Let’s discuss some real-world numbers and real-world hardware!
Try as we might, every cloud storage provider rebuffed our requests for information on their hardware and software setups — except Backblaze, one of the few remaining unlimited cloud backup providers. Backblaze is relatively small, but the real numbers will still astound you.
First things first: Backblaze stores a total of 32 petabytes of user data, mostly fashioned out of 3TB hard drives. For redundancy, groups of 15 drives are banded together using RAID 6, which places two parity blocks on each of the 15 disks. RAID 6 can rebuild data from the parity blocks as long as there is no more than two concurrent disk failures in a group of 15 — and yes, replacing faulty drives is by far the most important task of whoever is on call at the Backblaze data center, so that redundancy can be restored as soon as possible.
Backblaze adds 3 petabytes to its storage cluster every month, in the form of 135-terabyte “pods” — custom-designed 4U rack mounted computer cases that accommodate 45 hard drives each (pictured below). 3 petabytes per month means that Backblaze is adding roughly 22 pods per month — which equates to almost two standard 48U racks.
While you might think that the pods themselves are a major operating cost (they cost $7,384 each), hardware is really the least of Backblaze’s concerns; power, data center space, and bandwidth are far more expensive over time.
At $0.20 per kilowatt hour (kWh) each pod costs around $100 per month. Data center power is incredibly expensive, which is why many web companies are building data centers in locations with cheap power, such as Oregon. Backblaze currently hosts its servers in a third-party data center, which means it has to pay for the floorspace occupied by its cabinets; again, this works out at around $100 per pod (~$1,000 per rack). At the time of writing, Backblaze has a total of around 300 pods stored in 40 rack cabinets — so around $30,000 is spent per month on electricity, and around $40,000 is spent on floor space.
And then, of course, there’s bandwidth. At any one time, Backblaze’s routers have 12Gbps of data inbound (about 1.5 gigabytes per second), and 500Mbps (62.5MB/s) outbound. The disparity is because Backblaze is a backup service — but even so, that 500Mbps figure would suggest that hard drives fail (or data is lost) much more than I would expect. At the moment, Backblaze receives around 100 million files from its users per day, or 36.5 billion files per year — and in 2011, it restored a total of 600 million files. Unfortunately we don’t know how exactly how many restored computers that equates to, but at a conservative estimate of 5,000 files per computer (excluding Windows and program files) that comes to 120,000 very relieved backuppers.
Back in its salad days, Backblaze paid $30 per Mbps per month — now, thankfully, that figure’s down to $2/Mbps/month, which comes to a total of around $25,000 per month. Backblaze does have cheaper offers on the table from other ISPs, and again the company would save considerable money if it built its own data center.
All told, then, the total operating cost (before employees!) of a 40-petabyte cloud storage cluster is $100,000 per month, split fairly evenly between floor space, bandwidth, and power consumption.
Why cloud storage is a cut-throat business
You may have noticed that, in recent months, Microsoft, Google, Dropbox, and others are fervently fighting to be your cloud storage provider of choice. At first, I couldn’t work this out — but now, after looking at the numbers, it becomes a little clearer.
Backblaze charges $5 per month for unlimited backup — so with operational costs of $100k per month (before employees), it needs just 20,000 users to break even. Backblaze didn’t provide us with its customer count, but its CEO did tell us that the company is profitable (Backblaze has 13 employees, incidentally, and it operates as a startup: i.e. everyone gets a tiny salary).
Another more prominent example is Dropbox, which charges $100 for 50GB of storage. Unlike Backblaze which runs its own hardware, Dropbox uses Amazon S3 for storage, which charges $0.055 per gigabyte per month for clients using more than 5 petabytes (Dropbox stores around 40PB). If the average Dropbox customer uses 25GB of his 50GB allotment, Dropbox pays Amazon $1.38 per month — and yet the service costs between $9 and $10 per month.
For cloud storage providers who operate their own data centers, such as Microsoft SkyDrive, Google Drive, and Amazon S3, the profit margins will be significantly fatter than Dropbox.
The caveat, though, is that all of these services have many more free users than paid. Dropbox, with 100 million users, is by far the largest cloud storage provider — but the vast majority of those are freeloaders. The number of paid users is probably quite small, and thus the profit margins are probably quite slim too.
The more likely reason for such feverish competition in the cloud space is strategic dominance; for Google and Microsoft, cloud computing is where the next platform war will be pitched. With Windows 8, you will log in with your Live ID and all of your documents and settings will automatically sync with the cloud, allowing you to freely roam between devices (or access your files from any web browser). Chrome OS is entirely web- and cloud-based, each successive version of Android develops deeper ties to the cloud — and of course, with Google+, the company is trying to weave as many of its web properties together as possible.
Ultimately, these companies are vying for control of your files. If you use Google Docs/Drive for word processing, then you’re much more likely to choose Chrome OS or Android, both of which are strongly tied into Google’s cloud — but if you use SkyDrive and Office Web Apps, Windows 8 and Windows Phone 7 make much more sense. In this regard, I’m not entirely sure what the future holds for “standalone” cloud storage services, such as Dropbox. If Microsoft, Google, and Apple can build cloud backup and file sharing into the operating system, what purpose does Dropbox serve?