Skip to content

neoTactics - Randy Bias - GoGrid
Syndicate content Some Rights Reserved
Cloud strategy & infrastructure
Updated: 59 min 11 sec ago

Does OpenStack Change the Cloud Game?

Tue, 07/20/2010 - 20:01

This week Rackspace Cloud, in conjunction with the NASA Nebula project, open sourced some of their Infrastructure-as-a-Service (IaaS) cloud software. This initiative, dubbed ‘OpenStack’, should have a dramatic impact on the current dynamics for building cloud computing infrastructure. Previously there have been two major camps: Amazon API and architecture compatible and VMware’s vCloud. Now there is a third alternative that could not only be a viable alternative to these two approaches, but more importantly, a fantastic option for service providers and telecommunications companies that face unique challenges.

Let’s dive in and I’ll explain.

Cloud Stack Evolution & ‘Camps’
Amazon Web Services (AWS) spawned a huge ecosystem of knock-offs, management systems, tools, and vendors. They include, but aren’t limited to:

  • AWS API compatible ‘cloud stacks’ including Eucalyptus, Open Nebula, and others
  • Cloud management systems for the AWS APIs and services such as RightScale and enStratus
  • Cloud services layered on top of AWS services such as Jungle Disk (S3), Heroku (S3, EBS, EC2), and more

Prior, I wouldn’t have called the AWS ecosystem a ‘camp’ per se, but if you read our most recent article on Google’s foray into cloud storage, you know that it seems likely they will provide a 100% compatible version of S3 and EC2 this year. Imagine the impact of Google Compute & Storage with Amazon Web Services compatible APIs. Already the Google Storage API is nearly 100% compatible with S3.

Together, as a block, Amazon and Google could create a de facto duopoly for infrastructure clouds, which isn’t good for anyone. We need competition and more than two major players.

Up against the Amazon camp is VMware. In my article on Amazon vs. VMware last year I highlighted how these two businesses were on a collision course. Nothing has changed and competition is mounting between them. The reason is that telcos and service providers are under increasing threat from Amazon and soon Google. They need viable solutions and VMware is attempting to provide a competitive ecosystem.

The VMware cloud initiative, vCloud, is designed to arm enterprises and service providers to be competitive, but has not quite delivered yet. VMware has had a number of problems providing a full cloud stack. The software, now in beta, is codenamed ‘Redwood’ has had significant delays in getting to market. Their strategy for cloud infrastructure does not appear unified outside of delivering compute virtualization.

VMware, as a business, understands they need to make their customers competitive. They have made a number of strategic open source acquisitions such as SpringSource, RabbitMQ, and Redis. There are also murmurings that they have some special projects inside that are ‘up the stack’ from their virtualization offerings. In total this shows that VMware ‘gets it’ in that they want to create a competitive ecosystem. While each of these is currently a point solution, there is yet to be a coherent story here. Can VMW build a consistent story and strategy around these disparate pieces? Only time will tell…

Besides these two camps, there is a long tail of clouds running various frameworks vying to establish themselves such as Cloud.com’s CloudStack, 3Tera, Hexagrid, Abiquo, OpenNebula, etc. John Treadway recently had posted a roundup describing all of the various cloud stacks out there.

OpenStack is stepping into the ring as a viable third camp. In particular, the OpenStack Storage solution is a clear contender to Amazon S3 & Google Storage. Many service providers and telcos have struggled to find a viable solution using commodity hardware that was price competitive. Suddenly, there is a viable proven solution.

Yet this is only storage. How can it create an effective ‘third camp’ alternative to Amazon and VMware for an entire cloud?

Lock-in, Architecture, Standards and The Truth about Interoperability

Interoperability for infrastructure clouds is poorly understood. Most believe that the problem lies in the on-disk image format (e.g. VMDK vs. VHD vs. qcow) or in the ‘hypervisor’ (although people don’t really unders/tand what this means). The truth is that lock-in has little or nothing to do with disk formats or the hypervisor. Most on-disk image formats are simply representations of block storage (i.e. disk drives). That means converting between a VMware VMDK and a Citrix XenServer/Hyper-V VHD is relatively trivial.

What about booting the converted disk image up on a new hypervisor? Guess what, since most hypervisors now rely on hardware virtualization (HVM) [1] using Intel-VT/AMD-V, that means that by default most will work with unmodified operating systems out of the box. No changes needed. The only downside of this is that usually the resulting performance is poor. This requires new paravirtualization (PV) drivers in the converted image. What does that mean? After converting the image from one format to another, you simply have to install the PV drivers for the correct OS. A process that requires being methodical, but is in no way technically challenging.

Where is the lock-in then? If it’s not the hypervisor, what makes moving from one cloud to another so difficult? Simply put, it’s architectural differences. Every cloud chooses to do storage and networking differently.

For example, if you wanted to move a virtual machine from GoGrid to Amazon, converting the GoGrid image to an AMI is not difficult. Unfortunately, GoGrid uses two networks, a ‘frontend’ and a ‘backend’ where your cloud storage system is connected to via the backend network. Every Amazon virtual server has only a single network interface. If your application assumes a separate backend network then what happens when it moves to a cloud without one? Or vice versa? Similar architectural incompatibilities exist between Rackspace Cloud, Savvis, Terremark, Hosting.com, Joyent, and all of the others.

The problem here, to be a bit more succinct, is that we need reference architectures for how infrastructure clouds are built. Amazon is one such reference. VMware’s vCloud is potentially another. Now there could be a truly open option with the gravity to gather community support.

More on The Third Camp

OpenStack’s potential to build a real community and a set of reference architectures drives towards greater standardization and interoperability. Perhaps more important than a cloud storage alternative, is this possibility for a true OpenStack community to form a critical mass such that a similar level of developers contributing to it as Amazon or VMware. Then commercial and alternative offerings, such as Cloud.com, Hexagrid, and OpenNebula can match their APIs and architectures to this set of reference architectures.

Will it happen? It’s hard to say, but the opportunity is there. Rackspace and others are putting serious weight behind this initiative.

What This Means for Telcos and Service Providers

For Telcos and SPs this means an alternative to VMware’s vCloud for commodity service offerings. A way to compete and operate at scale like Amazon and at a similar price point. Standardization through a similar reference architecture means greater compatibility between service provider clouds, which means greater benefit for customers and less lock-in, making them more desirable than the walled gardens.

You don’t want to differentiate on the basic compute, storage, and network offering. You want this to be as standard and interoperable as possible, just like 3G networks, TCP/IP, and similar service provider technologies. By creating a common open platform that everyone uses there is a better opportunity to facilitate wider adoption, create a competitive infrastructure service marketplace where providers work on differentiating in areas where they have an inherent advantage:

  • Service and support
  • Network & datacenter tie-ins (e.g. MPLS, hosting/co-lo)
  • Bundled service offerings
  • Differentiated value-added cloud services (VACS)

This is a game that all telcos and service providers understand. They have been playing it for the past 15+ years.

Conclusion

OpenStack, with a strong community behind it, should be an important tool for service providers and large telcos to compete at scale with the Amazon and Googles of this world.

We believe OpenStack and the reference architecture(s) associated with it will allow service providers (SP) to get their undifferentiated cloud offerings up and running early. For this reason, Cloudscaling will put real resources into supporting this effort. Getting basic cloud offerings up early then means providers can focus on support, services, bundling, and differentiated services as soon as possible, while embracing as large a customer base as possible. This is just as they compete on top of basic TCP/IP services today.

[1] Clearly, the market leader, Amazon, does not use HVM. They use PVM, a fully paravirtualized mode of Xen. However, even they seem to understand that HVM is the future. Their latest offering, designed for HPC, which is performance sensitive, uses HVM and supports unmodified operating systems. The reality is that the Intel-VT and AMD-V capabilities on the latest round of processors is incredibly fast and will only get faster. The battle is over. HVM and silicon won in this case.

Post to Twitter

Categories: Companies

Rumor Mill: Google EC2 Competitor Coming in 2010?

Mon, 07/05/2010 - 16:00

I’ve heard from a somewhat reliable source that Google is working on their Amazon EC2 competitor. Yes, some kind of on-demand virtual servers. I would have been the last person to guess that Google would take this direction[1], but you have to admit it makes a certain sense from their perspective. Consider:

  • Amazon’s EC2 is clearly generating Real Revenue (TM) and could be at 500-750M in revenue for 2010
  • Google has a massive global footprint and is north of one million servers
  • The support structure for these servers includes a huge investment in datacenters, networking, and related
  • The Googleplex houses an extremely large number of talented engineers in relevant areas: networking, storage, Linux kernel, server automation, etc.
  • Google Storage recently went into BETA and is accepting developer signups

This last is perhaps one of the more telling signs. As you may be aware, Amazon’s Simple Storage Service (S3) pre-dates Amazon’s Elastic Compute Cloud (EC2). When Amazon launched in Europe they first deployed S3 followed by EC2. The same happened with their Asia/Pac deployment.

Amazon has built AWS in such a way that all of the services are synergistic, but in particular, EC2 is dependent on S3 as a persistent storage system of record. EC2 AMIs originate from and are stored in S3, it’s the long term backing store for Elastic Block Storage (EBS) and EBS snapshots, and it’s safe to assume that many other kinds of critical data that AWS relies on are stored there.

Would Google take a different approach? It’s doubtful. Amazon’s S3 is built to be a highly scalable storage platform[2]. Google’s own GoogleFS and BigTable server similar purposes. It’s certain that Google would use related design principles and hence we could see the Google Storage as a prelude to a Google on-demand virtual server service (Google Servers???).

Combined with the rumor I heard from a reasonably informed source I think we can look forward to an EC2 competitor, hopefully this year.

What I want to bring to folks attention here is that if another credible heavyweight enters into this market it will have a tremendous impact in further driving the utilitization of cloud services. In the medium term it will also threaten hosting providers and ‘enterprise clouds‘.

Why? I think what many hosting providers fail to understand is that Amazon and Google, particularly if fueled by direct competition, must grow up into the enterprise space. Just as in the Innovator’s Dilemma, they will eventually provide most of the features of any ‘enterprise’ cloud, which means that if you aren’t building to be competitive with Amazon and Google, you aren’t in the public cloud game.

Much more detail on this in a future posting.

[1] My best would have been that Google put more weight behind PaaS solutions like Google App Engine (GAE) and related, which are more ‘google-y’.
[2] See the whitepaper (PDF) on their Dynamo technology behind S3. Also check out Riak from Basho that is designed around the same techniques.

Post to Twitter

Categories: Companies

Building A Commodity Cloud with EMC?

Wed, 06/30/2010 - 13:09

Just a quick post to note a recent blog post by Chuck Hollis (@chuckhollis) that discusses some of the issues related to using EMC for commodity clouds.  The posting hubs around a conversation I have been having with Chuck trying to understand the EMC product line better and seeing if there is a fit for businesses building cost-effective clouds.

Chuck’s blog posting covers the discussion fairly well and he was very helpful.  My final takeaway is that I think there can be a place for EMC, traditionally a ‘premium’ vendor, in even low cost commodity clouds.  The challenge however, as he rightfully identifies is that EMC, the business, has a hard time understanding these requirements.  The local EMC sales team I’ve been dealing with doesn’t really understand the inherent assumptions I’m making.  There is a lot of push to simply purchase the ‘bigger box’.

A strongly recommended read for folks trying to understand both EMC’s potential value proposition in the cloud and how to build ’scale out’ commodity clouds.

Post to Twitter

Categories: Companies

Cloud: Change Management & Cloud Operations

Thu, 06/24/2010 - 03:57

Our own Andrew Shafer, killed it today at the Velocity Conference.  His presentation is a must read for webops, devops, and those aspiring to build 100% uptime cloud services.

It’s hard for folks to internalize how things are changing in Internet-land, but I think you’ll get closer through this presentation.  It’s not the same-old, same-old any more. Cloud computing is the biggest change to how IT functions since the 1980s and the advent of the personal computer (and hence the rise of client-server/enterprise computing).

Enjoy … (and outstanding job, Andrew!)

Change Management Velocity2010

View more presentations from Andrew Shafer.

Post to Twitter

Categories: Companies

Interview with Cloudscaling CEO on Cloud in the Mid-market

Wed, 06/16/2010 - 16:24

A recent interview I did with Alex Bewley of Uptime Software is finally available. Although the podcast is nominally about cloud computing for mid-tier enterprises, we actually cover much broader ground. Alex’s blog posting lists the core topics as:

  • what kinds of businesses are using cloud
  • how you should go about evaluating it
  • how to avoid being outsourced as an IT department
  • what are the barriers to adoption; monitoring in the cloud (near and dear to our hearts)
  • designing applications for failure awareness
  • where he thinks the cloud is going

More important, for me personally, is that I think this is one of my better podcasts. The audio is clear, my responses, while long, are reasonably crisp, and you can tell that the general thinking around here has evolved a lot.  Some key messages come through loud and clear, which I think aren’t well understood still:

  1. Cloud computing isn’t about virtualization
  2. This is disruptive sea change, be the disrupter, not the disrupted
  3. Whole new areas of opportunity, applications, etc. are opening up that didn’t exist before

I really think it’s worth a listen.  It’s a little less than 20 minutes and moves pretty quickly.  Please enjoy and a big thanks to Alex who did a great job with the interview. Head over to the original blog post to listen to the podcast with Flash in your browser or you can download the MP3 directly if you are using a non-flash capable system.

Post to Twitter

Categories: Companies

Getting Velocity – Economy of Motion

Tue, 06/08/2010 - 17:53

Last time, I implied that scale alone doesn’t always lead to the operational efficiency in the datacenter

There is no class someone can take that will teach them everything they need to know to run a datacenter, and the applications it ostensibly exists for, efficiently. Training exists for some of the tools, for some of the hardware, but gaps are still left to fill in with smart people, hopefully working together. (and sometimes getting the people aligned IS the hard part)

If there was a training covering every technical aspect of this, it would be expensive, take months, probably years, and each class would not be up to date by the time it was over.

Two weeks from today, you can get a glimpse of what different parts of that training might look like.

Velocity is the preeminent cloud operations conference. (Let’s face it, a cloud is a big web app. If you can’t do web ops efficiently at scale, keeping a cloud running effectively will be a chance for you to learn these lessons the hard way. These lessons will probably cost you on both sides, as you throw bodies at the problem while your customers leave after every outage.)

This is ‘the’ ops conference for building and managing cloud services at and on every level of whatever ‘cloud’ means to you.

I’ve been to every Velocity so far. Starting at the first Velocity, there were two recurring themes, first, automating dynamic infrastructures, and second, developers and operations working together as paramount and a differentiator. The change is only accelerating, no pun intended.

I’m leaving out half of the story because Velocity also gathers the planet’s expertise focused on client side performance, which is provably important. It’s not my focus at the moment, but we need each other to deliver the highest value.

Many lessons are best learned by doing. Come learn from and connect with the people who have experience building some of the most impressive infrastructures and ops teams on the planet (and the scars to prove it…). You’re at even greater disadvantage if you don’t, because we’ll be learning and sharing with each other.

RegNow_336x280

Cloudscaling will be there.

Register now for 20% off with this code: vel10s2d

Post to Twitter

Categories: Companies

More Economies of Scale: Efficiency, Head Count and TCO

Wed, 05/19/2010 - 18:00

James Hamilton’s presentation at Mix 10 illuminated cloud computing economics that few others have direct experience with, but I also believe that this presentation raises interesting questions that didn’t get addressed. (If you haven’t seen James Hamilton’s Mix10 presentation, go watch it now. You should probably also go through Randy’s follow up, and then watch James talk again… I’ve watched it 3 times now.)

This is the first post that will refer to aspects of James’ talk (and I plan at least one more about business models) and in case I haven’t stressed this enough, if you have any interest in understanding the economics of cloud computing, take the time to watch one of the best in this business.

Central to James’ presentation is the breaking down the total cost of ownership of computational infrastructure. His breakdown is based on his own data running web scale services and he provides us with a great analysis on the inevitability and sustainability of cloud computing business models. One of the key points James makes is that the only variable cost in the chart is the cost of power.

Cost Breakdown from James' Hamilton

Cost Breakdown from James' Hamilton

The thing I want to focus on here is the missing cost of personnel. James touches on this at different points discussing administration and automation. He gives out the number ‘as low as 3% for services’, so I’m assuming he is burying this as a negligible cost. I would argue that this cost is actually highly variable, and, while correlated with scale, is also a function of the types of services an organization provides and how those services relate to the core business. Additionally, automation investments can be scaled down effectively, but that’s what I’ve been working on for a couple years so this is likely a reflection of my bias.

Based on James’ biases (which he is straightforward about), that the cost of personnel can be driven down to almost nothing is essentially taken for granted. I contend, based on personal experience and observation, this is still a significant operational cost for many organizations. I would take this a step farther and posit that the level of efficiency James refers to only comes from the crucible of running web services at scale with considerable economic pressure for little or no downtime. Furthermore, this level of efficiency and cost reduction will never materialize in organizations who view IT as a cost center. Efficiency doesn’t just come for free, at scale or otherwise.

To keep the numbers easy, let’s assume an admin is paid $100,000 per year. Then neglecting the aspects of networking and storage, that admin can manage some number of machines. If that number is 100, then managing each machine costs $1000 per year or ~$83 per month. If that number is 1000 machines, then each machine costs $8.33. If those are $3000 servers and the servers are roughly 54%, then if I’m understand correctly the $8.33 is around 5% of the monthly cost when amortized over 3 years. James gave us price or efficiency ratios for storage, networking and admins. For a large service he listed ‘over 1000 servers/admin’. He did not give us a ratio or a price point per server, but in order to get down to ~3%, the admins need to manage significantly more than 1000, the server cost is significantly higher than $3000, or the admins get paid significantly less than $100,000. (this also assumes salary is the only cost and nothing is paid for any management tools…)

What do you pay your admins? What do you pay for servers? What is the ratio between them?

Which one do you have the most control over?

(hint: the way to optimize the ratio is not to hire less admins, unless your customers like down time…)

Watching the evolution of the cloud computing landscape, in the rush to bring new services to market or transition away from apparently disrupted business models, I believe many organizations may unnecessarily learn this lesson the hard way. The proper care and feeding of the infrastructure better be a core competency for those who intend to compete ‘as a Service’ at any level. The operational differentiators have as much to do with process and culture as they do with technology, but doing them well could be the difference between business success and failure. I believe this is difficult to retrofit, especially at scale.

So what should you do? Start by trying to understand what your costs look like today, and I’ll follow up with perspectives and resources that might help with operational efficiency, at any scale. Operations can be a competitive advantage, but only for those organizations who have made the investments in both the people and the infrastructure.

Post to Twitter

Categories: Companies

Lew Tucker, former Sun Cloud CTO, now Cloudscaling advisor

Tue, 05/11/2010 - 16:01

San Francisco, CA – May 11, 2010 – Cloudscaling today announced that Lew Tucker, former CTO of Sun’s Cloud Computing business unit, has joined Cloudscaling’s advisory board.  Cloudscaling CEO, Randy Bias, expanded on what this means to the company, “If you look at Lew’s history, you will see that he is a true visionary and always at the forefront of the next technology trend. His experiences at Salesforce.com, Sun Microsystems, and Thinking Machines, fit right alongside the deep expertise in cloud and distributed systems that makes Cloudscaling unique.”

Lew’s background spans more than 20 years during which he has been instrumental in driving several major  technology changes, including: AI and massively parallel systems, developer adoption of Java, Salesforce.com’s AppExchange,  and most recently, Cloud Computing.    According to Lew, “At Thinking Machines, in the early 1990’s, we were building massively parallel machines using thousands of individual processors.   At Sun, we drove the evolution of the web with Java and networking, often using the tagline, ‘The Network is the Computer’.   In this next phase, it’s becoming clear that the ‘Cloud is the Computer’  and this promises to be just as disruptive.”

Cloudscaling CEO, Randy Bias, and Lew Tucker both share a long-term interest in the design and architecture of large, scalable systems.  As former VP Technology Strategy of GoGrid, Randy was responsible for building out one of the most complete infrastructure services in the cloud.  As CTO of Cloud Computing for Sun Microsystems, Lew was responsible for the architecture and API for Sun Cloud.

Lew’s joining the advisory board continues to build up the Cloudscaling team’s unique set of resources.  “If you want to build significant clouds, you have to have the right team.” said Randy Bias.  We’re the only cloud engineering services team I know of that can point to not one, but many, large scale cloud environments they have built.”

About Cloudscaling

Cloudscaling is the leading cloud computing engineering services firm. We provide strategy, design and implementation to build cutting edge clouds. Located in San Francisco, the company was founded by experts who have built some of the largest public and private clouds operating today. Visit cloudscaling.com to read the blog and follow the team on twitter.com/cloudscaling.

Contact:
Pat Sharp
pat@cloudscaling.com
725 Cool Springs Blvd., Ste. 600
Franklin, TN
USA
Ph: +1 (615) 732-6192

###

Post to Twitter

Categories: Companies

Understanding Cloud Datacenter Economies of Scale

Tue, 05/04/2010 - 15:03

James Hamilton’s recent MIX’10 presentation on economies of scale for large cloud providers was quite impressive. James “gets it” like few others in the industry. If you haven’t watched his hour-long presentation, I suggest you do. I also recommend this excellent response from James Urquhart.  My goal in this posting is to highlight, clarify and expand on a few of James Hamilton’s points.  I will focus on Infrastructure-as-a-Service (IaaS) clouds, but the concepts are relevant for other kinds of cloud services.

In his presentation, James focuses on power: utilization, distribution, etc., and while an important element, like him, I don’t think it’s the most important factor.

I also want to dispel the myth that only the largest companies can achieve these economies of scale. Don’t get me wrong; providing a cloud service is a scale game. It requires a certain amount of buying power to compete. However,you don’t need to be MSFT, YHOO, AMZN, or GOOG to compete effectively. Buying power can be had at levels much lower than you might think.

In this article, I refer regularly to Jame’s comments in his presentation, so I suggest you watch his video first. In order to minimize confusion, I’ve borrowed some pictures from  his slides and inserted them here for your reference. This is a long entry, but it will be worth the read as I’ve got numbers for you which I hope you will find interesting.

Background
Like James, the Cloudscaling team has a history of building large scale services. I’ve worked in this area for 16+ years as has our COO, Adam Waters, and several of our team members. Understanding of the economies of scale, especially for service providers, cloud or otherwise is fundamental to our DNA. For example, see my previous piece describing how oversubscription works.

Enough of that! Let’s dig in and look at where you can achieve economies of scale, identifying areas James Hamilton may have neglected, and clarifying areas where I think there is still confusion.

Economies of Scale
There are a number of areas where you can achieve economies. James touched on a few in his talk.  While this is not a 100% complete list, here are key areas of opportunity that I see:

  • Datacenter and facilities architecture (power & cooling)
  • Buying power (COGS) for Networking, Compute (Servers), and Storage
  • Development & Labor Costs
  • Standardization & Homogenization
  • Cash Flow

In James Hamilton’s model (see pie chart below) server costs are the dominant cost, but he critically left out development & labor costs.  This can be as much as 10% for a cloud and while it’s possible for large clouds to drive this down to a marginal cost, in practice there are no Infrastructure-as-a-Service clouds of sufficient size to achieve this yet. While James focuses primarily on power & cooling in his presentation, let’s take a closer look at some other areas.

james-hamilton-pie-chart

Jame's Hamilton's Distribution of Cloud Datacenter Costs

Networking
There are  two key areas of networking where you can achieve economies of scale:

  • Buying IP (network) transit (OpEx)
  • Capital expense (CapEx)

Unfortunately, James provides one example with numbers from 2006 which compare two companies purchasing bandwidth. The larger company purchased bandwidth at $15 per Mbps per month at the 95th percentile ($/Mbps/mo) vs the smaller company’s expense of $95/Mbps/Mo. James uses this number to show a 6 or 7 to 1 buying power difference.

The IP landscape has changed dramatically since 2006.  It’s a dirty little secret in the hosting and cloud world that bandwidth is dirt cheap[1]. In fact, you need very little buying power  to get rock bottom prices. Street price for high quality Tier-1 IP transit is <$5/Mbps/Mo if you buy 1Gbps commits. That’s a mere $5,000/mo, which is well within the spending range of even small companies. My local coffee shop could probably afford it.  Yes, it’s quite possible larger buyers are getting even lower rates than $5/Mbps, but there is a bottom and it’s not much less than $5/Mbps/Mo so the disparity in buying power is closer to 3:1 or 2:1.

To give you some perspective, I know for a fact that some Yahoo! datacenters push upwards of 40Gbps.  A lot, certainly, but at $1-3/Mbps, well within the buying capacity of even medium size businesses.

The second area is using buying power to purchase network hardware at much reduced rates. When buying bulk quantities of network gear, the theory is that network costs can be significantly reduced, hence larger players have significant cost advantages.  Unfortunately, this only works partially in practice.  Network equipment costs are still increasing significantly, not reducing as with compute and storage equipment costs.  Combined, networking equipment (CapEx) and power usage (OpEx) make up 21% (in the piechart above) of datacenter costs and both are steadily increasing.

Large cloud providers are beginning to address this by systematically removing brand name networking vendors like Cisco and Juniper[2]. It is now possible to buy very high quality, exceptionally cheap networking gear direct from Taiwanese manufacturers. Many of these are the original equipment manufacturers for name brands.

Most people don’t realize how marked up Cisco/Dell/HP/Juniper gear is. For example, these Taiwanese OEMs have networking gear with price points as low as $100/10GE port. Yes, $100 per 10Gig Ethernet port. That’s about 1/10 the Cisco price point. At the same time, the quality of the gear is quite high and in some cases the components and chips are a generation ahead of what’s available from the name brands.

In other words, times are changing. We’re going to see a significant drop in the prices for networking gear across the board for the first time in ages and hopefully networking will get in line with the standard Moore’s Law curve.

Servers
Flogged. Dead Horse. There isn’t any significant buying power to be had in the commodity x86 server market. x86 server vendors, particularly those providing commodity offerings, have thin margins to begin with. 25% is typical. An Amazon or a Google can push these down somewhat by buying in bulk, but not enough to give them more than a marginal advantage. Anyone who can buy $1M USD of servers at once can negotiate a pretty steep discount. Many businesses can afford to buy at that price point.

James Hamilton understands this and points out where the real buying power is: buying customized hardware in bulk that allows for datacenter optimization and cost reductions in power, cooling, and space. By purchasing customized server offerings from the likes of an SGI/Rackable or Verari that include well spec’ed components, designing for their particular datacenter environment there are significant savings to be had. That’s where the real opportunities lie.

Vendors like SGI/Rackable and Verari can afford to build to spec in large quantities and amortize that customization across large orders.  These vendors are learning from the large clouds what works best and productizing it.  You will be able to benefit from this learning and productization too. In fact, we help our clients figure out these kinds of issues every day and know that these opportunities are within the reach of all types of cloud businesses.

Development & Labor Costs
Although touched on only briefly in the presentation, but I think this is the heart of the matter. Amazon leads the pack in rapid development of cloud services (see my post “Is Amazon Winning the Cloud Race?“). Their ability to innovate both automation and technology allows them to drive  significant economies of scale. This implies that development is a much larger cost of a cloud than might be expected.

Amazon’s EC2 was initially built using a 15 person engineering team. I estimate that AWS as a whole probably has 50+ software engineers and 20-30+ support and operations staff [3]. Last year, I estimated there were approximately 40,000 servers at a target price of $2,500 each for EC2. That’s 100M in CapEx on servers. A reasonable estimate of engineering and operational labor costs for AWS are probably close to 10-20M over the past 4 years. A not inconsequential number compared to the CapEx costs.

One client recently asked us to compare the operational expense costs for server administration between large clouds (e.g. Microsoft and Google) and a typical enterprise. Usually, enterprises can manage 100-200 servers per admin. Microsoft’s stated goals are 1,000-2,000 for their Chicago datacenter (confirmed in the presentation). Google is managing at the scale of 10,000 servers per admin and trying to get to 100,000.

If you do the math, the basic cost for administration is $75/mo/server for the enterprise, $7.5/mo for Microsoft, and a mere $.75/mo for Google, a 100x difference! When calculating the long term TCO, you’ll find that investing heavily in automation is a “no-brainer” for those whose core competency *is* building at scale and operating IT at the lowest cost possible[4].

The lesson here, which James alludes to in his presentation (see his map of AWS releases in 2009 below), is that one major economy of scale is the ability to have significant resources deployed for software development purposes. The outcome of most cloud software development is generally automation or technology that enables the business to scale more efficiently.

james-hamilton-aws-rapid-innovation-chart

James Hamilton's map of 2009 AWS releases

Standardization & Homogenization
Often overlooked is that businesses built at cloud scale *must* run relatively homogeneous environments. By standardizing, they can achieve reasonable scale. For example, Google is reputed to run as little as five hardware configurations across its one million+ server base. In contrast, a typical enterprise  has hundreds of configurations across a much smaller server base, increasing operational overhead and expense dramatically.

Did you know that the primary driver behind Yahoo!’s cloud computing initiative was to normalize and cleanup their 800+ configurations? It’s impossible to operate at massive scale without homogeneity and standardization. A huge benefit of virtualization in cloud environments is that it allows the standardization of the physical hardware platform while running a plethora of operating systems at the virtualized layer. This is also why I am occasionally a little sad when I see large enterprise IT shops insisting on purchasing their x86 server hardware vendor of choice. It just doesn’t matter any more. Not if your cloud (internal or external) is designed correctly.

The Cash Flow Problem
A less understood, but just as interesting area of business scalability for clouds is the interplay of growth, speed, and cash reserves. There’s no question clouds can be very profitable and attractive for operators given their high margins and 100% compound annual growth rates (CAGR). Typical payback periods on installed cloud hardware average a fast 3-6 months. However, new entrants need to plan for an extended period of ever increasing hardware expenses that stay well ahead of free cash flow. Larger clouds will require $10M in liquidity to meet their rolling hardware acquisition needs, and even small clouds need to think about acquiring hardware at $1M per step. One key in building a highly profitable cloud lies in minimizing the lag between hardware acquisition and revenue generation. The difference between days and months is the difference between profitability and disaster.

A Brief Aside on ‘Utilization Rates’
And that brings up an interesting point I want to address, which is not technically an economy of scale, but worth discussion.  I’m intrigued by the notion that  ’utilization rates’ are almost completely meaningless in the context of public cloud providers.  James Hamilton’s claims 30% server utilization (presumably CPU??) as a high metric even for cloud service companies. However,  this doesn’t matter when you sell your capacity like an IaaS cloud does.  Here’s why: whatever your particular cloud business model is (e.g. selling RAM, CPU, or ‘bundles’ like Amazon’s instances), you *must* sell as much of it as possible at any given time.  Anything else is business suicide.  You can’t overbuild and only have 30% of your capacity sold.

Most IaaS providers target 75-80% sold capacity. Their cloud is therefor ‘utilized’ at 75-80% in that the capacity is sold, even if not at 100% usage itself.  Unlike a business where unused capacity is waste,  in a business model that sells capacity, unused capacity is unsold capacity and hence, not competitive.  Not all usage of the term ‘utilization rates’ is the same, especially when discussing an IaaS cloud[5].

Conclusion
It’s critical to understand the potential economies of scale for cloud providers. They can achieve these economies through size and focus. While larger players have some advantages,  many businesses can afford to buy servers and network in enough bulk to see significant price savings. More important than sheer size is the ability to focus on innovation.

Public cloud providers have a core competency that involves delivering IT services at a very cost effective price point. They are the new IT utility companies of the near future. Their ability to focus and spend development resources to achieve ever newer economies of scale will be something that traditional businesses can’t compete with. Traditional enterprise IT vendors will likely continually be playing “catch up” and be unable to provide competitive solutions in time.

Economies of scale are why other business infrastructure in the past, e.g. railways, telecommunications, and shipping, have consolidated into businesses who focus on delivery of the infrastructure as a core competency. To think IT is any different is to bet against history.

UPDATE: Added footnote to clarify that AWS is currently aggressively hiring so my initial estimates on engineering staff size are way off.

[1] This means most hosting providers and clouds see very large markups on bandwidth.  They buy it cheap, but don’t typically pass on the cost savings to smaller customers.
[2] If you listen carefully to Jame’s Hamiltons presentation he alludes to this several times.
[3] It was pointed out in the comments that AWS currently has 100+ technical positions open, so I’m probably very off on my initial size estimate.  It’s hard to derive how big the AWS engineering team is from open positions because it depends on their aggressiveness at hiring.  A wild guess, assuming the answer is ‘very aggressive’, would be 200 engineers, putting their current operational costs for headcount around 30M annually.
[4] Unlike James Urquhart, in his response, it’s seems apparent to the Cloudscaling team that cost-effective and enterprise-class high-quality clouds are not mutually exclusive.
[5] Check out this tangentially related Wikipedia article on capacity utilization rates for the production base of nations.

Post to Twitter

Categories: Companies

Join Cloudscaling, the Power Behind the Cloud

Fri, 04/23/2010 - 09:00

Innovation and agile development—that’s how we build automated cloud infrastructure for leading global organizations. Our thought leaders and practitioners are building the best tools and processes to build cloud platforms, sharing our knowledge as we grow.

We’re hiring the devops dream team. Cloudscalers have built major IaaS, PaaS and SaaS systems. We need Senior Developers and System Administrators who recognize ‘Infrastructure is Code’ and embrace the fusion of development and operations.

Contact us if you:

  • Have a background of shipping software in a team environment
  • Are familiar with test-driven development, continuous integration/deployment
  • Are lazy! Have a tendency to automate everything
  • Have strong systems, network and storage experience
  • Have experience automating infrastructure provisioning
  • Are experienced with configuring an application runtime stack
  • Are intimately familiar with Linux (Debian, RedHat)
  • Have familiarity with existing public cloud computing platforms
  • (AWS, Rackspace)

More about Cloudscalers—we love:

  • Chef / Puppet
  • Building APIs
  • Linux Packaging
  • Virtualization & hypervisors (Xen, ESX, KVM)
  • VMWare APIs (specifically virtual infrastructure (VI) and virtual server)
  • OpenSource Cloud tools like, Eucalyptus, OpenNebula, Abiquo
  • Opensolaris

And tons of other nifty things. If building the next generation of cloud computing infrastructure interests you, please let us know! We’ll be listening to @cloudscaling on twitter and you can reach us at jobs@cloudscaling.com.

Post to Twitter

Categories: Companies

Cloudscaling Expands Executive Bench

Tue, 04/20/2010 - 19:01

San Francisco, CA – April 20, 2010 - In the past 9 months, Cloudscaling, a pioneer of cloud infrastructure engineering and design, has seen phenomenal growth in both clients and staff. The company has tapped a leading cloud veteran, Joe Arnold, to help manage growth and drive its cloud engineering projects.

Joe, who built agile cloud development teams at both Yahoo! and EngineYard, remarked, “Cloudscaling’s has the strongest history of anyone out there when it comes to building cutting edge infrastructure clouds. I’m very much looking forward to growing the team and helping our clients build and run their own world class clouds.”

CEO and founder, Randy Bias, describes Cloudscaling as more than a consulting firm. He compares it to large scale civil engineering firms that construct power plants and help their clients run them at peak efficiency.  He explains why the company chose Joe.  “Cloudscaling brings more than technology and architecture solutions to our cloud engineering services clients; we also help them build great teams to run with the clouds we design. With EngineYard, Joe proved that cloud is about more than just technology. The discipline imparted upon the EngineYard team is widely credited with its turnaround from an ‘R&D shop’ into the number one Ruby-on-Rails Platform-as-a-Service offering in the cloud computing space. Joe will bring those same skills to every client we work with.”

COO, Adam Waters, adds, “Cloudscaling builds great clouds and great product teams. Unlike our competitors, we teach our clients to fish, by showing them how to not only deliver a differentiated cloud offering, but also how to run fast paced Agile cloud devops teams who can compete with the likes of Amazon, Google, and Yahoo!”

At EngineYard, the leading Ruby on Rails platform, Joe led his team to private beta launch in under 3 months, and public beta with paying customers in under 4 months. The team continued its momentum by pioneering techniques for continuous deployment of cloud computing platforms. Joe will continue to employ the most innovative and Agile software development methods as he grows and develops the elite team of cloud engineers at Cloudscaling.

About Cloudscaling

Cloudscaling is the leading cloud computing engineering services firm. We provide strategy, design and implementation to build cutting edge clouds. Located in San Francisco, the company was founded by experts who have built some of the largest public and private clouds operating today. Visit cloudscaling.com to read the blog and follow the team on twitter.com/cloudscaling.

Contact:
Pat Sharp
pat@cloudscaling.com
725 Cool Springs Blvd., Ste. 600
Franklin, TN
USA
Ph: +1 (615) 732-6192

###

Post to Twitter

Categories: Companies