Big. Cloud Big.

SchrödingerAmazon’s EC2 is increasingly functioning as an on-demand supercomputer, according to a fascinating piece by Jon Brodkin in Ars Technica.  It’s another example of scale and heft moving to the cloud, and in the process democratizing the availability of high-end computing capability previously accessed by only the “one percent” of customers of high performance computing (HPC).

As the piece points out, Amazon customers have assembled clusters of 30,000 and 50,000 cores, some of them running “embarrassingly parallel” processes that did not depend on fast interconnect technologies like InfiniBand, the technology used by the top HPC datacenters.

Even if you’re using Amazon’s 10 Gigabit Ethernet connection, you may still be using servers that are literally half the world away from each other. For Amazon customer Schrödinger, which makes simulation software for use in pharmaceutical and biotechnology research, the distance didn’t matter on that aforementioned 50,000-core cluster. Schrödinger pulled Amazon resources from four continents and all seven of Amazon’s data center regions to make the cluster as big as possible.

But the Amazon approach can slow down applications that do require a lot of communication between servers, even at small scales. Schrödinger President Ramy Farid tells us that in one case his firm ran a job on two 8-core Amazon instances with terrible results.

“Certain types of parallel applications do not yet seem to be appropriate to run on the Amazon cloud,” Farid said. “We have successfully run parallel jobs on their eight-core boxes, but when we tried anything more than that, we got terrible performance. In fact, in one case, a job that ran on 16 cores took more wall clock time than the same job that was run on eight cores.”

Still, Amazon is a multi-purpose platform, running everything from Web hosting to MySQL so ultimately performance speeds lag behind massively parallel systems that are tuned for HPC apps only, like simulating the airflow over a wind turbine.

But sometimes “cloud big” doesn’t stand for democratization; it can also refer to the sui generis systems that Google and other big cloud players have to design for themselves.

It brings me back to a conversation that I had with salesforce.com CEO Marc Benioff back in 2006, the pioneer days of the cloud (How time flies!). I was complaining about a frustrating conversation that I had just had with our head of systems engineering at the time:

“He said that it’s like we were building a handmade car!” I whined.

“Well of course we are, young padewan. What did you think we did, called Dell and said ‘One Cloud, please’?”

OK, I made the padewan part up.

While salesforce.com doesn’t build its own servers, companies like Google and Facebook do. And a new story by Cade Metz at Wired Enterprise describes an ultra cool storage array that Facebook has developed, the Open Rack, part of a new storage project codenamed “Knox”:

Bf1

The dominant paradigm in server racks in cloud data centers is that when a failure is detected or predicted in a rack, the entire rack is replaced. Presumably the remaining servers are checked, repaired/recycled if necessary, and returned to service.

The open rack turns this paradigm on its head, designing arrays of disk drives to maximize access so that each drive can be swapped out at the push of a button without bringing the rest offline. The massive arrays are built as clamshells, swinging open with minimal resistance. The drives just pop out.

Bf2

The designs for the project are being made available through the Open Compute Project, an effort started by Facebook to share such technologies:

The aim of the project, says Frank Frankovsky, the ex-Dell man who oversees hardware group at Facebook and serves as point man for the Open Compute Project, is not only to improve hardware in the data center, but to do so in way everyone can benefit from. Web giants such as a Google and Amazon already use custom-built gear, and they’re streamlining their supply chains by purchasing this gear straight from manufacturers in Taiwan and China. But they treat their designs like trade secrets, viewing them as a competitive advantage best kept hidden from the rest of the world. Ultimately, Frankovsky believes, you can streamline the process even more if everyone shares their designs.

“The Open Compute Project is really about bringing together a convergence of voices,” he says. And other members of the project agree. Though Knox was designed by engineers at Facebook, the project was officially chaired by Cole Crawford, the director of technology at Nebula, a Silicon Valley startup that sells a hardware system for build Amazon-like cloud services, and according Crawford, the prototype was built with input from the larger community. “As a community member,” he says, “you are absolutely empowered to give your thoughts and ideas.”

Would this be happening if Microsoft were leading the industry now? Sharing critical infrastructure technologies instead of using them to competitive differentiation.

That’s an idea that’s Big. Cloud Big.