IT’S IMPORTANT TO recognize that not all high-performance computing systems work the same way. Indeed, choosing between a cloud-based or in-house HPC solution may well depend on the kind of processing work that needs to be done. Dennis Gannon, director of cloud research strategy for the Microsoft Research Connections team, analyzed the work performed by about 90 research groups that were given access to Microsoft Azure cloud resources over the last two years. He concluded that four major architectural differences between cloud clusters and supercomputers—machines running thousands, even tens of thousands of processors—determine which types of high-performance computing should be done where:

Each server in a data center hosts virtual machines, and the cloud runs a fabric scheduler, which manages sets of VMs across the servers. This means that if a VM fails, it can be started up again elsewhere. But it can be inefficient and time- consuming to deploy VMs for each server when setting up batch applications common to HPC.
Data in data centers is stored and distributed over many, many disks. Data is not stored on the local disks of supercomputers, but on network storage.

Clouds are perfect for large-data collaboration and data analytics like MapReduce (a strategy for dividing a problem into hundreds or thousands of smaller problems that are processed in parallel and then gathered, or reduced, into one answer to the original question).

THE GROWTH IN THE VOLUME of the world’s data is currently outpacing Moore’s Law,

Hadoop distribution, dubbed the Cloudera Distribution Including Apache Hadoop (CDH), is an example of data-manage- ment platform that combines a number of components, including support for the Hive and Pig languages; the HBase database for random, real-time read/write access; the Apache ZooKeeper coor- dination service; the Flume service for collecting and aggregating log and event data; Sqoop for relational database integration; the Mahout library of machine learning algorithms; and the Oozie server-based workflow engine, among others.

The sheer volume of data is not why most customers turn to Hadoop. Instead, it’s the flexibility the platform provides.

Hadoop is just one of the technologies emerging to support Big Data analytics, according to James Kobielus, IBM’s Big Data evan- gelist. NoSQL, which is a class of non-relational database-manage- ment systems, is often used to characterize key value stores and other approaches to analytics, much of it focused on unstructured content. New social graph analysis tools are used on many of the new event-based sources to analyze relationships and enable cus- tomer segmentation by degrees of influence. And so-called semantic web analysis (which leverages the Resource Description Framework specification) is critical for many text analytics applications.


transistors on integrated circuits doubles approximately every two years. If this is indicating the computer chi innovations then it is not keeping up with the rate at which data is being created.

At its core, Hadoop is a combination of Google’s MapReduce and the Hadoop Distributed File System (HDFS). MapReduce is a programming model for processing and generating large data sets. It supports parallel computations on so-called unreliable computer clusters. HDFS is designed to scale to petabytes of storage and to run on top of the file systems of the underlying operating system. Yahoo released to developers the source code for its internal distribution of Hadoop in 2009.

“It was essentially a storage engine and a data-processing engine combined,” explains Zedlewski. “But Hadoop today is really a constellation of about 16 to 17 open source projects, all building on top of that original project, extending its usefulness in all kinds of different directions.”