2015-09-23



Google Fires Up Spark, Hadoop Service On Its Cloud.

If Google knows a lot about one thing, it is that developers do not want to spend a lot of time setting up infrastructure just so they can create applications that chew on lots of data to run the business. So goes the thinking behind Cloud Dataproc, Google’s GOOG 0.80% new managed big-data service for running Hadoop and Spark as a service on Google’s cloud computing platform. The new Google Cloud Dataproc service sits between managing the Spark data processing engine or Hadoop framework directly on virtual machines and a fully managed service like Cloud Dataflow, which lets you orchestrate your data pipelines on Google’s platform.


While many of us admire Google for its systems and datacenter designs, the thing that truly makes Google powerful is its ability to quickly develop new technologies and make them easy for programmers to make use of. Greg DeMichillie, director of product management for Google Cloud Platform, told me Dataproc users will be able to spin up a Hadoop cluster in under 90 seconds — significantly faster than other services — and Google will only charge 1 cent per virtual CPU/hour in the cluster. It will help you create clusters quickly, manage them easily and it is also going to be economical as it will allow you to turn clusters off when you don’t need them. Commercial technology vendors such as Cloudera and Hortonworks HDP -0.70% are trying to solve this problem for users running these technologies in data centers, but the easiest option—for those willing to give up some control over their server—is just to have a cloud provider take care of it for them. Cloud Dataproc – which is short for data processing, of course, a very old school term for what we are all still doing – is in beta testing now; the timing of the commercial release of the service has not been revealed.


Google is also touting the integration of Cloud Dataproc with the company’s other cloud computing services for big data—including BigQuery, Cloud Storage and Cloud Bigtable (a database technology)—and the ability to work with Dataproc using standard interfaces. DeMichillie and Google product manager for big data products James Malone told me Google is able to ensure the service’s speed thanks to its network infrastructure, but also because it patched a few Spark issues (related to the open source YARN resource manager the company is using for this product) and by building optimized images. The pricing for Cloud Dataproc is sure to get the attention of people who are thinking about setting up Hadoop or Spark analytics clusters, which is a pain in the neck and which takes experts to maintain.

Google is charging a penny per virtual machine per hour in the virtual clusters it manages on behalf Cloud Dataproc users. (This is in addition to charges for Compute Engine instances, network bandwidth, and storage that customers incur to set up the clusters.) Cloud Dataproc can run on regular reserved and on-demand instances, as well as the new preemptible instances that Google debuted a few weeks ago. Minutes—whether it’s 2 or 30—can make a big difference if you need those resources now, or if you’re being billed while machines are still spinning up. Big data workloads are becoming more important with each passing day, especially as trends such as the Internet of Things provide a tangible, viable use case for years’ worth of talk about data analysis.

Also, as the tool uses the traditional Spark and Hadoop distributions, you can expect it to be completely compatible to virtually any existing Hadoop-based products. So, in a sense, instead of the ephemeral storage that was common in the early days of cloud computing, this can be thought of as ephemeral compute with persistent storage. Google does warn that a Cloud Dataproc cluster is limited to a maximum of 24 CPUs and 240 VM instances, like other Compute Engine resources, and you have to ask for any capacity above and beyond that. AWS offers the Apache distribution of Hadoop on its virtual clusters, and also allows customers to use the Hadoop distribution from MapR Technologies.

EC2 instances run from 4.4 cents per hour for a an m1.small instance to a high of $5.52 per hour for a d2.8xlarge storage optimized instance; the EMR service adds 1.1 cents to 27 cents per hour on top of these EC2 fees. (Those are on-demand instances in the US East region; you can lower the cost with reserved instances.) Amazon just added support for Spark to EMR back in June, and in July revamped EMR with a 4.0 release that includes Hadoop 2.6.0, Spark 1.4.1, Hive 1.0, and Pig 0.14. Microsoft similarly has a Hadoop service on its Azure cloud, which it calls HDInsight and which is based on the Hortonworks Data Platform distribution of that analytics platform.

Show more