Hadoop Weekly Issue #12

07 April 2013

Happy 7th birthday to Apache Hadoop! The first release of Hadoop was made in April 2006. This week's newsletter caps that anniversary by representing many parts of the Hadoop ecosystem. It's quite impressive how far the project and the ecosystem have come in those 7 short years!

News

April 2nd marked the 7-year anniversary of the first release of Apache Hadoop. In this post, Doug Cutting (the founder of Hadoop) provides 7 thoughts and predictions for Hadoop. He touches everything from open-source, to the name of the project, to where he sees Hadoop heading in the next 7 years.

http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/

The folks at LiveRamp have come up with a clever technique to speed up joins/cogroups by filtering map-side using Bloom filters. If you haven't seen Bloom filters before, the post explains their usefulness in this context. With this technique, they see performance improvements of 2x for a large job. They have open-sourced an implementation of this technique for Cascading.

http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/

Videos and slides from Hadoop Summit EU are beginning to arrive online. Hortonworks highlights the keynotes from the events which include presentations from 451 Research, Hortonworks, and a panel featuring HSBC, eBay and others. You can find many more talks (and more being added every week) on the Hadoop Summit YouTube page, too.

http://hortonworks.com/blog/keynotes-from-hadoop-summit-amsterdam-2013/ http://www.youtube.com/user/HadoopSummit?feature=watch

YARN is the new resource scheduler in Hadoop 2.0 for building applications other than vanilla MapReduce on a Hadoop cluster. Josh Patterson has started a new open-source project called Metronome built upon YARN. The software is based upon former projects IterativeReduce and Knitting Boar, and it provides an implementation of parallel linear regression.

http://www.slideshare.net/jpatanooga/hadoop-summit-eu-2013-parallel-linear-regression-iterativereduce-and-yarn https://github.com/jpatanooga/Metronome

VMWare has released version 0.8.0 of Serengeti, their open-source initiative to improve Hadoop for virtualization. This release includes support for CDH4 and MapR's distributions as well as improved support for HBase.

http://blogs.vmware.com/vfabric/2013/04/new-serengeti-release-extends-cloud-computing-support-for-hadoop-community.html

Luigi is an open-source Hadoop framework from the folks at Spotify. We've been using it at Foursquare for a few months and really like it. In this presentation, Elias gives an overview of Luigi as well as the evolution of Spotify's thinking about workflow management which explains how they arrived at Luigi.

http://www.slideshare.net/EliasFreider/luigi-pydata-presentation https://github.com/spotify/luigi

Datanami has a good summary of Eric Baldeschwieler's (aka Eric14) keynote from Hadoop Summit. The synopsis includes Eric's views on the future of Hadoop, from scaling to 10,000 nodes to lots of younger projects in the Hadoop ecosystem like HCatalog, Ambari, Tez, and more.

http://www.datanami.com/datanami/2013-04-03/baldeschwieler:_looking_at_the_future_of_hadoop.html

Microsoft has posted an in-depth analysis of how they use Hadoop on Azure (HDInsight) with Halo 4. The case study includes everything from analytics and BI to email targeting. It's a pretty interesting and impressive analysis considering that just a few months ago Hadoop didn't run on Windows at all.

http://www.microsoft.com/enterprise/it-trends/big-data/articles/Changing-the-Game-Halo-4-Team-Gets-New-User-Insights-from-Big-Data-in-the-Cloud.aspx#fbid=OAmTkNNsaBu

Falcon is a new Apache Incubator project from the folks at InMobi and Hortonworks focussing on ETL. It has a number of use cases, such as disaster recover, multi-cluster management, and SLA management. It seems to have some overlap with existing projects (e.g. Oozie or Sqoop) but is focused on just ETL within or between Hadoop clusters so far.

http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/ http://www.inmobi.com/inmobiblog/2013/04/02/inmobi-works-with-hortonworks-to-incubate-falcon-with-apache-software-foundation-to-provide-huge-benefits-to-the-big-data-community/

Did you know that Windows Azure can run Linux VMs? This tutorial shows how to build a Linux (CentOS) Hadoop cluster in Windows Azure. After booting a Windows Server for DNS, the rest of the tutorial focuses on Hadoop (they use HDP 1.2.2) on Linux.

http://blogs.msdn.com/b/benjguin/archive/2013/04/05/how-to-install-hadoop-on-windows-azure-linux-virtual-machines.aspx

April Fools day was this week, and there were a few fake Hadoop-related product announcements. Here are a couple in case you missed them.

http://www.hadoopsphere.com/2013/04/yas-1000x-faster-sql-on-hadoop-engine.html http://www.wibidata.com/blog/real-time-is-reckless-slow-and-steady-wins-the-race

Releases

Apache Pig 0.11.1 was released. This update includes fixes to Avro, HCatalog, and HBase integrations (and more) as well as improvements including documentation polish.

http://pig.apache.org/releases.html#1+April%2C+2013%3A+release+0.11.1+available

KairosDB is a rewrite of OpenTSDB (the time series database and visualization system from StumpleUpon) with a pluggable backend (defaults to Cassandra but also support HBase and H2). KairosDB uses Flot for visualization and provides REST APIs for retrieving data. The release on their website is 1.0.0-alpha-4a, so I assume it's still considered alpha quality.

http://nosql.mypopescu.com/post/47102531877/kairosdb-fast-scalable-time-series-database https://code.google.com/p/kairosdb/

Hama is a computing system on top of HDFS that is specialized for matrix and graph problems. Version 0.6.1, which includes improvements, bug fixes and new features, was released this week.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201304.mbox/%3CCAGQgZQQ1x3w5tRB3eVs-ZNdsBKGz5Qdwy%3DhW5JOOjcmadOUC6Q%40mail.gmail.com%3E https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%220.6.1%22%20AND%20project%20%3D%20HAMA

Cloudera Manager 4.5.1 was released. This bug fix release contains fixes for HDFS, MapReduce, and Hive.

https://groups.google.com/a/cloudera.org/d/msg/cdh-user/qYgMASROJWQ/ctFfmzE1TF0J http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.5.1/Cloudera-Manager-Enterprise-Edition-4.5.x-Release-Notes/Cloudera-Manager-Enterprise-Edition-4.html

Cloudera announced the Cloudera Developer Kit. The goal of the project is to make it easier to write applications on top of Hadoop. Unlike other high-level frameworks for Hadoop, CDK is focusing on all layers of the stack, not just MapReduce. For example, the data API is one of the first under development, and it focuses on easing the burden of implementing data integration services which would normally have to muck with the nuances of the HDFS APIs.

https://groups.google.com/a/cloudera.org/d/msg/cdh-user/xJV77baI4Ss/5oPZzcaIe7wJ https://github.com/cloudera/cdk

ElasticSearch Hadoop is a set of libraries for MapReduce, Pig, Hive, and Cascading from the folks at ElasticSearch. They haven't yet made a release, so you have to build yourself, but it includes a bunch of features. The drivers support read from and write to ElasticSearch over REST/JSON, and they've made the binary small and independent for easy integration (just add a single jar!).

https://github.com/elasticsearch/elasticsearch-hadoop

CDH3u6 was released. This version includes a handful of fixes in MapReduce, Flume, and HBase.

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH3/CDH3u6/CDH3-Release-Notes/CDH3-Release-Notes.html

Data Eng Weekly