Data Eng Weekly


Hadoop Weekly Issue #33

01 September 2013

This weeks newsletter has a lot of content following up on the Hadoop 2.1.0-beta release from last week -- specifically lots of discussion of API compatibilities and YARN. There were also a bunch of releases this week from Cloudera (CDH, Impala, ODBC Driver) and a new version of Parquet.

Technical

Steve Loughran of Hortonworks has a post with a bunch of interesting commentary on Hadoop 2.1 beta. It includes some extra comments on the release announcement, an overview of some of the work he's done for this release (including YARN improvements, support for markdown, slf4j 1.7.5, and updates to the s3/s3n implementations), as well as an overview of his testing process for +1'ing the release (it includes bringing up a HBase cluster with Hoya).

http://steveloughran.blogspot.com/2013/08/hadoop-21-beta-elephants-have-come-out.html

Wei Yan writes about his experiences interning at Cloudera where he worked on a simulator for evaluating YARN schedulers as well as infrastructure for testing Kerberos by running a Key Distribution Center. The post covers the technical details of the YARN scheduler simulator (like the metrics it provides and the single-JVM approach), the MiniKDC cluster for Kerberos testing, and learning to contribute to the Hadoop ecosystem.

http://blog.cloudera.com/blog/2013/08/what-i-learned-during-my-summer-internship-at-cloudera-part-2/

The Cloudera blog has a post on how they maintain and test HBase compatibility between minor CDH version changes. The article discusses the types of incompatibility that can be introduced as well as what procedures they use to verify compatibility -- in particular, automated testing of all versions of clients against all versions of a server (within the same major version) and running JDiff to build a report of changes in the APIs.

http://blog.cloudera.com/blog/2013/08/how-cloudera-ensures-hbase-client-api-compatibility-in-cdh/

The Hortonworks blog has a post about the API changes and compatibility between Hadoop 1.x and 2.x. The quick summary is that any code compiled against the o.a.hadoop.mapred package should work without any changes, while the o.a.hadoop.mapreduce package requires a recompile (but should be source compatible), and most other tools (cli scripts, MRv1 examples, Pig scripts, Hive, and Oozie) should work (with a few exceptions).

http://hortonworks.com/blog/running-existing-applications-on-hadoop-2-yarn/

The 15th episode of the All Things Hadoop podcast features a talk with Stefan Groschupf, the CEO of Datameer. Datameer offers an analytics and visualization system built on Hadoop.

http://allthingshadoop.com/2013/08/26/big-data-open-source-and-analytics/

High-level frameworks for MapReduce, like Hive, strive to be compatible with various releases of Hadoop -- often everything from 0.20.x through 2.x. And while Hadoop distributions like CDH bundle a set of compatible components, using a newer version of a particular component can cause problems. This post covers running Hive 0.11 on CDH (a patched version of 0.10 is included in CDH4), including all the hoops necessary to fix incompatibilities between Hive 0.11 and CDH4.

http://www.justinjworkman.com/big-data/hive-0-11-0-on-cloudera

Slides were posted from the August Cloudera Impala meetup. The presentation covers plans for Parquet 2.0 to improve Impala performance (new encodings and metadata), the widely anticipated user defined function (UDF) support in the upcoming Impala 1.2 release, and some details on performance rules and tuning practices for Impala. Worth noting is that Impala's UDF support will be a mix of existing hive jars and native code (including LLVM-based code generation) -- and Impala 1.3 will further improve support. The presentation has example code if you want to see what the API looks like.

http://www.slideshare.net/cloudera/presentations-25757981

Savanna, the project for running Hadoop on OpenStack, is getting a new "Elastic Data Processing" (EDP) component in the upcoming 0.3 release. EDP has components for data discover (pulling data from RDBMS, File Systems or NoSQL databases), job definitions, scheduling/dispatching, and a UI for creating jobs, monitoring, and more. EDP runs outside of Hadoop, and it can run jobs on an existing cluster or spin up a new cluster if enough resources aren't available.

http://www.mirantis.com/blog/savannah-0-3-edp-analytics-as-a-service/

News

Hortonworks CEO Rob Bearden told Reuters that they'll be going public in the next 15-24 months. The article has some other info, such as the fact that Hortonworks is forecasting $100 million in revenue for 2014 as well as profitability.

http://www.reuters.com/article/2013/08/28/net-us-hortonworks-ipo-idUSBRE97R19M20130828

Releases

Impala 1.1.1 was released. The release has several performance improvements, Hive compatibility enhancements, and bug fixes. It also features a new security auditing feature and better compatibility with Hive when using Parquet files.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Impala-1-1-1/m-p/1127#U1127

CDH 4.3.1 was released. In addition to including maintenance fixes/enhancements to Hadoop, HDFS, HBase, Mapreduce, and Oozie, this release fixes the recently announced CVE affecting Hadoop RPC.

http://community.cloudera.com/t5/Release-Announcements/Announcing-CDH-4-3-1/m-p/979#U979

Cloudera ODBC 2.5 driver for Hive and Cloudera Impala was released with better ODBC API support, compatibility with Mac OS and Windows XP, and better performance.

http://community.cloudera.com/t5/Release-Announcements/Announcing-ODBC-2-5-Driver-for-Apache-Hive-and-Cloudera-Impala/m-p/1129#U1129

Parquet 1.1.1 was released with a number of bug fixes, improvements in the Hive SerDe (fixes for short and byte types), and dictionary encoding for non-string types.

https://github.com/Parquet/parquet-mr/blob/master/CHANGES.md

Cassandra 1.2.9 was released with a number of fixes, including several related to the Cassandra-MapReduce integration.

https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-1.2.9

Ganitha is a new library for machine-learning and statistics built on Scalding. The initial release creates an integration between Scalding and Apache Mahout as well as an implementation of K-Means clustering.

https://github.com/tresata/ganitha

Events

Curated by Mortar Data ( http://www.mortardata.com )

Monday, September 2
Parallel Iterative Machine Learning on Hadoop and YARN (Santa Monica, CA)
http://www.meetup.com/LA-Machine-Learning/events/128624772/

Tuesday, September 3
Introduction to Summingbird (San Francisco, CA)
http://www.meetup.com/Bay-Area-Storm-Users/events/135403842/

Wednesday, September 4
Hadoop Based Machine Learning (Austin, TX)
http://www.meetup.com/Austin-ACM-SIGKDD/events/136676422/

Thursday, September 5
Workflow Engines for Hadoop (New York, NY)
http://www.meetup.com/NYC-Data-Engineering/events/136300272/

Saturday, September 7
Hadoop Meetup (Bangalore, India)
http://www.meetup.com/Bangalore-Baby-Hadoop-group/events/132976612/