Data Eng Weekly


Hadoop Weekly Issue #15

28 April 2013

The last full week of April was pretty busy for the Hadoop ecosystem -- two core projects (Hadoop and HBase) saw releases, there was also some exciting funding news (congrats to Qubole!), and there were plenty of interesting technical articles. This also marks the 15th issue of Hadoop Weekly -- thanks to everyone that has spread the word!

Technical

The naming of components in Hadoop-related projects have often caused confusion (e.g. HDFS' secondary namenode). Apache HBase is no exception -- the HMaster is often misunderstood, because unlike its name suggests, not all writes go through the HMaster. This article elaborates on the role of the HMaster in HBase.

https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master

Tachyon is the in-memory distributed file system from Berkeley's AMPLab that recently had its initial release. This article provides more details about the system, including how it might be good for MapReduce jobs and fit into the Hadoop ecosystem.

http://strata.oreilly.com/2013/04/tachyon-open-source-distributed-fault-tolerant-in-memory-file-system.html

"Meet the Project Founder" is a new blog series from Cloudera. Their first story features Doug Cutting, the founder of Hadoop and Cloudera's Chief Architect. Doug is incredibly prolific in open-source -- he's started the Apache Lucene, Nutch, Hadoop, and Avro projects.

http://blog.cloudera.com/blog/2013/04/meet-the-project-founder-doug-cutting-first-in-a-series/

Gravity migrated from MySQL to HBase as their primary data store. In addition to talking about their use-case of both online and batch processing via MapReduce, this article speaks to the recent improvements in the ease of deployment of HBase and the Hadoop stack, and how it's changing the data storage landscape.

http://gigaom.com/2013/04/22/how-hbase-converted-myspaces-mysql-champion-and-is-driving-hadoop-mainstream/

Apache MRQL (incubating) is another low-latency analytics solution on HDFS solution. Unlike other systems, it's SQL-like but not SQL and can take advantage of Hama's Bulk Synchronous Parallel (BSP). These differences make it possible to do iterative processing, e.g. computing k-means (of which there is an example in this post).

http://www.hadoopsphere.com/2013/04/mrql-sql-on-hadoop-miracle.html

In the second part in his series on Dr. Dobbs, Tom White gives a thorough walkthrough of writing your first MapReduce job. He covers the classic "hello world" of MapReduce -- word count.

http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197/

Since entering the Hadoop-game, VMWare has been improving Hadoop on virtualized hardware. This article covers 7 myths related to Hadoop -- some related to virtualization (and probably controversial) and more generally applicable.

http://blogs.vmware.com/vfabric/2013/04/myths-about-running-hadoop-in-a-virtualized-environment.html

It can be difficult to keep up with all the SQL-on-Hadoop solutions (it seems like there is a new one each week!). This article covers four of them -- Impala, Hadapt, Hawq, and Berkely Data Analytics Suite (BDAS) -- including the trade-offs you make when selecting one or the other (and importantly, the maturity of the product).

http://www.openbi.com/content/sql-hadoop

This article covers converting the Hortonworks Sandbox virtual machine image to a Rackspace instance. It's a pretty interesting idea, and the process is appears to be quite easy.

http://devops.rackspace.com/getting-started-with-hadoop-using-hortonworks-sandbox.html

Based upon a Dell Whitepaper, Hortonworks has highlighted six important hardware decisions for designing a Hadoop cluster -- from the operating system to storage to network.

http://hortonworks.com/blog/6-key-hardware-considerations-for-deploying-hadoop-in-your-environment/

These slides give a good technical overview of the HAWQ (Greenplums SQL-on-HDFS solution) architecture (starting on page 21) as well as the features of Spring's Hadoop integration (starting on page 37).

http://www.slideshare.net/marklpollack/pivotal-hd-and-spring

On the WANdisco blog, Konstantin Boudnik provides an interesting analysis of the Hadoop 2.0-alpha series, which as he notes is on its 5th release in the past 11 months.

http://blogs.wandisco.com/2013/04/22/hadoop-2-alpha-elephant-or-not/

In the paper, "Nobody ever got fired for using Hadoop on a cluster" (no link, due to copyright restrictions), the authors observe that while MapReduce is great for many tasks, there are a growing number of situations (mostly due to the dropping price of memory) in which data can fit in RAM on a single machine.

When getting started with Hadoop, it can be a challenge just to decide which distribution to use -- it seems like each week a new vendor is announcing their new distribution. This fragmentation has been a source of discussion recently, and this article speaks a bit about that and also celebrates Apache BigTop as the system making all of these releases possible.

http://blogs.wandisco.com/2013/04/22/on-coming-fragmentation-of-hadoop-platform/

News

Qubole, the Hadoop-as-a-Service startup from a team of former Facebook employees, has raised $7 million in Series A financing. It's fantastic to see a vote of confidence in a company that's lowering the barrier to entry of Hadoop.

http://gigaom.com/2013/04/23/hadoop-startup-qubole-raises-7m-for-hive-as-a-service/

Cloudera's HUE has some interesting new features in the recent 2.3 release. They include Oozie improvements and a new Pig Editor.

http://blog.cloudera.com/blog/2013/04/whats-new-in-hue-2-3/

Spring XD is a new project from the folks at SpringSource focussing on tools for data ingestion, real-time analytics, workflow management, and data export.

http://blog.springsource.org/2013/04/23/introducing-spring-xd/

Sematext's Performance Monitoring (SPM) suite is adding support for Hadoop. SPM is a proactive monitoring tool that can be self-hosted or used as a service.

http://blog.sematext.com/2013/04/23/sneak-peek-hadoop-monitoring-comes-to-spm/

Releases

HBase 0.94.7 was released and is the new stable version. The release contains performance improvements and bug fixes.

http://mail-archives.apache.org/mod_mbox/hbase-user/201304.mbox/%3C1366934770.31322.YahooMailNeo%40web140601.mail.bf1.yahoo.com%3E

Hadoop 2.0.4-alpha was released. This is intended to be the final alpha release, with a beta following up soon.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201304.mbox/%3C5B011941-90CA-4EEF-BAB9-39A6BFE99B1D%40hortonworks.com%3E

KijiSchema and KijiMR, the HBase schema management software and accompanying MapReduce libraries were updated to versions 1.0.2 and rc61. These releases include bug fixes and improvements.

http://www.kiji.org/2013/04/22/announcing-kijischema-1-0-2-and-kijimr-rc61/

The MySQL project has announced a new product (currently in the MySQL labs) called the MySQL Hadoop Applier for replaying the mysql binlog onto a file in HDFS. Notably, it currently only supports INSERT commands, but data can be inserted into HDFS in near-real time.

http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-1.html

Cloudera released CDH 4.2.1, which includes a number of improvements and bug fixes.

https://groups.google.com/a/cloudera.org/d/msg/cdh-user/yM2tng8-kqI/LinM89A4vhUJ

Events

Curated by Mortar Data ( )

Tuesday, April 30 Big Data Jobs in London Meetup (London, UK) http://www.meetup.com/Big-Data-Jobs-in-London/events/110496712/

Tuesday, April 30 Improving Hive; MapR Hbase M7 (Washington D.C.) http://www.meetup.com/Hadoop-DC/events/114264532/

Tuesday, April 30 Cloudera - Enterprise Big Data Platform (Hamilton Township, NJ) http://www.meetup.com/nj-hadoop/events/113996532/

Wednesday, May 1 Hadoop Hackathon! (Houston, TX) http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/102954462/

Thursday, May 2 Big Data, Data Science, and Hadoop (San Francisco, CA) http://www.meetup.com/San-Francisco-Bay-Area-Microsoft-BI-User-Group/events/114347422/

Thursday, May 2 Data Science for Sustainability (Redwood City, CA) http://www.meetup.com/Data-Science-for-Sustainability/events/113231972/

Saturday, May 4 Accumulo Hackathon (Washington D.C.) http://www.meetup.com/Hadoop-DC/events/112435332/