Data Eng Weekly


Hadoop Weekly Issue #57

16 February 2014

There were a lot of product releases and announcements in the ecosystem this week as folks met in Santa Clara for StrataConf. Among the highlights were announcements from MapR related to their distribution and a partnership with HP, a new beta of Cloudera’s CDH5, and the public preview of Hadoop 2 on Windows Azure. In addition, there are a number of interesting technical articles about HBase, MapReduce v2, Pig, Hadoop security, and more. Congrats to folks on all the releases, partnerships, and great articles. Also, a big congrats to Splice Machine for raising a new round financing.

Technical

The Cloudera blog has an interesting technical post about performance in MRv2. The post describes some of the major revamps that took place in MRv2, and it describes some performance regressions found by running the same jobs on both MRv1 and MRv2. The post walks through the low-level debugging that was done to identify the root cause of two of the issues, and it explains the fixes that were made. It’s a pretty technical overview including discussion of the perf tool, CPU cache latency, fadvise, and more.

http://blog.cloudera.com/blog/2014/02/getting-mapreduce-2-up-to-speed/

Understanding the HBase memory model, in particular how it caches data, is an important part of tuning an HBase deploy. This post walks through the two main parts of memory that HBase manages—the MemStore and BlockCache. It focusses on the implementation of the BlockCache, how the BlockCache speeds up queries, and gives a tour of the three BlockCache implementations shipped with HBase. The post is also annotated with in-depth technical details.

http://www.n10k.com/blog/blockcache-101/

HiveServer2 is the latest and greatest way to interact with Hive. The new service provides JDBC and ODBC, and the new Hive CLI client, beeline, connects to HiveServer2 via JDBC. Beeline introduces a number of changes (vs. the hive cli) across several cli operations—specifying a connection, running in embedded mode, variable handling, and more. A post on the Cloudera blog has a detailed overview of the changes in beeline, which is essential knowledge for anyone looking to migrate.

http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/

Rounding out a trifecta of interesting technical posts this week, Cloudera elaborates on their reference architecture for running CDH in AWS. The post is a FAQ covering areas of the AWS deployment model such as VPC, security groups, subnets, instance types, and EBS. From personal experience, I can attest that a lot of the recommendations in this post ring true and are very valuable advice.

http://blog.cloudera.com/blog/2014/02/best-practices-for-deploying-cloudera-enterprise-on-amazon-web-services/

Apache Pig gained new functionality to compute CUBEs and ROLLUPs in version 0.11. Many data scientists and engineers working with Hadoop might not be familiar with these primitives, but they are pretty common in the data warehousing world. This walkthrough is a great introduction to CUBE/ROLLUP, which is illustrated by real examples in PigLatin.

http://joshualande.com/2014/02/11/cube-rollup-pig-data-science/

The latest release of Parkour, the Clojure library for Hadoop, includes support for running Hadoop MapReduce jobs via the Clojure REPL. This tutorial walks through configuring an AWS Elastic MapReduce cluster to run queries form the Parkour REPL. The tutorial implements some non-trivial MR jobs on the Google Book n-gram corpus, and it also includes an example of writing tests in Parkour.

http://blog.platypope.org/2014/2/8/interactive-hadoop-with-parkour/

AMPLab’s Big Data Benchmark has been updated to include new versions of Impala, Hive (including Hive on Tez), and Shark. The results continue to show impressive numbers from Redshift, Impala, and Shark with Hive on Tez gaining ground. I’d suggest taking the results with a grain of salt, though, since they’re only targeting a single dataset and set of queries. But the benchmark is open-source, and you could use the scripts to recreate the evaluation with your own dataset and queries.

https://amplab.cs.berkeley.edu/benchmark/

There are a number of companion tutorials to the Hortonworks Sandbox, a single-node Hadoop cluster VM. The RHadoop framework for running R on Hadoop is covered in a recently-contributed community tutorial. The tutorial shows how to use RStudio to run a MapReduce job that builds a model to predict visitors to a website based upon historic web logs.

https://github.com/hortonworks/hadoop-tutorials/blob/master/Community/T01_RHadoop_visitors_prediction.md

The gartner blog has a post recapping a recent presentation by Square on encrypting data at rest in Hadoop. Square stores both redacted data and encrypted protobuf-serialized data, which fulfill 80% and 20% of their Hadoop workload, respectfully. This is one of the first home-grown encryption systems that I’ve heard of (although I suspect more folks are doing it). Work is in progress to bring similar functionality to Hadoop and HBase (Intel’s distribution has it already), but some folks obviously can’t wait for that to land.

http://blogs.gartner.com/nick-heudecker/how-square-secured-your-data-in-hadoop/

News

MapR and HP have announced a partnership to bring HP Vertica to MapR’s distribution. Vertica, which is a MPP SQL engine, runs directly on the MapR file system and alongside MapR compute resources. Unlike other variants of SQL-on-Hadoop, it doesn’t seem that Vertica will tightly integrate with the ecosystem (e.g. it won’t read Hadoop file formats or use the Hive megastore), but Vertica is a much more mature system than anything else in the SQL-on-Hadoop realm.

http://www.mapr.com/blog/hp-vertica-mapr-%E2%80%93-enterprise-sql-hadoop-option-sophisticated-bi-needs

Redhat and Hortonworks announced that they’ve expanded their partnership. The joint initiative includes support for the Red Hat Storage file system, the RHEL OpenStack Platform, Red Hat JBoss Data Virtualization, and further integrates the two companies’ support teams.

http://www.cio.com/article/748045/Red_Hat_and_Hortonworks_Expand_Strategic_Big_Data_Alliance

Slides and some videos from this week’s StrataConf have been posted online. There are a number of talks about Hadoop and related technologies from folks at Cloudera, MapR, Silicon Valley Data Science, and more. Forbes has a quick rundown of some of the highlights of the conference

http://strataconf.com/strata2014/public/schedule/proceedings http://www.forbes.com/sites/danwoods/2014/02/13/the-buyers-arrive-a-round-up-of-strata-2014-in-santa-clara/

There were a lot of partnerships and announcements in conjunction with StrataConf this week. GigaOm has a good wrap-up of the news including announcements from DataStax, a new tool from Alpine Data Labs, additional vendor support for Storm, and a patent award to Zettaset.

http://gigaom.com/2014/02/12/this-week-in-big-data-clouds-collaboration-and-cassandra/

Splice Machine, who has built a transactional SQL engine atop of HBase, has raised $15 million in Series B funding. Noted in the announcement, Splice Machine will be offering a public beta in Q1 2014, which the company says has much better price/performance vs. Oracle databases.

http://www.splicemachine.com/splice-machine-raises-15m-series-b-to-power-real-time-big-data-applications/

WibiData, the company behind the open-source Kiji Framework, has announced a partnership with DataStax to bring Kiji to Cassandra. Kiji, which provides a so-called entity-centric API, currently supports HBase for data storage. Adding Cassandra will bring support for two of the most-deployed column-family databases in the Hadoop ecosystem. The announcement suggests support for KijiSchema and KijiMR will be released within a few weeks.

http://www.kiji.org/2014/02/11/working-on-cassandra-integration-with-datastax/

GigaOm recently hosted Cloudera CSO and co-founder Mike Olson the Structure Show podcast. They’ve extracted five key updates on the Hadoop landscape from the the ideas discussed in that show. The ideas include “At least part of the database market is safe” and “MapReduce will fade away as innovation flourishes.”

http://gigaom.com/2014/02/15/5-things-everyone-should-know-about-hadoop/

High-Performance Computing (HPC) clusters and Hadoop clusters are typically built with vastly different goals in mind. As a result, the underlying hardware and network topology tend to be very different (HPC often uses expensive, proprietary components whereas Hadoop uses commodity hardware). But there’s an interesting trend of running Hadoop on HPC. For example. the San Diego Supercomputer Center now supports launching a “personal Hadoop cluster” on Gordon, the worlds #88 HPC system. Many more details on the trend and the implementation in two articles on HPCWire.com.

http://www.hpcwire.com/2014/02/11/hpc-hacking-hadoop/ http://www.hpcwire.com/2014/02/14/adapting-hadoop-hpc-environments/

GCN covers another place that Hadoop is gaining traction—inside of the US Government. In addition to migrations from NAS or SAN to HDFS, Hadoop enables cheaper and simpler network topologies.

http://gcn.com/Articles/2014/02/14/big-data-data-centers.aspx?Page=1

Releases

Cloudera announced the second beta of CDH5. The updated beta includes lots of new features, including HDFS Caching, NFS Gateway, SSL encryption for Hive on non-kerberos clusters, and native Parquet support in Hive.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-Beta-2/m-p/5979

MapR announced support for YARN and Hadoop 2.x. In an introductory blog post, MapR describes their philosophy for supporting the new technology—allowing customers to use either MRv1 or MRv2/YARN. They also support both simultaneously (as well as other technologies like Storm) on the same cluster.

http://www.mapr.com/blog/take-charge-hadoop-2x-and-yarn

Windows Azure announced a public preview of Hadoop 2.2 in their HDInsight Hadoop-as-a-Service offering. In the announcement, Microsoft promotes the benefits of YARN, describes some of their work on the Stinger initiative, and highlight some example usages of HDInisight.

http://blogs.technet.com/b/dataplatforminsider/archive/2014/02/14/windows-azure-hdinsight-supporting-hadoop-2-2-in-public-preview.aspx

WANdisco has announced a new product called “Non-Stop HBase.” The product extends HBase to replicate regions in memory to improve latency in case of a RegionServer failure. The implementation, like their Non-Stop Hadoop implementation, uses a patented technology for which the implementation details aren’t public. But the claims of both consistency and continuous availability have raised some eyebrows.

http://siliconangle.com/blog/2014/02/13/hbase-activeactive-replication-technology-consistency-and-continuous-availability-bigdatasv/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Building Hadoop Data Applications with Kite (Palo Alto) - Tuesday, February 18
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/164304642/

Samza: Reliable Stream Processing atop Apache Kafka & YARN by Sriram S./Linkedin (Los Angeles) - Tuesday, February 18
http://www.meetup.com/LA-HUG/events/160628632/

43rd Bay Area Hadoop User Group (HUG) Monthly Meetup - An Evening on Apache Tez (Sunnyvale) - Wednesday, February 19
http://www.meetup.com/hadoop/events/116895522/

February SF Hadoop Users Meetup (San Francisco) - Wednesday, February 19
http://www.meetup.com/hadoopsf/events/164267012/

Texas

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, February 19
http://www.meetup.com/Austin-ACM-SIGKDD/events/160803822/

Missouri

St. Louis Hadoop Users Group Meetup (Saint Louis) - Tuesday, February 18
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/154751152/

Illinois

Save the Date For Dean Wampler's Talk - Wednesday, February 19
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/

Georgia

Parkour: Hadoop MapReduce in idiomatic Clojure (Atlanta) - Tuesday, February 18
http://www.meetup.com/Atl-Clj/events/161519062/

Pennsylvania

February Meetup (Pittsburgh) - Wednesday, February 19
http://www.meetup.com/HUG-Pittsburgh/events/159885652/

New York

Unlock your Hadoop Data with Apache Spark (New York) - Monday, February 17
http://www.meetup.com/Hadoop-NYC/events/164264122/

Hadoop 2 with YARN, and Tez (New York) - Tuesday, February 18
http://www.meetup.com/Hadoop-NYC/events/163481002/

North Carolina

Setting up a Hadoop Cluster on CentOS (Durham) - Saturday, February 22
http://www.meetup.com/LinuxUserGroup/events/162748222/

CANADA

Monthly Solution Architect Scrum (Toronto) - Thursday, February 20
http://www.meetup.com/TorontoHUG/events/161727402/

ENGLAND

HBase London - Feb meetup @Cloudera (London) - Monday, February 17
http://www.meetup.com/HBase-London/events/154461262/

February Hadoop Meetup: Hadoop-as-a-Service & Zookeeper (London) - Tuesday, February 18
http://www.meetup.com/hadoop-users-group-uk/events/164182592/

POLAND

Spark! (Krakow) - Thursday, February 20
http://www.meetup.com/datakrk/events/163657932/

INDIA

Test your Hadoop Knowledge (Hyderabad) - Monday, February 17
http://www.meetup.com/hyderabad-hadoop/events/166029912/