Data Eng Weekly


Issue #5

17 February 2013

Welcome to the fifth edition of Hadoop Weekly! With ApacheCon NA and StrataConf coming up in just under two weeks, I think it's the calm before the early-spring storm -- making this weeks issue a bit shorter than usual. With that said, there are a bunch of interesting articles and releases to share. Hope you enjoy!

Tech News and Posts

Qubole offers a SaaS analytics platform running on AWS. In this post, they detail their Sqoop service. Apache Sqoop is a project for loading data to or from a database to a Hadoop-supported filesystem. Qubole's post describes some of the enhancements that they've made to Sqoop and also previews the UI they offer for interacting with their product. Large-scale ETL like this is really common with Hadoop, and it's interesting that companies are now offering SaaS solutions to the problem.

http://www.qubole.com/blog/index.php/sqoop-to-s3-as-service-in-aws-cloud/

Xconomy has a glossary of all the big-data systems and software to come out of Facebook. It might be useful to have this handy if you read (or reread) last week's Wired.com article about Facebook's data infrastructure team.

http://www.xconomy.com/san-francisco/2013/02/14/big-data-at-facebook-a-glossary/

Until very recently, the HDFS NameNode has been a single point of failure (SPOF) in the Hadoop stack. In practice, this hasn't been as big of a problem as you might think, but it's still been an issue particularly for downstream systems like HBase, which are used to serve real-time data. Companies like Facebook have been running forked versions of HDFS to remove the SPOF, but full support has been added to Apache Hadoop over the past year. In this presentation, Todd Lipcon details the motivation and evolution of HDFS's HA support.

http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum

MongoDB supports MapReduce, but writing low-level MapReduce code usually isn't a good use of time. The folks at MortarData detail the why and how of using Pig to analyze data from MongoDB as well as provide a working demo using their mortar product.

http://blog.mortardata.com/post/43080668046/mongodb-hadoop-why-how

Apache Zookeeper provides several primitives that facilitate things like leader election, service discovery, consensus, and other hard problems in distributed systems. This article is a great overview of Zookeeper and a good technical discussion of how to use Zookeeper to solve some tough problems in distributed computing.

http://blog.cloudera.com/blog/2013/02/how-to-use-apache-zookeeper-to-build-distributed-apps-and-why/

eWEEK published a set of 10 predictions from MapR CEO and co-founder John Schroeder. Some of the predictions aren't particularly earth-shattering (e.g. that there will continue to be a talent shortage), but some of his other predictions are looking a few miles into the future -- including the rise of HBase for blob storage and lightweight OLTP.

http://eweek.com/database/slideshows/hadoop-emerging-as-dominant-big-data-analytics-platform-10-reasons-why/

Scalding is the Scala wrapper for Cascading from folks at Twitter. Dean Wampler, who has literally written the book on Scala and Hive, does a great job motivating higher-level MapReduce frameworks, and then he gives an overview of Scalding. He touches on advanced topics such as Matrix operations and also does a comparison between Scalding and Hive and Pig.

http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf

Cassandra 1.2 introduced virtual nodes (vnodes) to provide benefits during node loss/gain, provide better load distribution, and to better support heterogenous clusters. Vnodes are a common paradigm in distributed systems, and I was actually a bit surprised that they weren't already in Cassandra. In any case, this presentation is a good overview of vnodes -- both in general and in terms of Cassandra's implementation.

https://www.slideshare.net/slideshow/embed_code/16561245

Releases

Apache Hadoop 2.0.3-alpha was released this week! The release contains a bunch of neat new features, but the biggest highlights are the Quorum-based Journal Manager for HDFS HA (see Todd's presentation above for details) and the addition of CPU to resource scheduling in YARN. The "alpha" label of 2.0.3 has more to do with the stability of APIs than anything else, according to the developers, and they hope to have a beta out in a few months.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201302.mbox/%3C83B6DC82-1506-47D2-8A1F-87B49740773E%40hortonworks.com%3E http://hortonworks.com/blog/announcing-apache-hadoop-2-0-3-release-and-roadmap/

Amazon RedShift, the Amazon Web Services solution for low-latency SQL on big data, is now generally available. For folks running Elastic MapReduce or Hadoop in EC2, Amazon RedShift provides the ability to easily load data into an MPP database from S3 in order to do low-latency queries.

http://aws.typepad.com/aws/2013/02/amazon-redshift-now-broadly-available.html

Cloudera Manager 4.1.4 was released. This version fixes a high-priority bug that caused all items moved to the trash to be deleted after 1 minute, regardless of the user settings.

https://ccp.cloudera.com/display/ENT41DOC/Cloudera+Manager+4.1.x+Release+Notes

Events

Structure Data
March 20-21, 2013
New York, NY
http://event.gigaom.com/structuredata/