Hadoop Weekly Issue #106

01 February 2015

The two main topics in this week’s newsletter both gained traction in 2014 and will likely be major topics for 2015: Apache Spark and security. On the security front, Cloudera has two posts this week and Hortonworks announced a new data governance initiative. Coverage of Spark includes interviews with Databricks co-founder Ion Stoica and a technical post on streaming k-means.

Technical

Mortar (disclosure: Mortar sponsors the event-section of this newsletter) CEO K Young has a post on data lakes, data pipelines, and data directories. Although data lakes are a hot topic right now, K argues that it's better to invest in data pipelines, and he discusses how Luigi is a good solution for building a pipeline.

https://www.linkedin.com/pulse/hows-data-lake-k-young

The Cloudera blog has a post about a new integration for CDH 5.3 between Sentry (the role-based access control layer for Hive) and HDFS ACLs. The post looks at how the integration allows Sqoop and Sentry to co-exist for the first time.

http://blog.cloudera.com/blog/2015/01/new-in-cdh-5-3-apache-sentry-integration-with-hdfs/

Hortonworks has the third post in a series on predicting airline delays with Hadoop. This post looks at using Scalding and R (previous posts covered Spark and Pig). Like the previous posts, there's an IPython notebook that walks through all the individual steps.

http://hortonworks.com/blog/data-science-hadoop-predicting-airline-delays-part-3/

The Hortonworks blog has a post summarizing some recent improvements to YARN that are part of HDP 2.2. Topics include: support for long running applications (Apache Slider), new types of resource management (CPU in addition to RAM slots, node labeling), and improvements to operational support (including rolling upgrades).

http://hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-substantial-step-forward-enterprise-hadoop/

"The Morning Paper" is a blog that recaps various computer science papers. This week, it looked at the ZooKeeper paper from 2010. It’s a good overview that serves as supplemental reading material or a refresher if it’s been a while since you read it.

http://blog.acolyer.org/2015/01/27/zookeeper-wait-free-coordination-for-internet-scale-systems/

Spark 1.2 introduced a streaming implementation of k-means with the ability to dynamically detect (and remove) clusters over time. The key to this feature is forgetfulness, which is implemented as a half-life parameter to decay old data. The Databricks blog has a post with more details on the algorithm, including several visualizations of it in action.

http://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

Cloudera had a post describing the enterprise security features that are part of CDH 5. Topics include Apache Sentry, integration with Active Directory and Kerberos, centralized audit logging, and encryption (plus key management). Not all of these features are available in the free version of CDH, but Cloudera claims many of the features aren't available in another distribution, either.

http://vision.cloudera.com/production-ready-hadoop-an-overview-of-security-in-cloudera-5/

The Confluent blog has a post from Martin Kleppmann, the author of the upcoming book “Designing Data-Intensive Applications.” The post is a edited transcript of a recent talk on stream processing. It covers a large number of topics, including streaming aggregation, relation to database systems, and several tools. The post is a great overview of important concepts in stream processing.

http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/

The Mortar blog has the transcript and video of a recent talk at the NYC Pig User Group. The talk describes the types of problems that Pig is really good for, its shortcomings, and the strengths and weaknesses of the user-facing APIs.

http://blog.mortardata.com/post/109495522361/pig-jonathan-coveney-talk

LinkedIn has written about their usage of Kafka and plans for the future. The post provides an insight into what they’re using Kafka for (including monitoring, messaging, analytics) and tools they’ve built around it (a REST API, schema registry, auditing service). Future plans include support for security, improved reliability/availability, and cost efficiency.

http://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future

The AWS Big Data Blog has a tutorial describing how to setup a Elastic MapReduce cluster with Elasticsearch and Kibana.

http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR

News

The Apache Software Foundation has announced that Samza has graduated from the incubator and is now a top-level project. Samza, the distributed stream processing framework, uses YARN for fault tolerance and integrates with Kafka.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces71

MapR announced an initiative this week to provide free on-demand Hadoop training for developers, analysts, and administrators. Currently available courses are “Hadoop Essentials,” “Hadoop Operations: Cluster Administration,” and “Developing Hadoop Applications. Future courses will cover HBase, Drill, and Hive.

https://www.mapr.com/company/press-releases/mapr-unveils-free-demand-training-program-50m-kind-contribution-hadoop

Typesafe and Databricks announced results of a recent survey of Scala and Spark developers. Among the highlights—13% of respondents already have Spark in production and 20% plan to do a production deploy in the coming year. Readwrite has more coverage of the survey, and a follow-up interview with Typesafe’s architect for Big Data Products and Services, Dean Wampler.

http://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html
http://readwrite.com/2015/01/27/spark-scala-hadoop-typesafe-dean-wampler

TechRepublic has an interview with Ion Stoica, the co-founder of Databricks, about Spark. The post emphasizes Sparks’ versatility—it supports batch, streaming, SQL, and machine-learning. There are a few other interesting tidbits, including mention of Spark support for R in the future and the importance of libraries for Spark.

http://www.techrepublic.com/article/spark-promises-to-up-end-hadoop-but-in-a-good-way/

Hortonworks announced the Data Governance Initiative to develop software to meet enterprise requirements for data governance. Along with Hortonworks, Aetna, Merck, Target, and SAS will be working on the initiative, which will include further integrating Apache Falcon and Apache Ranger.

http://hortonworks.com/press-releases/hortonworks-establishes-data-governance-initiative/

The Splice Machine RDBMS, which is built atop of HDFS and HBase, is now certified for Hortonworks HDP.

http://hortonworks.com/blog/splice-machine-now-hdp-certified/

Releases

SequenceIQ has announced a new beta release of Cloudbreak, the cloud-agnostic Hadoop-as-a-Service framework. The new version includes user accounts, a usage explorer, support for heterogenous clusters, support for OpenStack, and more.

http://blog.sequenceiq.com/blog/2015/01/28/cloudbreak-last-beta/

HFactory is a platform for building HBase-based applications using Scala. This week, version 1.2 was released with a few enhancements and new features.

http://hfactory.io/blog/hfactory-1-2-release-notes/

VoltDB announced version 5.0, which includes expanded Hadoop ecosystem support. Specifically, VoltDB is now integrated with HDFS, MapReduce, and Kafka. It also supports exporting data as Avro.

http://www.prnewswire.com/news-releases/voltdb-announces-version-50-expands-hadoop-integration-extends-leading-fast-data-application-development-platform-300026817.html

Cloudera announced bug fix releases of Cloudera Manager (5.2.2 and 5.3.1) and Cloudera Navigator (2.1.2 and 2.2.1).

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Manager-5-2-2-and-Cloudera-Manager-5-3-1/m-p/24101#U24101

Cloudera also announced a new release of the Impala ODBC and JDBC drivers. The new versions support HiveServer2 from CDH 4.1+ and Impala 1.0+.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Impala-ODBC-v2-5-23-and-JDBC-v2-5-16-Drivers/m-p/24211#U24211

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Interactive Session on Sparkling Water = Spark + H2O (Mountain View) - Tuesday, February 3
http://www.meetup.com/Silicon-Valley-Big-Data-Science/events/219076654/

Bayesian Networks with R and Hadoop (Palo Alto) - Wednesday, February 4
http://www.meetup.com/Hadoop-Talks/events/219644755/

Nonstop HBase: Making HBase Safe and Bulletproof by Ryan Rawson of WANDisco (Los Angeles) - Thursday, February 5
http://www.meetup.com/Los-Angeles-HBase-User-group/events/219045881/

Building Real-World Machine Learning Applications with PredictionIO and Spark ML (Mountain View) - Friday, February 6
http://www.meetup.com/Silicon-Valley-Machine-Learning/events/219609337/

Washington

Hadoop-Based Big Data Analytics with Datameer (Bellevue) - Thursday, February 5
http://www.meetup.com/CloudTalk/events/219236686/

Arizona

Oozie or Easy: Managing Hadoop Workflows the Easy Way (Tempe) - Wednesday, February 4
http://www.meetup.com/Phoenix-Hadoop-User-Group/events/219168066/

Colorado

Hands-on Spark Workshop for Beginners (Boulder) - Saturday, February 7
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/220044380/

Texas

Sean Busbey on Apache Accumulo (Austin) - Wednesday, February 4
http://www.meetup.com/austin-data-geeks/events/190580872/

Oklahoma

Machine Learning and Data Ingestion with Apache Storm, Kafka (Oklahoma City) - Thursday, February 5
http://www.meetup.com/Big-Data-in-Oklahoma-City/events/219965196/

Minnesota

Performance Tuning Cassandra at Target (Minneapolis) - Monday, February 2
http://www.meetup.com/Minneapolis-St-Paul-Cassandra-Meetup/events/218885496/

Tennessee

Intro to Hadoop Components and Distributions (Brentwood) - Monday, February 2
http://www.meetup.com/Data-Science-Nashville/events/219721972/

Maryland

Introduction to Big Data Techniques for Cybersecurity (Rockville) - Monday, February 2
http://www.meetup.com/Capital-Area-Cyber-Security/events/219333009/

Introduction to Apache Accumulo: Architecture and Use Cases (Jessup) - Tuesday, February 3
http://www.meetup.com/Accumulo-Users-DC/events/220167811/

Massachusetts

Get Started with Hadoop Experts: Big Data for Social Good Challenge (Cambridge) - Tuesday, February 3
http://www.meetup.com/Big-Data-Developers-in-Boston/events/219861979/

CANADA

Greenplum Deep Dive (Toronto) - Tuesday, February 3
http://www.meetup.com/Toronto-Pivotal-User-Group/events/219718869/

MEXICO

Primera Reunión de Apache Spark (Mexico City) - Friday, February 6
http://www.meetup.com/Mexico-City-Apache-Spark-Meetup/events/219029479/

IRELAND

First Galway Data Meetup, with Michael Hausenblas of MapR (Galway) - Tuesday, February 3
http://www.meetup.com/Galway-Data-Meetup/events/219176364/

FRANCE

Spark Meetup at Viadeo (Paris) - Wednesday, February 4
http://www.meetup.com/Paris-Spark-Meetup/events/220141774/

Batch on Hadoop with Cascading (Lyon) - Thursday, February 5
http://www.meetup.com/Lyon-Hadoop-Meetups/events/219273421/

GERMANY

Hadoop and Data Warehouse–Friends, Enemies or Profiteers? What about Real-Time? (Cologne) - Wednesday, February 4
http://www.meetup.com/NoSQL-Usergroup-Cologne/events/219874149/

AUSTRIA

Cassandra: How It Works and What It's Good For! (Vienna) - Wednesday, February 4
http://www.meetup.com/Vienna-Cassandra-Users/events/220027008/

ISRAEL

Lessons I Learned Building a Big Data Startup (Tel Aviv) - Monday, February 2
http://www.meetup.com/BI-on-the-Bar/events/218719652/

Tez vs Spark (Tel Aviv) - Sunday, February 8
http://www.meetup.com/HadoopIsrael/events/219382194/

CROATIA

Introduction to Spark (Zagreb) - Tuesday, February 3
http://www.meetup.com/Apache-Spark-Zagreb-Meetup/events/219811877/

AUSTRALIA

Big Data Integration Research (Canberra) - Tuesday, February 3
http://www.meetup.com/Data-Science-Canberra/events/220176788/

Data Eng Weekly