Data Eng Weekly


Hadoop Weekly Issue #123

31 May 2015

As has been the recent trend, a number of posts in this week's issue are only tangentially related to Hadoop. I've included them in the hopes that they're useful for folks working with distributed systems (whether built atop of Hadoop or not). For instance, there's a fantastic article on using logs for data integration, a post on Mesos/Omega/Borg, and a post on the consequences of disk wiping in distributed consensus.

Technical

Pipeline scans are a new feature targeted for a future release of Apache HBase. For scans over large numbers of records, the client will prefetch additional rows as the current batch is being processed. The Yahoo Hadoop blog has more details on the implementation and provides some experimental results in which the feature improves throughput by nearly 3x.

http://yahoohadoop.tumblr.com/post/119636047561/high-throughput-pipelined-scans-in-hbase

IBM BigInsights 4.0, which was released in late March, supports SQL querying of data in HBase. It includes a number of important features like windowing and OLAP aggregate functions, nested sub-queries, predicate pushdown, and secondary indexes. The IBM Hadoop Dev blog has many more details on the features of SQL-on-HBase in BigInsights.

https://developer.ibm.com/hadoop/blog/2015/05/26/rich-sql-support-hbase-biginsights-4-0/

The morning paper covered some publications relevant to folks working with Hadoop. First is a paper from Google on "Pregel: A System for Large-Scale Graph Processing" (and the inspiration for Apache Giraph). Second is GraphLab, which is a framework for parallelizing "...asynchronous iterative algorithms with sparse computational dependencies..." Third, Distributed GraphLab describes how to evolve the GraphLab abstraction to a distributed setting.

http://blog.acolyer.org/2015/05/26/pregel-a-system-for-large-scale-graph-processing/
http://blog.acolyer.org/2015/05/27/graphlab-a-new-framework-for-parallel-machine-learning/
http://blog.acolyer.org/2015/05/28/distributed-graphlab-a-framework-for-machine-learning-and-data-mining-in-the-cloud/

The Confluent blog has a transcript and the video of a recent talk by Martin Kleppmann entitled "Using Logs to Build a Solid Data Infrastructure (or: Why Dual Writes Are a Bad Idea)." The post describes the challenges involved in data integration, how an append-only log can be used to solve these, how logs are using in db storage engines, db replication, distributed consensus, & Apache Kafka, and how to build a data integration using a distributed log.

http://blog.confluent.io/2015/05/27/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

This post provides a great overview and summary of the Mesos, Omega, and Borg papers. It provides some background by contrasting the problems of heterogenous datacenters to homogenous HPC clusters. Next, it describes Mesos' two-level scheduling, Omega's optimistic scheduling, and Borg (which is the production datacenter scheduler at Google). There are a number of interesting details from the Borg paper mentioned, such as the median cluster size of 10K nodes and the distinction between priorities when scheduling services and batch jobs.

http://www.umbrant.com/blog/2015/mesos_omega_borg_survey.html

Distributed consensus implementations often use a disk for persisting data. With that in mind, it's still a bit surprising that wiping a disk can lead to data loss in a system like Zookeeper. This post describes the details of the problem (using Zookeeper as a reference), and it explores several solutions (e.g. using super-majorities and db tokens).

http://fpj.me/2015/05/28/dude-wheres-my-metadata/

This post describes how a data infrastructure based around Apache Kafka can be used to populate multiple (e.g. a new/prototype) data stores in parallel. This strategies enables much more informed decisions than a all-at-once switchover.

http://radar.oreilly.com/2015/05/validating-data-models-with-kafka-based-pipelines.html

This presentation describes the Kafka Mesos Framework for running Kafka as part of a Mesos cluster. It describes Mesos, Kafka, the integration, and the commands to run Kafka on Mesos.

http://www.slideshare.net/charmalloc/making-apache-kafka-elastic-with-apache-mesos

This post describes how to install and configure Apache Sentry and Sqoop2 such that Sqoop2 uses Sentry's Authorization. It also gives some examples of creating users/roles and verifying that the permissions work as intended.

http://ingest.tips/2015/05/28/sqoop2-integration-with-sentry/

When getting started with Oozie, it can be confusing to understand what is happening as you submit jobs via the oozie client. This post describes the process in detail and some common pitfalls. Specifically, it looks at common issues and workarounds related to the oozie "launcher job" failing: running out of memory, deadlocking a cluster, and configuring Hive's scratchdir.

https://www.altiscale.com/hadoop-blog/oozie-launcher-tips-for-tackling-its-challenges/

The databricks blog has a guest post on tuning garbage collection for Apache Spark. The post is full of lots of details, including a description of how Java GCs work, an overview of Spark's memory management, notable JVM arguments related to GC, tips for analyzing GC performance, and more.

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

This walkthrough (both in video and transcript form) describes the architecture of Apache Drill. It covers things like when Drill takes advantage of data locality, the components of the Drill cluster (it's a homogenous architecture), and connecting to Drill via ODBC/JDBC and REST.

https://www.mapr.com/blog/how-deploy-apache-drill-and-connect-bi-tools-whiteboard-walkthrough

This presentation contains practical advice and information related to Apache Flink (there have been some good introductory posts/presentations in previous issues of Hadoop Wekly). It covers things like running a Flink cluster, unit tests for Flink, debugging Flink (including remote debugging), job tuning, and much more.

http://www.slideshare.net/robertmetzger1/apache-flink-hands-on

I really enjoy reading about folks' practical experiences with Hadoop (whether good or bad). This post describes what "bad things will happen" when filling the datanode disks (in this case during a distro upgrade). It details the symptoms, including snippets from the logs, and suggests a few setting changes to mitigate the problem.

http://gbif.blogspot.com/2015/05/dont-fill-your-hdfs-disks-upgrading-to.html

News

Hortonworks has announced a new program for academic institutions to train students. Universities that are a Hortonworks Academic Partner get access to Hortonworks course materials for HDP Operations, HDP Developer, and/or HDP Data Science.

http://hortonworks.com/blog/hortonworks-university-annoucnes-an-academic-program/

Salesforce announced partnerships with Google, Cloudera, Hortonworks, New Relic, Informatica, and Trifacta to incorporate their offerings with Saleforce's cloud.

http://www.forbes.com/sites/alexkonrad/2015/05/28/salesforce-teams-up-to-bring-big-data-to-analytics-cloud/

ApacheCon is being split into two conferences: ApacheCon: Big Data and ApacheCon: Core. Both events will be held in Budapest; the former September 28-30 and the latter October 1-2.

http://www.linuxfoundation.org/news-media/announcements/2015/05/linux-foundation-announces-new-conference-support-collaboration

Releases

Apache HBase has published "CVE-2015-1836: Apache HBase remote denial of service, information integrity, and information disclosure vulnerability". There are hotfix upgrades of the 0.98, 1.0.1, and 1.1.0 releases, and the following post describes the mitigation steps.

http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCA+RK=_CFiTfQ2d0V+kuJx_y5izmYccaKjXaJ3V72KK7tbOhbkg@mail.gmail.com%3E

Version 4.3.1 of Apache BookKeeper was released. The version includes bug fixes and improvements.

http://bookkeeper.apache.org/docs/r4.3.1/releaseNotes.html

Apache Mahout 0.10.1 was released. This version upgrades to Spark 1.2.2 (and supports older versions), fixes a memory bug in co-occurrence analysis, and more.

http://mail-archives.apache.org/mod_mbox/mahout-user/201505.mbox/%3CCAOtpBjg4RiHnEW71ixdp%3DXDtvQDixb8waLEWqBYDnmWtag3V%2Bw%40mail.gmail.com%3E

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

Using Spark at Vungle (San Francisco) - Tuesday, June 2
http://www.meetup.com/spark-users/events/222313872/

June 2015 Meetup, Pre-Hadoop Summit (San Francisco) - Wednesday, June 3
http://www.meetup.com/hadoopsf/events/222457953/

Apache Drill: Learn the Basics with Neeraja Rentachintala of MapR (Palo Alto) - Wednesday, June 3
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/222536108/

Introduction to Spark and Tachyon (Emeryville) - Thursday, June 4
http://www.meetup.com/Bay-Area-Stream-Processing/events/219086133/

Graph Analytics in Spark (Mountain View) - Thursday, June 4
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472

Washington

Data Engineering with XFrames (Seattle) - Wednesday, June 3
http://www.meetup.com/Seattle-Spark-Meetup/events/219336592/

Arizona

Splunk and Hadoop Stack (Tempe) - Wednesday, June 3
http://www.meetup.com/Phoenix-Hadoop-User-Group/events/221442792/

Texas

A Pragmatic Approach to Apache Spark and Functional Programming (Addison) - Monday, June 1
http://www.meetup.com/DFW-Data-Science/events/222499306/

Virginia

Spark after Dark (Sterling) - Thursday, June 4
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/222666121/

New Jersey

HBase: NoSQL Design & Data Access Patterns (Hamilton Township) - Tuesday, June 2
http://www.meetup.com/nj-hadoop/events/222637573/

UNITED KINGDOM

Introduction to RHadoop for R Users (Birmingham) - Tuesday, June 2
http://www.meetup.com/BirminghamR/events/222784680/

NORWAY

Hadoop at If Insurance (Oslo) - Tuesday, June 2
http://www.meetup.com/oslohug/events/222708776/

SPAIN

Benchmarking Hadoop (Barcelona) - Wednesday, June 3
http://www.meetup.com/BDOOP-BigData-Operations-On-Perfomance-Barcelona/events/183912402/

FRANCE

DataLake, MapReduce & Spark, Véhicules Connectés (Paris) - Thursday, June 4
http://www.meetup.com/Hadoop-User-Group-France/events/222610836/

GERMANY

Introduction to Apache Flink Workshop (Berlin) - Wednesday, June 3
http://www.meetup.com/Apache-Flink-Meetup/events/220557545/

ISRAEL

Using Apache Kafka as a Large Scale Distributed Message Bus (Herzeliyya) - Monday, June 1
http://www.meetup.com/Coding-with-AppsFlyer/events/222372156/

Hadoop Ecosystem Workshop (Tel Aviv-Yafo) - Tuesday, June 2
http://www.meetup.com/full-stack-developer-il/events/222342395/

AUSTRALIA

Spark SQL and Beyond, from the Spark SQL Lead Developer (Melbourne) - Tuesday, June 2
http://www.meetup.com/Melbourne-Apache-Spark-Meetup/events/222448994/