Data Eng Weekly


Hadoop Weekly Issue #167

25 April 2016

Welcome to a special Monday edition of Hadoop Weekly. There's lots of great technical content this week from Spark to Kafka to Beam to Kudu. If you're looking for something even more bleeding edge than some of those technologies, Apache Metron (incubating) had its first release. Metron, which is a general-purpose security system built on Hadoop, is a project to keep an eye on going forward.

Technical

This presentation serves as a guide to building a stream processing system in AWS. It describes relatively simple solutions such as Amazon Kinesis with AWS Lambda and the Kineses S3 connector as well as more complex solutions for real-time analytics that make use of many AWS solutions.

http://cdn.oreillystatic.com/en/assets/1/event/144/Building%20a%20scalable%20architecture%20for%20processing%20streaming%20data%20on%20AWS%20Presentation.pdf

This post describes how to use Spark Testing Base, which is a testing framework for Spark written in Scala, from Java. The example code shows how to refactor Spark code to isolate the logic to test as well as how to deal with some of the gnarly Scala APIs from Java.

http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/

The Altiscale blog has an overview of the pros and cons of building thin and uber jars when working with Spark. There are examples of building both types in Maven and SBT.

https://www.altiscale.com/blog/spark-on-hadoop-thin-jars/

LinkedIn has posted about their Kafka ecosystem, which includes a special Kafka producer, a REST API for non-java clients, monitoring, an avro schema registry, Gobblin (a tool for loading data to Hadoop), and more.

https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin

This tutorial on Spark Streaming shows how to pull tweets using the twitter4j API, filter based on hashtag, and perform sentiment analysis on the tweets as they're processed.

https://www.mapr.com/blog/spark-streaming-and-twitter-sentiment-analysis

Apache Kudu (incubating) is an exciting companion to Apache Impala (incubating) because it can efficiently answer both broad analytics and very targeted queries. This post describes the technical details of the integration, how Kudu's design provides efficient querying capabilities, how to perform write/update/delete operations with Impala and Kudu, and more.

http://blog.cloudera.com/blog/2016/04/how-to-use-impala-and-kudu-together-for-analytic-workloads/

MapR has a post about using spark-sklearn to scale out an existing scikit-learn model. It walks through building a model from the Inside Airbnb dataset and describes how to plug in spark-sklearn for cross validation.

https://www.mapr.com/blog/predicting-airbnb-listing-prices-scikit-learn-and-apache-spark

The AWS big data blog has a tutorial describing how to use HBase and Hive with Amazon EMR. The post includes an introduction to HBase, describes how to restore a HBase table from S3, demonstrates Hive and HBase integration, and more.

http://blogs.aws.amazon.com/bigdata/post/Tx3EGE8Z90LZ9WX/Combine-NoSQL-and-Massively-Parallel-Analytics-Using-Apache-HBase-and-Apache-Hiv

This post describes some of the challenges in providing real-world experience to students taking a big data course. The author has gone through several iterations and options and seems to have finally landed on a good solution—Altiscale's Hadoop-as-a-Service.

https://www.altiscale.com/blog/hadoop-as-a-service-in-the-classroom/

The Cloudera blog has a guest post in which the author compares Parquet and Avro across two data sets—one that's narrow (3 column) and one that's wide (103 column). Using test query/operations in Spark and Spark SQL, the author finds that queries against Parquet and Avro serialized data sometimes perform similarly, although queries against Parquet data are much faster (and serialize data much smaller) in many cases.

http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/

This article describes how to use SparkR with a distribution, like CDH, that doesn't officially support it. By leveraging YARN and locally installed R packages on the workers, jobs can be executed with little additional work.

http://www.nodalpoint.com/sparkr-in-cloudera-hadoop/

There have been a number of open-source frameworks to execute MapReduce and similar jobs with a higher-level programming model. Historically, these have been tied to individual execution frameworks (e.g. MapReduce, Storm), but there's recently been work to make them agnostic. Apache Beam (incubating) aims to take that even further, generalizing across execution models for both batch and streaming and offering built-in support for complex compute models.

http://www.datanami.com/2016/04/22/apache-beam-emerges-ambitious-goal-unify-big-data-development/

The Apache blog has a 7-part series presenting experimental results for HBase write throughput across HDD, SSD, and RAMDISK. In performing the analysis, the authors found and proposed fixes to a few uncovered issues in HBase and HDFS.

https://blogs.apache.org/hbase/entry/hdfs_hsm_and_hbase_part

News

Tom White, the author of "Hadoop: The Definitive Guide," has written about how he became involved in Apache Hadoop. His early contributions were around integration Hadoop with Amazon Web Services, which has been an important part of the project's success.

http://vision.cloudera.com/how-i-got-into-hadoop/

Fluo, which is a distributed processing engine for Apache Accumulo, has been submitted to the Apache incubator.

https://wiki.apache.org/incubator/FluoProposal

A new conference for Apache Phoenix, the SQL-on-HBase system, has been announced for the day after HBaseCon. The conference is half-day, and will feature tracks on Phoenix internals and use cases.

http://hortonworks.com/blog/announcing-first-annual-phoenixcon-apache-phoenix-user-conference/

Releases

Apache Metron, a security framework built on Hadoop, has released version 0.1. Hortonworks is supporting it as a tech preview, and has written about the features, how to get started, how to contribute, how to use the Metron UI, and more.

http://hortonworks.com/blog/apache-metron-tech-preview-1-come-get/
http://hortonworks.com/blog/apache-metron-use-case-finding-needle-haystack/

Apache NiFi 0.6.1 was released this week. It's a bug fix release that addresses just over 10 bugs.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201604.mbox/%3CCALJK9a7yLnFeJ7Z=eU6mOB-DXvo8MHUr=_RshSjZcTbTcAHDZA@mail.gmail.com%3E

Apache Flink 1.0.2 was released this week. The new release includes bug fixes, a performance improvement when using RocksDB, and several improvements to documentation.

http://flink.apache.org/news/2016/04/22/release-1.0.2.html

Amazon has announced a new version of Amazon EMR with support for HBase 1.2.

https://aws.amazon.com/blogs/aws/amazon-emr-update-apache-hbase-1-2-is-now-available/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Spark 101 (San Francisco) - Tuesday, April 26
http://www.meetup.com/SF-Spark-and-Friends/events/229851313/

Big Data Application Meetup (Palo Alto) - Wednesday, April 27
http://www.meetup.com/BigDataApps/events/228460935/

Tackling Data Challenges at Netflix and Twitter (Los Gatos) - Wednesday, April 27
http://www.meetup.com/SF-Data-Engineering/events/230356137/

Apache Flink Technical Deep Dive w/ Stephan Ewen! (Palo Alto) - Thursday, April 28
http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/230409559/

Spark with Couchbase to Electrify Your Data Processing (Santa Monica) - Thursday, April 28
http://www.meetup.com/Couchbase-Los-Angeles/events/229345154/

Colorado

"What Is All the Hype about Apache Spark" (Denver) - Tuesday, April 26
http://www.meetup.com/Colorado-Data-Science/events/229999064/

Utah

Data Science @ Blue Coat - Chris Larsen Speaking (Draper) - Thursday, April 28
http://www.meetup.com/BigDataUtah/events/229871745/

Texas

Big Data Architecture for O&G (Houston) - Tuesday, April 26
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/230054103/

Oil and Gas Use Case: Spin Up & Visualize (Addison) - Thursday, April 28
http://www.meetup.com/DFW-Analytics-Big-Data-and-Beyond/events/230054005/

Wisconsin

Overview and Demo of the Apache NiFi Project (Madison) - Tuesday, April 26
http://www.meetup.com/BigDataMadison/events/225352208/

Ohio

April Edition of MOHUG (Dublin) - Tuesday, April 26
http://www.meetup.com/MOHUG-Mid-Ohio-Hadoop-User-Group/events/229662512/

Georgia

How the Weather Company Leverages Billions of Data Points & Predictive Analytics (Atlanta) - Wednesday, April 27
http://www.meetup.com/Data-Science-ATL/events/229761943/

HBase as a File System (Roswell) - Wednesday, April 27
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/230178596/

Virginia

Spark Saturday DC (McLean) - Saturday, April 30
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/229486436/

Maryland

Apache NiFi: Because It Ain’t Data Science without the Data (Laurel) - Wednesday, April 27
http://www.meetup.com/Data-Science-MD/events/229738833/

New York

Building Data Pipelines for Solr with Apache NiFi (New York) - Tuesday, April 26
http://www.meetup.com/NYC-Apache-Lucene-Solr-Meetup/events/225945610/

Analysis of Streaming Sensor Data with Spark & Kafka on Bluemix (New York) - Wednesday, April 27
http://www.meetup.com/Big-Data-Developers-in-NYC/events/230374767/

CANADA

Integration of Apache Kafka with Apache Spark (Toronto) - Wednesday, April 27
http://www.meetup.com/Logger/events/228807188/

April Meetup (Ottawa) - Wednesday, April 27
http://www.meetup.com/Ottawa-Big-Data-Enthusiasts/events/230044245/

UNITED KINGDOM

Apache Kudu Intro: Storage for Fast Analytics on Fast Data (London) - Thursday, April 28
http://www.meetup.com/Data-Science-London/events/230394455/

NORWAY

Big Data, No Fluff: Let’s Get Started with Hadoop #7 (Oslo) - Thursday, April 28
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/225589995/

SWEDEN

Spark as the Catalyst for Advanced Analytics (Stockholm) - Wednesday, April 27
http://www.meetup.com/Machine-Learning-Stockholm/events/230166300/

FRANCE

Spark Meetup at Criteo (Paris) - Thursday, April 28
http://www.meetup.com/Paris-Spark-Meetup/events/229847857/

SWITZERLAND

Spark Streaming: Dealing with State, by Francois Garillot (Renens) - Thursday, April 28
http://www.meetup.com/Big-Data-Romandie/events/230345605/

POLAND

Introducing Apache Ignite (Warsaw) - Tuesday, April 26
http://www.meetup.com/warsaw-hug/events/229293076/

Tabular Data Analysis in Apache Spark Using DataFrames (Warsaw) - Wednesday, April 27
http://www.meetup.com/Poland-CodiLime-Tech-Talk/events/230339771/

INDIA

Introduction to Flink Streaming (Bangalore) - Saturday, April 30
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/230192489/

AUSTRALIA

Kafka and OrientDB (Sydney) - Tuesday, April 26
http://www.meetup.com/Sydney-Alt-Net/events/229675660/