Data Eng Weekly


Hadoop Weekly Issue #145

15 November 2015

This issue is short and sweet, with coverage of Amazon EMR, Apache Apex, Apache Ambari, and more. Apache Spark, Apache Cassandra, and Apache Mahout all released new versions this week, and Succinct Spark is an interesting new project to keep an eye on.

Technical

Given the elasticity and capabilities (such as the S3 blob store) of the AWS cloud environment, AWS Elastic MapReduce has some unique features available. This article covers several of them—EMR's distinction between core and task nodes to support elasticity, the EMR FileSystem, S3DistCp, and more.

http://cloudacademy.com/blog/amazon-emr-five-ways-to-improve-the-way-you-use-hadoop/

This post from the Cloudera blog describes how to continuously ingest data into HDFS and Hive (using one minute batches) for querying by Impala. In addition to the basic concepts, the post describes how to productionize this setup by using staging tables (which are compacted daily) and a view over the compacted and active staging tables.

http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/

This post describes how Apache Apex (incubating), the batch and steam processing platform, utilizes checkpointing for fault-tolerance. Checkpoint data is written to HDFS, either asynchronously or synchronously (depending on the delivery guarantees), which allows any node in the cluster to recover the state.

https://www.datatorrent.com/blog-introduction-to-checkpoint/

The IBM Hadoop Dev blog has a post describing how to create email alerts to track when the status of a service becomes UNKNOWN (which is the case during a YARN HA failover).

https://developer.ibm.com/hadoop/blog/2015/11/10/alert-notification-track-unknown-status-ha-failover/

This tutorial covers setting up an Amazon EMR cluster with Apache Spark and Apache Zeppelin (incubating). Next, the post gives the steps (i.e. setting up an ssh tunnel) needed to access the Zeppelin web ui. From there, there are instructions for building a recommendation engine using the MovieLens dataset and a MatrixFactorization MLlib function.

http://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin

This presentation describes best practices for several non-trivial features of Spark—RRD re-use, working with key-value data, Spark accumulators, and SparkSQL. The presentation also has a preview of some work underway in Spark MLlib.

http://www.slideshare.net/hkarau/beyond-shuffling-global-big-data-tech-conference-2015-sj

News

Typesafe recently announced that they'll provide commercial support for Apache Spark.

http://www.eweek.com/database/typesafe-launches-support-for-apache-spark.html

This post lists 10 free Hadoop tutorials ranging from short to multi-step and in both text and video form.

http://www.datasciencecentral.com/profiles/blogs/hadoop-tutorials

The call for speakers for Strata+Hadoop World London, which takes place in May/June of 2016, is open until December 11th.

http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/cfp/425

Releases

Apache Spark 1.5.2 was released this week. It's a maintenance release with over 60 resolved issues.

http://spark.apache.org/releases/spark-release-1-5-2.html

Apache Mahout 0.11.1 was also released this week. The new version contains 10 bug fixes and several performance improvements—to Spark support, to dot product calculations, and to %*% calculations.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201511.mbox/%3CCAOtpBjh8fARuPV5rCW_-VhYxHyv=tzqrcwuMYnAuZY=k=+rZnw@mail.gmail.com%3E

The 3.0 release of Apache Cassandra contains a number of new performance optimizations, data storage savings, and developer enhancements. The new version was released this week.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces82

Succinct Spark is a new library for interacting with data in the Succinct distributed store. Succinct compresses and indexes input data, which can lead to massive speedups to Spark programs in certain situations. The introduction describes several ways that Succinct is integrated with Spark, provides example code, and describes performance benefits.

https://databricks.com/blog/2015/11/10/succinct-spark-from-amplab-queries-on-compressed-rdds.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Hive Contributors Meetup (Santa Clara) - Monday, November 16
http://www.meetup.com/Hive-Contributors-Group/events/226495286/

Intro to Apache Spark for Java and Scala Developers (Mountain View) - Wednesday, November 18
http://www.meetup.com/sv-jug/events/226109708/

One Hadoop, Multiple Clouds (Palo Alto) - Wednesday, November 18
http://www.meetup.com/cloudcomputing/events/226450900/

Kafka November Meetup (Mountain View) - Wednesday, November 18
http://www.meetup.com/http-kafka-apache-org/events/225592591/

Washington

Deep Dive Into Spark Streaming (Bellevue) - Wednesday, November 18
http://www.meetup.com/Big-Data-Bellevue-BDB/events/219852695/

Missouri

Hadoop in the Cloud (Clayton) - Tuesday, November 17
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/225046512/

Minnesota

Building a Real-Time Transformation Engine on Spark Streaming (Saint Paul) - Thursday, November 19
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/226472910/

Ohio

Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, November 16
http://www.meetup.com/Cleveland-Hadoop/events/225670548/

Florida

First Miami HUG Meetup! (Coral Gables) - Thursday, November 19
http://www.meetup.com/Miami-Hadoop-User-Group/events/226352443/

North Carolina

Agile, Nimble, Tenacious: Modern Data Analytics with Apache Spark (Charlotte) - Wednesday, November 18
http://www.meetup.com/CharlotteHUG/events/219153256/

Virginia

Practical Introduction to Apache Flink (Vienna) - Thursday, November 19
http://www.meetup.com/Washington-DC-Area-Apache-Flink-Meetup/events/225769282/

Connecticut

Connecticut Big Data #3 (Windsor) - Wednesday, November 18
http://www.meetup.com/Connecticut-Big-Data/events/224851797/

Massachusetts

Hands-On Presto Workshop (Boston) - Tuesday, November 17
http://www.meetup.com/bostonhadoop/events/226523527/

Spark, Big Data and Analytics Meetup (Boston) - Thursday, November 19
http://www.meetup.com/Big-Data-Developers-in-Boston/events/226659145/

URUGUAY

Hadoop: Intro and Experiences with AWS Elastic Map Reduce (Montevideo) - Wednesday, November 18
http://www.meetup.com/Montevideo-BigData-DataScience-Meetup/events/226378357/

NORWAY

Lets Get Started with Hadoop #4 (Oslo) - Thursday, November 19
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/222558927/

GERMANY

Schedoscope: Pain-free Scheduling for Agile Hadoop Data Warehouses (Munich) - Tuesday, November 17
http://www.meetup.com/Hadoop-User-Group-Munich/events/225557422/

POLAND

Lighting Talks: Flink Streaming, Spark Streaming, Control-M, Qlik (Warsaw) - Thursday, November 19
http://www.meetup.com/warsaw-hug/events/226348148/

ROMANIA

6th BigData/DataScience Cluj-Napoca Meetup (Cluj-Napoca) - Tuesday, November 17
http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/226443828/

INDIA

Mumbai Spark Meetup 4Q2015 (Mumbai) - Saturday, November 21
http://www.meetup.com/Big-Data-Developers-in-Mumbai/events/223301848/

Apache Spark: Introduction to Spark DataFrames/SQL and Deep Dive (Bangalore) - Saturday, November 21
http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/226419828/

AUSTRALIA

Spark in Cloud, SparkR and Machine Learning, Demos from Spark Hackathon (Sydney) - Tuesday, November 17
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/225866988/

NEW ZEALAND

Apache Spark Intro, RDD Basics and SparkSQL (Auckland) - Thursday, November 19
http://www.meetup.com/Auckland-Apache-Spark-User-Group/events/225893445/