Hadoop Weekly Issue #141

11 October 2015

Spark is the topic of over half of the technical articles this week. As evidenced by new features and companies sharing practical knowledge, it is maturing (and gaining plenty of adoption) as a product. Aside from the great articles on Spark, I highly recommend the visualization covering the fundamentals of Raft's distributed consensus algorithm.

Technical

In Spark 1.5, SparkR gained support for distributed computation of generalized linear models. This tutorial shows how to use the SparkR APIs to perform to build a linear model for predicting airline delays.

https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html

This tutorial describes how to build a Apache Spark cluster on Amazon Web Services using spot instances (which provide a significant cost savings). The instructions describe using the AWS web console, installing Spark using a recent release, and configuring Spark's important settings.

http://blog.insightdatalabs.com/spark-cluster-step-by-step/

The MapR blog has a guide to Spark Streaming, which discusses Spark Streaming's API and streaming model (microbatch). It also describes processing semantics (at least once, exactly once, at most once), which vary depending on the input source for Spark Streaming.

https://www.mapr.com/blog/quick-guide-spark-streaming

Datanami has an article describing how Uber has migrated from a data system built on Amazon EMR and Celery/Python ETL to a new system built on Spark and Kafka. Uber makes heavy use of Spark Streaming and Spark SQL, and they've built two Spark-based tools to keep the system running smoothly. The first, called Paricon, is used to validate data contracts when schema's change, and the second, called Komondor, takes care of common ingestion pieces (like dedup).

http://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-optimize-customer-experience/

Compared to other distributed systems, Kafka is relatively easy to configure and operate. But that's not to say it never has problems—this presentation describes several situations where folks have experienced trouble.

http://www.slideshare.net/gwenshap/nyc-kafka-meetup-2015-when-bad-things-happen-to-good-kafka-clusters

The Stitch Fix blog has a post describing their experience with Spark. It covers how they think about when to use Spark, the Spark Data Source API, caching, the DataFrame API, and SparkSQL. There are some good tips and anecdotes—e.g. that Stick Fix converted some Python jobs to use the DataFrame API and saw 6x performance improvements.

http://multithreaded.stitchfix.com/blog/2015/10/06/spark-for-data-science/

This post describes some statistical tests added to Spark's MLlib for Goodness-of-Fit. It contains some background on the tests, and how they're implemented in Spark.

http://blog.cloudera.com/blog/2015/10/continuous-distribution-goodness-of-fit-in-mllib-kolmogorov-smirnov-testing-in-apache-spark/

The MapR blog has a recap of the three talks given at the recent Bay Area Apache Flink Meetup. The talks covered stateful distributed stream processing, Gelly (the Flink graph processing API), and the future of Apache Flink.

https://www.mapr.com/blog/distributed-stream-and-graph-processing-apache-flink

Kudu, the new distributed storage engine from Cloudera, includes APIs in Java, C++, and Python (in alpha). These articles give an overview and introduction to the Kudu APIs in Python and Java.

http://peter-hoffmann.com/2015/getting-started-with-the-cloudera-kudu-storage-engine-in-python.html
http://harshj.com/writing-a-simple-kudu-java-api-program/

Sparkling Water is a library for combining H2O.ai's machine learning APIs and UI with Apache Spark. This post describes how Spark and H2O work together (both the API and architecture) and walks through an example of building a deep learning model using Sparkling Water.

http://blog.cloudera.com/blog/2015/10/how-to-build-a-machine-learning-app-using-sparkling-water-and-apache-spark/

This visualization provides an excellent introduction to the Raft distributed consensus algorithm. During the visualization (which lasts about 5 minutes), several animations describe leader election and log replication. If you're a visual learner (or even if not), this is one of the best ways to learn the fundamentals of Raft.

http://thesecretlivesofdata.com/raft/

News

The Call for Abstracts for Hadoop Summit Europe, which takes place in Dublin on April 13-14, is open until October 30th.

http://mosaicevents.com/he16/speakers/

Releases

Apache Ignite, the in-memory data-fabric, released version 1.4.0 this week. It's the first release since Ignite graduated from the Apache incubator, and it adds SSL support, a faster JDBC driver, and more.

https://ignite.apache.org/news.html#release-1.4.0

Apache Accumulo 1.6.4 was released. The new version of the distributed key-value store includes bug-fixes and performance improvements. Notably, this release contains a fix for silent data-loss during bulk import.

https://accumulo.apache.org/release_notes/1.6.4.html

Cook is a new open-source Mesos framework scheduler from Two Sigma. Cook is a batch-scheduler designed to balance latency and throughput when there are more jobs than a Mesos cluster has capacity for. It has built-in support for Spark (including a Spark scheduler backend).

https://github.com/twosigma/Cook/releases/tag/v1.0.0

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

The Data Scientists' Guide to Apache Spark (San Francisco) - Monday, October 12
http://www.meetup.com/SF-Data-Science/events/225205575/

Spark at Thomson Reuters and Project Tungsten (San Francisco) - Tuesday, October 13
http://www.meetup.com/spark-users/events/225434934/

Samza October Meetup (Sunnyvale) - Tuesday, October 13
http://www.meetup.com/Bay-Area-Samza-Meetup/events/225378902/

Impala: Tuning and Best Practices (San Mateo) - Wednesday, October 14
http://www.meetup.com/Bay-Area-Cloudera-User-Group/events/225076519/

Washington

ML on Spark Roundtable (Bellevue ) - Wednesday, October 14
http://www.meetup.com/Seattle-Spark-Meetup/events/220003831/

Texas

Learn about SpliceMachine (Houston) - Tuesday, October 13
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/224954252/

A Deeper Dive Into Apache Drill and Big Data with MongoDB (Addison) - Tuesday, October 13
http://www.meetup.com/Dallas-mongoDB-Meetup/events/225888117/

Georgia

Ted Dunning: Data Science & Business Intelligence (Kennesaw) - Tuesday, October 13
http://www.meetup.com/Atlanta-Society-for-Business-Intelligence/events/225083024/

New York

Transactions on Hadoop/HBase (New York) - Thursday, October 15
http://www.meetup.com/Hadoop-NYC/events/225572119/

UNITED KINGDOM

10th Spark London Meetup (London) - Monday, October 12
http://www.meetup.com/Spark-London/events/225721863/

Deep Dive: Spark SQL+ DataFrames + Cassandra Connector (Edinburgh) - Tuesday, October 13
http://www.meetup.com/Scotland-Data-Science-Technology-Meetup/events/225504741/

October HUG UK MeetUp (London) - Thursday, October 15
http://www.meetup.com/hadoop-users-group-uk/events/225851486/

IRELAND Spark After Dark/Spark for Fraud Detection (Dublin) - Thursday, October 15
http://www.meetup.com/Dublin-Spark-Meetup/events/225726069/

GERMANY

Flink Forward 2015 (Berlin) - Monday, October 12
http://www.meetup.com/Apache-Flink-Meetup/events/224238292/

CZECH REPUBLIC

Lambda & Kappa Architektura (Prague) - Thursday, October 15
http://www.meetup.com/CS-HUG/events/225866425/

HUNGARY

Cloudera Meets StarSchema and VirtDB to Rock Your Data (Budapest) - Monday, October 12
http://www.meetup.com/Budapest-Analytics-Rockstars/events/225750272/

INDIA

Mumbai Spark Meetup 4Q2015 (Mumbai) - Saturday, October 17
http://www.meetup.com/Big-Data-Developers-in-Mumbai/events/223301848/

Spark on YARN (Bangalore) - Saturday, October 17
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/225649429/

AUSTRALIA

Not Your Dad’s Old HBase (Melbourne) - Thursday, October 15
http://www.meetup.com/HadoopMelbourne/events/225693129/

Data Eng Weekly