Data Eng Weekly


Hadoop Weekly Issue #191

23 October 2016

This week's issue is short and sweet with a few technical posts, two interesting news articles, and several exciting releases (including Apache Kafka 0.10.1.0). With Spark Summit Europe this week, expect lots of great content in the next issue. And if you're attending, please send interesting slides/talks my way!

Technical

Cloudera's CDH supports intra-node disk balancing since version 5.8.2 (it's also part of the 3.0.0 alpha Apache release). Using this feature, a data node can rebalance data blocks across disks using the hdfs diskbalancer command. This post describes how the tool works and shows how to run it.

http://blog.cloudera.com/blog/2016/10/how-to-use-the-new-hdfs-intra-datanode-disk-balancer-in-apache-hadoop/

This post demonstrates the capabilities of the spark.ml library by building a logistic regression model to predict malignancy of cases from the Wisconsin Diagnostic Breast Cancer data set. The example code covers parsing, exploring a dataset with built-in statistics, extracting features from the input dataset, training the model, and evaluating the model.

https://www.mapr.com/blog/predicting-breast-cancer-using-apache-spark-machine-learning-logistic-regression

The Amazon Big Data blog has a tutorial for running RStudio with sparklyr on EMR. Thanks to a bootstrap action, a cluster complete with RStudio running on the master, can be launched with a single command.

https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/

The Databricks blog features a list of seven tips for debugging Apache Spark code on Databricks. Most of the suggestions, like "Scale up Spark jobs slowly for really large datasets" and "Examine the partitioning for your dataset," are generally applicable to all Spark users.

https://databricks.com/blog/2016/10/18/7-tips-to-debug-apache-spark-code-faster-with-databricks.html

News

InfoQ has an interview with Yahoo VP of Engineering, Peter Cnudde. Topics covered include Hadoop, Spark adoption at Yahoo (mostly for in-memory computing, not for ETL), and Caffe-on-Spark for deep learning.

https://www.infoq.com/articles/peter-cnudde-yahoo-big-data

ZDNet contributor Tony Baer has read between the lines when it comes to recent benchmarks by Cloudera and Hortonworks. The takeaways are as follows: 1) "SQL's the gateway drug to Hadoop." 2) Cloudera is trying to challenge Amazon (in this case Redshift), and 3) Hortonworks (via Hive's Live Long and Prosper) has caught up on the investment Cloudera made in Impala.

http://www.zdnet.com/article/sql-on-hadoop-benchmarks-get-serious/

Releases

Apache Kafka 0.10.1.0 was released this week. It contains improvements from over 500 pull requests and the implementation of 15 Kafka Improvement Proposals. The Confluent blog has the highlights of additions/improvements to Kafka Server (time-based indexes, replication quotas, and improved log compaction), improvements to Kafka client APIs (interactive queries for Kafak Streams, improved memory management, secure quotas, and more), and bug fixes.

http://mail-archives.apache.org/mod_mbox/kafka-users/201610.mbox/%3CCAJL4t_oz9q4T9vn6Z-EBoazWJFyqHw4Y0L-PTowD%2BpFhcPv0VQ%40mail.gmail.com%3E
http://www.confluent.io/blog/announcing-apache-kafka-0-10-1-0/

Apache Fluo (incubating), recently had its first release since entering the incubator. Fluo is a tool for making "incremental updates to large data sets stored in Apache Accumulo" a la Google's Perculator.

https://fluo.apache.org/release/fluo-1.0.0-incubating/

Apache Flume 1.7.0 was released. It adds support for a taildir source and includes a number of improvements and bug fixes. Many of these are around Flume's integration with Apache Kafka.

http://flume.apache.org/releases/1.7.0.html

Apache NiFi 0.7.1 was released as a follow-up to July's 0.7.0 release (version 1.0.0 was also recently released—in August). This release adds a number of improvements and bug fixes.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version0.7.1

Apache Giraph 1.2.0 was released. Highlight's of the release include a new blocks API, support for graphs that don't fit in memory, and the addition of a new set of default configuration options based on Facebook's experience with Giraph.

https://blogs.apache.org/giraph/entry/giraph_1_2_0_release

deeplearning4j is a deep learning implementation that integrates with Hadoop and Spark and supports GPUs. Version 0.6.0 was recently released.

https://github.com/deeplearning4j/deeplearning4j

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Uber Engineering Tech Talk Series (San Francisco) - Monday, October 24
http://www.meetup.com/UberEvents/events/234789134/

Real-Time Streaming and Exactly-Once Semantics with Kafka (San Francisco) - Tuesday, October 25
http://www.meetup.com/MemSQL/events/234405914/

Building Your First Spark & C* App + SMACK Stack + The Cassandra Odyssey (San Francisco) - Wednesday, October 26
http://www.meetup.com/SF-Spark-and-Friends/events/234932979/

Apache YARN Committers/Contribut­ors Meetup #4 (Sunnyvale) - Thursday, October 27
http://www.meetup.com/Hadoop-Contributors/events/234971372/

Washington

Kafka Palooza: LinkedIn, Microsoft Azure, MapR (Bellevue) - Monday, October 24
http://www.meetup.com/Seattle-Apache-Kafka-Meetup/events/234836624/

Nevada

PixieDust: Making Python Visualizations Easier for Jupyter Notebooks with Spark (Las Vegas) - Monday, October 24
http://www.meetup.com/Data-Science-Las-Vegas/events/234557659/

Texas

O&G Big Data Use Cases, by Hortonworks (Houston) - Thursday, October 27
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/234282996/

Kansas

Using Data Quality to Support Analytics in Hadoop (Overland Park) - Tuesday, October 25
http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/events/234597551/

Missouri

Using Data Quality to Support Analytics in Hadoop (Kansas City) - Tuesday, October 25
http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/events/234597347/

Illinois

Big Data Streaming Platform Ecosystem (Chicago) - Tuesday, October 25
http://www.meetup.com/ChicagoRealTimeStreamingAnalytics/events/234676872/

Apache Spark 101 (Chicago) - Tuesday, October 25
http://www.meetup.com/Chicago-Spark-Users/events/233999667/

Ohio

October Edition of MOHUG (Dublin) - Tuesday, October 25
http://www.meetup.com/MOHUG-Mid-Ohio-Hadoop-User-Group/events/234416779/

Florida

Apache Spark (Miami) - Wednesday, October 26
http://www.meetup.com/Miami-Hadoop-User-Group/events/234992451/

New York

Lambda-in-a-Box: Merging Apache Spark & HBase into an Open-Source Database (New York) - Thursday, October 27
http://www.meetup.com/mysqlnyc/events/233775657/

October Data Engineering Meetup (New York) - Thursday, October 27
http://www.meetup.com/NYC-Data-Engineering/events/234946410/

CANADA

Toronto Apache Spark #14 (Toronto) - Wednesday, October 26
http://www.meetup.com/Toronto-Apache-Spark/events/234878620/

Introduction to MapR (Toronto) - Thursday, October 27
http://www.meetup.com/Toronto-MapR-User-Group/events/231648976/

UNITED KINGDOM

Why SMACK for Fast Data (London) - Monday, October 24
http://www.meetup.com/skillsmatter/events/234588911/

Building Scalable Systems in a Changing Data Landscape (London) - Tuesday, October 25
http://www.meetup.com/data-science-lab/events/234754144/

Spark Structured Streaming in Practice (London) - Wednesday, October 26
http://www.meetup.com/hadoop-users-group-uk/events/234876912/

SPAIN

Season Premiere with Reynold Xin, Co-Founder & Chief Architect at Databricks (Barcelona) - Thursday, October 27
http://www.meetup.com/Spark-Barcelona/events/234463208/

Introduction to Kafka (Malaga) - Friday, October 28
http://www.meetup.com/Linux-Malaga/events/234826330/

BELGIUM

Spark Pre-Summit Meetup (Brussels) - Tuesday, October 25
http://www.meetup.com/Spark-Belgium/events/234234256/

Meeting on Streamsets, Datameer and Kudu (Kontich) - Tuesday, October 25
http://www.meetup.com/Belgium-Cloudera-User-Group/events/234618841/

Spark & Machine Learning Meetup (Brussels) - Thursday, October 27
http://www.meetup.com/Data-Science-Community-Meetup/events/234173917/

INDIA

Introduction to Spark & Use Cases (Hyderabad) - Monday, October 24
http://www.meetup.com/meetup-group-ytFpRTDs/events/234412261/

AUSTRALIA

Rethink SQL for Big Data with Apache Drill (Barton) - Tuesday, October 25
http://www.meetup.com/Canberra-Big-Data-Converged-SQL-NoSQL-and-Real-Time/events/233463561/

Spark Meetup October (Sydney) - Wednesday, October 26
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/233723585/

Rethink SQL for Big Data with Apache Drill (Melbourne) - Thursday, October 27
http://www.meetup.com/Melbourne-Big-Data-Converged-SQL-NoSQL-and-Real-Time/events/233463459/

ESTONIA

Big Data: Spark and TensorFlow (Tallinn) - Monday, October 24
http://www.meetup.com/Advanced-Java-Estonia/events/234612322/