Data Eng Weekly


Hadoop Weekly Issue #149

13 December 2015

The two themes for this week seem to be comparisons and compatibility. On the former, there are articles comparing Spark to Drill and Hadoop as well as Hive to MySQL. Regarding compatibility, MapR announced MapR Streams, which is API-compatible with Kafka, and there's a post on Flink's recently announced support for Storm topologies. In addition, Confluent Platform 2.0 is out, and this week's issue has the first (probably of many) articles looking back on the year of Hadoop. Finally, SCALE has an interview with Doug Cutting, which ties many of these topics together in the context of open-source and the evolution of the Hadoop ecosystem.

Technical

While I wouldn't consider Amazon Redshift a part of the Hadoop ecosystem per se, it's often used in conjunction with Amazon EMR or other cloud-based Hadoop solutions. With that in mind, here are some useful tips for optimizing a Redshift cluster for optimal performance.

http://blogs.aws.amazon.com/bigdata/post/Tx31034QG0G3ED1/Top-10-Performance-Tuning-Techniques-for-Amazon-Redshift

The MapR blog has a comparison of Apache Drill and Apache Spark. It notes that the major difference is that Drill is SQL-first, while Spark supports several query mechanisms, of which SQL is one. On the topic of this difference, Drill supports additional SQL features, like ANSI SQL, keywords for nested and array data (which is useful for querying JSON), and views.

https://www.mapr.com/blog/apache-spark-vs-apache-drill

This post describes the Hive architecture, schema-on-read, schema-on-write, and some recommendations on when to use Hive and when to use MySQL.

http://blog.matthewrathbone.com/2015/12/08/hive-vs-mysql.html

Apache Flink 0.10 added beta support for compatibility with Apache Storm. Using this support, a Storm topology can be run as-is on Flink (it must be converted to a Flink topology, though, which requires changes to a few lines of code). In addition, existing Storm Spouts and Bolts can be embedded inside of a Flink topology. This post describes the integration and gives examples of both features.

http://flink.apache.org/news/2015/12/11/storm-compatibility.html

This presentation describes how the team at Magnetic has scaled Spark. The slides are somewhat sparse, but they mention how Magnetic is using AWS (they're slowing migrating from a colo to there) with details on instance types and auto-scaling.

http://www.slideshare.net/arov/scaling-spark

This presentation describes how Treasure Data does data analytics. As a Ruby shop, Treasure Data uses a mix of languages for their platform. For collecting data, they using fluentd and embulk, and they use Hive and Presto for much of their processing. The presentation describes how they coordinate processing (e.g. PerfectSched and PerfectQueue) and describes several other tools they use (such as MessagePack).

http://www.slideshare.net/tagomoris/data-analytics-service-company-and-its-ruby-usage-56073823

Cloudera CDH 5.5 has support for Apache HTrace (incubating), which can provide granular details about timings of HDFS operations. This post describes how to setup HTrace and htraced (from Cloudera Labs) to record this information and view it with the included web front-end.

http://blog.cloudera.com/blog/2015/12/new-in-cloudera-labs-apache-htrace-incubating/

News

ReadWrite web has an article arguing that Hadoop and Spark will continue to coexist for the foreseeable future. Reasons include Spark's lack of a file system (Hadoop provides HDFS) and Hadoop YARN, which can provide a platform for other compute frameworks including various SQL-on-Hadoop systems.

http://readwrite.com/2015/12/02/spark-hadoop-business-intelligence

SCALE has an interview with Hadoop creator Doug Cutting. Topics covered include defining Hadoop as a collection of projects, the addition of Kafka and Spark to this core collection, the rise of Spark across several industries (and with many companies behind it), monetizing open-source big data projects, and the comparison between open-source and proprietary technology from Google.

https://medium.com/s-c-a-l-e/hadoop-creator-doug-cutting-on-evolving-and-succeeding-in-open-source-3277a42e5b6e

PCWorld has an article comparing Hadoop and Spark, which reiterates that the two complement each other well. But the article also describes some of the ways they're different (Spark is faster, different failure recovery modes) and that they can be used independently.

http://www.pcworld.com/article/3014515/five-things-you-need-to-know-about-hadoop-v-apache-spark.html#tk.rss_all

Apache Kylin, the OLAP big data system for Hadoop, has graduated from the Apache Incubator. The release notes that Kylin is used by several big companies, such as eBay and Meituan.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces85

The MSDN blog has a list of resources about using Azure for data science. In addition to several articles and tools (such as HDInsight for running Hadoop), the post highlights the Azure for Research Award program under which academic and research institutions can apply for research awards of Azure resources.

http://blogs.msdn.com/b/azure_4_research/archive/2015/12/07/data-science-with-microsoft-azure.aspx

This is the first of (likely) many posts reviewing 2015 and looking ahead to 2016. It highlights the rise of Spark, the shift towards SQL (and a couple of new SQL-on-Hadoop engines), the rise of highly scalable machine learning libraries, and more. Looking ahead, the author predicts that appliances and cloud will drive Hadoop adoption, integration of machine learning into analytics tools will improve, and data lakes will start to grow in number.

http://www.onstrategies.com/blog/2015/12/09/big-data-2015-2016-a-look-back-and-a-look-ahead/

Releases

Hortonworks has announced support for Apache Spark 1.5.2 for their distribution. The 1.5.x release line has big speedups for the DataFrame/SQL system, several improvements for Machine Learning APIs, improvements to Spark Streaming, and more.

http://finance.yahoo.com/news/hortonworks-accelerates-spark-scale-enterprise-140000833.html

Confluent has announced version 2.0 of the Confluent Platform, which packages Apache Kafka 0.9. The new version includes improvements to security, new Kafka connectors (for streaming data into and out of Kafka to/from sources like HDFS and JDBC), new and improved clients, and more.

http://www.confluent.io/blog/confluent-platform-2.0-with-apache-kafka-0.9-ga

MapR has announced MapR Streams, which is a new streaming product that's integrated MapR's existing data platform. MapR Streams provides the Kafka API and is compatible with Spark Streaming, Storm, Flink, and Apex.

https://www.mapr.com/blog/announcing-industrys-first-converged-data-platform

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Using Spark, GraphX, and Zeppelin to Analyze Clickstream Data (San Francisco) - Monday, December 14
http://www.meetup.com/SF-Data-Science/events/227128357/

Faster Than Parquet! A Deep Dive into Kudu (San Francisco) - Tuesday, December 15
http://www.meetup.com/San-Francisco-Spark-Hackers/events/226999521/

#OCBigData Holiday Party 2015 (Irvine) - Wednesday, December 16
http://www.meetup.com/OCBigData/events/226172497/

In-Memory Computing with Apache Ignite (Sunnyvale) - Wednesday, December 16
http://www.meetup.com/Bay-Area-In-Memory-Computing/events/226754108/

Texas

Going from Hadoop to Spark, Kept Simple (Houston) - Thursday, December 17
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/227253582/

Houston's 1st Spark Meetup (Houston) - Thursday, December 17
http://www.meetup.com/Houston-Spark-Meetup/events/227185659/

Illinois

Interactive Data Analytics with Flink and Zeppelin (Chicago) - Tuesday, December 15
http://www.meetup.com/Chicago-Apache-Flink-Meetup/events/227129522/

Georgia

Modern Data Management Practices (Alpharetta) - Wednesday, December 16
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/227120899/

North Carolina

Kudu: New Hadoop Storage for Fast Analytics on Fast Data (Charlotte) - Wednesday, December 16
http://www.meetup.com/CharlotteHUG/events/225229197/

New York

Data Driven NYC #42 (New York) - Monday, December 14
http://www.meetup.com/NYC-Data-Business-Meetup/events/226860395/

Hadoop & Spark Panel Discussion (New York) - Monday, December 14
http://www.meetup.com/Metis-New-York-Data-Science/events/226677813/

Greenplum Database: The First Open Source Data Warehouse (New York) - Wednesday, December 16
http://www.meetup.com/Pivotal-NY/events/226886488/

CANADA

Toronto Apache Spark #4 (Toronto) - Monday, December 14
http://www.meetup.com/Toronto-Apache-Spark/events/226531082/

IRELAND Storm, Spark Streaming + Prometheus Monitoring + Spark/Akka for Data Generation (Dublin) - Monday, December 14
http://www.meetup.com/hadoop-user-group-ireland/events/227132614/

UNITED KINGDOM

London Big Data Meetup - Dec2015 (London) - Monday, December 14
http://www.meetup.com/LondonandSEbigdata/events/226967447/

GERMANY

Flink Meetup #12 (Berlin) - Wednesday, December 16
http://www.meetup.com/Apache-Flink-Meetup/events/226586555/

SERBIA

Apache Spark in Theory and Practice (Belgrade) - Friday, December 18
http://www.meetup.com/Data-Science-Serbia/events/227356896/

ROMANIA

Apache Spark Workshop (Cluj-Napoca) - Wednesday, December 16
http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/227169515/

ISRAEL

Spark on Mesos: "The Road Less Travelled" & Profiling Users Using Spark (Tel Aviv-Yafo) - Tuesday, December 15
http://www.meetup.com/Big-things-are-happening-here/events/225339994/

INDIA

Introduction to Apache Spark (Hyderabad) - Saturday, December 19
http://www.meetup.com/HySpark/events/227265458/

NEPAL

Introduction to Apache Spark: Lightning-Fast Cluster Computing (Kathmandu) - Saturday, December 19
http://www.meetup.com/Kathmandu-Apache-Spark-Meetup/events/226446244/