Data Eng Weekly


Hadoop Weekly Issue #166

17 April 2016

Hortonworks had a number of announcements this week at Hadoop Summit Europe, and there's coverage of these throughout the newsletter. Apache Storm hit version 1.0.0 with some impressive new features. In technical news, there are multiple articles about scaling Kafka/services built on Kafka and testing of distributed systems. And if you missed Hadoop Summit, videos of many of the presentations have already been posted.

Technical

Smyte has written about their infrastructure for realtime spam and fraud detection based on a stream of event data. The initial event processing system was built with Kafka, Redis, Secor, and S3. To scale further and cheaper, they moved to a disk-based solution built using the Redis protocol with RocksDB and Kafka for replication.

https://medium.com/the-smyte-blog/counting-with-domain-specific-databases-73c660472da

This post describes combining rsyslog, Kafka, and AWS with the ELK stack (ElasticSearch, Logstash and Kibana) to address back-pressure, scaling problems, and maintenance issues. The post covers the rsyslog integration with Kafka and schema tricks with ryslog as well as how to run Kafka, Zookeeper, and others in AWS auto scaling groups.

https://www.bashton.com/blog/2016/elk-on-ark/

Hortonworks has a post about the upcoming data governance features of Apache Atlas and Apache Range. Among these are classification-based access controls, data expiry-based policies, location-specific policies, prohibition of dataset combinations, and cross component lineage (e.g. tracking data from Kafka to Storm to Hive).

http://hortonworks.com/blog/the-next-generation-of-hadoop-based-security-data-governance/

Apache HAWQ (incubating) is the SQL engine based on Greenplum for querying data on HDFS. This post discusses the classic design of and upcoming improvements to HAWQ, including how it differs from Spark and MapReduce, some of the challenges of a classic MPP design with Hadoop, and how HAWQ's new design combines techniques from MPP and batch to get the best of both worlds.

https://blog.pivotal.io/big-data-pivotal/products/apache-hawq-next-step-in-massively-parallel-processing

The Cloudera blog has a post describing tools they use for fault injection and network partitioning to test distributed systems like Hadoop. Their AgenTEST tool can inject network problems (e.g. dropping packets), saturate resources (e.g. CPU, IO, disk space), and more. When testing network partitions, they evaluate circular partitioning, bridge partitioning, and more.

http://blog.cloudera.com/blog/2016/04/quality-assurance-at-cloudera-fault-injection-and-elastic-partitioning/

The Hortonworks blog has a look ahead to HDP 2.4.2 which will include new versions of Spark and Zeppelin. Looking past that release, it has a preview of Spark 2.0 and upcoming features of Zeppelin.

http://hortonworks.com/blog/apache-spark-apache-zeppelin-whats-coming-in-hdp-2-4-2/

Cask has written about how they do long-running tests to evaluate correctness of distributed systems before and after infrequent events like region compaction in HBase.

http://blog.cask.co/2016/04/long-running-tests-in-cdap/

This article describes how to use SparkR with Amazon EMR to do geospatial analysis. Using SparkR's Hive integration, the first step is to define a Hive external table based on data in S3. From there, data can be collected into memory and analyzed directly in R making it easy to produce high-quality visualizations.

http://blogs.aws.amazon.com/bigdata/post/Tx1MECZ47VAV84F/Exploring-Geospatial-Intelligence-using-SparkR-on-Amazon-EMR

MapR has a tutorial of analyzing team-level Major League Baseball statistics using Pig and Hive. Pig is used for initial data munging, and Hive is used to provide SQL-based data querying. Using the Hive ODBC driver and the Hive server, Microsoft Excel can be used to fetch and analyze the data.

https://www.mapr.com/blog/using-hive-and-pig-baseball-statistics

SignalFX pushes 70+ billion messages/day through a 27-node Kafka cluster. Based on their experience scaling Kafka to such high volumes, they've shared a number of tips around instrumenting Kafka, when to alert (i.e. when the log flush latency increases and their are under-replicated partitions), and scaling out Kafka.

http://www.confluent.io/blog/how-we-monitor-and-run-kafka-at-scale-signalfx

The dataArtisan's blog has an article about Flink's ability to count streams of data efficiently, at low-latency, and correctly. To demonstrate efficiency, the team has run the recent Yahoo! streaming benchmark at higher throughputs (and a few other changes). In terms of correctness, the post highlights Flink's ability to differentiate event and processing time (using Star Wars movie chronology as an analogy). Finally, the post describes Flink's upcoming ability to query in-memory state of live jobs in an upcoming version.

http://data-artisans.com/counting-in-streams-a-hierarchy-of-needs/

This tutorial shows how to convert a stream of text data from a tcp socket into a Spark Streaming source.

https://medium.com/@anicolaspp/spark-custom-streaming-sources-e7d52da72e80

This post describes how to prevent inadvertent addition of AWS credentials to a patch or git commit when building Hadoop. In addition to some Hadoop-specific recommendations, the post suggests using the git-secrets tool for preventing accidental commits of your access/secret key. If you're using the Hadoop S3 bindings, there's also a call to help evaluate new patches.

http://steveloughran.blogspot.co.uk/2016/04/testing-against-s3-and-object-stores.html

Big Data and Brews has interviews with Ted Dunning of MapR and Jacques Nadeau of MapR. Apache Arrow is a topic of both interviews, and Ted also talks about MapR, Drill, and more.

https://www.youtube.com/watch?v=l3mDDKjDjMk
https://www.youtube.com/watch?v=Xo9CO0a0VJI

News

DataEngConf was recently held in San Francisco. This post has an overview of talks from Uber, Stripe, Microsoft, Instacart, and Jawbone. It also describes a major theme of the conference: "Data Science in real world is a product design and engineering discipline."

https://medium.com/@eugmandel/software-engineering-invades-data-science-notes-from-dataengconf-4a3c066b081f#.g2h0duo44

Hortonworks made several announcements at Hadoop Summit Europe, which took place last week in Dublin. ZDNet has coverage of the highlights, which include an extended partnership with Pivotal (who will now resell HDP), a reseller agreement with Syncosrt, and tech previews of Atlas, Ranger, Zeppelin, and Metron. The article also discusses some of the differences in offerings between Hortonworks, Cloudera, and MapR.

http://www.zdnet.com/article/hortonworks-announces-new-alliances-and-releases-hadoop-comes-to-fork-in-road/

Flink Forward 2016 will be held in Berlin, Germany in September. The call for submissions is open until the end of June.

http://flink.apache.org/news/2016/04/14/flink-forward-announce.html

Videos of presentations from Hadoop Summit Dublin are available on YouTube. As one would expect, there are a presentations covering all different parts of the Hadoop ecosystem.

https://www.youtube.com/channel/UCAPa-K_rhylDZAUHVxqqsRA/videos?flow=list&live_view=500&view=0&sort=dd

Releases

Metascope is a new tool that provides metadata management alongside of Schedoscope for Hadoop clusters. It provides lots of insight into data, such as the data lineage, via a web interface. It also supports search, inline documentation, a REST API, and more.

https://github.com/ottogroup/metascope

Apache HBase 1.2.1, which resolves 27 issues since the original 1.2.0 release, was announced this week. The release announcement highlights four priority fixes.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201604.mbox/%3CCAN5cbe7-T5uAYvGRbxw2dfvdbwe5s0nx3vKU8Nt2fzXbKPoQTg@mail.gmail.com%3E

Version 0.12.0 of Apache Mahout, the machine learning library, has been released. With this release, Mahout's "Samsara" math environment now supports Apache Flink and is platform independent. The release announcement shares more about the Flink integration, known issues, and the project roadmap.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201604.mbox/%3CCAOtpBjj5An876PStdn5kMeaF+up-B72WTmCk9j21EXdP=JOCUA@mail.gmail.com%3E

Apache Storm 1.0.0 was released this week. Highlights of the release include improved performance (~3x for most use cases), a new distributed cache API, high availability for the nimbus node, automatic backpressure, dynamic worker profiling, and more.

http://storm.apache.org/2016/04/12/storm100-released.html

Apache Kudu (incubating) version 0.8.0 was released this week. This release adds an Apache Flume sink, implements several improvements, and fixes a handful of bugs.

http://getkudu.io/releases/0.8.0/docs/release_notes.html

Cloudbreak, the system for provisioning Hadoop clusters on cloud infrastructure using Docker, released version 1.2 this week. New features include support for OpenStack and a new recipe feature for running custom server provisioning scripts.

http://hortonworks.com/blog/announcing-cloudbreak-1-2/

Cloudera announced Cloudera Enterprise 5.4.10, which has fixes for Flume, Hadoop, HBase, Hive, Impala, and more.

http://community.cloudera.com/t5/Community-News-Release/ANNOUNCE-Cloudera-Enterprise-5-4-10-Released/m-p/39790#U39790

Presto Accumulo is a new project providing a Presto connector for reading/writing data from/to Accumulo.

https://github.com/bloomberg/presto-accumulo

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Open Source Roundtable (Mountain View) - Tuesday, April 19
http://www.meetup.com/BOLD-Bay-area-Organization-of-Ladies-in-big-Data/events/230133936/

SDBigData Meetup #14 (San Diego) - Wednesday, April 20
http://www.meetup.com/sdbigdata/events/228653315/

52nd Bay Area Hadoop User GroupMeetup (Sunnyvale) - Wednesday, April 20
http://www.meetup.com/hadoop/events/229327671/

High Performance Spark +Spark Committers +Internals +Perf Tuning +Profiling (San Francisco) - Thursday, April 21
http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/223668878/

Data in Motion: Simplifying Security & Building Custom Integrations (Palo Alto) - Thursday, April 21
http://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/230021297/

Hands-On Introduction to Spark & Zeppelin (Santa Clara) - Thursday, April 21
http://www.meetup.com/futureofdata-siliconvalley/events/229478656/

Washington

CassieQ: A Distributed Queue Built on Cassandra (Bellevue) - Monday, April 18
http://www.meetup.com/Cassandra-Seattle-Users/events/229766571/

GraphFrames, Survival Analysis, and SnappyData + Spark (Bellevue) - Wednesday, April 20
http://www.meetup.com/Seattle-Spark-Meetup/events/229878481/

Going Beyond Apache HBase (Bellevue) - Wednesday, April 20
http://www.meetup.com/Big-Data-Bellevue-BDB/events/222646103/

Putting Apache Drill to Use: Best Practices for Production Deployments (Seattle) - Thursday, April 21
http://www.meetup.com/Seattle-Data-Science/events/228465589/

Illinois

ETL to ML: Use Apache Spark as an End-to-End Tool for Advanced Analytics (Chicago) - Monday, April 18
http://www.meetup.com/Chicago-Spark-Users/events/229999470/

Apache Flink 1.0: A New Era for Real-Time Streaming Analytics (Chicago) - Tuesday, April 19
http://www.meetup.com/Chicago-Apache-Flink-Meetup/events/229472165/

Spark & H2O: Sparkling Water + Cybersecurity with Machine Learning (Chicago) - Tuesday, April 19
http://www.meetup.com/Chicago-Big-Data-Science/events/230209808/

Tennessee

How HealthTrust Is Using Cloudera (Franklin) - Wednesday, April 20
http://www.meetup.com/NashvilleTN-Cloudera-User-Group/events/230047633/

Alabama

Real-Time Data Processing Using ZooKeeper, Kafka, and Storm (Huntsville) - Tuesday, April 19
http://www.meetup.com/Huntsville-Big-Data-Meetup/events/230070112/

Virginia

Apache Flink 1.0 (McLean) - Wednesday, April 20
http://www.meetup.com/DCFlinkMeetup/events/229435434/

Maryland

Socialize and Discuss the Roadmap of Azure Services (Chevy Chase) - Monday, April 18
http://www.meetup.com/MD-DC-VA-Azure-Architecture-Meetup/events/229135280/

District of Columbia

Hilton Presents Their Hadoop Journey & Apache Calcite Introduction (Washington) - Wednesday, April 20
http://www.meetup.com/Washington-DC-Hortonworks-User-Group-Meetup/events/229668371/

New Jersey

Hands-On Workshop: Scala + Spark (Hamilton) - Tuesday, April 19
http://www.meetup.com/nj-hadoop/events/229778793/

New York

SnappyData: Create Robust Analytic Applications with Spark Streaming (New York) - Wednesday, April 20
http://www.meetup.com/Pivotal-NY/events/230174931/

Apache Flink 1.0 (New York) - Thursday, April 21
http://www.meetup.com/NYCFlink/events/229435757/