Data Eng Weekly


Hadoop Weekly Issue #235

09 October 2017

Short and sweet issue this week with coverage of Apache Apex, a new serverless ETL library from Nextdoor, Kafka Streams at Pinterest, a Jepsen post on Hazelcast, and more.

Technical

Since its announcement, Apache Kudu has promised to power a number of interesting use cases. This article looks at one of those—using it with the Apache Apex stream processing system. The presentation covers the high-level architecture of both projects, how Apex supports effectively once processing, and more.

https://www.slideshare.net/Hadoop_Summit/low-latency-high-throughput-streaming-using-apache-apex-and-apache-kudu

Nextdoor has written about how they're using AWS Lambda for Serverless ETL. Originally, it powered two use-cases: streaming logs into Elasticsearch and writing data to S3. For this and other use cases, they've built and open-sourced a new tool called Bender. Bender is written in Java and it's driven off of JSON and YAML config files.

https://engblog.nextdoor.com/bender-ff65a6edee92

This post describes how a unified analytics platform, such as Databricks, can power multiple use cases and developer personas. Using the Amazon public product ratings dataset, it shows how both an analyst and a data scientist can build reports and machine learning prediction algorithms (respectively) using the notebook features. It also describes using the Databricks dbml-local library for serving of model data and using Spark for stream processing. There are a few Databricks-specific parts to the post (such as the notebook API and scheduling), but in general the post is broadly applicable.

https://databricks.com/blog/2017/10/05/build-complex-data-pipelines-with-unified-analytics-platform.html

Kinesis Analytics is able to run SQL queries over streams if the data is in a suitable format with a clear schema. Some data, such as Apache HTTPD access logs and certain compressed data files aren't in that format and need to be translated. This post shows how to use AWS Lambda to do that transformation—there's an example in node.js.

https://aws.amazon.com/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

The Pinterest Engineering blog has a post on how they're using Kafka Streams for real-time predictive stream processing. First, the post describes the problem—overdelivery of online ads. Next, it walks through how they use Kafka Streams to measure in flight spend to provide a better heuristic for determining if an ad should be shown to end users as part of an ad inventory system. The post also includes a few tricks they employed to increase throughput and efficiency of their application.

https://medium.com/@Pinterest_Engineering/using-kafka-streams-api-for-predictive-budgeting-9f58d206c996

Another great post in the Jepsen series looks at HazelCast. It tests the behavior of AtomicLong, distributed maps, ID Generators, and other data types. Due to design flaws in the system, these types all have correctness issues in the face of network partitions.

http://jepsen.io/analyses/hazelcast-3-8-3

A presentation from the LA Spark User Group covers everything SparkML—from how to use the various libraries to the SparklingML library which adds new ML algorithms to how to build your own estimator.

https://www.slideshare.net/sawjd/an-introduction-into-spark-ml-plus-how-to-go-beyond-when-you-get-stuck

News

Pivotal has a post celebrating the graduation of Apache MADlib to be a top level ASF project. The article covers the origins and journey of the in-database analytics library.

https://content.pivotal.io/blog/apache-madlib-comes-of-age

Releases

Version 4.1 of HUE was released. It includes a over 250 bug fixes.

http://gethue.com/hue-4-1-is-out/

Mocked Streams is a library for testing Apache Kafka Streams. This week, version 1.4.0 was released.

https://github.com/jpzk/mockedstreams/releases/tag/v1.4.0

Apache Flume 1.8.0 was released. It includes a number of bug fixes as well as several minor new features and improvements.

https://lists.apache.org/thread.html/d8b91bd130f2fe8e74de6aac89994320c930c1e9b27b8e703bbf8286@%3Cannounce.apache.org%3E

Version 1.4.0 of Apache NiFi is out. Per the release notes, it adds stability and new features, including support for Apache Knox, LDAP-based authorization, and several new processors.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.4.0

Luigi 2.7.1 was released with updates for Redshift, BigQuery, ECS, and more.

https://github.com/spotify/luigi/releases/tag/2.7.1

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Autopiloting Real-Time Stream Processing (San Ramon) - Wednesday, October 11
https://www.meetup.com/datariders/events/243278903/

Introduction to Hadoop and MapReduce! (Sunnyvale) - Sunday, October 15
https://www.meetup.com/Women-Who-Code-Silicon-Valley/events/243689542/

Pennsylvania

Kafka for Java Developers & Java Puzzlers (Pittsburgh) - Thursday, October 12
https://www.meetup.com/The-Pittsburgh-Java-Meetup-Group/events/242379671/

New York

Streaming Data at Scale (New York) - Wednesday, October 11
https://www.meetup.com/NYC-Data-Engineering/events/243487860/

Massachusetts

Open Blueprint for In-Stream Processing (Boston) - Thursday, October 12
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/243677713/

IRELAND Apache Kafka with Confluent's Tim Berglund (Dublin) - Thursday, October 12
https://www.meetup.com/TechMeetupspace/events/242376704/

UNITED KINGDOM

Kafka Streams API (Hinxton) - Tuesday, October 10
https://www.meetup.com/Genome-Campus-Software-Craftsmanship-Community/events/243518985/

Processing Streaming Data with KSQL & Full-Text Search and Machine Learning (London) - Tuesday, October 10
https://www.meetup.com/Couchbase-London/events/243773774/

PORTUGAL

Inaugurational Meetup! (Porto) - Thursday, October 12
https://www.meetup.com/Porto-Big-Data/events/240546837/

FRANCE

Hadoop v3 and NiFi (Villeurbanne) - Wednesday, October 11
https://www.meetup.com/Meetup-Big-Data-Lyon/events/241790897/

NETHERLANDS

October Kafka Meetup (Utrecht) - Monday, October 9
https://www.meetup.com/Kafka-Meetup-Utrecht/events/243295992/

GERMANY

Migrating Towards Stream Processing and Microservices (Berlin) - Wednesday, October 11
https://www.meetup.com/jug-bb/events/243589911/

SWITZERLAND

The Elastic Stack in the Logistics Industry and as an Alternative to Hadoop (Lausanne) - Wednesday, October 11
https://www.meetup.com/Swisscom-Digital-Lab/events/243769733/

POLAND

Sparkathon: Developing Extensions for Spark Structured Streaming in Scala (Warsaw) - Tuesday, October 10
https://www.meetup.com/WarsawScala/events/243704750/

CHINA

Spark & Flink Meetup 5 (Hangzhou) - Saturday, October 14
https://www.meetup.com/Hangzhou-Apache-Spark-Meetup/events/243514028/

AUSTRALIA

Google Cloud Fundamentals: Big Data & Machine Learning (Brisbane) - Tuesday, October 10
https://www.meetup.com/Brisbane-Artificial-Intelligence/events/243708462/

NEW ZEALAND

Deep Dive with Azure Data Factory (Wellington) - Thursday, October 12
https://www.meetup.com/Wellington-Data-Management-and-Analytics-Meetup/events/242434389/