Data Eng Weekly


Data Eng Weekly Issue #274

22 July 2018

Lots of stream processing coverage this week—Apache Kafka, Wallaroo, Apache Samza, WSO2, and Amazon SQS. There are also a couple of posts on Kubernetes, a presentation with database monitoring best practices, and a look at a distributed configuration system at Facebook. In news, there are two new books that may be of interest and a proposed data ethics checklist.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7 or visit http://dremio.com to learn more.

Technical

The MapR blog has a tutorial describing how to get started with Apache Drill using several different OSes. It includes an example use case of joining JSON data with tables in MySQL and covers debugging some common problems.

https://mapr.com/blog/how-to-combine-relational-and-nosql-datasets-with-apache-drill/

This presentation describes the main signals (Concurrency, Error Rate, Latency, and Throughput) that are important to measure when monitoring a database system. It describes how they relate to quality of service, several different ways to track these metrics, and common problems across a few different databases.

https://www.xaprb.com/slides/how-to-monitor-your-database/

This post provides a good overview of the types of challenges there are with running a stateful service on Kubernetes. In this case, they've designed a solution for resilience of the Spark application driver in the face of network partitioning. The recovery process is a bit tricky, and there's an interesting discussion on the referenced PR about this and other possible designs to solve for resilience.

https://banzaicloud.com/blog/spark-resiliency/

Apache Avro is a common serialization format for storing data in Apache Kafka. The Cloudera blog has a post that describes why Avro is a common format and how to use Avro with the Kafka APIs.

http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-1/

Facebook was hitting scalability limits when using ZooKeeper for storing dynamic configuration. This post describes those challenges and how the system they built to replace it, Location-Aware Distribution, solves for for them.

https://code.fb.com/data-infrastructure/location-aware-distribution-configuring-servers-at-scale/

This post describes a new round robin channel selector for Apache Flume that helps it scale to almost 10x the throughput of the out of the box batching configuration.

https://medium.com/data-collective/scaling-a-flume-agent-to-handle-120k-events-sec-11f70a428ca2

This tutorial builds a stream processing application in Python using the Wallaroo streaming engine. The application is an e-commerce / marketing system that tracks several types of events and triggers a personalized

https://blog.wallaroolabs.com/2018/07/event-triggered-customer-segmentation/

Sorting data before you serialize it as Apache Parquet can make a big difference in query performance. This post explains why and provides some ideas for how to figure out which columns to sort on.

https://medium.com/@pankajroark/sorting-and-parquet-3a382893cde5

In this post, MoEngage writes about how they optimized cost and latency in their SQS pipeline by batching data. They used some interesting techniques to pack an optimal number of messages into each batch while keeping latency low.

https://medium.com/@moengage/how-we-managed-to-reduce-the-sqs-costs-by-90-d8a99b01f368

New York City subway data is available via a RESTful API as a GTFS Realtime feed. This tutorial builds a system to load that data into Apache Kafka and process arrival data in real time. It's written in python, and the example code is available on github.

https://medium.com/@leihetito/building-a-real-time-nyc-subway-tracker-with-apache-kafka-40d4e09bfe98

Apache Beam now has a Apache Samza runner for executing applications. According to the compatibility matrix, it has pretty good support for the Beam Model (on par with Apache Apex and Apache Spark). This presentation provides more details about the implementation.

https://www.slideshare.net/XinyuLiu11/beam-me-up-samza

This post has an introduction to the WSO2 stream processing framework including two example programs. The examples, of loading data from JMS to MySQL and extracting data from a DB and loading into Kafka, demonstrate the Siddhi Application DSL and the graphical view that comes with WSO2.

https://medium.com/@suhothayan/perform-realtime-and-periodic-etl-with-wso2-stream-processor-358680689935

Sponsor

Unravel demoed a new, fully automated Spark optimization tool at Spark Summit in San Francisco. They showed how to speed up or improve reliability of any Spark application with a single click. See the demo video or download the slides here.

http://bit.ly/unravel-spark-optimization

News

BlueData has announced a new initiative, called BlueK8s to bring stateful distributed systems (such as Hadoop, Spark, and Kafka) to Kubernetes. There's a pre-alpha implementation via an open source project called KubeDirector.

https://www.bluedata.com/blog/2018/07/operation-stateful-bluek8s-and-kubernetes-director/

This article introduces a new "checklist for people who are working on data projects." Mostly phrased as questions to consider as part of your project, it covers topics across the data science and data engineering spectrum.

https://www.oreilly.com/ideas/of-oaths-and-checklists

The schedule for Flink Forward Berlin, which takes place in early September, has been published.

https://data-artisans.com/blog/data-artisans-unveils-flink-forward-berlin-program-agenda-50-sessions-with-speakers-from-airbnb-ing-lyft-microsoft-netflix-and-uber

"Streaming Systems" is a new book from O'Reilly. It covers topics like watermarks, exactly-once, streaming joins, and streaming SQL. The book is available for download now and will be available in print in a few weeks.

http://streamingbook.net/

"Next-Generation Big Data" is another new book—it covers using Apache Kudu, Apache Impala, and Apache Spark across topics like real-time processing, data warehousing, and governance.

https://www.apress.com/us/book/9781484231463

Accumulo Summit is October 15th in Columbia, Maryland. Early bird pricing is through September 1st.

http://accumulosummit.com/

There's a new podcast from Confluent on Apache Kafka called "Streaming Audio."

https://itunes.apple.com/us/podcast/streaming-audio-kafka-confluent-cloud-tim-berglund/id1401509765

Jobs

The Data Eng Weekly board has jobs from Etsy (Senior Data Engineer - New York), Netflix (Senior Data Engineer - Los Gatos, CA), Wooga (Data Engineer - Berlin), and Shopify (Software Eng Data Infrastructure - Ottawa/Waterloo/Montreal).

https://jobs.dataengweekly.com/

Releases

Apache Kafka 1.0.2 has been released with over 20 bug fixes and improvements.

https://lists.apache.org/thread.html/d9e8841f0ba54418f97f012ab9106e8ed048926abf355773e91c5c2c@%3Cannounce.apache.org%3E

Databricks Runtime 4.2 includes new features, improvements, and performance upgrades to Databricks Delta (which is getting closer to general availability). It also includes new features in Structured Streaming and a new SQL Deny command for access control.

https://databricks.com/blog/2018/07/18/announcing-databricks-runtime-4-2.html

Apache Nifi 1.7.1 was released. It includes a number of fixes, including to an infinite loop and issues related to wildcard certificates.

https://lists.apache.org/thread.html/92a877139d40dc14e2677eb32232d83de8f27bfb2a1ef15f36f4e892@%3Cannounce.apache.org%3E

Musoq is a tool for querying lots of different kinds of data, including from JSON, Zip, file folders, and more. It's built with .NET, so it might require a little work to make it run on Mac or Linux.

https://github.com/Puchaczov/Musoq

Sponsors

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7 or visit http://dremio.com to learn more.

Unravel demoed a new, fully automated Spark optimization tool at Spark Summit in San Francisco. They showed how to speed up or improve reliability of any Spark application with a single click. See the demo video or download the slides here.

http://bit.ly/unravel-spark-optimization

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

KSQL & Event-Powered Kafka Rails Services (San Francisco) - Tuesday, July 24
https://www.meetup.com/KafkaBayArea/events/252167297/

Texas

Real-Time Data Analytics with Apache Spark Streaming (Austin) - Wednesday, July 25
https://www.meetup.com/Austin-ACM-SIGKDD/events/251065591/

NETHERLANDS

Streaming Microservices at Yolt (Amsterdam) - Tuesday, July 24
https://www.meetup.com/Hands-On-Big-Data-Architecture/events/251432058/

Moving from Legacy to Event-Driven with Kafka (Nijmegen) - Wednesday, July 25
https://www.meetup.com/NMGNtech/events/252596432/

GERMANY

Kafka, KSQL & IoT with Crate and Confluent (Berlin) - Tuesday, July 24
https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/252154551/

Comsysto Reply Insights from Spark (Munich) - Wednesday, July 25
https://www.meetup.com/Hadoop-User-Group-Munich/events/252354743/

Apache Nifi & Apache Airflow (Unterfohring) - Thursday, July 26
https://www.meetup.com/data-engineering-munich/events/252170998/

ITALY

Towards Writing Better Scalable Spark Applications & Hands-on HBase (Bologna) - Tuesday, July 24
https://www.meetup.com/Bologna-Big-Data-Meetup/events/252021556/

POLAND

Stream Processing and Building Streaming Data Pipelines with Apache Kafka & KSQL (Krakow) - Wednesday, July 25
https://www.meetup.com/Krakow-Kafka/events/252830549/

HUNGARY

Ephemeral Workloads to Support BI Use Cases in Hadoop (Budapest) - Wednesday, July 25
https://www.meetup.com/futureofdata-budapest/events/252828373/

ISRAEL

Apache Kafka Streams Workshop (Tel Aviv) - Monday, July 23
https://www.meetup.com/ApacheKafkaTLV/events/252798143/

Spark SQL and DataFrames Using Azure Databricks (Tel Aviv) - Tuesday, July 24
https://www.meetup.com/AzureIsrael/events/251616457/

Using Chaos Engineering to Level Up Apache Kafka Skills (Tel Aviv-Yafo) - Wednesday, July 25
https://www.meetup.com/Big-things-are-happening-here/events/252510802/

SINGAPORE

BigDataX Presents: Spark 101 (Singapore) - Thursday, July 26
https://www.meetup.com/BigDataX/events/252904590/

NEW ZEALAND

Kafka Streams API by Antony Stubbs from Confluent + More Kafka from Jon Court (Auckland) - Wednesday, July 25
https://www.meetup.com/Auckland-Kafka/events/252500562/