Data Eng Weekly


Data Eng Weekly Issue #255

11 March 2018

The Strata Data Conference was this week, and there's coverage of a few enterprise releases made there. It was also a popular week for open source releases, with Apache Kafka, Apache Flink, and several other projects announcing new versions. There are also several tutorials, a fantastic article on the data engineering space, and great technical deep dives on Kafka at Cloudflare, Instagram's RocksDB storage layer for Apache Cassandra, and Jepsen testing for Aerospike.

Sponsor

At Foursquare, we understand where millions of phones go everyday. Our tech and map are changing the landscape of social, travel, mobile. We’re hiring data engineers, platform engineers, tech leads, full stack web engineers, +++. Join us!

https://bitly.com/foursquare-jobs-data-eng-weekly

Technical

Cloudflare has a great post on their work to optimize a Kafka cluster by enabling compression. The post describes the motivation, the work they did to get the golang client to work efficiently, and their experience with various different compression libraries. In the end, they were able to decrease network and storage usage by 4.5x.

https://blog.cloudflare.com/squeezing-the-firehose/

This article provides a fantastic survey of the data engineering space—from tools (such as Hadoop and friends) to the roles and responsibilities of a data engineer to online databases and the CAP theorem to predictions for the future. There's also good list of common terminology (such as data mart, data lake, OLAP/OLTP) and definitions.

https://medium.com/@richard534/getting-started-with-data-engineering-3d2e728d0c1f

Starting with the high-level concept of event sourcing, this post goes into a few architectural options and then describes how to implement event sourcing with Kafka Streams. There are some good tips like how to configure standby replicas and implementing various types of delivery guarantees with Kafka.

https://blog.softwaremill.com/event-sourcing-using-kafka-53dfd72ad45d

An overview of getting Apache Spark, and all of its dependencies (e.g. Java 8) installed on a Mac.

https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85

This post shows how to package Python code and deploy/run it via a Oozie workflow.

http://www.adaltas.com/en/2018/03/06/execute-python-in-an-oozie-workflow/

Instagram has been working on implementing a RocksDB storage layer for Apache Cassandra. As compared to the Java implementation, the RocksDB engine reduces the amount of intermediate/garbage data that Cassandra generates. This, in turn, leads to a much better tail latency due to less garbage collecting. The post has some good insight into challenges in the implementation, which has been open sourced.

https://engineering.instagram.com/open-sourcing-a-10x-reduction-in-apache-cassandra-tail-latency-d64f86b43589

For simple ETL, real-time aggregation, event routing, and similar use cases, Apache Pulsar is adding Pulsar Functions. Inspired by AWS Lambda and Google Cloud Functions, Pulsar Functions use a simple API and the Pulsar cluster for deployment. The post covers the design goals, deployment mechanism, runtime guarantees, and more.

https://streaml.io/blog/pulsar-functions/

The Kubernetes blog has a look at Apache Spark 2.3's Kubernetes integration. It has some details on the implementation, how to get started, and what some of the plans are for the future of the integration.

http://blog.kubernetes.io/2018/03/apache-spark-23-with-native-kubernetes.html

This post has some good tips for working with EMR and Spark, like how to and why to use the EMRFS and advice for sizing a cluster.

https://medium.com/@sdia/learnings-about-aws-elastic-map-reduce-and-spark-a1169edf348f

A short walkthrough of taking Apache Spark 2.3's new Kubernetes support for a spin. This post has a few additional details, like how to build a Docker image with your custom code, that aren't found in the Kubernetes post above.

https://medium.com/@timfpark/cloud-native-big-data-jobs-with-spark-2-3-and-kubernetes-938b04d0da57

There's a new Jepsen post out on Aerospike. For those who aren't familiar, Jepsen is a framework for verifying correctness of distributed systems in the face of failure. The post goes into the details of the Aerospike architecture (such as its gossip and replication system), performance in the face of network partitions, node failure, and clock skew, and makes some recommendations about how to best configure Aerospike.

https://jepsen.io/analyses/aerospike-3-99-0-3

BoulderDB is a custom distributed database built with RocksDB. MakeMyTrip uses it with Spark streaming (for data ingestion) and Akka (for serving data out). The post goes into the details of how they've implemented scaling and clustering, how they make use of the lambda architecture, and the scalability of the system.

https://medium.com/makemytrip-engineering/boulderdb-makemytrips-personalization-user-data-store-acc8e00f1433

News

Big Data Day LA 2018 has opened up the call for speakers. The event takes place in August, and the speaker submissions are open through June 15th.

https://www.bigdatadayla.com/#speakers

The Technologist’s Hippocratic Oath - is "an optional oath for building ethically considered experiences." If you want to avoid ethically murky areas, the oath is full of good lines.

https://builttoadapt.io/technologists-hippocratic-oath-94b88d3fe480

Releases

This week, Cloudera released a new version of their cloud service Altus, MapR has announced new Kubernetes support via the Kubernetes Volume Driver, and AtScale has announced a new version of BI Platform. ZDNet has more coverage of these announcements.

http://www.zdnet.com/article/cloudera-mapr-atscale-announce-new-releases-at-strata/

Apache Kylin, the OLAP system for big data, has released version 2.3.0. It includes over 260 resolved issues. New features include support for Redshift & SQL Server, and a new metric framework.

https://lists.apache.org/thread.html/0446024b4695cf73c33df558034703ce2f970a5288013a27ea542f21@%3Cannounce.apache.org%3E

Version 0.5.0 of the Apache Hivemall (incubating) project has been released. Hivemall provides UDFs for machine learning on Hive/Spark/Pig.

https://lists.apache.org/thread.html/18e6355dd06ff065004516eec6ff861777b4ad526714b61667b4904b@%3Cannounce.apache.org%3E

Kafka Security Manager is a new project for managing Kafka ACLs via an external source of truth, like a configuration file. It also provides notifications for integration with tools like Slack.

https://github.com/simplesteph/kafka-security-manager

Apache Kafka 1.0.1 is out with 49 fixed issues since the 1.0.0 release.

https://lists.apache.org/thread.html/539a763352ed6e5ca76eae167eeb3c276a49799e00601eaf82b4c9df@%3Cannounce.apache.org%3E

StreamSets Data Protector is a new tool that can be used to obfuscate or remove sensitive (e.g. PII) data before ingestion.

https://streamsets.com/blog/data-protector/

Hortonworks announced the release of Cloudbreak 2.4, their system for running HDP in the cloud. New features include a new CLI tool, support for configuring Kerberos, and support for custom images.

https://hortonworks.com/blog/announcing-cloudbreak-2-4/

Databricks has added an exciting new feature to make it easier to deploy machine learning models from Apache Spark—the ability to export models for scoring and predictions in non-Spark systems.

https://databricks.com/blog/2018/03/07/announcing-machine-learning-model-export-in-databricks.html

Confluent Platform 4.1 is out. The major new feature is the general availability of KSQL, which also had a 0.5 release. Since 0.4, the KSQL team has been focussed on quality and stability improvements.

https://www.confluent.io/press-release/confluent-makes-ksql-available-confluent-platform-announces-general-availability/
https://www.confluent.io/blog/ksql-february-release-streaming-sql-for-apache-kafka/

Apache Flink 1.4.2 is out with a bunch of bug fixes and improvements.

http://flink.apache.org/news/2018/03/08/release-1.4.2.html

BABAR is a new tool from Criteo for profiling YARN applications. Using an agent, it collects system and JVM-level metrics. There is a processor to output a number of different graphs, including a flame graph of JVM-level function execution.

https://github.com/criteo/babar

Sponsor

At Foursquare, we understand where millions of phones go everyday. Our tech and map are changing the landscape of social, travel, mobile. We’re hiring data engineers, platform engineers, tech leads, full stack web engineers, +++. Join us!

https://bitly.com/foursquare-jobs-data-eng-weekly

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Colorado

Introduction to Spark (Denver) - Wednesday, March 14
https://www.meetup.com/Denver-All-Things-Data/events/247286746/

Texas

Building a Streaming Data Platform at HomeAway (Austin) - Tuesday, March 13
https://www.meetup.com/Austin-Apache-Kafka-Meetup-Stream-Data-Platform/events/248450574/

New Jersey

Spark Structured Streaming: Hands-On Session, Part 1 (Hamilton) - Thursday, March 15
https://www.meetup.com/nj-dapp/events/247824370/

New York

DAG and the Third Generation of Big Data Stream Processing (New York) - Tuesday, March 13
https://www.meetup.com/mysqlnyc/events/248075546/

UNITED KINGDOM

Big Data and Machine Learning (London) - Tuesday, March 13
https://www.meetup.com/Big-Data-and-Machine-Learning-London/events/245884428/

Join Us for Our First Kafka Meetup in Leeds (Leeds) - Wednesday, March 14
https://www.meetup.com/Leeds-Kafka/events/248239199/

Building a Real-Time Complex Event Processing Platform with Apache Flink (Manchester) - Wednesday, March 14
https://www.meetup.com/HadoopManchester/events/248298343/

Streamy Wednesday: Eventsourcing from Back to Front(end) (London) - Wednesday, March 14
https://www.meetup.com/MiniHN/events/247378262/

SWEDEN

Presto: SQL-on-Anything (Stockholm) - Tuesday, March 13
https://www.meetup.com/stockholm-hug/events/247764427/

Experience at Ooyala and Klarna: Apache Kafka (Stockholm) - Thursday, March 15
https://www.meetup.com/Stockholm-Apache-Kafka-Meetup-by-Confluent/events/247968822/

SPAIN

Pipeline ETL + Kubernetes with Google Cloud (Madrid) - Wednesday, March 14
https://www.meetup.com/Innovative-technology-BEEVA/events/248437226/

Integrating Apache Flink with Real-Time NoSQL (Madrid) - Thursday, March 15
https://www.meetup.com/Meetup-de-Apache-Flink-en-Madrid/events/247957538/

FRANCE

Stream Processing: Apache Flink, Kafka Streams (Toulouse) - Friday, March 16
https://www.meetup.com/MonkeyTechDays/events/246514263/

GERMANY

When Not to Use Apache Spark? Data Pipelines with vert.x + RxJava (Berlin) - Tuesday, March 13
https://www.meetup.com/CTO-Roundtable-Berlin/events/247799267/

4 Talks about Distributed Databases (Munich) - Thursday, March 15
https://www.meetup.com/data-engineering-munich/events/247523799/

ROMANIA

GDPR and Big Data + Spark on Azure Demo (Bucharest) - Tuesday, March 13
https://www.meetup.com/Bucharest-Big-Data-Meetup/events/247832827/

AUSTRALIA

Real-Time Sentiment Analysis with NiFi and Zeppelin (Sydney) - Tuesday, March 13
https://www.meetup.com/microsoft-data-wranglers/events/247536666/