Data Eng Weekly


Data Eng Weekly Issue #268

10 June 2018

Lots of interesting stories this week, including Branch's migration from Aerospike to DynamoDB, Freshworks' data stack, observability tools at Netflix, and using KSQL to analyze data capture by a software defined radio. There are also great looks at InfluxDB, TiDB, Microsoft's ServiceFabric, new features in Apache Pulsar 2.0, and Apache Beam.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.

Technical

A thorough comparison of Apache Range and Apache Sentry, which both offer authorization and other security features for Hadoop and friends. This post covers how they fit into the wider ecosystem (like Apache Hive and Apache Kafka).

https://www.linkedin.com/pulse/apache-ranger-vs-sentry-mythily-rajavelu/

Branch recently did a large-scale migration of a real-time system from Aerospike to DynamoDB. In this post, they write about the process, the tests that they did before switching over, experiences with DynamoDB autoscaling, and more.

https://medium.com/branch-engineering/from-zero-to-40-billion-links-our-journey-migrating-from-aerospike-to-dynamodb-4ce598926382

The Freshworks blog has the second post in a series on their data lake, which is built on the Hadoop stack, Kafka, Solr, and AWS. There's good coverage of their ingestions pipeline, log analytics, security, and BI tools.

https://blog.freshworks.com/data-lake-freshworks-part-2/

The Coinograph team has written about their experience using InfluxDB to store and query time series data about cryptocurrency trades. They have a good walk through of loading in data and doing some foundational aggregate functions from the Influx CLI.

https://medium.com/coinograph/storing-and-processing-billions-of-cryptocurrency-market-data-using-influxdb-f9f670b50bbd

Here's a great article that combines Apache Kafka and DIY IOT to capture air traffic data from a software defined radio, analyze it using KSQL, and visualize the results in Kibana.

https://medium.com/@simon.aubury/using-ksql-apache-kafka-a-raspberry-pi-and-a-software-defined-radio-to-find-the-plane-that-wakes-14f6f9e74584

Netflix has written about their observability tools, including how they do real-time analysis to keep relevant slices of application logs, instrument services for request tracing, and use various data stores depending on query patterns.

https://medium.com/netflix-techblog/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17

It seems that GraphQL could be a good partner for a lot of data engineering tasks, especially with its ability to perform batch operations. If you haven't looked at it yet, here's one of the best overviews that I've seen.

https://brandur.org/graphql

The Microsoft Azure team has published a paper about ServiceFabric, their platform for building microservices. It has a lot of interesting features, including strong consistency, support for reliable collections (e.g. distributed dictionaries and queues). As usual, the morning paper has a great overview of the paper, and there's a link to get access to the full text. Lots of great distributed systems topics—I can see this quickly becoming an influential system design and paper.

https://blog.acolyer.org/2018/06/05/servicefabric-a-distributed-platform-for-building-microservices-in-the-cloud/

Good overview of the Apache Beam APIs for building an inverted index (and testing the Beam job). It's full of tips for those making the transition from Apache Spark.

https://medium.com/@davide.anastasia/getting-started-with-apache-beam-26bfc5126438

The Streamlio blog has posts on two new features of Apache Pulsar 2.0—a schema registry and topic compaction. The Apache Kafka ecosystem has had schema registry implementations for Apache Avro data for some time. Pulsar has chosen a slightly different approach by focussing on JSON and binary data to start (with other formats in the future).

https://streaml.io/blog/pulsar-schema-registry/
https://streaml.io/blog/pulsar-topic-compaction/

Great description of how Kafka Streams divides up work based on the number of topic partitions as well as sub-topologies. The post also describes a few solutions to bottlenecks caused when one particular server becomes overloaded.

https://medium.com/@andy.bryant/kafka-streams-work-allocation-4f31c24753cc

TiDB is a hybrid transactional and analytics processing database that's compatible with MySQL and has baked in Spark support. This tutorial walks through getting it running locally with Docker compose, loading in some sample data, and querying it both with the MySQL CLI client and Spark.

https://www.pingcap.com/blog/how_to_spin_up_an_htap_database_in_5_minutes_with_tidb_tispark/

Even as many services are cropping up to improve observability, logging is often the first (or best) strategy with a system like Hadoop YARN. Speaking from experience, wrangling logging configs can be time consuming and frustrating. This post has lots of examples and explanations that should make it much easier to fine tune YARN logging.

https://medium.com/@iacomini.riccardo/spark-logging-configuration-in-yarn-faf5ba5fdb01

Data Eng Jobs

Data engineering jobs in Barcelona, Philadelphia, and remote. Check them out or add your own!

https://jobs.dataengweekly.com

News

Datanami has coverage of Project Hydrogen, which is a new effort to make Apache Spark work better with deep learning frameworks. Announced at Spark + AI Summit this week in San Francisco, it aims to solve some of the problems deep learning frameworks have faced in integrating into Spark's runtime.

https://www.datanami.com/2018/06/05/project-hydrogen-unites-apache-spark-with-dl-frameworks/

BlueData announced that their EPIC system for deploying Hadoop with containers has been certified through the Hortonworks Quality Assured Testing Suite. Since there hasn't been much of a story to vendor-supported Hadoop in containers, this seems like it can open up some interesting possibilities for companies who have already gone all in on Docker.

https://www.bluedata.com/blog/2018/06/hortonworks-certification-hdp-on-docker-containers/

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.

Releases

MLflow is a new platform that aims to make it easier to track ML experiments, improve reproducibility of results, and make deploying easier and less risky. The Databricks blog has more (including some code snippets), and the code is up on github.

https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html

Apache Atlas, which provides governance services for Hadoop, released version 1.0.0.

https://lists.apache.org/thread.html/19e8027c778373874bf27a94a8ec5f271163ab6fceddce8b0b53805b@%3Cannounce.apache.org%3E

Apache Storm 1.1.3 and 1.2.2 were released. The maintenance releases include improvements and bug fixes, including to fault tolerance and performance.

http://storm.apache.org/2018/06/04/storm113-released.html
http://storm.apache.org/2018/06/04/storm122-released.html

Amazon EKS, their hosted Kubernetes service, is now generally available.

https://aws.amazon.com/blogs/aws/amazon-eks-now-generally-available/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Washington

How Uber Uses Open Source Big Data Technologies (Seattle) - Wednesday, June 13
https://www.meetup.com/Uber-Engineering-Events-Seattle/events/251052420/

Utah

Monthly Spark, Big Data, and Data Engineering Meetup (Salt Lake City) - Monday, June 11
https://www.meetup.com/apache-spark-slc/events/248643905/

Apache Kafka (South Jordan) - Thursday, June 14
https://www.meetup.com/Utah-Scala-Enthusiasts/events/250718458/

Missouri

Apache Heron (Saint Louis) - Wednesday, June 13
https://www.meetup.com/SaintLouis_FullStack_WebDevelopment/events/246318945/

Georgia

Crossing the Streams: Stream Processing for Spring Developers (Atlanta) - Tuesday, June 12
https://www.meetup.com/AtlantaSpring/events/249635880/

Building IoT Pipelines with Spark, Kafka, and MemSQL (Atlanta) - Thursday, June 14
https://www.meetup.com/Atlanta-Hadoop-Users-Group/events/250582933/

CANADA

Travis Jeffery from Confluent & Elizabeth Giles from PagerDuty Speak Kafka (Toronto) - Thursday, June 14
https://www.meetup.com/Toronto-Kafka/events/251202480/

HUNGARY

Big Data Meetup: BudapestData Edition (Budapest) - Wednesday, June 13
https://www.meetup.com/Big-Data-Meetup-Budapest/events/250204902/

AUSTRALIA

Kafka Deployment & Management in Prod (Sydney) - Tuesday, June 12
https://www.meetup.com/apache-kafka-sydney/events/250192290/