Data Eng Weekly


Hadoop Weekly Issue #236

15 October 2017

Technical highlights in this week's issue include a post on cross-data center replication with Apache Pulsar (incubating) and a new SQL stream processing tool from Uber called AthenaX. There are also a number of benchmarks in this week's issue and several conference-related announcements (videos posted from Big Data LA, CFP for Big Data Technology Warsaw, and a preview of Spark Summit EU).

Technical

Hortonworks uses 21,000 compute hours a day to run tests for Hadoop ecosystem projects. The majority of these are nightly automated tests, but there are four testing environments in total. This post describes more about the testing stages.

https://hortonworks.com/blog/automated-validation-apache-hadoop-ecosystem/

Fivetran, who builds a product for syncing data into a data warehouse, has written up a benchmark comparing Redshift, Snowflake, and BigQuery. It shows that for a client's typical use case—lots of tables—all the systems behave similarly. While one should always be careful at reading too much into benchmarks... in this case they also provide some commentary on several previous benchmarks to provide details on why their analysis differs.

https://blog.fivetran.com/warehouse-benchmark-dce9f4c529c1

In another benchmark, the Cloudera blog considers the performance characteristics of the Azure Data Lake Store (ADLS). Ultimately, they find that ADLS is comparable to the Azure disk storage using the Teragen/Terasort/Teravalidate benchmark to measure throughput.

http://blog.cloudera.com/blog/2017/10/a-look-at-adls-performance-throughput-and-scalability/

This post describes how to perform cross-data center replication with Apache Pulsar (incubating). It describes the commands needed to setup replication, how to override it on a per-application basis, how to monitor, how to limit replication bandwidth, and more.

https://streaml.io/blog/geo-replication-patterns-practices/

After hello world, the "twitter data" tutorial seems to be the next example used for most streaming systems. In this case, Confluent has a demo of analyzing tweets in Kafka using KSQL.

https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka

One final benchmark for this week—Databricks has compared Spark streaming to Flink and Kafka Streams.

https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

Qubole has a post that shows how to use PyHive from a Jupyter notebook to run queries against a Hive or Presto cluster. While parts of the post are Qubole specific, the instructions should work as long as the Hive Server 2 is enabled on your cluster.

https://www.qubole.com/blog/hive-presto-clusters-jupyter-aws-azure-oracle/

This tutorial is a tour of both tools and machine learning models for ultimately predicting whether a song will make the Billboard top 10. On the tools front—RStudio, H2O.ai, and Amazon Athena are used for analysis using GLM, GMB, and deep learning models. The walkthrough includes a bunch of code and example output.

https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Kontena has a post on using Kafka with microservices. It provides a good overview of how Kafka's distributed log semantics and data retention policies can power event sourcing microservices.

http://blog.kontena.io/event-sourcing-microservices-with-kafka/

News

Videos from Big Data LA, which took place in August, have been posted online.

https://www.bigdatadayla.com/#slides

The Roaring Elephant podcast has an episode this week that recaps a number of sessions from the recent Dataworks Summit in Sydney. Topics covered include Hadoop 3.0, SparkR and Kerberos.

https://roaringelephant.org/2017/10/10/episode-56-dataworks-summit-sydney-recap-by-dave-part-1/

The Call for Papers for Big Data Technology Warsaw, which takes place in February, ends tomorrow. This post has more about what to expect from the conference.

http://getindata.com/2-3-reasons-become-speaker-big-data-technology-warsaw-summit-2018/

Spark Summit EU takes place in just over a week in Dublin. The Databricks blog has more details and a discount code if you haven't yet signed up.

https://databricks.com/blog/2017/10/11/biggest-eu-summit-ever-one-hundred-presentations-two-conference-days-day-training.html

A CVE for Apache NiFi was announced. Some versions prior to 1.4.0 (released earlier this month) are subject to a XML External Entity attack.

https://mail-archives.apache.org/mod_mbox/www-announce/201710.mbox/%3C13B90414-1C62-4858-BD74-051F67F1F6D4%40apache.org%3E

A CVE for Apache ZooKeeper was also announced. In this case, certain commands can cause a denial of service. More details about affected versions and mitigation are in the announcement.

https://mail-archives.apache.org/mod_mbox/www-announce/201710.mbox/%3CCANLc_9KJTmetFt6MrsFQm%2Badr-1w2VeGYyMJMVVZ281-3UmJKw%40mail.gmail.com%3E

Releases

Uber has open-sourced their stream processing framework AthenaX, which is allows users to write stream jobs in SQL. The announcement describes the motivation and the high-level architecture of AthenaX, which is built on Apache Flink and Apache Kafka. It also uses Apache YARN for deployment. Uber has over 220 applications in multiple data centers running via AthenaX.

https://eng.uber.com/athenax/

Apache Phoenix 4.12 was released. It includes resolution of over 100 issues, a new approximate count distinct function, a new table sampling feature, improvements to secondary indexing, and more.

https://blogs.apache.org/phoenix/entry/announcing-phoenix-4-12-released

Cloudera Director 2.6 was released. Among all the changes, it includes new TLS capability / other improvements to SSH key and improvements to the Microsoft Azure plugin.

http://blog.cloudera.com/blog/2017/10/whats-new-in-cloudera-director-2-6/

Gluon is a new machine learning library from AWS and Microsoft. It's available today as part of Apache MXNet.

https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-library-for-machine-learning-from-aws-and-microsoft/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Tensorflow on Apache Hadoop YARN (San Francisco) - Wednesday, October 18
https://www.meetup.com/SF-Big-Analytics/events/243896061/

Apache Ignite: Building Consistent, HA Distributed Systems & Memory-Centric SQL (San Ramon) - Thursday, October 19
https://www.meetup.com/datariders/events/243704206/

The Evolving Landscape of Data Engineering & How Systems Fail (San Francisco) - Thursday, October 19
https://www.meetup.com/Data-Engineering-Club/events/243137230/

Utah

Spark MLlib (Salt Lake City) - Monday, October 16
https://www.meetup.com/apache-spark-slc/events/243934568/

Minnesota

KSQL and Data Management with Kafka (Minneapolis) - Wednesday, October 18
https://www.meetup.com/TwinCities-Apache-Kafka/events/243858653/

Georgia

Leveraging Messaging Platforms Such as Kafka for Real-Time Streaming Transaction (Atlanta) - Thursday, October 19
https://www.meetup.com/BigData-Atlanta/events/244083390/

North Carolina

Real-World Deployments with Apache Kafka (Raleigh) - Tuesday, October 17
https://www.meetup.com/Raleigh-Apache-Kafka-Meetup-by-Confluent/events/243894809/

Virginia

Spring for Apache Kafka (Richmond) - Wednesday, October 18
https://www.meetup.com/Richmond-Java-Users-Group/events/243370292/

New York

Exactly-Once Semantics in Apache Kafka (New York) - Monday, October 16
https://www.meetup.com/Apache-Kafka-NYC/events/243901137/

IRELAND Let's Talk Kafka Streams (Dublin) - Thursday, October 19
https://www.meetup.com/hadoop-user-group-ireland/events/243387159/

UNITED KINGDOM

Stream SQL and Real-Time Applications with Apache Flink (London) - Wednesday, October 18
https://www.meetup.com/Apache-Flink-London-Meetup/events/244056097/

NORWAY

Big Data Analytics for Small- and Medium-Size Enterprises (Oslo) - Tuesday, October 17
https://www.meetup.com/OsloBigDataDay/events/237398012/

FINLAND

Helsinki Apache Kafka Meetup October (Helsinki) - Wednesday, October 18
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/243673560/

FRANCE

Stream SQL and Real-Time Applications with Apache Flink (Neuilly-Sur-Seine) - Thursday, October 19
https://www.meetup.com/Paris-Fast-Data-Meetup/events/244055726/

GERMANY

Stream Processing and War Stories (Berlin) - Thursday, October 19
https://www.meetup.com/fast-data-berlin/events/243706029/

ISRAEL

Apache Kafka Streams: Building Distributed, Fault-Tolerant Processing Apps (Tel Aviv-Yafo) - Wednesday, October 18
https://www.meetup.com/Apache-Kafka-TLV/events/243455001/