Data Eng Weekly


Hadoop Weekly Issue #241

19 November 2017

Lots of new releases this week, including Kafka, Hadoop, Hive, and Phoenix. Also, Databricks Azure and Azure Cosmos DB compatibility for Cassandra are both in preview, and there are great technical articles covering Kafka, StreamSets, Redshift, and the Dist-Keras deep learning framework.

Technical

The Landoop blog has an overview of their web and API-based tool, Lenses, for exploring data in Kafka. Based around the Lenses SQL Engine, it detects data types from streams and has support for real-time views, batch-queries, functions and "time traveling." There are tabular, tree, and raw views, and a Jupyter integration via the API.

http://www.landoop.com/blog/2017/11/lenses-how-to-view-kafka-topics-data/

The Qubole blog has a tutorial for how to use the Dist-Keras framework for deep learning as part of a Spark ML pipeline. While a small part of the post is Qubole specific, it's predominantly generally applicable to anyone looking to use Spark for deep learning.

http://www.qubole.com/blog/distributed-deep-learning-keras-apache-spark/

Pivotal writes about how they migrated data between GemFire clusters using Apache Kafka for replication. There are some interesting details, including how they avoided an infinite loop using a technique similar to reverse path forwarding.

https://content.pivotal.io/blog/zero-downtime-migration-between-gemfire-releases

This post describes how to use the StreamSets Data Collector to save data to Amazon S3 and then load into Snowflake DB.

https://blog.redpillanalytics.com/bulk-loading-zone-becac864cb12

A secure design for data access requires making tradeoffs. This post describes how to ensure security for Amazon Redshift with multiple accounts, which has some apparent inconveniences that can actually be automated. The walk-through describes loading data via Apache Spark on Amazon EMR and shuffling data across accounts (via assuming roles with Amazon STS).

https://aws.amazon.com/blogs/big-data/create-an-amazon-redshift-data-warehouse-that-can-be-securely-accessed-across-accounts/

News

The Azure Cosmos DB has announced preview support for Cassandra API compatibility. The implementation aims to be wire compatible so that only a change to the connection string is needed.

https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra-introduction

Azure Databricks was announced this week. Currently in preview, there are a number of Azure-specific optimizations and features.

https://databricks.com/blog/2017/11/15/a-technical-overview-of-azure-databricks.html

Insight has announced that they're expanding their program to a third city. The inaugural class of fellows in Boston will start their program in April. Applications are open now.

https://blog.insightdatascience.com/insight-data-engineering-fellows-program-expands-to-boston-36bd0d54c64f

Releases

Apache Hive 2.3.2 was released this week. It's a bug fix release across several sub components, including the Hive metastore client and Kerberos.

https://lists.apache.org/thread.html/114bda98f7481eaf20a1738e1d31627ca30fbd466c387f73dda42812@%3Cuser.hive.apache.org%3E

Apache Phoenix 4.13.0 was released. It's compatible with HBase 0.98 and 1.3, and it includes improvements to collection of statistics, a critical bug fix for snapshot creation, and other bug fixes.

https://lists.apache.org/thread.html/fafc40ecae3e9679e3ddb2b477c05bb7c480ccf73163c75bf0943645@%3Cuser.phoenix.apache.org%3E

Apache ZooKeeper 3.4.11 is a new bug fix release with a number of improvements and bug fixes.

http://mail-archives.apache.org/mod_mbox/zookeeper-user/201711.mbox/%3cCANLc_9Ke0ka_5C3mjnLqtiD+fuKp9wCZ9s=30ENSiXJtwuZaNg@mail.gmail.com%3e

Confluent has released version 3.3.1 of their Confluent Platform distribution. There are a number of improvements, including to both enterprise and open source versions (which is now based on Apache Kafka 0.11.0.1 and librdkafka 0.11.1).

https://docs.confluent.io/3.3.1/release-notes.html

MockStreams is a library for for unit testing applications built on Apache Kafka and Kafka streams. Version 1.5.0 was released with compatibility for Apache Kafka 1.0.

https://github.com/jpzk/mockedstreams/releases/tag/v1.5.0

Apache Hadoop 2.9.0 was released with a number of major new features including to the Timeline Service, YARN Federation, YARN Web UI, HDFS, and the CapacityScheduler API.

https://lists.apache.org/thread.html/974015c816760b20c6fc251ca0b14ba07bfd21b80b4941e2a4294317@%3Cuser.hadoop.apache.org%3E

A bug fix of Apache Kafka, version 0.11.0.2, has come out. There are some important fixes, including one for data loss.

https://lists.apache.org/thread.html/fa438f2553f8da1e20d3aee037039fda0e2b7382eca2b215edf34275@%3Cusers.kafka.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Georgia

Developing Java Streaming Applications (Atlanta) - Tuesday, November 21
https://www.meetup.com/atlantajug/events/244441484/

IRELAND Leveraging Apache Kafka for Web Crawling and Data Processing (Cork) - Tuesday, November 21
https://www.meetup.com/OpenStack-Cork/events/243678867/

SWEDEN

The First Beam Stockholm (Stockholm ) - Wednesday, November 22
https://www.meetup.com/Apache-Beam-Stockholm/events/244196560/

SPAIN

IBM BigSQL, 5 Years with Hadoop and HDF (Madrid) - Wednesday, November 22
https://www.meetup.com/futureofdata-madrid/events/244092435/

FRANCE

Apache Kafka Goes 1.0 (Paris) - Tuesday, November 21
https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/244780393/

NETHERLANDS

Mini-Batch Processing with Spark Streaming (Nieuwegein) - Wednesday, November 22
https://www.meetup.com/Hands-On-Big-Data-Architecture/events/243703572/

RUSSIA What’s New with Hadoop 3.0, Greenplum 5.0, Hive 2? (Moscow) - Wednesday, November 22
https://www.meetup.com/Scale-out-databases-and-engines/events/244930026/

INDIA

Recent Apache Hive Enhancements Powering Enterprise Data Analytics (Bangalore) - Wednesday, November 22
https://www.meetup.com/futureofdata-bangalore/events/244857610/

Data @ Scale Using Apache Spark (Bangalore) - Saturday, November 25
https://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/events/245075829/