Data Eng Weekly


Hadoop Weekly Issue #209

19 March 2017

Some big product announcements at Strata + Hadoop World, a number of open-source project releases, and several great technical posts all make for a great issue this week.

Technical

Uber has open-sourced Hoodie, their library for incremental processing on Hadoop with Spark. Somewhat reminiscent of early MPP databases, Hoodie supports a read-optimized view of data (using Parquet) and a hybrid view that combines columnar with row-based (Avro) data. The read optimized store can be used for incremental processing while the hybrid view supports realtime query. In either case, Hoodie does quite a bit of heavy lifting to support updates in addition to inserts—the introductory post describes the ingest path in detail.

https://eng.uber.com/hoodie/

Docker containers aren't supposed to add much overhead, but it's good to see benchmarking results that prove the case. BlueData put their BlueData EPIC, which runs Hadoop in Docker containers, up against bare-metal Hadoop on the BigBench benchmark. BlueData has some proprietary technology (IOBoost and DataTap) prevent this from being an apple to apple comparison, but it still shows promise for the future of Big Data in Docker.

https://www.bluedata.com/blog/2017/03/big-data-performance-benchmarking/

The 0.6.0 release of Apache Beam adds a python SDK. The introductory post gives a brief intro on how to get started with the new SDK and provides an example pipeline for estimating the value of pi.

https://beam.apache.org/blog/2017/03/16/python-sdk-release.html

The Morning Paper covers another post from the FAST conference, this week on Apache Omid (incubating). Omid is a transaction layer atop of Apache HBase that was originally built at Yahoo. The post describes the architecture, gives a high-level-overview of the transaction model (and how the client interacts), describes the model for high availability, and more.

https://blog.acolyer.org/2017/03/17/omid-reloaded-scalable-and-highly-available-transaction-processing/

This post shows how to use Rstudio with Amazon Athena to query data in S3.

https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/

SiliconANGLE has an interview with Holden Karau on the future of Spark. There's a video of the ~15 minute interview and an excerpt about how Spark's MLlib's local mode might be useful for applying machine learning models on devices for IoT applications.

http://siliconangle.com/blog/2017/03/15/spark-ml-getting-closer-edge-improve-latency-bigdatasv/

News

Hortonworks has announced that they'll be brining major revisions of their platform to cloud platforms before releasing it for on-premise.

https://hortonworks.com/blog/microsoft-hortonworks-empower-azure-hdinsight-customers-first-benefit-innovation/

MapR has introduced a new product called MapR Edge, which is a scaled down version of the MapR platform that can do simple operations on data closer to the source. It's specifically aimed at IoT use cases, and it has a bandwidth-awareness feature for delivering data of unreliable or slow connections.

https://mapr.com/blog/introducing-mapr-edge-internet-things-devices-create-ton-data/

Based on their Sense.io acquisition, Cloudera is launching the Cloudera Data Science Workbench, which is somewhat like a jupyter or zeppelin notebooks. There's an enterprise take on the system, though, in that it focuses on security and compliance as well as reproducibility.

http://vision.cloudera.com/cloudera-data-science-workbench-self-service-data-science-for-the-enterprise/

Microsoft had a bunch of announcements at Hadoop Summit. Among them, they announced that they've built a Spark connector for DocumentDB, which is their distributed data store that supports SQL and is compatible with the MongoDB query interface. They also announced improved security support for Hive's LLAP (via HDP 2.6) and Spark, a 99.9% SLA for Apache Spark 2.1 on Azure HDInsight, improved streaming support from Spark to the Azure Event Hubs, and more.

https://azure.microsoft.com/en-us/blog/announcing-new-capabilities-of-hdinsight-and-documentdb-at-strata/

The second annual Phoenixcon is taking place on June 13th (the day after HBaseCon). It takes place 10:30-6pm at the Salesforce San Francisco office.

https://www.eventbrite.com/e/phoenixcon-2017-tickets-32872245772

Slides and videos from many presentations given at the Spark Summit East, which was just over a month ago in Boston, have been posted online.

https://spark-summit.org/east-2017/schedule/

Releases

Version 2.6.1 of the Luigi workflow engine was released. The new version includes some updates and bug fixes as well as new support for running jobs via Kubernetes.

https://github.com/spotify/luigi/releases/tag/2.6.1

Apache Drill 1.10.0 came out this week. It adds a new 'create temporary table as..' command, improved support for timestamps in Parquet files, better fault tolerance over JDBC, support for Kerberos authentication, and more.

https://drill.apache.org/docs/apache-drill-1-10-0-release-notes/

In addition to the Python SDK support, this week's Apache Beam release adds support for Apache HBase, improved support for the Beam model in runner implementations, and more.

https://lists.apache.org/thread.html/9b8de059f29af54754e0e41d7e4a867d2ae25b3b1515d97c0e52f918@%3Cuser.beam.apache.org%3E

Version 4.1 of the Cask Data Application Platform was released. The Cask blog has a post on the major improvements, including fine-grained secure impersonation, hot-cold replication, and improved user experience.

http://blog.cask.co/2017/03/cdap-4-1-more-enterprise-grade-hardening-pre-built-solutions-and-enhanced-ux/

This project provides an example template for running Kafka on OpenShift/Kubernetes.

https://github.com/sabre1041/fis-kafka

Version 0.2.13 of scio, the scala API for Apache Beam / Google Cloud Dataflow, was released. There are some performance improvements and bug fixes in the release. There's also a beta of version 0.3.0, which is the release in which scio moves from the Dataflow Java SDK to Apache Beam.

https://github.com/spotify/scio/releases/tag/v0.2.13
https://github.com/spotify/scio/releases/tag/v0.3.0-beta2

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Bay Area Apache Spark Meetup (Santa Clara) - Thursday, March 23
https://www.meetup.com/spark-users/events/237793340/

Oregon

Spark and MapR Streams: A Motivating Example (Portland) - Tuesday, March 21
https://www.meetup.com/PDXJUG/events/238323828/

Washington

Spark Streaming @ Expedia and Implementing Spark Streaming Connector (Bellevue) - Thursday, March 23
https://www.meetup.com/Seattle-Spark-Meetup/events/230310598/

Minnesota

Integrating Real-Time Video Data Streams With Spark and Kafka (Saint Paul) - Thursday, March 23
https://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/238056069/

Illinois

What's Changed With Apache Spark's Structured Streaming? (Chicago) - Tuesday, March 21
https://www.meetup.com/Chicago-Spark-Users/events/238143885/

Flink Forward SF Sneak Peak Double Feature! (Chicago) - Wednesday, March 22
https://www.meetup.com/Chicago-Apache-Flink-Meetup-CHAF/events/238027655/

Georgia

Building a Self-Service Analytics Platform on Hadoop (Atlanta) - Wednesday, March 22
https://www.meetup.com/Atlanta-Hadoop-Users-Group/events/238108903/

Massachusetts

March Presentation Night (Cambridge) - Tuesday, March 21
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/238211198/

Introduction to Spark Structured Streaming (Cambridge) - Thursday, March 23
https://www.meetup.com/Big-Data-Developers-in-Boston/events/237850996/

UNITED KINGDOM

How Apache Kafka Can Change Your Life (Brighton) - Wednesday, March 22
https://www.meetup.com/Brighton-Java/events/238214732/

SPAIN

Speed Talks (Barcelona) - Tuesday, March 21
https://www.meetup.com/Spark-Barcelona/events/237242600/

Functional Programming in Practice: Abstracting away from Spark (Barcelona) - Thursday, March 23
https://www.meetup.com/Scala-Developers-Barcelona/events/238052363/

FRANCE

Big Data & Data Science (Paris) - Monday, March 20
https://www.meetup.com/Big-Data-Developers-in-Paris/events/238143029/

Introduction to Kudu: Fast Analytics on Fast Data (Neuilly sur seine) - Thursday, March 23
https://www.meetup.com/Cloudera-User-Group-France/events/238303284/

AUSTRIA

Tableau and Structured Streaming in Spark (Vienna) - Tuesday, March 21
https://www.meetup.com/Vienna-Modern-Data-Science/events/237915493/

HUNGARY

Hadoop Deployment into the Cloud (Budapest) - Tuesday, March 21
https://www.meetup.com/futureofdata-budapest/events/237518196/

ISRAEL

High Performance Big Data Techniques Should Be Easy (Tel Aviv-Yafo) - Monday, March 20
https://www.meetup.com/Big-things-are-happening-here/events/238184354/

Data Processing in Hadoop: Analytics & Data Pipelines in Practice (Ra'anana) - Tuesday, March 21
https://www.meetup.com/Big-Data-Israel/events/238037899/

Big Data Meetup: Amazon EMR and Nuviad (Tel Aviv-Yafo) - Tuesday, March 21
https://www.meetup.com/Big-Data-Israel/events/238221948/

INDIA

Introduction to Concurrent Programming With Akka Actors (Bangalore) - Saturday, March 25
https://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/237647587/