Data Eng Weekly


Hadoop Weekly Issue #183

21 August 2016

This week's issue is short and sweet, featuring articles on Hadoop, Spark, Kafka, and HAWQ. In releases, Apache Phoenix and Apache Gearpump, which is a relatively new incubator project for real-time streaming implemented with Akka Actors, both had a releases.

Technical

SparkSession, exposed as spark in the spark-shell, is a new API in Spark 2.0. The SparkSession aims to be a unified entry point by providing the same functionality as the SparkContext, SQLContext, and more (including creation of Datasets and Dataframes). The Databricks blog has an overview of the main functionality of SparkSession.

https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html

The Hortonworks blog has a post about Hadoop in the cloud. It discusses some of the challenges (e.g. different semantics in blob stores, different security integration), and the improvements planned to address them (e.g. caching, improved connectors).

http://hortonworks.com/blog/making-elephant-fly-cloud/

Pivotal has created a docker-based sandbox (both single and multi-node) for Apache HAWQ (incubating). HAWQ is a MPP database that's integrated with Hadoop. There introductory block post has background and information on getting started.

https://blog.pivotal.io/pivotal/products/new-single-multi-node-sandboxes-for-pivotal-hdb-apache-hawq

Apache Hadoop has a tool called create-release for building releases, creating release notes, signing artifacts, and more. This post describes how to use the tool to build your own release, and how to use some of the non-default settings such as building native libraries and building via Docker.

https://effectivemachines.com/2016/08/16/building-your-own-apache-hadoop-distribution/

The LINE Engineers' Blog has a post on how LINE is deploying Kafka with Kafka Streams. The article describes how they decided on Kafka Streams (vs. Samza), some of its compelling features, two applications built on the platform, and some of the improvements to Kafka Streams the LINE team has made.

http://developers.linecorp.com/blog/?p=3960

News

Hortonworks and Microsoft have a strong partnership—the Hortonworks Data Platform powers Azure HDInsight, and Hortonworks worked on Windows support for Hadoop. This interview has more details about the relationship and the state of Hadoop on Microsoft Azure.

http://hortonworks.com/blog/microsoft-hortonworks-qa-hans-weiser-mark-mason/

The Hadoop project recently captured some criteria and example paths to becoming a committer for the project. If you're a contributor or are thinking of contributing, this adds some useful context to what it takes to become a committer.

https://hadoop.apache.org/committer_criteria.html

Cloudera has reaffirmed their commitment to Apache Spark. The post highlight streaming and machine learning applications, and it notes that Spark 2.0 has high expectations.

http://vision.cloudera.com/enhanced-streaming-and-machine-learning-with-apache-spark-2-0/

Releases

Apache Gearpump (incubating), the real-time stream processing system built on Akka, has released version 0.8.1-incubating. The release includes a number of changes (e.g. link updates, package renames) related to the projects entry into the Apache incubator.

http://mail-archives.apache.org/mod_mbox/incubator-general/201608.mbox/%3CCAB9X6p1vY3GtFd8QBXJdBh9gP10Ex%3DNn3L69hiqbU5HX6X9OUw%40mail.gmail.com%3E

Apache Phoenix 4.8 was released this week. The release includes a number of bug fixes, OFFSET support for pagination, Apache Hive integration, and more.

https://blogs.apache.org/phoenix/entry/announcing_phoenix_4_8_released

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Stream Processing Meetup @ LinkedIn (Mountain View) - Tuesday, August 23
http://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/232864129/

#SDBigData Meetup #17 (San Diego) - Wednesday, August 24
http://www.meetup.com/sdbigdata/events/229927355/

Lambda, Kinesis, Spacepods & More... (Santa Monica) - Wednesday, August 24
http://www.meetup.com/Los-Angeles-AWS-Users-Group/events/233376105/

Apache Cassandra + Spark Makeover w/ Apache Zeppelin & Scaling Cassandra at Uber (San Francisco) - Wednesday, August 24
http://www.meetup.com/CassandraSF/events/229748986/

Focusing on Ingest into Hive/Impala and Streaming with Kafka (Palo Alto) - Thursday, August 25
http://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/232976471/

Meetup at Salesforce (San Francisco) - Thursday, August 25
http://www.meetup.com/spark-users/events/232428471/

Texas

Beyond ETL: Real-Time, Streaming Architectures (Plano) - Tuesday, August 23
http://www.meetup.com/DFW-BigData/events/233238809/

Missouri

Processing & Serving 60 Terabytes of Data… Per Day! (Kansas City) - Tuesday, August 23
http://www.meetup.com/Data-Science-KC/events/228335742/

Talend: Integrating Real-time Data Streams with Spark and Kafka (Kansas City) - Thursday, August 25
http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/events/233052359/

Illinois

Flink Cluster on Cloud in Minutes to Analyze Streaming Time Series Data (Chicago) - Wednesday, August 24
http://www.meetup.com/Chicago-Apache-Flink-Meetup/events/233117347/

Wisconsin

2016 BigDataWisconsin Conference (Madison) - Monday, August 22
http://www.meetup.com/BigDataMadison/events/233079426/

North Carolina

Integrating Hadoop and SQL Server and Comparison of All SQL-on-Hadoop Options (Durham) - Thursday, August 25
http://www.meetup.com/futureofdata-triangle/events/233312637/

Virginia

Fast-Data Meetup Event (McLean) - Wednesday, August 24
http://www.meetup.com/Fast-Data-DC-NoVA-MD-DC/events/233035440/

UNITED KINGDOM

Lighting Fires and Predicting User Behaviour with Spark (London) - Wednesday, August 24
http://www.meetup.com/Spark-London/events/233326443/

Harnessing Kafka for Payment Processing at Massive Scale w/ Joe Nash, Improbable (London) - Thursday, August 25
http://www.meetup.com/andchat/events/233307849/

SWEDEN

Join Us for Our First Meetup in Stockholm, Hosted by Spotify (Stockholm) - Tuesday, August 23
http://www.meetup.com/s9s-Polyglot-Resistence-Stockholm/events/232706346/

NETHERLANDS

Spark, Topic Models, and Content Recommendation (Amsterdam) - Friday, August 26
http://www.meetup.com/SEA-Search-Engines-Amsterdam/events/230808199/

TURKEY

Ankara Tech Talks #3 (Ankara) - Friday, August 26
http://www.meetup.com/Ankara-Tech-Talks/events/233202419/

INDIA

Apache Storm and Real-Time Data Ingestion (Noida) - Wednesday, August 24
http://www.meetup.com/futureofdata-noida/events/232990509/

MapReduce and the Art of Thinking Parallel: Dr. Shailesh Kumar (Hyderabad) - Saturday, August 27
http://www.meetup.com/hyderabad-scalability/events/233351716/

Anatomy of Spark SQL Catalyst, Part 2 (Bangalore) - Saturday, August 27
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/233204539/

TAIWAN

MLDM Monday | Spark (Taipei) - Monday, August 22
http://www.meetup.com/Taiwan-R/events/233202305/