Data Eng Weekly


Hadoop Weekly Issue #160

06 March 2016

This week, Hortonworks made several announcements, including a new partnership and changes to the way they're shipping HDP. They also announced support for Spark 1.6, and Spark is a big theme this week (with articles on logging, memory settings, GraphFrames). In terms of release, Apache Hive, MRQL (incubating), and Kudu (incubating) had releases this week.

Technical

As the Hadoop ecosystem has grown to include a number of stream processing and low-latency querying frameworks, the term "real-time" gets thrown around a lot. This post aims to disambiguate the phrase by explaining the differences between sub-second response, human comfortable response time, event-driven, streaming data processing, and near real-time.

http://bigdatapage.com/4-really-real-meanings-of-real-time/

This post takes a look at configuring and generating logs in Spark. Because of the way in which Spark serializes and distributes closures, there is a gotcha with the way in which you can use loggers. The post describes a couple of solutions to this issue.

https://medium.com/@anicolaspp/how-to-log-in-apache-spark-f4204fad78a#.dg4c1n4y4

Altiscale has "part 1.1" in a series on Hadoop NodeGroups, in which it discusses enabling the four-layer network topology for a Docker-based deployment and the discovery of a related performance degradation.

https://www.altiscale.com/blog/a-discussion-of-the-configuration-of-rack-aware-replica-placement-with-hadoop-nodegroup-implementation/

The DataTorrent blog has a post describing how Apache Apex (incubating) implements exactly-once processing even when interacting with external systems. It describes how the semantics are maintained with Kafka as input and JDBC as output. There is also an overview a new Kafka 0.9-based connector, which is much simpler due to the new consumer API in that release.

https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/

In another series on the Altiscale blog, part 4 of their Spark on Hadoop series covers Spark memory settings. It looks at how command-line arguments for drivers and executors correspond to the actual memory allocated for JVMs when running Spark-on-YARN. To explain these numbers, it dives into logs and source code.

https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-4-memory-settings/

The AWS Big Data Blog has a tutorial about using DynamoDB from Spark. The article describes how to create an RDD using the Hadoop DynamoDBInputFormat, which is a general purpose solution for any Hadoop InputFormat.

http://blogs.aws.amazon.com/bigdata/post/Tx1G4SQRV049UL0/Analyze-Your-Data-on-Amazon-DynamoDB-with-Apache-Spark

This post describes GraphFrames, a graph library built using Spark DataFrames. The library is compatible with Python, Java, and Scala APIs. There's an example of computing some basic computations as well as PageRank and a discussion of the relationship between GraphFrames and Spark's GraphX.

https://databricks.com/blog/2016/03/03/introducing-graphframes.html

News

Hortonworks held an event called "The Future of Data" this week. ZDNet and CMSWire have coverage of the announcements, which include a new partnership with Hewlett Packard Enterprise on Apache Spark, changes to the support model for Hortonworks DataFlow, and a new release schedule for Hortonworks Data Platform (more details below).

http://www.zdnet.com/article/hortonworks-revamps-its-stack-further-embraces-apache-spark/
http://www.cmswire.com/big-data/hortonworks-big-data-play-reaches-beyond-hadoop/

On the heels of Spark Summit East, the Altiscale blog has an article with several news clips about Spark.

https://www.altiscale.com/blog/big-data-news-best-of-spark-summit-east-2016/

Releases

Hortonworks has released HDP 2.4 and announced a new release strategy. In the new cadence, core services (such as Hadoop) will be updated yearly and extended services, such as Spark, will be updated more frequently. The post has a lot more information about the new strategy, and there's another post about the first of the extended releases—Apache Spark 1.6.

http://hortonworks.com/blog/announcing-the-general-availability-of-hortonworks-data-platform-2-4/
http://hortonworks.com/blog/announcing-ga-of-apache-spark-1-6-in-hortonworks-data-platformtm-hdp2-4/

Apache Hive 2.0.0 was recently released. The release resolves over 1,000 issues, which have helpfully been distilled into several highlights on the Cloudera blog. These include an alpha version of an HBase metastore, several Hive-on-Spark improvements, performance optimizations (such as Parquet predicate pushdown), and a new HiveServer2 web UI.

http://mail-archives.apache.org/mod_mbox/hive-user/201602.mbox/%3CD2E8B80B.57345%25sershe@apache.org%3E

http://blog.cloudera.com/blog/2016/03/apache-hive-2-0-is-released/

Apache Kudu 0.7.0-incubating was released his week. The release notes summarize key changes and improvements, which include a new python client, an improved Spark integration, new server-level metrics, and bug fixes (file descriptor leak, hang in Java client).

http://getkudu.io/releases/0.7.0/docs/release_notes.html

Version 0.9.6-incubating of Apache MRQL, the query processing framework, was released this week. MRQL supports MapReduce, Apache Hama, Apache Spark, and Apache Flink as backends, and the supported versions of Flink and Hama have been updated as part of the release. The release notes have more details about the contents of the release.

https://mrql.incubator.apache.org/ReleaseNotes-0.9.6.html

Cloudera Enterprise 5.6.0 was released this week. The new release adds support for EMC's DSSD D5.

http://community.cloudera.com/t5/Community-News-Release/ANNOUNCE-Cloudera-Enterprise-5-6-0-Released/m-p/38148#U38148

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Building Apps with Distributed In-Memory Computing Using Apache Geode (Palo Alto) - Monday, March 7
http://www.meetup.com/Pivotal-Open-Source-Hub/events/228983898/

Building Real-Time Data Pipelines with Spark, Kafka, and Python (San Francisco) - Wednesday, March 9
http://www.meetup.com/MemSQL/events/229055417/

Cloud Control: Efficient Hadoop ETL Processing with 85% Spot Utilization (San Francisco) - Thursday, March 10
http://www.meetup.com/AWS-SANFRANCISCO/events/228930769/

Washington

A Primer Into Jupyter, Spark on HDInsight, and Office 365 Analytics with Spark (Bellevue) - Wednesday, March 9
http://www.meetup.com/Seattle-Spark-Meetup/events/221571237/

Texas

What's New in Hadoop? Hive on Tez and Spark, Compression, Encryption, and More (Houston) - Tuesday, March 8
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/228768861/

Michigan

Neo4j for Process Mining and Hadoop on AWS! (Wyoming) - Wednesday, March 9
http://www.meetup.com/Big-Data-and-Hadoop-Users-Group-of-West-Michigan/events/227414774/

Georgia

Virtualizing Big Data: Effective Approaches Derived from Real Deployment (Atlanta) - Wednesday, March 9
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/229140813/

New Jersey

Scala + Spark SQL Workshop (Hamilton Township) - Thursday, March 10
http://www.meetup.com/nj-hadoop/events/229343096/

New York

Scaling Your R Analytics Using Hadoop & Spark w/ IBM & Galvanize! (New York) - Tuesday, March 8
http://www.meetup.com/Open-Source-Analytics-New-York/events/229076065/

Iowa

Eat, Drink, and Talk about HDInsight (Cedar Rapids) - Tuesday, March 8
http://www.meetup.com/380PASS/events/229246887/

IRELAND Big Data, AWS & The Data Pipeline + Distributed MPP & Analytics with HPCC (Dublin) - Monday, March 7
http://www.meetup.com/hadoop-user-group-ireland/events/228775135/

UNITED KINGDOM

Tech Nottingham: Joe Nash on Kafka (Nottingham) - Monday, March 7
http://www.meetup.com/Tech-Nottingham/events/229014573/

SMACK & Data Modelling (London) - Tuesday, March 8
http://www.meetup.com/Cassandra-London/events/229147810/

Big Data Bootcamp (London) - Saturday, March 12
http://www.meetup.com/hadoop-users-group-uk/events/229039854/

NORWAY

Big Data, No Fluff: Let’s Get Started with Hadoop #6 (Oslo) - Thursday, March 10
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/225156414/

SPAIN

Speed-Up Distributed Deep Learning with Spark on AWS (Barcelona) - Thursday, March 10
http://www.meetup.com/Spark-Barcelona/events/229220368/

FRANCE

SMACK & Achilles (Paris) - Monday, March 7
http://www.meetup.com/Cassandra-Paris-Meetup/events/229147692/

NightClazz Spark + Machine Learning (Paris) - Thursday, March 10
http://www.meetup.com/NightClazz/events/228870357/

NETHERLANDS

Stream Processing with Apache Flink and Mining Github (Amsterdam) - Thursday, March 10
http://www.meetup.com/GOTO-Nights-Amsterdam/events/228863893/

POLAND

Apache Spark Workshops (Torun) - Saturday, March 12
http://www.meetup.com/Torun-JUG/events/229194936/

LITHUANIA

Hadoop Meetup #5 (Vilnius) - Monday, March 7
http://www.meetup.com/Vilnius-Hadoop-Meetup/events/229225299/

AUSTRIA

Spark Streaming and GraphX (Vienna) - Tuesday, March 8
http://www.meetup.com/Vienna-Spark-Meetup/events/228084201/

CROATIA

Unsupervised Learning with Apache Spark (Zagreb) - Wednesday, March 9
http://www.meetup.com/Apache-Spark-Zagreb-Meetup/events/228721161/

ISRAEL

Know Your Distributed Tools, Apache Tez and Spark (Tel Aviv-Yafo) - Wednesday, March 9
http://www.meetup.com/Big-things-are-happening-here/events/227991013/

INDIA

Big Data Processing with Apache Spark (Hyderabad) - Saturday, March 12
http://www.meetup.com/Big-Data-Hyderabad/events/227853405/

Comparison of Various Streaming Technologies (Bangalore) - Saturday, March 12
http://www.meetup.com/Bangalore-Spark-Enthusiasts/events/229252514/

NEW ZEALAND

ML Pipeline Demo, Spark Code Generation with Talend + More (Auckland) - Tuesday, March 8
http://www.meetup.com/Auckland-Apache-Spark-User-Group/events/228743132/