Data Eng Weekly


Hadoop Weekly Issue #117

19 April 2015

Hadoop Summit Europe was this week in Brussels, and there were a number of announcements and presentations related to that event (and I'm sure we'll see even more in the next few weeks). Among the announcements, Hortonworks has acquired SequenceIQ (makers of tools for Hadoop in the cloud), and HDP now fully supports Apache Spark (in case it wasn't yet clear that Spark is booming). Speaking of Spark, there are technical articles on Spark's GraphX, MLlib, and catalyst (the Spark SQL optimizer) as well as a few great posts on distributed systems that touch on Hadoop.

Technical

AppNexus has written about their experiences with Parquet. In comparison to snappy-compressed sequence files, snappy compressed parquet files use substantially less storage and aid performance (fewer and faster map tasks) across a number of Hive queries.

http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/

This post looks at how to use Spark's GraphX to process RDF data. Specifically, it runs GraphX's connected components implementation on the graph defined by the related values field of the Library of Congress Subject header dataset.

http://www.snee.com/bobdc.blog/2015/04/running-spark-graphx-algorithm.html

This post describe how to integrate Luigi, the workflow engine, with Google Cloud's BigQuery. The author shares some code for running BigQuery tasks as well as experiences in improvement to throughput and cluster utilization after introducing Luigi.

http://alex.vanboxel.be/2015/04/13/luigi-and-google-cloud-in-production-retrospective/

The Databricks blog has an interesting look at how the Spark SQL "catalyst" optimizer works. It discusses Trees, which are the main data type manipulated by the optimizer, Rules, which optimize a query by transforming from one tree to another, the logical optimization phase, physical planning, and code generation (which makes use of Scala's quasiquotes). The post is based on a paper that's to appear at SIGMOD 2015.

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

The Hortonworks blog has a look at the new security-related features in Apache Ambari 2.0. These include setting up Kerberos and deploying/configuring Apache Ranger.

http://hortonworks.com/blog/ambari-2-0-for-deploying-comprehensive-hadoop-security/

This presentation from Hadoop Summit covers some of the recent and ongoing work related to SQL with Hive and HBase. It looks at Hive's "Live Long And Process" daemons for sub-second queries, work on a new HBase-backed Hive metastore (to replace a RDBMS), and some thoughts on how Apache Phoenix (which adds SQL atop of HBase) could leverage some of Hive's components.

http://www.slideshare.net/alanfgates/hive-hbase-phoenixtp-hadoopsummiteuapr2015

The morning paper has coverage of two-Hadoop related papers this week. The first, "Cross-layer scheduling in cloud systems," looks at how a network-layer aware scheduler can improve throughput. For MapReduce, the authors show improvement during the shuffle phase. The second post looks at "ApproxHadoop: Bringing Approximations to MapReduce Frameworks," which uses sampling, task dropping, and user-defined approximation to reduce execution time when approximation is acceptable.

http://blog.acolyer.org/2015/04/15/cross-layer-scheduling-in-cloud-systems/
http://blog.acolyer.org/2015/04/16/approxhadoop-bringing-approximations-to-mapreduce-frameworks/

The CAP theorem is the basis of a lot of distributed system research and applications. Unfortunately, it's often misunderstood—particularly when it comes to availability. This post describes the various parts of the CAP theorem and gives real-world examples of several CAP trade-offs (HDFS is used as the example of a CP system).

http://blog.thislongrun.com/2015/04/cap-availability-high-availability-and_16.html

This post looks at using Spark ML to analyze network data. It describes frequent pattern mining, and how to use Spark 1.3's implementation of Parallel FP-growth to compute it. The authors describe how Spark's implementation scales in comparison to Mahout. The post also describes MLlib's Power Iteration Clustering.

https://databricks.com/blog/2015/04/17/new-mllib-algorithms-in-spark-1-3-fp-growth-and-power-iteration-clustering.html

News

When I got started with Hadoop for a university project, "Hadoop: The Definitive Guide" was my best resource. Hadoop has changed a lot since then, and the book is now on its fourth edition. The Cloudera blog has a post describing what's new in the latest version.

http://blog.cloudera.com/blog/2015/04/hadoop-the-definitive-guide-is-now-a-4th-edition/

Hortonworks announced that they've acquired SequenceIQ, the makers of Cloudbreak (docker-based, cloud-agnostic provisioning of Hadoop clusters) and Periscope (autoscaling of Hadoop clusters in the cloud). The Hortonworks blog has more details on Cloudbreak and Periscope and their plans for these projects (available as a Tech Preview and incubate under the Apache Software Foundation).

http://hortonworks.com/blog/hortonworks-acquires-sequenceiq-to-provide-automated-deployment-of-hadoop-everywhere/

In January, Hortonworks established the Data Governance Initiative (DGI) with several industry partners (Aetna, Merck, Target, and SAS). This week, the members of the DGI submitted a new project called Atlas to the Apache Incubator. Apache Atlas aims to address governance requirements like data classification, centralized auditing, search and lineage, and security/policy engine.

http://hortonworks.com/blog/apache-atlas-project-proposed-for-hadoop-governance/

CMSWire has coverage of an announcement by three members of the Open Data Platform (ODP)—Hortonworks, IBM, and Pivotal. The trio announced that they've standardized on Apache Hadoop 2.6 and Apache Ambari for the ODP. The article contains coverage of the announcement as well as discussion about Cloudera's and Mapr's choice not to join the ODP.

http://www.cmswire.com/cms/big-data/hey-cloudera-mapr-open-data-platform-is-the-real-deal-028787.php

"My data is bigger than your data!" is a blog that tracks the size of Hadoop and other big data systems that folks mention during public venues. It was recently updated with new numbers from Hadoop Summit, including the size of Yahoo!'s Hadoop deployment (600PB and 43K nodes).

http://lintool.github.io/my-data-is-bigger-than-your-data/

Enterprise Apps Today has an interview with Tom White, author of "Hadoop: The Definitive Guide." The interview discusses a number of new and growing parts of the Hadoop ecosystem, including Spark, Crunch, Flume, Parquet, YARN, and Kafka.

http://www.enterpriseappstoday.com/data-management/how-is-hadoop-evolving-to-meet-big-data-needs.html

Apache Zeppelin (incubating) is a relatively new project (added to the incubator last December) for building web-based notebooks (a la IPython) for Spark, SQL, and more. There isn't yet a release, but there are instructions for building from source.

https://zeppelin.incubator.apache.org/

Releases

JethroData, which is a SQL-on-Hadoop solution, recently announced that version 1.0 is generally available. JethroData combines full-indexing with a columnar store to power BI tools like Qlik, Tableau and Microstrategy. It is compatible with Cloudera's CDH, Hortonworks' HDP, MapR, and AMazon EMR.

http://www.jethrodata.com/blog/jethrodata-is-now-in-ga
http://www.jethrodata.com/blog/whats-new-in-jethrodata-1.0

Seagate has open-sourced their implementation of the Hadoop FileSystem API that's backed with Lustre. It's been tested with both CDH 5.3.1 and HDP 2.2.

https://github.com/Seagate/lustrefs

Hortonworks has announced that Apache Spark 1.2.1 is now generally available as part of HDP 2.2.4. Hortonworks has integrated Spark with the ORC file format, Apache Ambari, and more.

http://hortonworks.com/blog/announcing-apache-spark-now-ga-on-hortonworks-data-platform/

Cloudera announced a number of releases this week. Cloudera Enterprise 5.0.6, 5.1.5, and 5.2.5 all fix key bugs in HDFS, YARN, and Hive (there are slightly different fixes in each release, though). Cloudera Director 1.1.2 includes bug fixes and support for new instance types.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-0-6-CDH-5-0-6-and-Cloudera/m-p/26490#U26490
http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-1-5-and-5-2-5/m-p/26598#U26598
http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Director-1-1-2/m-p/26589#U26589

Apache Spark 1.2.2 and 1.3.1 were released. Version 1.2.2 has a number of fixes to Spark Core and PySpark. Version 1.3.1 includes fixes for Spark SQL, Spark Streaming, PySpark, and Spark Core.

http://spark.apache.org/news/spark-1-2-2-released.html

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

If You Don't Know the "Apache Way," You Really Don't Know Open Source (Palo Alto) - Tuesday, April 21
http://www.meetup.com/Pivotal-Open-Source-Hub/events/221347359/

April 2015 Hive Contributors Meetup (Santa Clara) - Wednesday, April 22
http://www.meetup.com/Hive-Contributors-Group/events/221610423/

Why Is My Spark Job Failing? by Sandy Ryza of Cloudera (Santa Monica) - Thursday, April 23
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/221107260/

Colorado

Apache Spark Tutorial with Paco Nathan (Boulder) - Thursday, April 23
http://www.meetup.com/Boulder-Denver-Big-Data/events/221572239/

Virginia

How McGyver Learned to Leave Duct Tape Behind and Use Spark Instead (McLean) - Wednesday, April 22
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/221595228/

Elastic Analytics with Spark, Mesos and Docker (Vienna) - Thursday, April 23
http://www.meetup.com/bigdatadc/events/221812159/

District of Columbia

Scaling with Couchbase, Kafka, and Apache Spark (Washington) - Tuesday, April 21
http://www.meetup.com/Couchbase-Washington-MD-VA-DC/events/220782854/

New Jersey

Real-Time Data Warehouse: Hadoop on SQL and Hive (Hamilton Township) - Tuesday, April 21
http://www.meetup.com/nj-hadoop/events/221820339/

New York

Couchbase, Kafka, Spark, Hadoop: Polyglot Persistence and the Big Data Pipeline (New York) - Monday, April 20
http://www.meetup.com/nycjava/events/219021763/

Vermont

Hadoop and R, with Brent Sitterly of KSV (Burlington) - Wednesday, April 22
http://www.meetup.com/Burlington-Data-Scientists/events/220960759/

IRELAND The Data Lake Use Case: Insights in Days or Weeks Rather Than Months (Dublin) - Wednesday, April 22
http://www.meetup.com/hadoop-user-group-ireland/events/221077659/

UNITED KINGDOM

Time-Series Analysis with Spark and Cassandra (London) - Monday, April 20
http://www.meetup.com/data-science-lab/events/221489718/

NORWAY

Hadoop ELT Lab with Hive, Sqoop, Pig, HBase, and Hue (Oslo) - Monday, April 20
http://www.meetup.com/oslohug/events/221486062/

DENMARK

Recommendation Engines in the Cloud with HDInsight/Hadoop in Azure (Aarhus) - Monday, April 20
http://www.meetup.com/Big-Data-Denmark/events/221412974/

GERMANY

NoSQL in a Hadoop World (Munich) - Wednesday, April 22
http://www.meetup.com/Hadoop-User-Group-Munich/events/220704952/

INDIA

An Introduction to Real-Time Spark (Bangalore) - Saturday, April 25
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/221645049/

AUSTRALIA

Apache Spark: Introducing DataFrames for Large Scale Data Science (Sydney) - Thursday, April 23
http://www.meetup.com/Sydney-Apache-spark-Meetup/events/221724508/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit http://hadoopweekly.com