Data Eng Weekly


Hadoop Weekly Issue #66

20 April 2014

There were a number of announcements this week, including new Hadoop integrations from Microsoft, Google, and Amazon. Red Hat and Hortonworks announced the next step of their partnership, and Cloudera announced a new zero-download trial for CDH5. There are also some excellent technical resources including details on the Slideshare analytics stack and a peak under the hood of Hadoop operations at Spotify.

Technical

perf top is a tool for profiling Linux systems. This post explains how to use it with Java, and how to convert the output to a flame graph. It focusses on how to do all of this with Tez, but it is broadly applicable to any Java application.

https://github.com/t3rmin4t0r/notes/wiki/Perf-sampling-Tez

Slides from the Hadoop Summit talk by Adam Kawa, Data Engineer at Spotify, about Hadoop operations were posted. Spotify is running YARN on a several-hundred node cluster, and the talk covers analyzing and understanding the usage of their cluster. Examples include analyzing NameNode GC, analyzing HDFS usage, HDFS capacity planning, auto tuning of MapReduce jobs and much more.

http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam

The SlideShare engineering blog has an in-depth post about migrating their analytics stack from MySQL and Ruby to HBase and Pig scripts. The post covers the technologies involved, including detailed design diagrams of their processing pipeline, Hadoop infrastructure, HA setup, and more. They also have details about their configuration tweaks, lessons learned, and much more.

http://engineering.slideshare.net/2014/04/hadoop-and-near-real-time-analytics-at-slideshare/

The Databricks blog has some details about upcoming support for Java 8 lambda expression in the Spark API. There are a few examples showing how concise the API becomes with the new syntax, which will be supported in Spark 1.0 (targeted for release in May).

http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html

The Cloudera blog has an article about writing, building, and running a simple Spark application on CDH5. The source accompanying the post is available on github, and it includes implementations in both Java and Scala.

http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

News

MapR has launched a new Developer Central with code samples and articles on best practices for Hadoop. Articles cover Hive, Pig, MapReduce, HBase, the Lamda Architecture, and more.

http://www.mapr.com/blog/sample-code-best-practices-and-technical-resources-now-available-developer-central

Using the advent of commercial cameras as an example, the Cloudera blog has a post exploring the legal and ethical ramifications of big data. It discusses privacy and transparency as well as the currently regulations and regulatory efforts.

http://vision.cloudera.com/devising-our-data-destiny/

Accumulo Summit is taking place June 12 in College Park, MD. The call for papers is open until April 30.

http://accumulosummit.com/program/submit-a-talk/

Allied Market Research has released a report about the Hadoop Market. The report predicts that Hadoop will be worth $50.2 billion by 2020. The release breaks this down into software and hardware, and there are geographic and further breakdowns in the full report.

http://www.pr.com/press-release/553376

Ovum analyst Tony Baer has an article about the recently announced 2.1 release of HDP. It talks about how the release is expanding outside the traditional Hadoop core with Storm for streaming, search, interactive SQL (via Hive + Tez), and Apache Falcon and Knox.

http://ovum.com/2014/04/08/hortonworks-hdp-2-1-release-starts-extending-hadoop-core-platform/

FICO, makers of predictive analytics software, has acquired Hadoop startup Karmasphere. GigaOm has more details on the deal.

http://gigaom.com/2014/04/17/hadoop-analytics-startup-karmasphere-sells-itself-to-fico/

Releases

Cassandra 2.0.7 was released. It’s a bug fix release containing over 50 resolved tickets.

http://mail-archives.apache.org/mod_mbox/cassandra-user/201404.mbox/%3CCAKkz8Q1-y4w7V%3D7xe-FDe6szH_tz39FHg8p170%2BZD%3DYTk%3Dsejg%40mail.gmail.com%3E

Apache Pig 0.12.1 was released. The new version includes a number of bug fixes as well as documentation improvements and a new version of HBase.

http://mail-archives.apache.org/mod_mbox/www-announce/201404.mbox/%3CCAOehgTmauZraO4sAVQrGHKkT7bUh1U2eXFj_VR-PsFdaR=0vwA@mail.gmail.com%3E

Apache Phoenix (incubating) released version 4.0 (targeting HBase 0.98.1+) and version 3.0 (targeting HBase 0.94.4+). The new releases includes support for equi-joins through broadcast hash join, SQL views, and more.

https://blogs.apache.org/phoenix/entry/apache_phoenix_released_next_major

This week’s update to Amazon Redshift adds support for copying data directly from an Amazon Elastic MapReduce cluster to a Redshift cluster.

http://aws.amazon.com/releasenotes/Amazon-Redshift/9335310585817102

Google announced preview releases of the Google BigQuery connector and Google Cloud Datastore connector for Hadoop. These implement the Hadoop InputFormat and OutputFormat APIs.

http://googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

Ferry, the project for running development environments of various distributed systems in Docker containers, released a new version this week. The new release includes improvements for YARN and the ability to forward ports from a container to the host.

http://blog.opencore.io/changelog/2014/04/18/changelog/

Microsoft announced support for Apache Avro via the Microsoft Avro Library. The library is optimized by building an in-memory expression tree which is compiled into IL code.

http://blogs.msdn.com/b/windowsazure/archive/2014/04/14/microsoft-avro-library.aspx

Sprunch is a new Scala API atop of Apache Crunch. It provides “Pimp My Library”-style extensions via implicits and class extensions. Compared to Scrunch (which is the Scala API part of Apache Crunch), the API aims to be minimalistic and is less than 90 lines of code.

https://github.com/DavW/sprunch

Cloudera announced a new zero-download demo of CDH 5 called “Cloudera Live.” It provides access to the entire Cloudera stack for up to 3 hours at a time via the Hue web interface.

http://vision.cloudera.com/cloudera-live/

Kafkacat is a stand-alone application for reading from and writing to Kafka from stdin/stdout (a la cat). It’s a small, statically linked C program.

https://github.com/edenhill/kafkacat

Microsoft announced a new product called the Analytics Platform System that allows queries across traditional SQL data warehouses and Hadoop.

http://visualstudiomagazine.com/articles/2014/04/16/sql-and-hadoop-products-unveiled.aspx

Hortonworks and Red Hat announced the next step of their partnership by integrating OpenShift PaaS with the Hortonworks Data Platform. The integration allows OpenShift applications to run in Hadoop via YARN. There is an example project that runs a Python Flask server to serve data stored in HBase.

https://www.openshift.com/blogs/combining-big-data-and-rapid-application-development-openshift-and-hortonworks-data-platform

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Data Science Stack Showcase (San Francisco) - Tuesday, April 22
http://www.meetup.com/Bay-Area-Women-in-Machine-Learning-and-Data-Science/events/174197352/

Spark 1.0 and Beyond (San Francisco) - Wednesday, April 23
http://www.meetup.com/spark-users/events/175940092/

Large-Scale Machine Learning with Apache Spark (San Francisco) - Thursday, April 24
http://www.meetup.com/sfmachinelearning/events/174560212/

Big Data Developer Day (Los Angeles) - Saturday, April 26
http://www.meetup.com/Big-Data-Developers-in-Los-Angeles/events/175314662/

Washington

Learnings from Running Spark at Twitter (Bellevue) - Wednesday, April 23
http://www.meetup.com/Seattle-Spark-Meetup/events/161043042/

Seattle Scalability Meetup - Agile Data & Apache Tez (Seattle) - Wednesday, April 23
http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/events/165993952/

Colorado

Introduction to Spark (Boulder) - Wednesday, April 23
http://www.meetup.com/Boulder-Denver-Big-Data/events/175122732/

Texas

Advanced Hadoop Based Machine Learning (Austin) - Wednesday, April 23
http://www.meetup.com/Austin-ACM-SIGKDD/events/171159622/

Missouri

Teradata User Group Conference: Central-St. Louis Region Event Agenda (Saint Louis) - Tuesday, April 22
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/174173022/

Pennsylvania

Impala - Straight from the Antelope's Mouth (Philadelphia) - Tuesday, April 22
http://www.meetup.com/PhillyDB/events/175302932/

Hadoop Users Group Pittsburgh April Meetup (Pittsburgh) - Wednesday, April 23
http://www.meetup.com/HUG-Pittsburgh/events/159886132/

Massachusetts

The Future of Data (Cambridge) - Tuesday, April 22
http://www.meetup.com/bostonhadoop/events/173224862/

ISRAEL

Presentation on Hbase viewer by Abraham Elmahrek - Hue developer (Tel Aviv-Yafo) - Wednesday, April 23
http://www.meetup.com/HadoopIsrael/events/161701092/

NETHERLANDS

Large Scale Image Classification and Apache Spark for applied machine learning (Amsterdam) - Wednesday, April 23
http://www.meetup.com/The-Amsterdam-Applied-Machine-Learning-Meetup-Group/events/173953962/

FRANCE

PerfUG : Hadoop et HDFS : Stockage, Requêtage et Performances (Paris) - Thursday, April 24
http://www.meetup.com/PerfUG/events/177041022/