Data Eng Weekly


Hadoop Weekly Issue #101

28 December 2014

For the last issue of Hadoop Weekly in 2014, we have a short but sweet edition. The theme for this week is the future of Hadoop—from the first technical post about Apache NiFi to posts on younger security projects (Apache Ranger and Apache Sentry) to several posts about the Hadoop industry in 2015.

Technical

The Apache Drill blog has a post on upcoming features that the project will be focussing on for 2015. These include improved JSON support, improved access control, an integration with Apache Spark, and operational enhancements.

http://drill.apache.org/blog/2014/12/16/whats-coming-in-2015/

Camus is the open-source project from LinkedIn for loading data from Kafka to HDFS. This post gives an introduction to Camus, a walkthrough to setting it up, and details how to customize Camus by writing a custom Decoder and RecordWriter.

http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline

This post gives a technical introduction to a new Apache incubator project, NiFi. NiFi is a system for integrating data sources using a web-interface to build data flows. The intro shows how to build a local dropbox folder that uploads data to HDFS whenever a file is added. The post also describes how to integrate NiFi with the KiteSDK.

http://ingest.tips/2014/12/22/getting-started-with-apache-nifi/

Apache Ranger (incubating) is a security project for Hadoop (based on code from XA Secure, which Hortonworks acquired earlier this year). This post looks at the audit features of Ranger, which are integrated for storage in a RDBMS, in HDFS, or via log4j. The post details the various configuration settings of each setup.

http://hortonworks.com/blog/apache-ranger-audit-framework/

Apache Sentry (incubating) is a security project originally backed by Cloudera for fine-grained authorization in Hive, Impala, and search. This post describes a new integration between Sentry and HDFS, and how that integration can be used to import data via Sqoop.

http://ingest.tips/2014/12/25/sqoop-import-in-a-world-governed-by-sentry-2/

News

Spark Packages is a new community index of packages for Spark. The initial set of packages includes integrations with Avro, Kafka, Pig, and more.

http://databricks.com/blog/2014/12/22/announcing-spark-packages.html

With 2015 starting later this week, Silicon Angle ventures a few predictions for Hadoop in the new year. These include the rise of “fast Big Data,” the need for real-time ingestion, support for streaming analytics in Hadoop-as-a-Service platforms, and the ubiquity of YARN.

http://siliconangle.com/blog/2014/12/22/2015-technology-predictions-datatorrent-on-big-data/

ZDNet has an interview with MapR CEO John Schroeder on what’s in store for MapR and the industry in 2015. Predictions include an emphasis on real-time over batch, data agility, fading hype of Hadoop, and vendor consolidation.

http://www.zdnet.com/article/mapr-ceo-talks-hadoop-ipo-possibilities-for-2015/

The Gartner blog has a short post that points out that while enterprises are ready to adopt Hadoop, the amount of expertise with the system is still lagging. And with new systems being added to the Hadoop ecosystem very frequently, this problem doesn’t seem to be going away any time soon.

http://blogs.gartner.com/nick-heudecker/hadoops-achilles-heel-in-2015/

Releases

The Cloudera Labs Kafka integration has been updated to support Kafka 0.8.2-beta. That release includes a number of useful features and improvements, which are summarized in the announcement.

https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/7-QaOzhJqlE/McO0hug8w_wJ

Cloudera Manager 5.3 was released this week. It includes a number of fixes and improvements, including stronger support for encryption (folder-level HDFS encryption), a new implementation of the S3-native file system, Kafka-Flume integration, and significant improvements to HBase. The release also includes the latest version of Apache Spark, Hue, and Impala.

http://blog.cloudera.com/blog/2014/12/cloudera-enterprise-5-3-is-released/

Cloudbreak, the system for auto-scaling Hadoop clusters in the cloud, has released a new version based on Apache Ambari 1.7.0 and Hortonworks HDP 2.2. The post has an overview of the new features and explains some upcoming improvements planned for future versions.

http://blog.sequenceiq.com/blog/2014/12/23/cloudbreak-on-hdp-2-dot-2/

Apache Drill, the SQL engine for Hadoop and NoSQL, released version 0.7 this week. Improvements in this release include: no longer depending on UDP multicast (Drill now works in EC2), automatic partition pruning, Hive 0.13 compatibility, and improved performance on JSON data.

http://drill.apache.org/blog/2014/12/23/drill-0.7-released/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

SFML Office Hours (San Francisco) - Monday, December 29
http://www.meetup.com/sfmachinelearning/events/219423133/

Colorado

Big Data for Business (Centennial) - Thursday, January 1
http://www.meetup.com/Big-Data-for-Business/events/218775347/

New York

Data Workshop #7 (New York) - Sunday, January 4
http://www.meetup.com/NYC-Data-Wranglers/events/219352351/