Apache Zeppelin

zepplin

AT A GLANCE

Said to be a collaborative data analytics and visualization tool for Apache Spark, Apache Flink, Apache’s incubating Zeppelin project is a web-based tool for data scientists to collaborate over large-scale data exploration. Zeppelin is independent of the execution framework, and its interpreter allows any language or data-processing backend (such as Scala, Spark, Markdown and Shell) to be plugged in.

Zeppelin has a pluggable architecture so it supports not only Scala but also Python with built-in Spark integration. Zeppelin supports data exchange between Scala and Python environments but also SparkContext sharing for spark cluster resource utilization. It’s got the ability create a rich interactive analytics GUI inside of it’s notebooks. It also has a customizable layout system as well.

Some basic charts are already included in Apache Zeppelin. Visualizations are not limited to SparkSQL query, any output from any language backend can be recognized and visualized.

TRENDING

Apache Geode

AT A GLANCE

The initial version of this data store was a distributed caching product that allowed both C++ and Java applications to share objects in a scale out environment at high speeds. GemFire was launched as a result of lessons learned from its predecessor – an object-oriented database and its performance challenges in a highly scaled environment. The problems primarily stemmed from the centralized design in an RDBMS and a design that used main memory to optimize disk IO not application access patterns. In April 2015 the GemFire core was released into open source under the Apache incubator project Geode.

apache geode
GemFire, Tangosol Coherence (now Oracle) and Gigaspaces were the primary players in the creation of the IMDG. A new class of product that went beyond the traditional relational database. In this article, Jags Ramnarayan, the Chief Architect of GemFire, answers questions about in memory data.

HISTORY

It was through their experiences on Wall Street and the DoD (signal intelligence) that the team expanded to integrating real-time pub-sub and replication over the wide area network. All of this emerged well before the world had even heard the term NoSQL.

GemFire, Tangosol Coherence (now Oracle) and Gigaspaces were the primary players in the creation of the IMDG. A new class of product that went beyond the traditional relational database. In this article, Jags Ramnarayan, the Chief Architect of GemFire, answers questions about in memory data.

Over time IMDGs started to get adopted in every single market – scalable web sites, travel apps, internet of things, and others. Anyplace differentiation can be achieved through speed and immediate access to data, Geode is a great fit.

Then, GemFire was purchased by VMWare in 2010 and was incorporated into the vFabric platform. It became a core component of vFabric Suite. At VMware, development continued, until April 2013, when GemFire became part of Pivotal. At Pivotal, GemFire has enjoyed being the fast, in memory component of Big Data Suite.

In February 2015, Pivotal announced plans to open source parts of the Big Data Suite, starting with GemFire.

Amazon Kinesis

AT A GLANCE

Amazon’s Kinesis is a cloud-based service for real-time data processing over large, distributed data streams. It claims to be able to continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events.

kinesis

With Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis Applications and use streaming data perform real-time dashboards, generate alerts, implement dynamic pricing and advertising, and other applications. You can also emit data from Amazon Kinesis to other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic Map Reduce (Amazon EMR), and AWS Lambda.

Kinesis is a commercial product. It has no upfront cost but you pay for the resources you use. For as little as $0.015 per hour, Kinesis will stream with 1MB/second ingest rate and 2MB/second egress rate.

PROS

CONS

TRENDING

Apache Chukwa

AT A GLANCE

Apache Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

PROS

Roughly the Chukwa story is:
  • Get data to the centralized store, and do periodic near-real-time analysis.

CONS

At the same level of granularity, the Flume story is:

  • Reliably get data to the centralized store, enable continuous near real-time analysis, and enable periodic batch analysis.
1)  Architecture and Near-realtime.
  • Chukwa’s near real-time == minutes
  • Flume’s near real-time == seconds (hopefully milliseconds).
Both systems have a agent-collector topology for nodes.  Architecturally,
Chukwa is a batch/minibatch system. In contrast, Flume is designed
more as a continuous stream processing system.
2) Reliability
Flume’s reliability levels are tunable, just pick the appropriate sink
to specify the mode you want to collect data at.  It offers three
levels — best effort, store+retry on failure, and end-to-end mode
that uses acks and a write ahead log.
AFAICT Chukwa is best effort from agent to collector, writes to local
disk at the pre-demux collector, and then finally becomes reliable
when written to hdfs. This seems stronger than scribe’s reliability
mechanisms (equiv to Flume’s Store on failure), but weaker than
Flume’s end-to-end reliability mode (write ahead log and acks).
3) Manageability
Flume just requires the deployment of a master (or set of masters) and
nodes.  It then provides a centralized management point that allows
you to configure nodes dynamically, and to reconfigure the data flow
topology dynamically.
Chukwa’s deployment story is restrictive and more complicated than
Flume’s.  It only supports a agent/collector topology.  Despite this
restriction, it requires the depoyment of more different programs than
flume — agents, collectors, a console daemon.  For Chukwa to work as
intended, it also has dependencies on a hadoop cluster (both hdfs and
mapreduce) and a MySQL database.
4) Support.
Cloudera packages Flume as part of its distribution of Hadoop and will
provide commercial support for users of the system.

TRENDING

Riak

riak

AT A GLANCE

Two products:

Riak KV is a distributed NoSQL database that is highly available, scalable and easy to operate. It automatically distributes data across the cluster to ensure fast performance and fault-tolerance. Riak KV Enterprise includes multi-cluster replication ensuring low-latency and robust business continuity.

Riak TS is a distributed NoSQL database optimized for time series data. By co-locating time series data, Riak TS provides faster reads and writes making it easier to store, query, and analyze time and location data. Like Riak KV, Riak TS is highly available, scalable, and easy to operate at scale.

Best used: If you want something Dynamo-like data storage, but no way you’re gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you’re ready to pay for multi-site replication.
For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as an updateable web server.

PROS

  • Tunable trade-offs for distribution and replication
  • Pre- and post-commit hooks in JavaScript or Erlang, for validation and security.
  • Map/reduce in JavaScript or Erlang
  • Links & link walking: use it as a graph database
  • Blobs
  • Large object support (Luwak)
  • Comes in “open source” and “enterprise” editions
  • Full-text search, indexing, querying with Riak Search
  • In the process of migrating the storing backend from “Bitcask” to Google’s “LevelDB”
  • Masterless multi-site replication and SNMP monitoring are commercially licensed

CONS

  • Secondary indices: but only one at a time

TRENDING

Prediction.IO

AT A GLANCE

Said to “eliminate the friction between software development, data science and production deployment,” PredictionIO is an open-source Machine Learning server for developers and data scientists to build and deploy predictive applications.

The core part of the tool is an engine deployment platform built on top of Apache Spark.

prediction.io

The DASE architecture of engine is the “MVC for Machine Learning”. It enables developers to build predictive engine components with separation-of-concerns. Data scientists can also swap and evaluate algorithms as they wish. Predictive engines are deployed as distributed web services. In addition, there is an Event Server. It is a scalable data collection and analytics layer built on top of Apache HBase. It takes care of the data infrastructure routine so that your data science team can focus on what matters most.

The template gallery offers a wide range of predictive engine templates for download and developers can customize them easily.

Data scientists use the tool to evaluate models and keep track of parameter adjustment history. With the DASE architecture, scientists can use or develop re-usable components such as data sampling methods and evaluation metrics. PredictionIO also helps unify different type of data from multiple platforms and provide an interface for data analysis. Data scientists can look at the data using their favorite analytical tools such as Tableau, IPython Notebook and Zeppelin.

PROS

CONS

NOTICE

Salesforce has made another acquisition to build out its technology in machine learning and big data analytics: the company has acquired PredictionIO, a startup based out of Palo Alto that had developed an open source-based machine learning server.

Originally called TappingStone when it was founded in 2012 in London, the startup at first had built a product that it described as “Machine Learning as a Service” before pivoting to an open source model.

Under that new direction, the company had developed some traction, with some 8,000 developers and 400 apps powered by its technology. As a point of reference, that’s double the number of developers in the community as of 2014.

TRENDING

Apache Giraph

apache giraph

AT A GLANCE

Apache’s Giraph project is said to be “a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes.” Giraph is in use at companies like Facebook and PayPal to help represent and analyze the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera’s Impala.

As in the core MapReduce framework, in Giraph all data and workload distribution related details are hidden behind an API. As of Feb 2015 the Giraph API was used on top of MR1 via a slightly unstable mechanism, although a YARN-based implementation was planned.

In Giraph, a worker node or a slave node is a host (either a physical server or even a virtualized server) that performs the computation and stores data in HDFS. Such workers load the graph and keep the full graph or just a part of it (in case of distributed graph analysis) in memory. Very large graphs are partitioned and distributed across many worker nodes.

(Note: the term partition has a different meaning in Giraph. Here, the partitioning of the graph is not necessarily the result of the application of a graph-partitioning algorithm. Rather, it is more a way to group data based on a vertex hash value and the number of workers. This approach is comparable to the partitioning that is done during the shuffle-and-sort phase in MapReduce.)

A Giraph algorithm is an iterative execution of “super-steps” that consist of a message exchange phase followed by an aggregation and node or edge property update phase. While vertices and edges are held in memory, the nodes exchange messages in parallel. Therefore, all worker nodes communicate and send each other small messages, usually of very low data volume.

After this message-exchange phase, a kind of aggregation is done. This leads to an update of the vertex and/or edge properties and a super-step is finished. Now, the next super-step can be executed and nodes communicate again. This approach is known as Bulk Synchronous Processing (BSP). The BSP model is vertex based and generally works with a configurable graph model G<I,V,E,M>, where I is a vertex ID, V a vertex value, E an edge value, and M a message data type, which all implement the Writable interface (which is well known from the Hadoop API).

However, the iterative character of many graph algorithms is a poor fit for the MapReduce paradigm, especially if the graph is a very large one like the full set of interlinked Wikipedia pages. In MapReduce, a data set is streamed into the mappers and aggregation of intermediate results is done in reducers. One MapReduce job can implement one super-step. In a next super-step, the whole graph structure — together with the stored intermediate state of the previous step — have to be loaded from HDFS and stored at the end again. Between processing steps, the full graph is loaded from HDFS and stored there, which is a really large overhead. And let‘s not forget that the intermediate data requires local storage on each worker node while it passes the shuffle-sort phase.

For very large graphs, it is inefficient to repeatedly store and load the more or less fixed structural data. In contrast, the BSP approach loads the graph only once, at the outset. The algorithms assume that runtime-only messages are passed between the worker nodes. To support fault-tolerant operation the intermediate state of the graph can be stored from time to time, but usually not after each super-step.

PROS

CONS

TRENDING

Couchbase

couchbase

AT A GLANCE

Couchbase Server is an open source, distributed database engineered for scalability, performance, and availability. It’s a general purpose database that can be deployed as a document database, key-value store, and/or distributed cache.

Its memory-centric, multi-threaded architecture leverages integrated caching, memory-optimized indexes, and memory-to-memory replication to provide consistent high throughput, low latency access to data, and is capable of fully utilizing multi-core processors and terabytes of memory.

couchbase

N1QL is a full-featured, SQL-based query language that extends SQL to JSON and enables low latency queries on distributed data, regardless of scale. It supports filtering, projection, and aggregation as well as joins and subqueries, and utilizes indexes to improve query performance – including simple, compound, partial, functional, and covering indexes.

A comprehensive, web-based console simplifies development with integrated query editing and schema browsing, and for administrators, simplifies management and monitoring by performing complex tasks with the push of a button, and providing over 200+ metrics on everything from cluster health to individual node performance.

Found in an RDBMS Provided by Couchbase Server
SQL Express complex queries with N1QL, a declarative query language based on SQL.
JOINs Perform left outer and inner joins with N1QL to support relationships.
Schema Browsing Explore the data model with automatically inferred schemas and sample data.
Query Editing Create and run queries using a built-in editor with autocomplete and syntax highlighting.
Comprehensive Indexing Create compound, partial, functional, and covering indexes to improve query performance.
Strong Consistency Specify read-your-own-writes consistency for queries that require stronger consistency.
Advanced Security Configure encryption, auditing, role-based access control, and LDAP for secure environments.
Disaster Recovery Leverage backup/restore tools and built-in cross data center replication for disaster recovery.
Monitoring Actively monitor database state and operations with a complete, integrated admin console.

 

PROS

CONS

TRENDING

Apache Kudu

Apache Kudo

AT A GLANCE

A new addition to the open source Apache Hadoop ecosystem, Apache Kudu (incubating) completes Hadoop’s storage layer to enable fast analytics on fast data. Currently, a limited-functionality version of Kudu is available as a Beta.

Like most modern analytic data stores, Kudu internally organizes its data by column rather than row. Columnar storage allows efficient encoding and compression. For example, a string field with only a few unique values can use only a few bits per row of storage. With techniques such as run-length encoding, differential encoding, and vectorized bit-packing, Kudu is as fast at reading the data as it is space-efficient at storing it. Columnar storage also dramatically reduces the amount of data IO required to service analytic queries. Using techniques such as lazy data materialization and predicate pushdown, Kudu can perform drill-down and needle-in-a-haystack queries over billions of rows and terabytes of data in seconds.

Apache Kudo

Distribution and Fault Tolerance

In order to scale out to large datasets and large clusters, Kudu splits tables into smaller units called tablets. This splitting can be configured on a per-table basis to be based on hashing, range partitioning, or a combination thereof. This allows the operator to easily trade off between parallelism for analytic workloads and high concurrency for more online ones.

In order to keep your data safe and available at all times, Kudu uses the Raft consensus algorithm to replicate all operations for a given tablet. Raft, like Paxos, ensures that every write is persisted by at least two nodes before responding to the client request, ensuring that no data is ever lost due to a machine failure. When machines do fail, replicas reconfigure themselves within a few seconds to maintain extremely high system availability.

The use of majority consensus provides very low tail latencies even when some nodes may be stressed by concurrent workloads such as MapReduce jobs or heavy Impala queries. But unlike eventual consistency systems, Raft consensus ensures that all replicas will come to agreement around the state of the data, and by using a combination of logical and physical clocks, Kudu can offer strict snapshot consistency to clients that demand it.

PROS

CONS

TRENDING

Oryx 2

Oryx2

AT A GLANCE

A realization of Nathan Marz‘s lambda architecture that is built on Apache’s Spark and Kafka projects, Oryx 2 is a “framework for building that includes packaged, end-to-end applications for collaborative filtering, classification, regression and clustering”.

Oryx 2

It consists of three tiers, each of which builds on the one below it:

  • A generic lambda architecture tier, providing batch/speed/serving layers, which is not specific to machine learning
  • A specialization on top providing ML abstractions for hyperparameter selection, etc.
  • An end-to-end implementation of the same standard ML algorithms as an application (ALS, random decision forests, k-means) on top

Viewed another way, it contains the three side-by-side cooperating layers of the lambda architecture too, as well as a connecting element:

  • A Batch Layer, which computes a new “result” (think model, but, could be anything) as a function of all historical data, and the previous result. This may be a long-running operation which takes hours, and runs a few times a day for example.
  • A Speed Layer, which produces and publishes incremental model updates from a stream of new data. These updates are intended to happen on the order of seconds.
  • A Serving Layer, which receives models and updates and implements a synchronous API exposing query operations on the result.
  • A data transport layer, which moves data between layers and receives input from external sources

The project may be reused tier by tier: for example, the packaged app tier can be ignored, and it can be a framework for building new ML applications. It can be reused layer by layer too: for example, the Speed Layer can be omitted if a deployment does not need incremental updates. It can be modified piece-by-piece too: the collaborative filtering application’s model-building batch layer could be swapped for a custom implementation based on a new algorithm outside Spark MLlib while retaining the serving and speed layer implementations.from the Oryx 2 site

Michael Hausenblas has posted a good handling of Oryx in a Dr. Dobbs article.

PROS

CONS

TRENDING

1 2 3 12