Breaking News
Home / Tools

Tools

Amazon Kinesis

AT A GLANCE Amazon’s Kinesis is a cloud-based service for real-time data processing over large, distributed data streams. It claims to be able to continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, …

Read More »

Apache Chukwa

AT A GLANCE Apache Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a toolkit for displaying, monitoring and analyzing results to …

Read More »

Apache Giraph

apache giraph

AT A GLANCE Apache’s Giraph project is said to be “a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes.” Giraph is in use at companies like Facebook and PayPal to help represent and analyze the billions (or even trillions) of connections …

Read More »

Oryx 2

Oryx2

AT A GLANCE A realization of Nathan Marz‘s lambda architecture that is built on Apache’s Spark and Kafka projects, Oryx 2 is a “framework for building that includes packaged, end-to-end applications for collaborative filtering, classification, regression and clustering”. It consists of three tiers, each of which builds on the one …

Read More »

Apache Aurora

Aurora

AT A GLANCE Apache’s Mesos is a cluster manager that provides resource isolation and sharing across distributed applications. You might think of it as a “kernel” for your data center. Now (late 2014) the new Aurora project is a service scheduler that runs on top of Mesos, enabling you to …

Read More »

Memcached

Memcached

AT A GLANCE It’s entirely likely you will eventually encounter a situation where you need very fast access to a large amount of data for a short period of time. For example, let’s say you want to send an email to your customers and prospects letting them know about new …

Read More »

Fluentd

Fluentd

AT A GLANCE Conceived by Sadayuki “Sada” Furuhashi in 2011, Fluentd is an open source data collector that claims to unify the data collection and consumption for a better use and understanding of data. Fluentd tries to structure data as JSON as much as possible: this allows it to unify …

Read More »

Gobblin

Gobblin

AT A GLANCE Gobblin claims to be a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., into Hadoop. Gobblin handles the common routine tasks required for all data ingestion ETLs, including …

Read More »

Blur

Apache Blur

AT A GLANCE Let’s say you’ve bought in to the entire big data story using Hadoop. You’ve got Flume gathering data and pushing it into HDFS, your MapReduce jobs are transforming that data and building key/value pairs that are pushed into HBase and you even have a couple enterprising data …

Read More »

H2O

data mining

Oxdata’s H2O is a statistical, machine learning and math runtime tool for Big Data analysis. Developed by a predictive analytics company, H2O has established a leadership in the ML scene together with R, Mahout and MLlib from Spark. According with Oxdata, H2O is the world’s fastest in-memory platform for machine learning …

Read More »