Breaking News
Home / Tools / ETL & Ingestion Tools

ETL & Ingestion Tools

Amazon Kinesis

AT A GLANCE Amazon’s Kinesis is a cloud-based service for real-time data processing over large, distributed data streams. It claims to be able to continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, …

Read More »

Fluentd

Fluentd

AT A GLANCE Conceived by Sadayuki “Sada” Furuhashi in 2011, Fluentd is an open source data collector that claims to unify the data collection and consumption for a better use and understanding of data. Fluentd tries to structure data as JSON as much as possible: this allows it to unify …

Read More »

Gobblin

Gobblin

AT A GLANCE Gobblin claims to be a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., into Hadoop. Gobblin handles the common routine tasks required for all data ingestion ETLs, including …

Read More »

Cloudera Morphlines

AT A GLANCE Cloudera’s Morphlines is an open source framework that is said to reduce the time and skills necessary to integrate, build, and change Hadoop processing applications to perform ETL on data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. Cloudera Morphlines is an …

Read More »

Apache Samza

Apache Samza

AT A GLANCE Apache Samza is a distributed stream processing framework. It uses Apache’s Kafka for messaging, and Hadoop’s YARN to provide fault tolerance, processor isolation, security, and resource management. Developed by Linkedin. Messaging systems are a popular way of implementing near-real-time asynchronous computation. Messages can be added to a message …

Read More »

Apache Kafka

Apache Kafka

AT A GLANCE A distributed publish-subscribe system for processing large amounts of streaming data, Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. Because messages are persisted, it has the interesting ability for clients to rewind a stream and consume the …

Read More »

Apache Sqoop

AT A GLANCE Apache’s Sqoop is a system for bulk data transfer between HDFS and structured datastores as RDBMS. Like Flume but from HDFS to RDBMS. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data …

Read More »

Apache Flume

Apache Flume

AT A GLANCE Apache’s Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery …

Read More »