Breaking News
Home / Tools / Dev & Build / Apache Chukwa

Apache Chukwa


Apache Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


Roughly the Chukwa story is:
  • Get data to the centralized store, and do periodic near-real-time analysis.


At the same level of granularity, the Flume story is:

  • Reliably get data to the centralized store, enable continuous near real-time analysis, and enable periodic batch analysis.
1)  Architecture and Near-realtime.
  • Chukwa’s near real-time == minutes
  • Flume’s near real-time == seconds (hopefully milliseconds).
Both systems have a agent-collector topology for nodes.  Architecturally,
Chukwa is a batch/minibatch system. In contrast, Flume is designed
more as a continuous stream processing system.
2) Reliability
Flume’s reliability levels are tunable, just pick the appropriate sink
to specify the mode you want to collect data at.  It offers three
levels — best effort, store+retry on failure, and end-to-end mode
that uses acks and a write ahead log.
AFAICT Chukwa is best effort from agent to collector, writes to local
disk at the pre-demux collector, and then finally becomes reliable
when written to hdfs. This seems stronger than scribe’s reliability
mechanisms (equiv to Flume’s Store on failure), but weaker than
Flume’s end-to-end reliability mode (write ahead log and acks).
3) Manageability
Flume just requires the deployment of a master (or set of masters) and
nodes.  It then provides a centralized management point that allows
you to configure nodes dynamically, and to reconfigure the data flow
topology dynamically.
Chukwa’s deployment story is restrictive and more complicated than
Flume’s.  It only supports a agent/collector topology.  Despite this
restriction, it requires the depoyment of more different programs than
flume — agents, collectors, a console daemon.  For Chukwa to work as
intended, it also has dependencies on a hadoop cluster (both hdfs and
mapreduce) and a MySQL database.
4) Support.
Cloudera packages Flume as part of its distribution of Hadoop and will
provide commercial support for users of the system.


About davidn

Check Also

Apache Twill

Apache Twill

AT A GLANCE Apache’s Twill project is an abstraction over Apache Hadoop® YARN that reduces …

Leave a Reply

Your email address will not be published. Required fields are marked *