AT A GLANCE
Apache Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Roughly the Chukwa story is:
- Get data to the centralized store, and do periodic near-real-time analysis.
At the same level of granularity, the Flume story is:
- Reliably get data to the centralized store, enable continuous near real-time analysis, and enable periodic batch analysis.
1) Architecture and Near-realtime.
- Chukwa’s near real-time == minutes
- Flume’s near real-time == seconds (hopefully milliseconds).
Both systems have a agent-collector topology for nodes. Architecturally,
Chukwa is a batch/minibatch system. In contrast, Flume is designed
more as a continuous stream processing system.
Flume’s reliability levels are tunable, just pick the appropriate sink
to specify the mode you want to collect data at. It offers three
levels — best effort, store+retry on failure, and end-to-end mode
that uses acks and a write ahead log.
AFAICT Chukwa is best effort from agent to collector, writes to local
disk at the pre-demux collector, and then finally becomes reliable
when written to hdfs. This seems stronger than scribe’s reliability
mechanisms (equiv to Flume’s Store on failure), but weaker than
Flume’s end-to-end reliability mode (write ahead log and acks).
Flume just requires the deployment of a master (or set of masters) and
nodes. It then provides a centralized management point that allows
you to configure nodes dynamically, and to reconfigure the data flow
Chukwa’s deployment story is restrictive and more complicated than
Flume’s. It only supports a agent/collector topology. Despite this
restriction, it requires the depoyment of more different programs than
flume — agents, collectors, a console daemon. For Chukwa to work as
intended, it also has dependencies on a hadoop cluster (both hdfs and
mapreduce) and a MySQL database.
Cloudera packages Flume as part of its distribution of Hadoop and will
provide commercial support for users of the system.