The brain-child of Nathan Marz, the author of a book on real-time big data, the Lambda Architecture (LA) was created to be a generic, scalable, fault-tolerant data processing architecture based on his experience working on distributed data processing systems at Backtype and Twitter. According to the official LA site:
The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.
LA is designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. It attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate pre-computed views, while simultaneously using real-time stream processing to provide dynamic views.
The Lambda Architecture was purpose-built as a robust framework for ingesting streams of fast data while providing efficient real-time and historical analytics. In Lambda, immutable data flows in one direction: into the system. The architecture’s main goal is to execute OLAP-type processing faster than what is possible with current OLAP solutions.
Lambda-based applications are used for:
- Log ingestion and analytics
- Real-time programmatic advertising
- Serving streaming video
- Recommendations engines
Lambda is chosen for:
- Applications that require lower latency
- Data pipeline applications
- Applications that process asynchronous, complex transformations
The Lambda batch layer is usually a “data lake” system like Hadoop, although it could also be an OLAP data warehouse such as HP Vertica or IBM Netezza. This historical archive is used to hold all of the data ever collected. The batch layer supports batch query; batch processing is used to generate analytics, either predefined or ad hoc.
The Lambda speed layer is defined as a combination of queuing, streaming and operational data stores. In the Lambda Architecture, the speed layer is similar to the batch layer in that it computes similar analytics – except that it computes those analytics in real-time on only the most recent data. The analytics the batch layer calculates, for example, may be based on data one hour old. It is the speed layer’s responsibility to calculate real-time analytics based on fast moving data – data that is zero to one hour old.
Combining the analytics produced by the batch layer and the speed layer provides a complete view of analytics across all data, fresh and historical. The third layer of Lambda, the serving layer, is responsible for serving up results combined from both the speed and batch layers.
- All data entering the system is dispatched to both the batch layer and the speed layer for processing.
- The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
- The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
- The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
- Any incoming query can be answered by merging results from batch views and real-time views.
Developers evaluate the Lambda architecture for handling streaming data. Lambda’s inherent complexity, comprised of the three layers described (speed, serving and batch) requires developers to maintain the same application code (results) in two complex systems (the batch and speed layers).
Lambda issues include:
- Lambda cannot be used to build responsive, event-oriented applications
- Lambda is limited: immutable data flows in one direction only, into the system -for analytics harvesting.