Lambda architecture

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency , throughput , and fault-tolerance by using a combination of real-time data processing and data processing. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data , real-time analytics, and the drive to mitigate the latencies of map-reduce. [1]

Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record. [2] : 32 It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.

Overview

Lambda architecture describes a system of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. [3] : 13 The processing layers ingest from an immutable master copy of the entire data set.

Batch layer

The batch layer precomputes results with a distributed processing system that can handle very large quantities of data. The batch layer at AIMS perfect accuracy by being white to reliably process all available data When Generating views. This means that it can be fixed by a recomputing on the complete data set. Output is typically stored in a database, with new and improved content. [3] : 18

Apache Hadoop is the de facto standard batch-processing system used in most high-throughput architectures. [4]

Speed ​​layer

Diagram showing the flow of data through the processing and serving layers of lambda architecture. Example named components are shown.

The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views to the most recent data. Essentially, the speed layer is responsible for filling the “gap” caused by the batch layer in the most recent data. This layer is not limited to a single layer, but they are available immediately after receipt of the data, but they are available immediately when the data is available. [3] : 203

Stream-processing technologies typically used in this layer include Apache Storm , SQLstream and Apache Spark . NoSQL databases. [5] [6]

Serving layer

Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.

Examples of technologies used in the serving layer include Druid , qui Provides a single cluster to handle output From Both layers. [7] Dedicated blinds used in the serving layer include Apache Cassandra , Apache HBase , MongoDB gold ElasticSearch for speed-layer output, and Elephant DB , Cloudera Impala gold Apache Hive for batch-layer output. [2] : 45 [5]

Optimizations

To optimize the data set and to improve the efficiency of data collection, various techniques and techniques are used on the data, [7] : 23 while estimation techniques are employed to further reduce computation costs. [8] Computing and computation algorithms may also be required for increased reliability, and techniques such as partial computation and resource-utilization optimizations can effectively help lower latency. [3] : 93,287,293

Lambda architecture in use

Metamarkets, which provides analytics for companies in the programmatic advertising space, employs a version of the lambda architecture that uses Druid for storing and serving both the streamed and batch-processed data. [7] : 42

For running analytics data warehouse is icts advertising, Yahoo HAS taken a similar approach, aussi using Apache Storm , Apache Hadoop , and Druid . [9] : 9.16

The Netflix Suro project has separate processing paths, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not necessarily to provide the same type of views. [10]Nevertheless, the overall idea is to make a lot of progress, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.

Criticism

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides require a different code that must be maintained and kept in sync so that processed data produces the same result from both paths. Yet attempting to abstract the code into a single framework and many of the specialized tools in the batch and real-time ecosystems out of reach. [11]

In a technical discussion on the merits of employing a pure streaming approach, it was noted that a flexible streaming framework such as Apache Samza could provide some of the same benefits as batch processing without latency. [12] Such a streaming framework could accommodate arbitrarily large windows of data, accommodate blocking, and handle state.

See also

  • Event stream processing

References

  1. Jump up^ Schuster, Werner. “Nathan Marz on Storm, Immutability in the Lambda Architecture, Clojure” . www.infoq.com . Interview with Nathan Marz, 6 April 2014
  2. ^ Jump up to:b Bijnens, Nathan. “Real-time architecture using Hadoop and Storm” . December 11, 2013.
  3. ^ Jump up to:d Marz, Nathan; Warren, James. Big Data: Principles and best practices of scalable realtime data systems . Manning Publications, 2013.
  4. Jump up^ Kar, Saroj. “Hadoop Sector will have annual growth of 58% for 2013-2020”, 28 May 2014.Cloud Times.
  5. ^ Jump up to:b Kinley, James. “The Lambda Architecture: Principles for Realistic Big Data Systems Architecting” , retrieved August 26, 2014.
  6. Jump up^ Ferrera Bertran, Pere. “Lambda Architecture: A state-of-the-art”. January 17, 2014, Datasalt.
  7. ^ Jump up to:c Yang Fangjin, and Merlino, Gian. “Real-time Analytics with Open Source Technologies” . July 30, 2014.
  8. Jump up^ Ray, Nelson. “The Art of Approximating Distributions: Histograms and Quantiles at Scale”. 12 September 2013. Metamarkets.
  9. Jump up^ Rao, Supreeth; Gupta, Sunil. Interactive Analytics in Human Time. 17 June 2014
  10. Jump up^ Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. “Suro Announcing: Netflix’s Netflix’s Data Pipeline”, Netflix , 9 December 2013
  11. Jump up^ Kreps, Jay. “Questioning the Lambda Architecture” . radar.oreilly.com . Oreilly . Retrieved 15 August 2014 .
  12. Jump up^ Hacker Newsretrieved August 20, 2014