This tutorial gives you an overview and talks about the fundamentals of Apache STORM.
Introduction
Storm is a distributed, reliable, fault-tolerant system for processing streams of data. The work is delegated to different types of components that are each responsible for a simple specific processing task. The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a component called a bolt, which transforms it in some way. A bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout.
To illustrate this concept, here’s a simple example. Last night I was watching the news when the announcers started talking about politicians and their positions on various topics. They kept repeating different names, and I wondered if each name was mentioned an equal number of times, or if there was a bias in the number of mentions.
Imagine the subtitles of what the announcers were saying as your input stream of data. You could have a spout that reads this input from a file (or a socket, via HTTP, or some other method). As lines of text arrive, the spout hands them to a bolt that separates lines of text into words. This stream of words is passed to another bolt that compares each word to a predefined list of politician’s names. With each match, the second bolt increases a counter for that name in a database. Whenever you want to see the results, you just query that database, which is updated in real time as data arrives. The arrangement of all the components (spouts and bolts) and their connections is called a topology.
Figure:- A simple topology
Now imagine easily defining the level of parallelism for each bolt and spout across the whole cluster so you can scale your topology indefinitely. Amazing, right? Although this is a simple example, you can see how powerful Storm can be.
1) Storm was open-sourced by Twitter in September of 2011 and has since been adopted by numerous companies around the world.Storm provides a small set of simple, easy to understand primitives. These primitives can be used to solve a stunning number of realtime computation problems, from stream processing to continuous computation to distributed RPC
2) To understand the parallelism of storm topology, Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster: Worker processes, Executors (threads), Tasks
3) A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.
4) An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).
5) A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.
6) The core data model in Trident is the “Stream”, processed as a series of batches. A stream is partitioned among the nodes in the cluster, and operations applied to a stream are applied in parallel across each partition.
7) There are five kinds of operations in Trident:
2) To understand the parallelism of storm topology, Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster: Worker processes, Executors (threads), Tasks
3) A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.
4) An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).
5) A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.
6) The core data model in Trident is the “Stream”, processed as a series of batches. A stream is partitioned among the nodes in the cluster, and operations applied to a stream are applied in parallel across each partition.
7) There are five kinds of operations in Trident:
- Operations that apply locally to each partition and cause no network transfer
- Repartitioning operations that repartition a stream but otherwise don’t change the contents (involves network transfer)
- Aggregation operations that do network transfer as part of the operation
- Operations on grouped streams
- Merges and joins
8) By default, Storm can serialize primitive types, strings, byte arrays, ArrayList, HashMap, HashSet, and the Closure collection types. If you want to use another type in your tuples, you’ll need to register a custom serializer. Well much more in-depth about custom serialization, java serialization is explained during Storm training sessions
If you want more visit Mindmajix
Author
Lianamelissa is Research Analyst at Mindmajix. A techno freak who likes to explore different technologies. Likes to follow the technology trends in market and write about them.
No comments:
Post a Comment