Big Data is large amount of the data which is difficult or impossible for traditional relational database. Big data, the term has seen increasing use since the past few years. In this field, we review the various ways that big data is described and how Hadoop which is developed as a technology, is commonly used to process big data. In addition, we introduce Microsoft HDInsight, an implementation of Hadoop available as a Windows Azure service. Then we explore Microsoft PolyBase, an on-premises solution that integrates relational data stored in MICROSOFT SQL SERVER Parallel Data Warehouse (PDW) with non-relational data stored in a Hadoop Distributed File System (HDFS).
Big data
For several decades, many organizations have been analyzing data generated by transactional systems. This data has usually been stored in a relational database management systems. A common step in the development of a business-intelligence solution is weighing the cost of transforming, cleansing, and storing this data in preparation for analysis against the perceived value that insights derived from the analysis of the data could deliver. As a consequence, decisions are made about what data to keep and what data to ignore. Meanwhile, the data available for analysis continues to proliferate from a broad assortment of sources, such as server log files, social media, or instrument data from scientific research. At the same time, the cost to store high volumes of data on commodity hardware has been decreasing, and the processing power necessary for complex analysis of all this data has been increasing. This confluence of events has given rise to new technologies that support the management and analysis of big data.
Describing Big Data
The point at which data becomes big data is still the subject of much debate among data-management professionals. One approach of describing big data is known as the 3Vs: volume, velocity, and variety. This model, introduced by Gartner analyst Doug Laney in 2001, has been extended with a fourth V, variability. However, disagreement continues, with some people considering the fourth V to be veracity.
Although it seems reasonable to associate volume with big data, how is a large volume different from the very large databases (VLDBs) and extreme workloads that some industries routinely manage? Examples of data sources that fall into this category include airline reservation systems, point of sale terminals, financial trading, and cellular-phone networks. As machine-generated data outpaces human-generated data, the volume of data available for analysis is proliferating rapidly. Many techniques, as well as software and hardware solutions such as PDW, exist to address high volumes of data. Therefore, many people argue that some other characteristic must distinguish big data from other classes of data that are routinely managed.
Some people suggest that this additional characteristic is velocity or the speed at which the data is generated. As an example, consider the data generated by the Large Hadron Collider experiments, which is produced at a rate of 1 gigabyte (GB) per second. This data must be subsequently processed and filtered to provide 30 petabytes (PB) of data to physicists around the world. Most organizations are not generating data at this volume or pace, but data sources such as manufacturing sensors, scientific instruments, and web-application servers are nonetheless generating data so fast that complex event-processing applications are required to handle high-volume and high-speed throughputs. Microsoft StreamInsight is a platform that supports this type of data management and analysis.
Data does not necessarily require volume and velocity to be categorized as big. Instead, a high volume of data with a lot of variety can constitute big data. Variety refers to the different ways that data might be stored: structured, semistructured, or unstructured. On the one hand, data-warehousing techniques exist to integrate structured data (often in relational form) with semistructured data (such as XML documents). On the other hand, unstructured data is more challenging, if not impossible, to analyze by using traditional methods. This type of data includes documents in PDF or Word format, images, and audio or video files, to name a few examples. Not only the unstructured data problematic for analytical solutions, but it is also growing more quickly than file systems on a single server that it can usually accommodate.
Big data as a branch of data management is still difficult to define with precision, given that many competing views exist and that no clear standards or methodologies have been established. Data that looks big to one organization by any of the definitions we’ve described might look small to another organization that has evolved solutions for managing specific types of data. Perhaps the best definition of big data at present is also the most general. For the purpose of this chapter, we take the position that big data describes a class of data that requires a different architectural approach than the currently available relational database systems it can effectively support, such as append-only workloads instead of updates.
If you want more visit Mindmajix
Explore More Courses Visit Mindmajix
If you want more visit Mindmajix
Explore More Courses Visit Mindmajix
Author
Lianamelissa is Research Analyst at Mindmajix. A techno freak who likes to explore different technologies. Likes to follow the technology trends in market and write about them.
Lianamelissa is Research Analyst at Mindmajix. A techno freak who likes to explore different technologies. Likes to follow the technology trends in market and write about them.
No comments:
Post a Comment