Maruti Techlabs Big data is primarily defined by the volume of a data set. Big data sets are generally huge — measuring tens of terabytes — and sometimes crossing the threshold of petabytes. The term big data was preceded by very large databases (VLDBs) which were managed using database management systems (DBMS). Today, big data falls under three categories of data sets — structured, unstructured and semi-structured.
Structured data sets comprise of data which can be used in its original form to derive results. Examples include relational data such as employee salary records. Most modern computers and applications are programmed to generate structured data in preset formats to make it easier to process.
Unstructured data sets, on the other hand, are without proper formatting and alignment. Examples include human texts, Google search result outputs, etc. These random collections of data sets require more processing power and time for conversion into structured data sets so that they can help in deriving tangible results.Semi-Structured data sets are a combination of both structured and unstructured data. These data sets might have a proper structure and yet lack defining elements for sorting and processing. Examples include RFID and XML data.
Semi-Structured data sets are a combination of both structured and unstructured data. These data sets might have a proper structure and yet lack defining elements for sorting and processing. Examples include RFID and XML data.
Big data processing requires a particular setup of physical and virtual machines to derive results. The processing is done simultaneously to achieve results as quickly as possible. These days big data processing techniques also include Cloud Computing and Artificial Intelligence. These technologies help in reducing manual inputs and oversight by automating many processes and tasks.
The evolving nature of big data has made it difficult to give it a commonly accepted definition. Data sets are consigned the big data status based on technologies and tools required for their processing.
Big data analytics — Technologies and Tools
Big data analytics is the process of extracting useful information by analysing different types of big data sets. Big data analytics is used to discover hidden patterns, market trends and consumer preferences, for the benefit of organizational decision making. There are several steps and technologies involved in big data analytics.
Data acquisition has two components: identification and collection of big data. Identification of big data is done by analyzing the two natural formats of data — born digital and born analogue.
Born Digital Data
It is the information which has been captured through a digital medium, e.g. a computer or smartphone app, etc. This type of data has an ever expanding range since systems keep on collecting different kinds of information from users. Born digital data is traceable and can provide both personal and demographic business insights. Examples include Cookies, Web Analytics and GPS tracking.
Born Analogue Data
When information is in the form of pictures, videos and other such formats which relate to physical elements of our world, it is termed as analogue data. This data requires conversion into digital format by using sensors, such as cameras, voice recording, digital assistants, etc. The increasing reach of technology has also raised the rate at which traditionally analogue data is being converted or captured through digital mediums.
The second step in the data acquisition process is collection and storage of data sets identified as big data. Since the archaic DBMS techniques were inadequate for managing big data, a new method is used for collecting and storing big data. The process is called MAD — magnetic, agile and deep. Since, managing big data requires a significant amount of processing and storage capacity, creating such systems is out-of-reach for most entities which rely on big data analytics. Thus, the most common solutions for big data processing today are based on two principles — distributed storage and Massive Parallel Processing a.k.a. MPP. Most of the high-end Hadoop platforms and specialty appliances use MPP configurations in their system.
In-memory Database Systems
These database storage systems are designed to overcome one of the major hurdles in the way of big data processing — the time taken by traditional databases to access and process information. IMDB systems store the data in the RAM of big data servers, therefore, drastically reducing the storage I/O gap. Apache Spark is an example of IMDB systems. VoltDB, NuoDB and IBM solidDB are some more examples of the same.
Hybrid Data Storage and Processing Systems — Apache Hadoop
Apache Hadoop is a hybrid data storage and processing system which provides scalability and speed at reasonable costs for mid and small-scale businesses. It uses a Hadoop Distributed File System (HDFS) for storing large files across multiple systems known as cluster nodes. Hadoop has a replication mechanism to ensure smooth operation even during instances of individual node failures. Hadoop uses Google’s MapReduce parallel programming as its core. The name originates from ‘Mapping’ and ‘Reduction’ of functional programming languages in its algorithm for big data processing. MapReduce works on the premise of increasing the number of functional nodes over increasing processing power of individual nodes. Moreover, Hadoop can be run using readily available hardware which has sped up its development and popularity, significantly.
It is a recent concept which is based on contextual analysing of big data sets to discover the relationship between separate data items. The objective is to use a single data set for different purposes by different users. Data mining can be used for reducing costs and increasing revenues.
Source: Towards Data Science