Monday, April 14, 2014

Big Data Introduction


 Big data are those data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. It is the term for a collection of large and complex data sets that is difficult to process using traditional database management tools or traditional data processing applications.  Big Data is characterized by 3V- volume, velocity and variety.

Data sets grow in size in part because they are increasingly being gathered from many sources such as information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.

As the data collection is increasing day by day, it becomes difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring "massively parallel software running on tens, hundreds, or even thousands of servers. The challenges include capture, duration, storage, search, sharing, transfer, analysis, and visualization. So such large gathering of data suffers the organization forces the need to big data management with distributed approach.

Distributed System in Big Data Technology

A distributed system is a collection of independent computers that appears to its users as a single coherent system. A distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages.
Distributed system play an important role in managing the big data problems that prevails in today’s world. In the distributed approach, data are placed in multiple machines and are made available to the user as if they are in a single system. Distributed system makes the proper use of hardware and resources in multiple location and multiple machines.

Example: How google uses distributed system to manage data for search engines

Due to accumulation of large amount of data in the web every day, it is difficult to manage the document in the centralized server. So to overcome the big data problems, search engines companies like Google uses distributed server. In distributed search engine there is no central server.


Unlike traditional centralized search engines, work such as crawling, data mining, indexing, and query processing is distributed among several peers in decentralized manner where there is no single point of control. Several distributed servers are set up in different location. The information is made accessible to the user from nearby located servers. Mirror servers perform different types of caching operation as required. 

No comments:

Post a Comment