To start one’s journey into the world of Data Science and Analytics, it behooves one to gain an understanding of the most touted terms in this field. The paper Data Science and Distributed Intelligence provides a good explanation of the often used terms Big Data, Data Science and Map Reduce. Here is my synopsis of these terms based on a reading of this paper:
Big Data refers to the proliferation of large amounts of data in huge databases and very high rate streaming data that produces it. Examples are : web server logs due to user clicks, cell phone logs, twitter feeds, facebook comments etc. An interesting fact is that 90% of Big Data was produced only in the last 2 years.
Data Science is a term that describes the process of gathering data, analyzing it and obtaining information out of it to produce a “data product”. Such data may involve large unstructured data sets. Hence the term ‘Data Science’ is often used in conjunction with Big Data.Since the processing techniques for Big Data often involve functional programming paradigms map and reduce, this has given rise to the MapReduce algorithm, popularized by Google.
MapReduce is an algorithm that helps mitigate the problem of processing Big Data and large data sets by breaking down the processing into smaller more manageable chunks.
Hadoop is the popular open source implementation of MapReduce