Data Science Reading List

Based on a discussion on the LinkedIn Data Science group and my experience, here’s are some articles/books that should be included in a good Data Science reading list for programmers and developers, with many available on the Safari O’Reilly website:

Introduction

Big Data and Hadoop

Data Mining, Analysis and Statistics

Machine Learning

Data Visualization

Data Science, Big Data and MapReduce

To start one’s journey into the world of Data Science and Analytics, it behooves one to gain an understanding of the most touted terms in this field. The paper Data Science and Distributed Intelligence provides a good explanation of the often used terms Big Data, Data Science and Map Reduce. Here is my synopsis of these terms based on a reading of this paper:

Big Data refers to the proliferation of large amounts of data in huge databases and very high rate streaming data that produces it. Examples are : web server logs due to user clicks, cell phone logs, twitter feeds, facebook comments etc. An interesting fact is that 90% of Big Data was produced only in the last 2 years.

Data Science is a term that describes the process of gathering data, analyzing it and obtaining information out of it to produce a “data product”. Such data may involve large unstructured data sets. Hence the term ‘Data Science’ is often used in conjunction with Big Data.Since the processing techniques for Big Data often involve functional programming paradigms map and reduce, this has given rise to the MapReduce algorithm, popularized by Google.

MapReduce is an algorithm that helps mitigate the problem of processing Big Data and large data sets by breaking down the processing into smaller more manageable chunks.
Hadoop is the popular open source implementation of MapReduce