Data Science is both an art and a science. The science accords means to measure accuracy, significance, and data manipulation strategies. Art infuses creative concepts on problem solving techniques.” -Chris Orwa

If you tour the World Wide Web today, you’ll come across tonnes of Data Science material. Ranging from books, articles, online tutorials, you name it. These can all get a little overwhelming and confusing. So, one may opt not to dig in; as finding the right book to help you understand basic concepts can be daunting.

I decided to do some research to find out at least 10 of the most helpful books every aspiring Data Scientist should lay their hands on. It is in these efforts that I interacted with various other professionals in the industry to discover the most relevant Data Science books to data scientists or anyone seeking a better understanding.

Success in data science is not the ability to build prediction models in Python or R but mainly driven by knowledge of the subject. You must have sound knowledge of how things are done and also what algorithms, tools and techniques are being used.

One of the ways you could get this knowledge is by reading books and being confident to start off in the field. I’ve displayed a mix of technical and non-technical books for you from my findings… Do note that the reviews and accounts are of the two experts I spoke to, the one and only Chris Orwa(Black Orwa) - Head of AI at StepWise and the widely read and experienced Ben Mainye of Africa’s Talking.

List of 10 Must Have Data Science Books

1) Superforecasting: The Art and Science of Prediction By Philip E. Tetlock and Dan Gardner

I’ve heard that no list of forecasting books is complete without reference to Superforecasting, The Art and Science of Prediction. This is definitely going into my ‘to-read’ library this year.

“…Superforecasting has been my favorite Data Science book so far! Not just for algorithms, but in providing concepts on how to think and become a great forecaster. Philip Tetlock’s experience running the _ Good Judgement Project is its basis._

The American Intelligence community had become weary of their ability to predict world events. They had missed 9/11 attack besides incorrectly identified WMD in Iraq and Shah’s overthrow in Iran. So they turned to social scientist, Professor Philip E. Tetlock for help.

Philip Tetlock had done a research project in which he monitored the predictions of political pundits. When he tallied the data, it proved political pundits were not better than the general public in predicting geopolitical events. It is this research that caught the eye of IARPA (Intelligence Advanced Research Projects Activity). IARPA needed a similar analysis of their intelligence analyst.

The result was the Good Judgment Project. Philip Tetlock setup an experiment where ordinary citizens with access only to public information could compete against CIA analysts with access to confidential information in prediction geopolitical events.

Guess what? After the first two years of the experiment, the citizen team, with their little training, were 20 percent more accurate in predicting world events compared to the intelligence community. This outcome brought to fore the conclusion that good analysis comes from good thinking rather than more or privileged data. As a Data Scientist, this was a wake up call to debunk the notion that Big Data leads to better prediction.

The book proceeds to include interviews with top forecasters whom Philip refers to as super forecasters. They are the subject of the book. In introducing the theory of superforecasters, Tetlock also introduced the Brier Score. The Brier Score is a metric that measures the gap between forecasts and reality for each person. Brier score keeps tabs on how accurate a person’s predictions are over time.

In Machine Learning, there’s the temptation to build a model and assume it will work in all circumstances. As such it would be fantastic to include a Brier Score to know how a model performs every time it makes a real-world prediction.

Over and above the technicalities, Tetlock also tackles the personalities of great forecasters. Most people who made accurate predictions were not experts in those field. They relied on good research to make conclusions. Experts sometime ignore research and rely on their experience which becomes a pitfall.

The amateurs were ready to make mistakes, while experts most times assume making mistakes is a sign of being less knowledge. Overall, the book is full of data science nuggets. You will learn of the origin of Randomized Control Trials (RCT) in medicine and the German army command structure in WW2 that made the highly effective (auftragstaktik__). In the end, the book helps to tie thinking and problems. A concept forgotten while running algorithms.” - Chris Orwa

2) The Signal and the Noise: Why So Many Predictions Fail but Some Don’t. By Nate Silver

This is one of the highly recommended books online. I’ve had this book for the longest time and it’s about time I started on it and finished it. If you’re one of those people that doesn’t enjoy the mathematical basis that go behind data science, this book is for you!

“…This possibly my second-best Data Science book. Nate Silver is an Economist who made a career performing statistical analysis on baseball matches. Otherwise known as Sabermetrics. In 2008, he turned his interest to politics and made accurate prediction for all States in the US except for one. He writes to give advice on how to make good predictions.

The book has overlapping concept’s with Tetlock’s Superforecasting book. It talks about the pitfall of Big Data and how political pundits make poor predictions. Nate’s book also adds information on how he was able to make accurate baseball predictions. For a statistical nerd, the details on determining a player’s performance is gold! In it, you will learn about PECOTA, the algorithm developed by Nate Silver to predict baseball matches outcome while working at KPMG.

Nate Silver now runs an amazing data journalism website Five Thirty Eight.” - Chris Orwa

3) The Quants: How a New Breed of Math Whizzes Conquered Wall Street and Nearly Destroyed It. By Scott Patterson

Quants - Quantitative analysts. The Quants is suited for people with a non-maths background or a manager, executive or data analyst who is interested in learning how to make decisions using numbers & analysis, rather than intuition.

“… Once upon a time, I made my living from trading currencies. During this period, I came about this book. It talks about how probability theory was first applied to trading and used to beat the market.

Ed Thorp, a mathematician (PhD) who had applied Brownian motion to black jack experimented on the same concept on price volatility and hit a jackpot.

During this period, it was believed that it was impossible to ‘beat the market’. A phrase coined due to the Efficient Market Hypothesis (EMH). EMH states that the current price of a stock factors in all available information hence making it impossible to make above average returns. Using his model and ability to predict volatility, Thorp realized many stocks that appeared to be mispriced. Thorp had stumbled upon a gold mine full of arbitrage opportunities. He could now short overpriced stocks.

The book ends with the 2009 market crash that was ostentatiously created by quants.” - Chris Orwa

4) Black Swan: The Impact of the Highly Improbable. By Nassim Taleb

This is another non-technical book about unpredictable events where you’ll get to learn the limits of statistical methods.

“…Black Swan explores the limits of statistics. Nassim Taleb, an ex-quant, develops a brilliant idea about certain events that are impossible to predict e.g the 9/11 attack. He refers to this event as a black swan in line with the thought among Europeans that swans were white until they discovered black swans in Australia. Using this metaphor, Taleb dives into life events where statistics fail.

He has other books that compliment this title. They are:

  • Fooled by Randomness
  • Antifragile
  • Bed of Procrustes
  • Skin in the Game

The Black Swan is important in helping Data Scientists understand that we cannot solve all problems with statistics. This could be as a result of inadequate understanding or possibly being too far out in the future.

Taleb builds a good concept of mediocristan and extremistan where he critics The Bell Curve and how quants apply it to every scenario. He writes, ‘Consequently, if we are in the domain of Extremistan, and we use analytical tools from Mediocristan for prediction, say risk management, we can face enormous surprises. Some of these surprises may be positive and some negative. Their impact will however most likely exceed what we are prepared for.’” - Chris Orwa

5) Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. By Cathy O’Neil

Weapons of Math Destruction has been labeled as being captivating and insightful. It talks about the increasing influence of machine learning to control the news we see, the jobs we can get and the politicians we vote for. Also a must read for me.

“…The book reads like a continuation of Ted Kaczynski’s manifesto ‘The Industrial Society and Its Future’.

Cathy focusses on machine learning and its use in coercing behaviour change as well as discriminating the poor and disadvantaged. From the examples provided in the book, there are three categories of Weapons of Math Destructions (WMD):

The first WMD, Poor Statistics - These are incorrectly calculated stats used to infer human behaviour and performance. In them, is lack of understanding on interpretation or validation of certain statistics. A good example are proxy variables, such as geography used to infer purchase power, reoffending propensity et cetera.

The second WMD, Misused Correct Statistics - These seem to be the majority of the case in WMD. It is more of an ethical issue rather than machine taking over human lives. For instance when a company utilizes zip code to steer customers to high interest loans, that qualifies as unethical use of machine learning output and not necessarily anything wrong with the machine learning processes themselves.

The last WMDs, Dataset - From the book, certain attributes within data should never be used for prediction purposes. For instance race, gender, income and zip code. This is because they are likely to correlate with outputs connected with discrimination.” - Chris Orwa

Here are some two other non-technical books that I thought you should also have:

6) Predictive Analytics: The Power to Predict Who will Click Buy, Lie or Die. By Eric Siegel

In Predictive Analytics, Eric Siegel, a renowned expert in data analytics and former professor at Columbia University, explains to us how data scientists use data to help predict anything - from what you will buy, to where you will travel to when you’re likely to quit your job and much more.

This was one of my first non-technical data science book that I got for myself.. Sadly I’m still yet to finish it. Though, I still highly recommend it to anyone who wants to really understand what data science is all about. The book entails a plethora of real word examples. These examples can be generalized into a number of different applications throughout a company and has a tone of relevance to multiple business departments.

7) Storytelling With Data : A data Visualization Guide for Business professionals. By Cole Nussbaumer Knaflic

storytelling with data teaches you the fundamentals of data visualization and how to communicate effectively with data. You’ll discover the power of storytelling and the way to make data a pivotal point in your story. The lessons in this illuminative text are grounded in theory, but made accessible through numerous real-world examples—ready for immediate application to your next graph or presentation.

Another book I’ll recommend to those interested in finding insights from data through data visualization. Another ‘to read’ in 2018!

Mathematical/Technical Books

8) OpenIntro Statistics. By David Diez, Christopher Barr, and Mine Çetinkaya-Rundel

Want to start getting your hands dirty with Statistics, then I highly recommend this book. All the source code that went into making this book is freely accessible online and all you need is some basic skills in R to run them

“…This book is available for free at leanpub.com

It was used as a book resource for the course Data Analysis and Statistical Inference in coursera.org. It begins by giving you an overview of data, probability, inferential techniques. These range from numerical data and categorical data, linear regression and multiple and logistic regression with numerous examples to walk you through the material.

The book and the course emphasize the need to learn for instance, to calculate hypothesis testing by hand and think about the problem as well as knowing how to code it in a programming language called R.

The team made a whole package where you can practice the concepts they cover which is available for download here https://www.rdocumentation.org/packages/openintro/versions/1.7.1.

I learned statistics and more importantly, to be more data curious. This laid the foundation to me questioning a lot of things and doing a lot of observational studies and experiments. I recommend it if you are starting out, doing computer science, data science and genomic data science or just curious.” - Ben Mainye

9) Introduction to Machine Learning with Python: A guide for Data Scientists  By Andreas Müller and Sarah Guido

This is a perfect book to get introduced to supervised and machine learning algorithms using python so as to make pretty good predictions. Will come in handy if you’re planning on getting into doing some Kaggle competitions.

“…As the title of the book says, it is a guide for data scientists.

The authors go through machine learning algorithms and concepts using examples. With an emphasis on data visualization to understand how the models make decision boundaries, for instance support vector machines and K-nearest neighbours. As a bonus, they discuss the strengths and weaknesses of several algorithms. You don’t need to be a pro at machine learning. You can just pick it up and start building your own models. I learned how machine learning algorithms work and how to implement them in my own work.

All the work that I’ve done so far I’ve always used this book as reference. If you don’t believe me check my github repository. I recommend it because it was my favorite 2017 book and it has been my handbook while doing competitions in Kaggle and DrivenData.

Andreas Müller and Hugo Bowne Anderson also made a course about Supervised Learning With Sci-Kit Learn which  pumped me up further to get the book. It’s here: https://www.datacamp.com/courses/supervised-learning-with-scikit-learn. - Ben Mainye

10) Deep Learning with Python. By François Chollet.

Does deep learning tickle your fancy? Then this is a good book to get you started on building deep learning models using Keras which is a high-level neural network API that is written in Python. The author of this book is the creator of Keras. You can get this book here.

“… François uses examples wrapped with theory to teach Deep Learning.

He goes through deep learning in parts: Part One focuses on fundamentals of machine learning where you’ll learn the basics of machine learning experiments and how they’re transferable to other areas.

Part Two focuses on the practicals where you’ll apply the knowledge you’ve gained in the first part to real world datasets and new concepts are also presented – you’ll code a lot here.

I like the arrangement of the book, the practical exercises and the advice he gives as you go through the book. I learned to structure my machine learning experiments better and how to tune hyperparameters better especially for neural networks. Plus, i enjoyed chapter 7, 8 and 9 the most.

I recommend the reader to get it because the advice the author gives about machine learning and deep learning is priceless. He even guarantees that you’ll become a Keras expert after reading his book. So get it!” - Ben Mainye

Conclusion

This article has outlined some of the data science books that you need to have in your library to get you started with data science, machine learning and deep learning as well. If you don’t yet have your hands on any of these books then what are you waiting for?

Should you have any thoughts to add on the books mentioned or have other books that you’d like to recommend, you can comment below or find me on twitter @categitau_. You can get some of these books at Prestige Bookshop Nairobi, Kenya or on Amazon.

I’d like to give credit to Chris Orwa and Ben Mainye for taking their time to give reviews of the books mentioned and the awesome Hazel Apondi for helping out with the editing of this article.