Think about your data intelligently and ask the right questions Key
Features Master data cleaning techniques necessary to perform
real-world data science and machine learning tasks Spot common
problems with dirty data and develop flexible solutions from first
principles Test and refine your newly acquired skills through
detailed exercises at the end of each chapter Book DescriptionData
cleaning is the all-important first step to successful data
science, data analysis, and machine learning. If you work with any
kind of data, this book is your go-to resource, arming you with the
insights and heuristics experienced data scientists had to learn
the hard way. In a light-hearted and engaging exploration of
different tools, techniques, and datasets real and fictitious,
Python veteran David Mertz teaches you the ins and outs of data
preparation and the essential questions you should be asking of
every piece of data you work with. Using a mixture of Python, R,
and common command-line tools, Cleaning Data for Effective Data
Science follows the data cleaning pipeline from start to end,
focusing on helping you understand the principles underlying each
step of the process. You'll look at data ingestion of a vast range
of tabular, hierarchical, and other data formats, impute missing
values, detect unreliable data and statistical anomalies, and
generate synthetic features. The long-form exercises at the end of
each chapter let you get hands-on with the skills you've acquired
along the way, also providing a valuable resource for academic
courses. What you will learn Ingest and work with common data
formats like JSON, CSV, SQL and NoSQL databases, PDF, and binary
serialized data structures Understand how and why we use tools such
as pandas, SciPy, scikit-learn, Tidyverse, and Bash Apply useful
rules and heuristics for assessing data quality and detecting bias,
like Benford's law and the 68-95-99.7 rule Identify and handle
unreliable data and outliers, examining z-score and other
statistical properties Impute sensible values into missing data and
use sampling to fix imbalances Use dimensionality reduction,
quantization, one-hot encoding, and other feature engineering
techniques to draw out patterns in your data Work carefully with
time series data, performing de-trending and interpolation Who this
book is forThis book is designed to benefit software developers,
data scientists, aspiring data scientists, teachers, and students
who work with data. If you want to improve your rigor in data
hygiene or are looking for a refresher, this book is for you. Basic
familiarity with statistics, general concepts in machine learning,
knowledge of a programming language (Python or R), and some
exposure to data science are helpful.
General
Is the information for this product incomplete, wrong or inappropriate?
Let us know about it.
Does this product have an incorrect or missing image?
Send us a new image.
Is this product missing categories?
Add more categories.
Review This Product
No reviews yet - be the first to create one!