• Version
  • Download 5
  • File Size 32.31 MB
  • File Count 1
  • Create Date October 12, 2020
  • Last Updated October 12, 2020

R for data science

R for Data Science Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham, Garrett Grolemund

Learn how to use R to turn raw data into insight, knowledge, and understanding. This book introduces you to R studio and the tidy verse, A collection of packages designed to work together make data science fast, and fun. Should I go for readers with no previous programming experience, are for data science is designed to get you doing data science as quickly as possible.

Autos Hadley Wickham and Garrett Grolemund guide you through the steps of important things wrangling exploring and modeling data and complicating the result. You will get a complete big data picture understanding of the data science cycle, along with the basic tools you need to manage the details.

You will learn how to:

Wrangle - Transform your data sets into a form convenient for analysis.

Program - loan powerful R2 for solving data problems with greater clarity and ease.

Explorer - examine your data, generate a hypothesis, and quickly test them.

Model - provide a low dimensional summary that captures true “signals” in your data set.

Communicate - launch R markdown for integrating prose, code, and result

Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of R for Data Science is to help you learn the most important tools in R that will allow you to do data science. After reading this book, you’ll have the tools to tackle a wide variety of data science challenges, using the best parts of R.

What You Will Learn

Data science is a huge field, and there’s no way you can master it by reading a single book. The goal of this book is to give you a solid foundation in the most important tools. Our model of the tools needed in a typical data science project looks something like this:

First, you must import your data into R. This typically means that you take data stored in a file, database, or web API, and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!

Once you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation.

Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.

Once you have tidy data, a common first step is to transform it. The transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing velocity from speed and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling, because getting your data in a form that’s natural to work with often feels like a fight!

Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modeling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.

Visualization is a fundamentally human activity. A good visualization will show you things that you did not expect, or raise new questions about the data. A good visualization might also hint that you’re asking the wrong question, or you need to collect different data. Visualizations can surprise you, but don’t scale particularly well because they require a human to interpret them.

Models are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s

usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.

The last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.

What You Won’t Learn

There are some important topics that this book does not cover. We believe it’s important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can’t cover every important topic.