- Download 1
- File Size 7.84 MB
- File Count 1
- Create Date October 12, 2020
- Last Updated May 23, 2021
Data Science with R A Step By Step Guide With Visual Illustrations and Examples
INTRODUCTION TO DATA MINING
The evolution of technology helped internet expand lightning fast. Over time, internet access became accessible to more and more people. This led to the development of million websites and the use of databases for storing these data. The creation of the first commercial and social webpages created the need for storing and managing large amount of data.
Today, the amount of available data is huge and is growing exponentially every day. The need for minimizing the costs for collecting and storing these data was one of the biggest reasons for the growth of this scientific field.
The huge amount of data stored in databases and data warehouses could not be utilized as is. In order to get useful conclusions, some necessary actions are required in order to structure the data. On this chapter we will view which are the fundamental stages in order to extract valuable and usable information from data.
INTRODUCTION TO R
R is not just a programming language but a development environment as well. It is quite popular and its mostly used for statistical calculations, for creating graphs and for processing and analyzing data during Data Mining.
R development was based on the S programming language, created by John Chambres. R was created by Ross Ihaka and Robert Gentleman on the Auckland university of New Zealand. Over recent years it became very popular and is now developed by a team known as R Development Core Team.
Some of the reasons which made R so popular is the ease of learning, its compatibility with the most known operating systems (Linux, Mac OS and Windows), the plethora of packages with well written manuals and last but not least, it’s absolutely free.
INTRODUCTION TO DATA SCIENCE
Data Science is a new term, which came to replace former terms like Knowledge Discovery in Database or Data Mining.
Both three terms can be used to describe a semiautomated process whose main purpose is to analyze a huge volume of data about a specific problem with the purpose of creating patterns in scientific fields like Statistics, Machine Learning and Pattern Recognition.
Those patters, found in multiple forms like associations, anomalies, clusters, classes etc. constitute structures or instances, which appear in data and are statistically significant.
One of the most important aspects of Data Science has to do with finding or recognizing (recognizing means that the patters where not expected in advance) and evaluating these patterns. A pattern should show signs of organization across this structure. These patterns, also known as models, most of the times can be tracked with the use of measurable features or attributes which are extracted by data.
Data Science is a new science, which appeared at the end of 1980 and started growing gradually. During this era, Relational Databases were at their zenith and served data storage needs for companies and organizations with the purpose of better organizing and managing them, so that mass queries needed for their day to day operations could be accomplished faster.
These Database Managing Systems (DBMS) followed the so called OLTP (OnLine Transaction Processing) model, with the purpose of processing transactions. These tools allowed the user to find answers in questions he already knew or create some references.
The need for better utilization of data created by these systems – the systems which helped the daily needs of a company- led to the development of OLAP (OnLine Analytical Processing) type tools. With these OLAP tools it was easier to answer more advanced queries, allowing bigger and Multidimensional
Databases (MDB) to work faster and provide data visualization. OLAP tools could also be named as data exploration tools due to the visualization of these data. These tools allowed users (sales managers, marketing managers etc.) to recognize new patters but this discovery should be made by the user.
For example, a user could perform queries about the total revenue generated by multiple stores of a particular company within a country, in order to find the stores with the lowest revenue. Automation of pattern recognition was created through methodologies and tools created by the field of Data Science. Through these solutions, pattern recognitionwas aided by the final goal. For example, if a user wanted a report about the stores with the less revenue generated last month, he could ask from the system to find various useful insights about stores revenue.
Data Science growth came gradually and was directly associated to the capability of collecting and listing huge amount of data, of different types, through the rapid expansion of fast web infrastructures on which commercial applications could rely on.
One of the first companies who embraced this advancement was Amazon, which started by selling books and other products and then created a user-friendly related products recommendation system. This system was built and adjusted accordingly based on user interactions, using a method called Collaborating
The unconditional data generation in a 24-hour basis supports a huge amount of human activities like shopping cart data, medical records, social media announcements, banking and stock market operations and so on. These data have a wide variety of types (images, videos, real time data, DNA sequences etc.) and different acquisition times. If some of these data are not analyzed immediately, it might be difficult later to be stored and processed, creating this way a new scientific field known as Big Data. Data Science’s goal is to address the needs created from this new environment and provide solutions for the escalated and sufficient process of out-of core data.
Methods and tools used for this purpose have already being developed like Hadoop, Map-Reduce, Hive, MongoDB, GraphPD. The two main goals of practical Data Science are to create models, which can be used both in predicting and describing data. Prediction is about using some variables or parts of a database from which we could estimate an unknown or future value of another attribute. Description focuses on finding comprehensive patters which can describe data like finding clusters or groups of objects with similar attributes.