R Programming Language
Course curriculum includes R Installation & Setting R Environment, Variables, Operators & Data types, Structures, Vectors, Vector Manipulation & Sub-Setting, Constants, RStudio Installation & Lists Part 1, Lists Part 2, List Manipulation, Sub-Setting & Merging List to Vector & Matrix Part 1, and Part 2, Matrix Accessing, and lots more
- Self-paced with Life Time Access
- Certificate on Completion
- Access on Android and iOS App
This course is designed for software programmers, statisticians and data miners who are looking forward for developing statistical software using R programming. If you are trying to understand the R programming language as a beginner, this tutorial will give you enough understanding on almost all the concepts of the language from where you can take yourself to higher levels of expertise.
Before proceeding with this course, you should have a basic understanding of Computer Programming terminologies. A basic understanding of any of the programming languages will help you in understanding the R programming concepts and move fast on the learning track.
Who this course is for:
- All graduates and pursuing students
- Before proceeding with this course, you should have a basic understanding of Computer Programming terminologies. A basic understanding of any of the programming languages will help you in understanding the R programming concepts and move fast on the learning track
- R Programming Language for Statistical Computing and Graphical Representation
R:
- R is a programming language
- Free software
- Statistical computing, graphical representation and reporting.
- Designed by: Ross Ihaka, Robert Gentleman, Developed at University of Aukland
- Derived from S and S-plus language (commercial product)
- Typing discipline: Dynamic
- Stable release: 3.5.2 ("Eggshell Igloo") / December 20, 2018; 58 days ago
- First appeared: August 1993; 25 years ago
- License: GNU GPL
- Functional based language
- Interpreted programming language
- Distributed by CRAN (Comprehensive R Archive Network)
- Open source product (R-Community)
- Functions are available as a package
- Default packages are already attached to the R-console eg base, utils, stats, graphics etc
- Attach the package to the R-application
- Install Add-on packages from CRAN Mirrors.
Write a program to print HELLO WORLD in C language:
#include<stdio.h>
#include<conio.h>
void main()
{
printf("HELLO WORLD");
getch();
}
Write a program to print HELLO WORLD in Java:
class Hello
{
public static void main(String args[])
{
System.out.println("HELLO WORLD");
}
}
Write a program to print HELLO WORLD in R:
print("HELLO WORLD")
NOTE: R programming language is very simple to learn when compare to traditional programming languages (C, C++, C#, Java).
How to Download & Install R:
- Once goto official website of R i.e., www.r-project.org
- (or)
- Search "R" in Google and click on first link (The R Project for Statistical Computing).
- Click on "Download R".
- Click on any one of the CRAN Mirror. Eg: https//cloud.r-project.org
- Click on Download R for Windows.
- Click on Install R for the first time.
- Finally click on Download R 3.5.1 for Windows (32/64 bit).
Setting R Environment:
- R come with a lot of packages.
- By default only some packages will be attached to the R environment.
- search()
- Displays the currently attached packages
- installed.packages()
- Displays the installed packages in the machine
- library(package name) / require(package name)
- Attaches the packages to the R application
- install.packages("package name")
- Installs the add-on packages from CRAN
- detach(package:package name)
- Detaches the packages from the R environment
Package - Help
- library(help="package name")
Function - Help
- help(function name)
- or
- ?function name
Comments in R:
==============
--> Single comment is written using # in the beginning of the statement.
# Comments are like helping text in your R Program
--> Multi-line comments is written using if()
if(FALSE) {
"We put such comments inside, either
single or double quote" }
Variable Assignmet:
===================
1. print()
2. cat()
print():
-------
--> print() function is used to print the value stored in variable
Ex:
a <- 10
print(a)
cat():
-----
--> cat() function is used to combines multiples items into a continuous print output.
Ex:
a <- "DataHills"
cat("Welcome to ", a)
Datatype of a Variable:
=======================
1. typeof()
2. class()
3. mode()
1. typeof(var_name/value)
-------------------------
--> typeof determines the (R internal) type or storage mode of any object
Ex:
typeof(a)
typeof(10)
2.class(var_name/value)
-----------------------
--> R possesses a simple generic function mechanism which can be used for an object-oriented style of programming.
--> Method dispatch takes place based on the class of the first argument to the generic function.
Ex:
class(a)
class(10)
3. mode(var_name/value)
-----------------------
--> Get or set the type or storage mode of an object.
Ex:
mode(a)
mode(10)
Displaying & Deleting Variables in R:
=====================================
1. ls()
2. rm()
1. ls():
--------
--> ls() function is used to display all the variables currently availabe in the R environment.
Ex:
ls()
--> ls() function is also used to display patterns to match the variables names by using pattern.
Ex:
# Display the variables starting with the pattern "a"
ls(pattern="a")
--> ls() function is also used to display hidden variables i.e, the variable starting with dot(.) by using all.names=TRUE.
Ex: Display the variables which are hidden
ls(all.names=TRUE)
--> rm() function is used to delete the variable.
Ex:
rm(a)
--> rm() function is also used to delete all the variables by using rm() and ls() function together.
Ex: Remove all the variables at a time
rm(list=ls())
Structures/Objects in R:
========================
1. Vectors
2. Lists
3. Matrices
4. Data Frames
5. Arrays
6. Factors
Vectors:
========
--> Single dimensional object with homogenous data types.
--> To create a vector use fucntion c()
--> Here "c" means combine
# if i try like this
a <- 10,20,30,40
it gives an error.
# then combine all these values by using c()
a <- c(10,20,30,40)
# to check the internal storage of a
typeof(a)
# to check the internal storage of each value in a
lapply(a,FUN=typeof)
sapply(a,FUN=typeof)
or
lapply(a,typeof) # list of values
sapply(a,typeof) # vector of values
--> Vectors are the most basic R structures/objects
--> The types of atomic vectors are in
1. logical
2. integer
3. double
4. complex
5. character
Vector Creation:
================
--> We can create vectors with single element and multiple elements.
--> They are
1. Single Element Vector
2. Multiple Elements Vector
Single Element Vector:
======================
--> When we assign a single value into variable, it becomes a vector of length 1 and belongs to one of the above vector types.
Ex:
a <- 10
b <- 20L
c <- "DataHills"
d <- TRUE
e <- 2+3i
Multiple Elements Vector:
=========================
--> When we assign multiple value into a variable, it becomes a vector of length n
and belongs to one of the above vector types.
Ex:
a <- c(10,20,30,40,50)
b <- c(20L,40L,60L,80L)
c <- c("Srinivas","DataHills","DataScience","MachineLearning")
d <- c(T,FALSE,TRUE,F,T,F)
e <- c(2+3i,4+4i,5+6i)
# Heterogeneous data type values are converted into homogeneous data type values:
a <- c(10,20,30,40,"DataHills")
Output:
"10" "20" "30" "40" "DataHills"
# The double and character values are converted into characters.
Observer with some examples:-
a <- c(10L,20)
a <- c(T,5)
a <- c(2+3i,"DataHills")
a <- c(9L,30,4+5i)
Here data types having some priority, based on that they are converting.
i.e, Lower data types to higher data types
1. CHARACTER
2. COMPLEX
3. DOUBLE
4. INTEGER
5. LOGICAL
a <- c(TRUE,30,20L,2+3i,"DataHills")
a <- c(TRUE,30,20L,2+3i)
a <- c(TRUE,30,20L)
a <- c(TRUE,20L)
To generate a sequence of numeric values
<Start_Value>:<End_Value>
1:10
10:1
3.5:10.5
10.5:3.5
# by using seq() function
Syntax: seq(from=VALUE,to=VALUE,by=VALUE)
Ex: seq(from=1,to=10,by=1)
seq(to=10,by=1,from=1)
seq(by=10,to=100,from=10)
seq(1,10,by=2)
seq(from=1,10,2)
seq(1,to=10,2)
seq(1,10,1)
seq(2,20,2)
seq(10,1,1) # Error
seq(10,1,-1)
seq(1,10,pi)
seq(10)
seq(-10)
seq(1:10)
# length.out --> desired length of the sequence,
'length.out' must be a non-negative number.
seq_len is much faster.
seq(length.out=10)
seq_len(10)
seq(1,10,length.out=10)
seq(1,10,length.out=5)
seq(1,10,length.out=11)
# along.with --> take the length from the length of this argument,
it generates the integer sequence 1,2,....
seq_along is much faster.
seq(along.with=10)
seq(along.with=c(20,30,40))
seq(along.with=c("Data",T,2,3,4))
seq(along.with=c("Data",T,2,3,,4,5,6,7,8,9,10))
a <- seq(along.with=c("Data",T,2,3,,4,5,6,7,8,9,04))
seq(along.with=a)
seq_along(a)
Vector Manipulation:
====================
a <- c(4,7,9,12,8,3)
b <- c(2,3,5,7,8,5)
length(a)
length(b)
add <- a+b
sub <- a-b
mul <- a*b
div <- a/b
# if we apply arithmetic operators to two vectors of unequal length, then the elements of the shorter vector are recycled to complete the operators.
a <- c(4,7,9,12,8,3)
b <- c(2,3)
add <- a+b
sub <- a-b
mul <- a*b
div <- a/b
# Elements in a vector can be sorted using the sort() function.
a <- c(9,3,5,8,1,6,5)
sort <- sort(a)
rev_sort <- sort(a,decreasing=T)
a <- c("Srinivas","DataHills","Analysis","MachineLearning")
sort <- sort(a)
rev_sort <- sort(a,decreasing=TRUE)
Sub-setting the Data in Vectors:
================================
--> Extracting the required fields, rows from the R object.
vector[position/logical index/negative index/name]
---------------------------------------------------------------
a <- c("DataScience","DataAnalysis","MachineLearning","R","Python","Weka")
# Accessing vector elements using position
# Here [ ] brackets are used for indexing.
# Indexing starts with position 1.
a[3]
a[2,4] # Error
a[c(2,4)]
a[c(1,4,5)]
course <- a[c(1,4,5)]
# Accessing vector elements using negative indexing
a[-6]
a[-3,-5] # Error
a[c(-3,-5)]
a[-c(3,5)]
a[-c(4,5,6)]
course <- a[-c(4,5,6)]
# Accessing vector elements using logical indexing
a[c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)]
a[c(T,T,T,F,F,F)]
a[T]
a[F]
a[c(T,F)]
a[c(F,T)]
# Accessing vector elements using name
a <- c(a="DataScience",b="DataAnalysis",c="MachineLearning",d="R",e="Python",f="Weka")
a[2]
a[b] # Error
a["b"]
a["d","e"] # Error
a[c("d","e")]
a[c("-d","-e")] # Error
a[c(-"d",-"e")] # Error
a[-c("d","e")] # Error
Constants:
==========
R has a small number of built-in constants.
The following constants are available:
1. LETTERS: the 26 upper-case letters of the Roman alphabet;
2. letters: the 26 lower-case letters of the Roman alphabet;
3. month.abb: the three-letter abbreviations for the English month names;
4. month.name: the English names for the months of the year;
5. pi: the ratio of the circumference of a circle to its diameter.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> month.name
[1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
> month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> pi
[1] 3.141593
But it is not good to rely on these, as they are implemented as variables whose values can be changed.
> pi
[1] 3.141593
seq(1,10,pi)
> pi <- 10
> pi
[1] 10
seq(1,10,pi)
LETTERS[24]
LETTERS[2,3,4,5] # Error
LETTERS[c(2,3,4,5)]
LETTERS[seq(2,5,1)]
LETTERS[2:5]
LETTERS[c(10,11,12,13,14,15,16,17,18,19,20)]
LETTERS[10:20]
LETTERS[-10:-20]
LETTERS[-c(10:20)]
a <- c(10,20,30,40,50,60)
names(a)
names(a) <- c("A","B","C","D","E")
b <- c(70,80,90,100,110,120)
names(b)
names(b) <- LETTERS[21:26]
sales_1 <- c(100,200,300)
names(sales_1) <- c("Jan","Feb","Mar")
names(sales_1) <- month.abb # Error
names(sales_1) <- month.abb[1:3]
names(sales_1) <- month.abb[10:12]
names(sales_1) <- month.abb[c(1,5,10)]
names(sales_1) <- month.abb[seq(1,12,4)]
sales_2 <- c(100,200,300,400,150,250,350,450,120,220,320,420)
names(sales_2) <- month.abb
names(sales_2) <- month.name
RStudio:
========
--> RStudio is a free and open-source integrated development environment (IDE) for R.
--> RStudio requires R 3.0.1+. If you don't already have R, download it.
--> RStudio makes R easier to use.
--> It includes a code editor, debugging & visualization tools.
--> RStudio is a separate piece of software that works with R to make R much more user friendly and also adds some helpful features.
--> RStudio was founded by JJ Allaire.
--> RStudio is written in the C++ programming language.
--> Initial release: 28 February 2011 - 7 years ago
--> Stable release: 1.1.456 / 19 July 2018 - 52 days ago
Downloading & Installation RStudio:
===================================
--> Goto official website of RStudio i.e., www.rstudio.com
--> Click on RStudio Download
--> Click on RStudio Desktop Open Source License (FREE) Download
--> Click on RStudio 1.1.456 - Windows Vista/7/8/10 (85.8 MB Size)
--> Automatically file will be downloaded in our system
--> Installation is easy, it takes less than 2 min to install.
Lists:
======
--> Single dimensional object with hetrogeneous data types.
--> To create a list use function list().
# Create a list containing character, complex, double, integer and logical.
a <- list("DataHills",2+3i,10,20L,TRUE)
# to check the internal storage of a
typeof(a)
# to check the internal storage of each value in a
lapply(a,FUN=typeof)
sapply(a,FUN=typeof)
or
lapply(a,typeof) # list of values
sapply(a,typeof) # vector of values
--> Lists are the R objects which contain elements of different types like
Characters
Complex
Double
Integer
Logical
Vector
Matrix
Function and
another list inside it.
# Create a list containing vectors
a <- list(c(1,2,3),c("A","B","C"),c("R","Python","Weka"),c(10000,8000,6000))
print(a)
typeof(a)
lapply(a,typeof)
# Create a list containing characters, vector, double
b <- list("DataHills","Srinivas",c(10,20,30),15.5)
print(b)
typeof(b)
lapply(b,typeof)
# Create a list containing a vector, matrix, fucntion and list.
c <- list(c(10,20,30),matrix(c(1,2,3,4),nrow=2),search(),list("DataHills",9292005440))
print(c)
typeof(c)
lapply(c,typeof)
Naming List Elements:
=====================
--> The list elements can be given names and they can be accessed using these names.
b <- list(Name1="DataHills",Name2="Srinivas",vector_values=c(10,20,30),single_value=15.5)
print(b)
c <- list(c(10,20,30),matrix(c(1,2,3,4),nrow=2),search(),list("DataHills",9292005440))
names(c) <- c("values","mat","fun","inner_list")
matrix(c(1,2,3,4,5,6,7,8,9,10), nrow=5)
matrix(1:10, nrow=5)
# Elements are arranged by row
matrix(1:10, nrow=5, byrow=TRUE)
# Elements are arranged by column
matrix(1:10, nrow=5, byrow=FALSE)
matrix(1:10, ncol=5, byrow=T)
# Create a matrix with row names and column names
matrix(1:10, ncol=5, byrow=TRUE, dimnames=list(c("A","B"),c("C","D","E","F","G")))
matrix(1:10, ncol=5, byrow=TRUE, dimnames=list(LETTERS[1:2],LETTERS[3:7]))
# To check or define or update or delete the names of rows and columns,
we have to use the functions
rownames(var_name)
colnames(var_names)
a <- matrix(1:10, ncol=5, byrow=TRUE, dimnames=list(LETTERS[1:2],LETTERS[3:7]))
rownames(a)
colnames(a)
rownames(a) <- c("row1","row2")
colnames(a) <- c("col1","col2","col3","col4","col5")
rownames(a)
colnames(a)
print(a)
a <- matrix(1:10, ncol=5, byrow=TRUE)
rownames(a) <- LETTERS[20:21]
colnames(a) <- LETTERS[22:26]
print(a)
a <- matrix(11:20, ncol=5, byrow=TRUE)
x <- c("r1","r2")
y <- c("c1","c2","c3","c4","c5")
rownames(a) <- x
colnames(a) <- y
print(a)
a <- matrix(21:30, ncol=5, byrow=TRUE, dimnames=list(x,y))
print(a)
# Remove the row names and column names
rownames(a) <- NULL
print(a)
colnames(a) <- NULL
# Create a matrix without argument names
matrix(1:10,2)
matrix(1:10,5)
matrix(1:10,2,5)
matrix(1:10,2,5,T)
matrix(1:10,2,5,T,list(c("a","b")))
Transpose:
----------
The transpose (reversing rows and columns) is perhaps the simplest method of reshaping a dataset. Use the t() function to transpose a matrix.
a <- matrix(1:9,nrow=3)
print(a)
t(a)
a <- matrix(1:10, ncol=5)
print(a)
t(a)
a <- t(a)
Matrix Part 2
a <- matrix(1:10,2)
a <- matrix(1:10,2,T) #Here it prints only first 2 elements
a <- matrix(1:10,nrow=2,T) #Same result
a <- matrix(1:10,2,5,T)
# Create some matrices with hetrogenous datatype elements and observe
a <- matrix(c(1,2,3,"A","B","C"),2)
print(a)
typeof(a)
a <- matrix(c("Data",2+3i,TRUE,20,30L,FALSE),3)
print(a)
typeof(a)
a <- matrix(c(TRUE,20,30L,FALSE),2)
print(a)
typeof(a)
# Create a matrix which is not multiple of the no. of rows and columns
a <- matrix(1:3,3,3)
a <- matrix(1:3,3,3,T)
a <- matrix(1:5,2,3) # Warning Message
a <- matrix(1:5,2,5)
a <- matrix(1:5,2,3) # Warning Message
a <- matrix(1:10,2,5)
a <- matrix(1:10,3,4) # Warning Message
a <- matrix(1:10,5,4)
Dimensions of a Matrix:
-----------------------
--> Retrieve or set the dimension of an object.
--> We have to use
dim(x)
dim(x) <- value
# to check the dimensions of a matrix
dim(a)
x <- 1:12
dim(x) <- c(3,4)
print(x)
Accessing Matrix Elements:
==========================
--> Elements of a matrix can be accessed by using the column and row index[position] of the element.
a <- matrix(1:12,3)
print(a)
# Access the element at 1st column and 1st row
a[1]
a[1,1]
# Access the element at 2nd column and 3rd row
a[3,2]
# Access the element at 2nd column, 1st row,2nd row and 3rd row
a[c(1,2,3),2]
a[1:3,2]
# Access the element at 2nd column, 1st row and 3rd row
a[c(1,3),2]
# Access the element at 1st row, 2nd & 3rd column
a[1,2:3]
a[1,c(2,3)]
# Access the element at 2nd & 3rd row, 2nd & 3rd column
a[2:3,2:3]
# Access the element at 1st & 3rd row, 1st & 3rd column
a[c(1,3),c(1,3)]
# Access only the 1st row
a[1,1:4]
a[1,]
# Access only the 3rd column
a[1:3,3]
a[,3]
# Access the element at 2nd & 3rd column, all rows except 2nd row
a[-2,2:3]
--> Elements of a matrix can be accessed by using the column and row names of the element.
rownames(a)
colnames(a)
rownames(a) <- LETTERS[1:3]
colnames(a) <- LETTERS[23:26]
a[c("A","B"),c("W","X")]
a[c("A","B","C"),c("W","Z")]
--> Elements of a matrix can be accessed by using the column and row logical index of the element.
a[c(T,F,T),c(F,T,T,F)]
a[c(F,F,F),c(T,T,T,T)] # only colnames will access
a[c(T,T,F),c(T,T,T,T)]
Convert the columns of a data.frame to characters:
--------------------------------------------------
--> By default, data frames convert characters to factors.
--> The default behavior can be changed with the stringsAsFactors parameter
--> Here stringsAsFactors can be set to FALSE.
--> If the data has already been created, factor columns can be converted to character columns as shown below.
# Convert all columns to character
students[] <- lapply(students, as.character)
str(students)
# Create a students data frame without stringsAsFactors
students <- data.frame(
sid=c(1:5),
sname=c("Srinu","Vasu","Nivas","Reddy","Sai"),
course=c("R","Python","Weka","DS","ML"),
fee=c(10000,8000,5000,2000,5000),
stringsAsFactors=FALSE)
print(students)
typeof(students)
class(students)
dim(students)
nrow(students)
ncol(students)
rownames(students)
colnames(students)
names(students)
str(students)
summary(students)
# creating a emp data frame
emp <- data.frame(
eid=c(101:105),
ename=c("Srinu","Vasu","Nivas","Reddy","Sai"),
salary=c(95000,85000,75000,50000,65000),
desig=c("DataScientist","MLDeveloper","DataAnalyst","Testing","DBA"),
address=c("Hyd","Chennai","Bang","Hyd","Bang"),
stringsAsFactors=FALSE)
print(emp)
str(emp)
summary(emp)
rownames(emp)
colnames(emp)
names(emp)
# creating a course data frame
cid=c(10,20,30)
cname=c("DataScience","DataAnalytics","MachineLearning")
cfee=c(10000,8000,10000)
course <- data.frame(cid,cname,cfee)
course <- data.frame(cid,cname,cfee,stringsAsFactors=F)
Extract Data from Data Frame:
============================
--> Syntax for accessing rows and columns: [, [[, and $
--> Like a matrix with single brackets data[rows, columns]
Using row and column numbers
Using column and row names
--> Like a list:
With single brackets data[columns] to get a data frame
With double brackets data[[one_column]] to get a vector
With $ for a single column data$column_name
--> Here we can extract the specific column from a data frame using column name.
# Extract Specific columns on emp data frame.
emp1 <- data.frame(emp$ename,emp$salary)
# Extract first three rows.
emp[1:3,]
# Extract 2nd and 4th row with 1st and 4th column.
emp[c(2,4),c(1,4)]
emp[1,2]
emp[,2]
emp[1:2,2]
emp[1:2,2:4]
emp$ename
emp[,"ename"]
data.frame(emp$ename)
emp["ename"]
emp$ename[2]
c(emp$ename,emp$salary)
list(emp$ename,emp$salary)
data.frame(emp$ename,emp$salary)
data.frame(emp[,c("ename","salary")])
data.frame(emp[,2:3])
Expand Data Frame:
==================
--> Data frame can be expanded by adding columns and rows.
Add Column:
-----------
--> Add the column using a new column name.
# Add the "contact" coulmn.
emp$contact <- c(9292005440,9898989898,9696969696,9595959595,9191919191)
column bind and row bind:
=========================
--> Combine R Objects by Rows or Columns
--> Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows
--> c to combine vectors as vectors
--> data.frame to combine vectors and matrices as a data frame.
--> The functions to bind rows and columns are cbind() and rbind()
c <- cbind(1:5, 1:5)
print(c)
c <- cbind(1, 1:5) # Here 1 is recycled
print(c)
c <- cbind(c, 6:10) # insert a column at last
print(c)
c <- cbind(c,11:15)[,c(1,3,4,2)] # insert a column at required position
print(c)
r <- rbind(1:5,1:5)
print(r)
r <- rbind(1,1:5) # Here 1 is recycled
print(r)
r <- rbind(a = 1, b = 1:5)
print(r)
# deparse.level
--> deparse.level = 0 by default it will not construct labels
--> deparse.level = 1 or 2 constructs labels from the argument names
a <- 40
rbind(1:5, b = 20, Data = 30, a, deparse.level = 0) # middle 2 rownames
rbind(1:5, b = 20, Data = 30, a, deparse.level = 1) # 3 rownames (default)
rbind(1:5, b = 20, Data = 30, a)
rbind(1:5, b = 20, Data = 30, a, deparse.level = 2) # 4 rownames
Add Row:
-------
--> Adding a Single Observation (row)
emp <- rbind(emp,c(106,"Data",45000,"Java","Hyd",9589695895))
--> Adding a Many Observations (rows)
emp <- rbind(emp,c(107,"Data",45000,"Java","Hyd",9589695895),
c(108,"Hills",40000,"Testing","Chennai",9589658965))
--> Adding more rows to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.
--> Here we create a data frame with new rows and merge it with the existing data frame to create the final data frame.
# Create the second data frame
emp_2 <- data.frame(
eid=c(109,110),
ename=c("Nandu","Naga"),
salary=c(95000,85000),
desig=c("DataScientist","MLDeveloper"),
address=c("Hyd","Chennai"),
contact=c(9494949494,9393939393),
stringsAsFactors=FALSE)
# we can merge two data frames using rbind() function
# Bind the two data frames.
emp_final <- rbind(emp,emp_2)
print(emp_final)
--> We can join multiple vectors by using the cbind() function.
cid = c(10, 20, 30)
cname = c("DataScience", "DataAnalytics", "MachineLearning")
cfee = c(10000, 8000, 10000)
# Combine above three vectors.
course <- cbind(cid, cname, cfee)
print(course)
str(course)
typeof(course)
class(course)
course <- data.frame(cid, cname, cfee)
print(course)
typeof(course)
class(course)
str(course)
course <- data.frame(cid, cname, cfee, stringsAsFactors = FALSE)
print(course)
str(course)
# Create another data frame with similar columns
course_new <- data.frame(
cid = c(40, 50, 60),
cname = c("DataScience", "DataAnalytics", "MachineLearning"),
cfee = c(10000, 8000, 10000),
stringsAsFactors = F)
print(course_new)
str(course_new)
typeof(course_new)
class(course_new)
# Combine rows from both the data frames.
course_final <- rbind(course, course_new)
print(course_final)
str(course_final)
typeof(course_final)
class(course_final)
Merging Data Frames:
====================
--> Merge two data frames by common columns or row names
--> Merge is similar to join operations in database
--> Joins are used to retrive the data from multiple tables.
--> We can merge two data frames by using the merge() function.
--> The column names should be same when merging the data frames.
Syntax:
merge(x, y, by, by.x, by.y, all, all.x, all.y, sort)
Arguments
---------
x, y x and y are data frames or objects
by, by.x, by.y specifies the common columns.
all, all.x, all.y determines the type of merge
sort logical. by default it is TRUE
student_details <- data.frame(
s_name = c("Sreenu","Vasu","Nivas","Reddy","Sai"),
address = c("Hyd","Bang","Chennai","Pune","Mumbai"),
contact = c(9292005440,9898989898,9696969696,9595959595,9292929292),
stringsAsFactors = FALSE)
course_details <- data.frame(
s_name = c("Sreenu","Vasu","Nivas","Reddy","Sai"),
course = c("DataScience","MachineLearning","DataAnalytics","R","Python"),
fee = c(20000,15000,10000,8000,10000),
stringsAsFactors = FALSE)
by, by.x, by.y:
---------------
--> The names of the columns that are common to both x and y.
--> The default is to use the columns with common names between the two data frames.
merge(student_details, course_details, by="s_name")
merge(course_details, student_details, by="s_name")
merge(student_details, course_details) #Here by is optional
--> When both data frames contains more than one common column names, then it retrive based on all common column names.
course_details$address = c("Hyd","Bang","Chennai","Pune","Mumbai")
merge(student_details, course_details)
course_details$address[3:4] <- c("Delhi","Hyd")
merge(student_details, course_details)
course_details$address = NULL
merge(student_details, course_details)
colnames(student_details)[1] <- "student_name"
or
names(student_details)[1] <- "student_name"
print(student_details)
print(course_details)
merge(student_details, course_details, by.x="student_name", by.y="s_name")
merge(course_details, student_details, by.x="s_name", by.y="student_name")
merge(student_details, course_details) #Without by it give the cross merge
student_details[c(3,5),1] <- c("Rama","Sita")
merge(student_details, course_details, by.x="student_name", by.y="s_name")
merge(course_details, student_details, by.x="s_name", by.y="student_name")
# Here by default sort is TRUE, set as FALSE
merge(course_details, student_details, by.x="s_name", by.y="student_name",
sort=FALSE)
Melting and Casting:
====================
--> Melting and casting are used to change the shape of the data in multiple steps to get a desired shape.
--> The functions are melt() and cast().
--> First install the package "reshape".
--> The "reshape" package is used for restructuring and aggregating datasets.
install.packages("reshape")
library(reshape)
mydata <- data.frame(ID=c(1,1,2,2),Time=c(1,2,1,2),
X1=c(5,3,6,2),X2=c(6,5,1,4),stringsAsFactors=F)
print(mydata)
Melt the Data:
--------------
--> Melt an object into a form suitable for easy casting.
--> When we melt a dataset, we restructure it into a format where each measured variable is in its own row, along with the ID variables needed to uniquely identify it.
md <- melt(mydata, id=c("ID","Time"))
--> Note: We must specify the variables needed to uniquely identify each measurement (ID and Time) and that the variable indicating the measurement variable names (X1 or X2) is created automatically.
--> Now the data in a melted form, you can recast it into any shape, using the cast() function.
Cast the Melted Data:
---------------------
--> cast() function starts with melted data and reshapes it using a formula that we provide and an (optional) function used to aggregate the data.
--> The format is
newdata <- cast(md, formula, FUN)
--> Where md is the melted data,
formula describes the desired end result, and
FUN is the (optional) aggregating function.
--> The formula takes the form
rowvar1 + rowvar2 + … ~ colvar1 + colvar2 + …
--> In this formula,
rowvar1 + rowvar2 + … define the set of crossed variables that define the rows, and colvar1 + colvar2 + … define the set of crossed variables that define the columns.
With Aggregation:
-----------------
cast(md, ID~variable, mean)
cast(md, Time~variable, mean)
cast(md, ID~Time, mean)
Without Aggregation:
--------------------
cast(md, ID+Time~variable)
cast(md, ID+variable~Time)
cast(md, ID~variable+Time)
We consider the dataset called ships present in the library called "MASS".
library(MASS)
print(ships)
ships - ships damage data
# Data frame giving the number of damage incidents and aggregate months of service by ship type, year of construction, and period of operation.
type
type: "A" to "E".
year
year of construction: 1960–64, 65–69, 70–74, 75–79 (coded as "60", "65", "70", "75").
period
period of operation : 1960–74, 75–79.
service
aggregate months of service.
incidents
number of damage incidents.
Now we melt the data to organize it, converting all columns other than type and year into multiple rows.
molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)
We can cast the molten data into a new form where the aggregate of each type of ship for each year is created.
cast(molten.ships, type~variable,sum)
cast(molten.ships, year~variable,sum)
cast(molten.ships, type+year~variable,sum)
Arrays:
=======
--> Multidimensional object with homogeneous data types.
--> An array can have one, two or more dimensions.
--> It is simple like a vector
--> One-dimensional array looks like vectors
--> Two-dimensional array looks like matrix
--> An array is created using the array() function
Syntax:
array(data, dim, dimnames)
Arguments
---------
data:
--> a vector data to fill the array.
dim:
--> dim attribute which creates the required number of dimensions
dimnames:
--> either NULL or the names for the dimensions.
# Create an array with two elements which are 3x3 matrices each.
a <- array(c("Data", "Hills"), dim = c(3,3,2))
# it creates 2 rectangular matrices each with 3 rows and 3 columns
print(a)
dim(a)
length(a)
typeof(a) # character
class(a) # array
# Create two vectors of different lengths.
a <- 1:3
b <- 4:9
# Take these vectors as input to the array.
c <- array(c(a,b),dim = c(3,3,2))
print(c)
Column Names and Row Names:
---------------------------
--> By using the dimnames parameter we can give names to the rows, columns and matrices in the array .
# Create two vectors of different lengths.
a <- 1:3
b <- 4:9
c <- array(c(a,b),dim = c(3,3,2),dimnames = list(c("R1","R2","R3"),
c("C1","C2","C3), c("M1","M2")))
x <- c("ROW1","ROW2","ROW3")
y <- c("COL1","COL2","COL3")
z <- c("Matrix1","Matrix2")
# Take these vectors as input to the array.
c <- array(c(a,b),dim = c(3,3,2),dimnames = list(x,y,z))
print(c)
dim(c)
length(c)
Accessing Array Elements:
-------------------------
# Print the first row of the second matrix of the array.
c[1,,2]
# Print the second column of the first matrix of the array.
c[,2,1]
# Print the element in the 1st row and 2nd column of the 1st matrix.
c[1,2,1]
# Print the element in the 3rd row and 2nd column of the 2nd matrix.
c[3,2,2]
# Print the 1st Matrix.
c[,,1]
# Print the 2nd Matrix.
c[,,2]
Manipulating Array Elements:
----------------------------
--> Array is made up matrices in multiple dimensions, the operations on elements of array are carried out by accessing elements of the matrices.
# create matrices from these arrays.
m1 <- c[,,1]
m2 <- c[,,2]
# Add the matrices.
result <- m1+m2
print(result)
d <- array(11:22,dim = c(3,3,2))
# create matrices from these arrays.
m1 <- c[,,2]
m2 <- d[,,2]
# Add the matrices.
result <- m1+m2
print(result)
Calculations across Array Elements:
-----------------------------------
--> By using the apply() function we can do calculations across the elements in an array.
--> Syntax:
apply(X, MARGIN, FUN)
Arguments
---------
--> X is an array.
--> MARGIN is the name of the dataset used.
E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns
--> FUN is the function to be applied across the elements of the array.
# Use apply to calculate the sum of the rows across all the matrices.
d <- apply(c, MARGIN=1, FUN=sum)
d <- apply(c, 1, sum) #1 indicates rows
print(d)
e <- apply(c, 2, sum) #2 indicates columns
print(e)
f <- apply(c, c(1,2), sum) #c(1, 2) indicates rows and columns
print(f)
g <- apply(c, c(2,1), sum) #c(2,1) indicates columns and rows
print(g)
FACTORS:
========
--> Factors represent categorical values
--> Factors are the data objects which are used to categorize the data and store it as levels.
--> By default, if the levels are not supplied by the user, then R will generate the set of unique values in the vector, sort these values alphanumerically, and use them as the levels.
--> Factors can store both strings and integers.
--> Factors are useful in the columns which have a limited number of unique values. Like "Yes", "No" and "Good", "Bad" etc.
--> Factors are useful in data analysis for statistical modeling.
--> Factors are created using the factor() function
Factors in Data Frame
---------------------
--> After creating a data frame with character data type elements, R treats the character column as categorical data and creates factors on it.
# Create the vectors for data frame.
h <- c(5.6,6.0,5.10,6.2,5.5,5.2,5.8)
w <- c(50,55,58,65,70,65,60)
g <- c("M","F","F","F","M","F","M")
# Create the data frame.
hwg <- data.frame(h,w,g)
print(hwg)
# Test if the gender column is a factor.
is.factor(hwg$g)
# Print the gender column so see the levels.
hwg$g
str(hwg)
# Create a vector as input.
gender <- c("M","F","M","M","F","F","F","M","F","M")
length(gender)
is.factor(gender)
typeof(gender) #character
class(gender) #character
# Apply the factor function.
gender_f <- factor(gender)
is.factor(gender_f)
typeof(gender_f) #integer
class(gender_f) #factor
levels(gender_f)
print(gender)
print(gender_f)
str(gender_f)
Changing Order Levels & Creating Labels:
----------------------------------------
#If we want to change the ordering of the levels, then one option to specify the levels manually:
gender_f <- factor(gender,levels=c("M","F"))
levels(gender_f)
print(gender_f)
str(gender_f)
gender_f <- factor(gender,levels=c("M","F"),ordered=TRUE)
print(gender_f)
str(gender_f)
gender_f <- factor(gender,levels=c("M","F"),labels=c("Male","Female"))
print(gender_f)
gender_f <- factor(gender,levels=c("M","F"),labels=c("Male","Female"),ordered=T)
print(gender_f)
Generating Factor Levels
------------------------
--> We can generate factor levels by using the gl() function.
--> It takes two integers as input which indicates how many levels and how many times each level.
syntax:
gl(n, k, labels)
Arguments
---------
n is a integer giving the number of levels.
k is a integer giving the number of replications(no. of times each level).
labels is a vector of labels for the resulting factor levels.
a <- gl(2, 4, labels = c("Male", "Female"))
print(a)
speed <- c("high","low","medium","low","high","low")
typeof(speed) #character
class(speed)
speed_f <- factor(speed)
typeof(speed_f)
class(speed_f)
speed_f <- factor(speed,levels=c("low","medium","high"))
speed_f <- factor(speed,levels=c("low","medium","high"),ordered=T)
str(speed_f)
data <- c("E","W","E","N","N","E","W","W","W","E","N")
print(data)
is.factor(data)
data_f <- factor(data)
print(data_f)
is.factor(data_f)
# Apply the factor function with required order of the level.
data_f2 <- factor(data_f,levels = c("E","W","N"))
print(data_f2)
grades <- c(1,2,3,4,4,3,1,2,1,2,3)
grades_f <- factor(grades)
grades
grades_f
str(grades_f)
grades_f <- factor(grades,levels=c(3,1,4,2),ordered=TRUE)
str(grades_f)
is.factors(grades) #False
is.factors(grades_f) #True
Weekdays <- factor(c("Sunday","Monday", "Tuesday", "Wednesday","Thursday","Friday", "Saturday"))
print(Weekdays)
Weekdays <- factor(Weekdays, levels=c("Sunday","Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"), ordered=TRUE)
print(Weekdays)
Weekend <- subset(Weekdays, Weekdays == "Saturday" | Weekdays == "Sunday")
print(Weekend)
# When a level of the factor is no longer used,
we can drop it using the droplevels() function:
Weekend <- droplevels(Weekend)
print(Weekend)
===============
--> Self contained block of one or more statements which is designed for a specific task is called "Function".
--> A programmer builds a function to avoid repeating the same task, or reduce complexity.
--> R function is created by using the keyword function
--> Functions are classified into 2 types. i.e
1. Built-in functions / Predefined functions
2. User defined functions
Built-in Functions:
-------------------
--> There are a lot of built-in function in R.
Examples of built-in functions are search(),seq(),rep(),c(),sum()....etc
search()
seq(1,10,2)
rep("DataHills",7)
c(10,20,30,40,50)
sum(1:10)
--> It is possible to see the source code of a function by running the name of the function itself in the console.
search
seq
rep
c
sum
User-defined Function:
----------------------
--> we need to write our own function because we have to accomplish a particular task and no ready made function exists.
--> User-defined functions are specific to what a user wants and once created they can be used like the built-in functions.
--> Create a function name with user-defined function different from a built-in function. It avoids confusion.
--> The syntax to create a new function is
function_name <- function(arg1, arg2, ... ){
statements
return(object)
}
Function Components:
Function Name
Arguments
Function Body
Return Value
# Create a function without an argument.
addnum <- function()
{
a=10
b=20
result <- a+b
return(result)
}
# Call the function without an argument.
addnum()
# Create a function with single argument.
pownum <- function(a)
{
result <- a^2
return(result)
}
# Call the function pownum supplying 10 as an argument.
pownum(10)
x=c(10,20,30,40,50)
pownum <- function(a)
{
result <- a^2
return(result)
}
# Call the function pownum supplying a vector 'x' as an argument.
pownum(x)
# Create a function with multiple arguments
addnum <- function(a,b)
{
result <- a+b
return(result)
}
# Call the function by position of arguments.
addnum(10,20)
# Call the function by name of arguments.
addnum(a=20,b=30)
# Create a function with default arguments
subnum <- function(a=10,b=20)
{
result <- a-b
return(result)
}
# Call the function without giving any argument.
subnum()
# Call the function with giving new values of the argument.
subnum(30,10)
addnum <- function(a,b,c)
{
print(a+b)
print(c)
}
# Evaluate the function without supplying one of the arguments.
addnum(10,20)
# This is called Lazy Evaluation of Function, which means so they are evaluated only when needed by the function body.
Control Flow Statements:
========================
--> These type of statements will control the execution flow of the program.
--> Types of control flow statements are
1. Decision making statements / selection statements
2. Looping statements / Iteration statements
3. Loop control statements
Decision Making:
================
--> By using decision making statements we can create conditional oriented block i.e depending on the condition, interpreter will decides that block will be executed or not.
--> In decision making statements if the condition is TRUE then block will be executed , if the condition is FALSE then block will not be executed.
--> R provides the following types of decision making statements.
if statement
if..else statement
if..else..if..else statement
switch statement
if(2>5)
print("Welcome")
print("DataHills")
if(!5!=5>8)
{
print("Welcome to")
print("DataHills")
}
else
{
print("Welcome to")
print("Data Science")
}
Loops:
======
--> Set of instructions give to the interpreter to execute set of statements untill condition become FALSE, it is called Loop.
--> The basic purpose of loop is code repetation.
--> R provides the following types of loop to handle looping requirements.
repeat Loop
while loop
for loop
--> Loop control statements are
break statement
next statement
Ex:
for(i in 1:5) {
b <- i^2
print(b)
}
# Create a function to print squares of numbers in sequence.
powseq <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
# Call the function powseq supplying 5 as an argument.
powseq(5)
Strings:
========
--> A string can be created using single quotes or double quotes.
--> Internally R stores every string within double quotes, even when we create them with single quote.
--> The class of an object that holds character strings is “character”.
--> R has several built-in functions that can be used to print or display information, but print() and cat() functions are the most basic.
print("Hello World") #"Hello World"
cat("Hello World\n") #Hello World
# Without the new-line character (\n) the output would be
cat("Hello World") #Hello World>
--> cat() function takes one or more character vectors as arguments.
--> If the character vector has a length greater than 1, arguments are separated by a space (by default)
cat(c("hello", "world", "\n")) #hello world
Valid and Invalid strings:
==========================
chr <- 'this is a string'
chr <- "this is a string"
chr <- "this 'is' valid"
chr <- 'this "is" valid'
chr <- "this is "not" valid"
chr <- 'this is 'not' valid'
--> We can create an empty string with empty_str = "" or an empty character vector with empty_chr = character(0).
--> Both have class “character” but the empty string has length equal to 1 while the empty character vector has length equal to zero.
empty_str <- ""
empty_chr <- character(0)
class(empty_str) #character
class(empty_chr) #character
length(empty_str) #1
length(empty_chr) #0
--> The function character() will create a character vector with as many empty strings as we want.
--> We can add new components to the character vector just by assigning it to an index outside the current valid range.
--> The index does not need to be consecutive, in which case R will auto-complete it with NA elements.
chr_vector <- character(2) # create char vector
chr_vector # "" ""
chr_vector[3] <- "Three" # add new element
chr_vector # "" "" "Three"
chr_vector[5] <- "Five" # do not need to be consecutive
chr_vector # "" "" "Three" NA "Five"
String Manipulation with "base" package:
========================================
--> Some of the string manipulation functions which belongs to base package are
paste()
format()
toupper()
tolower()
substring()
nchar()
paste():
========
--> paste() function is used to concatenating (combine) the strings.
Syntax:
paste(..., sep = " ", collapse = NULL)
Arguments:
...
represents any number of arguments to be combined.
sep
represents any separator between the arguments. It is optional.
collapse
is used to eliminate the space in between two strings. But not the space within two words of one string.
a <- "Heartly"
b <- 'Welcome to'
c <- "DataHills! "
paste(a,b,c)
paste(a,b,c, sep = "$")
paste(a,b,c, sep = "", collapse = "")
format():
=========
--> format() function is used to formatting the numbers and strings to a specific style.
Syntax:
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
Arguments:
x
is the vector input.
digits
is the total number of digits displayed.
nsmall
is the minimum number of digits to the right of the decimal point.
scientific
is set to TRUE to display scientific notation.
width
indicates the minimum width to be displayed by padding blanks in the beginning.
justify
is the display of the string to left, right or center.
# Total number of digits displayed. Last digit rounded off.
format(10.123456789, digits = 9)
# Display numbers in scientific notation.
format(c(7, 10.12345), scientific = TRUE)
# The minimum number of digits to the right of the decimal point.
format(10.12, nsmall = 5)
# Format treats everything as a string.
format(7)
# Numbers are padded with blank in the beginning for width.
format(10.5, width = 7)
# Left justify strings.
format("DataHills", width = 20, justify = "l")
# Justfy string with center.
format("DataHills", width = 20, justify = "c")
toupper():
==========
--> toupper() function is used to convert the characters of a string into upper case.
Syntax:
toupper(x)
Here x is the input vector
toupper("Welcome to DataHills") #"WELCOME TO DATAHILLS"
tolower():
==========
--> tolower() function is used to convert the characters of a string into lower case.
Syntax:
tolower(x)
Here x is the input vector
tolower("Welcome to DataHills") #"welcome to datahills"
substring():
============
--> substring function is used to extracts the part of a string.
Syntax:
substring(x, first, last)
Arguments:
x
is the character vector input
first
is the position of the first character to be extracted
last
is the position of the last character to be extracted
substring("DataHills", 5, 9) #"Hills"
nchar():
========
--> nchar() function is used to count the number of characters including spaces in a string.
Syntax:
nchar(x)
Here x is the input vector.
nchar("Welcome to DataHills") #20
nchar(9292005440) #10
Strings & String Manipulation with Base
String manipulation with "stringi" package:
===========================================
--> String functions which belongs to base package are good for only simple text processing.
--> Stringi package contains advanced string Processing functions
--> For better processing we need stringi package for dealing with more complex problems such as natural language processing.
--> Features of stringi package are
text sorting,
text comparing,
extracting words,
sentences and characters,
text transliteration,
replacing strings, etc.
#Install and load stringi
install.packages(“stringi”)
library(stringi)
data <- "Welcome to DataHills.
DataHills provides online training."
# To avoid \n write %s+%
data <- "Welcome to DataHills. " %s+%
"DataHills provides online training."
data <- "Welcome to DataHills. DataHills provides online training on Data Science, Data Analytics, Machine Learning, R Programming, Python, Weka, Pega, SqlServer, MySql, SSIS, SSRS, SSAS and PowerBI. For details contact 9292005440 or datahills7@gmail.com."
stri_split_boundaries():
========================
--> stri_split_boundaries() function is used to extract words.
Arguments
---------
type
single string input
skip_word_none
logical; perform no action for "words" that do not fit into any other categories
skip_word_number
logical; perform no action for words that appear to be numbers
skip_word_letter
logical; perform no action for words that contain letters, excluding hiragana, katakana, or ideographic characters
stri_split_boundaries(data)
stri_split_boundaries(data, type="line")
stri_split_boundaries(data, type="word")
stri_split_boundaries(data, type="word", skip_word_none=TRUE)
stri_split_boundaries(data, type="word", skip_word_letter=TRUE)
stri_split_boundaries(data, type="word", skip_word_none=TRUE, skip_word_letter=TRUE)
stri_split_boundaries(data, type="word", skip_word_number=TRUE)
stri_split_boundaries(data, type="word", skip_word_none=TRUE, skip_word_number=TRUE)
stri_split_boundaries(data, type="sentence")
stri_split_boundaries(data, type="character")
stri_count_boundaries:
======================
--> Count the number of text boundaries(like character, word, line, or sentence boundaries) in a string.
stri_count_boundaries(data)
stri_count_boundaries(data, type="line")
stri_count_boundaries(data, type="word")
stri_count_boundaries(data, type="sentence")
stri_count_boundaries(data, type="character")
stri_count_words(data)
stri_length(data)
stri_numbytes(data)
stri_startswith & stri_endswith:
================================
--> stri_startswith_* and stri_endswith_* determine whether a string starts or ends with a given pattern.
stri_startswith_fixed(c("srinu", "data", "science", "statistics", "hills"), "s")
stri_startswith_fixed(c("srinu", "data", "science", "statistics", "hills"), "d")
stri_startswith_fixed(c("srinu", "data", "science", "Statistics", "hills"), "s")
stri_startswith_coll(c("srinu", "data", "science", "Statistics", "hills"), "s", strength=1)
stri_endswith_fixed(c("srinu", "data", "science", "statistics", "hills"), "s")
stri_detect_regex(c("srinu", "data", "science", "statistics", "hills"), "^s")
stri_detect_regex(c("srinu", "data", "science", "statistics", "hills"), "s")
stri_startswith_fixed("datahills", "hill")
stri_startswith_fixed("datahills", "hill", from=5)
stri_replace_all:
=================
--> stri_replace_all() function replaces a word with another word based on conditions
--> vectorize_all parameter, which defaults to TRUE
stri_replace_all_fixed(data, " ", "#")
stri_replace_all_fixed(data, "a", "A")
stri_replace_all_fixed(data,c("DataHills","provides"), c("Information","offers"), vectorize_all=FALSE)
stri_replace_all_fixed(data,c("DataHills","provides"), c("Information","offers"), vectorize_all=TRUE)
stri_replace_all_fixed(data,c("DataHills","provides"), c("Information","offers"))
stri_replace_all_fixed(data,c("Data","provides"), c("Information","offers"), vectorize_all=FALSE)
stri_replace_all_fixed(data,c("Data","provides"), c("Information","offers"), vectorize_all=TRUE)
stri_replace_all_fixed(data,c("Data","provides"), c("Information","offers"), vectorize_all=FALSE)
stri_replace_all_regex(data,"\\b"%s+%c("Data","provides")%s+%"\\b", c("Information","offers"),vectorize_all=FALSE)
String Manipulation with Stringi Package Part
stri_split:
===========
--> stri_split() is used to split sentences based on ; , _ or any other metric
stri_split_fixed(data, " ")
stri_split_fixed("a_b_c_d", "_")
stri_split_fixed("a_b_c__d", "_")
stri_split_fixed("a_b_c__d", "_", omit_empty=FALSE)
stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE)
stri_split_fixed("a_b_c__d", "_", n=2) # "a" & remainder
stri_split_fixed("a_b_c__d", "_", n=2, tokens_only=FALSE)
stri_split_fixed("a_b_c__d", "_", n=2, tokens_only=TRUE) # "a" & "b" only
stri_split_fixed("a_b_c__d", "_", n=4, tokens_only=TRUE)
stri_split_fixed("a_b_c__d", "_", n=4, omit_empty=TRUE, tokens_only=TRUE)
stri_split_fixed("a_b_c__d", "_", omit_empty=NA)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""),"_")
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n=1, tokens_only=TRUE, omit_empty=TRUE)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n=2, tokens_only=TRUE, omit_empty=TRUE)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n=3, tokens_only=TRUE, omit_empty=TRUE)
stri_list2matrix:
=================
--> stri_list2matrix() is used to convert lists of atomic vectors to character matrices
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE)
stri_list2matrix(stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE))
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=FALSE)
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=TRUE)
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=NA, simplify=NA)
stri_trans:
===========
--> stri_trans() functions transform strings either to lower case, UPPER CASE, or to Title Case.
stri_trans_toupper(data) #toupper(data)
stri_trans_tolower(data) #tolower(data)
stri_trans_totitle(data)
stri_trans_totitle(data,type="word")
stri_trans_totitle(data,type="sentence")
stri_trans_totitle(data,type="character")
Date and Time:
==============
--> R is able to access the current date, time and time zone
--> Sys.time and Sys.Date functions are used to get current data and time.
Sys.Date()
Sys.time()
Sys.timezone()
x <- as.Date("2018-10-12")
print(x)
typeof(x) # double
class(x) # Date
as.Date(c('2018-10-11', '2018-10-12'))
a <- Sys.Date()
b <- Sys.time()
typeof(a) # "double"
typeof(b) # "double"
class(a) # "Date"
class(b) # "POSIXct" "POSIXIt"
String Manipulation with Stringi Package Part
DateTimeClasses:
----------------
--> DateTimeClasses function description of the classes "POSIXlt" and "POSIXct" representing calendar dates and times.
--> To format Dates we use the format(date, format="%Y-%m-%d") function with either the POSIXct (given from as.POSIXct()) or POSIXlt (given from as.POSIXlt())
--> Codes for specifying the formats to the as.Date() function.
Format Code_Meaning
------ -----------
%d day
%m month
%y year in 2-digits
%Y year in 4-digits
%b abbreviated month in 3 chars
%B full name of the month
# It tries to interprets the string as %Y-%m-%d
as.Date("2018-10-15") # no problem
as.Date("2018/10/15") # no problem
as.Date(" 2018-10-15 datahills") # leading whitespace and all trailing characters are ignored
as.Date("15-10-2018") #interprets as "%Y-%m-%d"
as.Date("15/10/2018") #again interprets as "%Y-%m-%d"
as.Date("2018-10-15", format = "%Y-%m-%d")
as.Date("2018-10-15") # in ISO format, so does not require formatting string
as.Date("10/15/18", format = "%m/%d/%y")
as.Date("October 15, 2018", "%B %d, %Y")
as.Date("October 15th, 2018", "%B %dth, %Y") # add separators and literals to format
as.Date("15-10-2018",format="%d-%m-%Y")
as.Date("15-10-2018", "%d-%m-%Y")
as.Date("15 Oct, 2018","%d %b, %Y")
as.Date("15Oct2018","%d%b%Y")
as.Date("15 October, 2018", "%d %B, %Y")
Formatting and printing date-time objects:
------------------------------------------
# test date-time object
d = as.POSIXct("2018-10-15 06:30:10.10", tz = "UTC")
format(d,"%S") # 00-61 Second as integer
format(d,"%OS") # 00-60.99… Second as fractional
format(d,"%M") # 00-59 Minute
format(d,"%H") # 00-23 Hours
format(d,"%I") # 01-12 Hours
format(d,"%p") # AM/PM Indicator
format(d,"%Z") # Time Zone Abbreviation
# To add/subtract time, use POSIXct, since it stores times in seconds
as.POSIXct("2018-10-15")
# adding/subtracting times - 60 seconds
as.POSIXct("2018-10-15") + 60
# adding 5 hours, 30 minutes, 10 seconds
as.POSIXct("2018-10-14") + ( (5 * 60 * 60) + (30 * 60) + 10)
# as.difftime can be used to add time periods to a date.
as.POSIXct("2018-10-14") +
as.difftime(5, units="hours") +
as.difftime(30, units="mins") +
as.difftime(10, units="secs")
# To find the difference between dates/times use difftime() for differences in seconds, minutes, hours, days or weeks.
# using POSIXct objects
difftime(
as.POSIXct("2018-10-14 12:00:00"),
as.POSIXct("2018-10-14 11:59:50"),
unit = "secs")
as.POSIXct("07:30", format = "%H:%M") # time, formatting string
strptime("07:30", format = "%H:%M") # identical, but makes a POSIXlt object
as.POSIXct("07 AM", format = "%I %p")
as.POSIXct("07:30:10",
format = "%H:%M:%S",
tz = "Asia/Calcutta") # time string without timezone & set time zone
as.POSIXct("2018-10-15 07:30:10",
format = "%F %T") # shortcut tokens for "%Y-%m-%d" and "%H:%M:%S"
Loading Data into R Objects:
============================
--> R can read and write into various file formats like
Data Extraction from CSV
Data Extraction from URL
Data Extraction from CLIPBOARD
Data Extraction from EXCEL
Data Extraction from DATABASES
Working Directory:
==================
--> getwd() function is used to get current working directory.
--> setwd() function is used to set a new working directory.
getwd() # C:/Users/Sreenu/Documents
setwd("C:/Users/Sreenu")
getwd() # C:/Users/Sreenu
Data Extraction from CSV:
=========================
--> read.csv() function is used to import the Comma separated value files (CSVs)
--> Use sep = "," to set the delimiter to a comma.
Parameter Details
--------- --------
file name of the CSV file to read
header logical: does the .csv file contain a header row with column names?
sep character: symbol that separates the cells on each row
quote character: symbol used to quote character strings
dec character: symbol used as decimal separator
fill logical: when TRUE, rows with unequal length are filled blanks.
comment.char character: character used as comment in the csv file.
Reading a CSV File Separated by ",":
------------------------------------
--> Read a CSV file available in current working directory
read.csv("emp10.csv", sep=",", stringsAsFactors=TRUE)
emp <- read.csv("emp10.csv")
--> Read a CSV file available in another directory
emp <- read.csv("c:\Users\Sreenu\Desktop\MLDataSets\emp10.csv") # Error
emp <- read.csv("c:\\Users\\Sreenu\\Desktop\\MLDataSets\\emp10.csv")
emp <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/emp10.csv")
Analyzing the CSV File:
-----------------------
is.data.frame(emp) # TRUE
typeof(emp) # list
mode(emp) # list
class(emp) # data.frame
nrow(emp)
ncol(emp)
dim(emp) # dimensions of the emp file
names(emp) # names of the attributes
str(emp) # structure of the attributes
emp[1:6,] # first 6 rows and all columns
# head() and tail() functions are used to returns the first or last records
head(emp) # return first 6 records by default
head(emp, n=3) # return n no. of records
head(emp, 3)
head(emp, -3)
tail(emp) # return last 6 records by default
tail(emp, 3)
tail(emp, -3)
# max salary from emp.
max(emp$sal)
# emp details having max salary.
subset(emp, sal == max(sal))
# all the employees working as data analyst
subset(emp, desig == "data analyst")
# Data Analyst whose sal is greater than 65000
subset(emp, sal > 65000 & desig == "data analyst")
# select only emp desig whose sal is greater than 60000
subset(emp, sal > 60000, select = desig)
select all columns except desig whose sal is greater than 60000
subset(emp, sal > 60000, select = -desig)
# employees who joined on or after 2013
subset(emp, as.Date(doj) > as.Date("2013-01-01"))
recent_join <- subset(emp, as.Date(doj) > as.Date("2013-01-01"))
print(recent_join)
Writing into a CSV File:
------------------------
--> write.csv() function is used to create the csv file.
--> This file gets created in the current working directory.
recent_join <- subset(emp, as.Date(doj) > as.Date("2013-01-01"))
print(recent_join)
# Write filtered data into a new file.
write.csv(recent_join, "emp6.csv")
newemp <- read.csv("emp6.csv")
print(newemp)
write.csv(newemp, "emp6_1.csv")
new <- read.csv("emp6_1.csv")
print(new)
# By default column X comes from the data set.
write.csv(recent_join,"emp6.csv", row.names = FALSE)
newemp <- read.csv("emp6.csv")
print(newemp)
Reading a CSV File Separated by ";":
------------------------------------
wines <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv")
head(wines)
dim(wines)
wines <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv",sep=";")
head(wines)
dim(wines)
str(wines)
wines <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv")
dim(wines)
wines <- read.csv2("c:/Users/Sreenu/Desktop/MLDataSets/winequality-red.csv")
dim(wines)
Data Extraction from EXCEL:
===========================
--> R can read and write the excel files using xlsx package.
--> Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format.
--> Note that, xlsx packages depends on rJava and xlsxjars R packages.
# First install java software in our system, otherwise xlsx will not load into R
install.packages("xlsx")
library("xlsx")
Reading the Excel File:
-----------------------
--> read.xlsx() and read.xlsx2 functions are used to import the excel files.
--> read.xlsx2 is faster on big files compared to read.xlsx function.
Syntax:
read.xlsx(file, sheetIndex, header=TRUE)
Arguments:
----------
file:
the path to the file to read
sheetIndex:
a number indicating the index of the sheet to read;
Ex:- use sheetIndex=1 to read the first sheet
header:
a logical value. If TRUE, the first row is used as the names of the variables
# Read the first worksheet in the file emp10.xlsx.
emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx") #Error
emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx", sheetIndex = 1)
print(emp)
dim(emp)
typeof(emp)
mode(emp)
class(emp)
str(emp)
emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx", 2)
print(emp)
Writing the Excel file:
-----------------------
--> write.xlsx() and write.xlsx2() functions are used to export data from R to an Excel file.
--> write.xlsx2 achieves better performance compared to write.xlsx for very large data.frame (with more than one lakh records).
Arguments:
----------
x:
a data.frame to be written into the workbook
file:
the path to the output file
sheetName:
a character string to use for the sheet name.
col.names, row.names:
a logical value specifying whether the column names/row names of x are to be written to the file
append:
a logical value indicating if x should be appended to an existing file.
emp <- read.xlsx("c:/Users/Sreenu/Desktop/MLDataSets/emp10.xlsx", 1)
a <- head(emp)
b <- tail(emp)
# Write the first data set in a new workbook
write.xlsx(a, file="write_emp.xlsx", sheetName="first6", append=FALSE)
# Add a second data set in a new worksheet
write.xlsx(b, file="write_emp.xlsx", sheetName="last6", append=TRUE)
?mtcars
class(mtcars)
names(mtcars)
str(mtcars)
write.xlsx(mtcars,"C:/Users/Sreenu/Desktop/mtcars.xlsx")
# practice on emp100, emp1000 files
emp100 <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/emp100.csv")
dim(emp100)
emp1000 <- read.csv("c:/Users/Sreenu/Desktop/MLDataSets/emp1000.csv")
dim(emp1000)
Data Extraction from CLIPBOARD:
===============================
--> read.delim("clipboard") function is used to import the copied data.
emp <- read.delim("clipboard")
print(emp)
dim(emp)
str(emp)
typeof(emp)
mode(emp)
class(emp)
Data Extraction from URL:
=========================
read.csv(url("url address"))
wine_red <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"))
dim(wine_data) # 1599 1
wine_red <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"),sep=";")
dim(wine_red) # 1599 12
wine_white <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"),sep=";")
dim(wine_white) # 4898 12
Data Extraction from XML:
=========================
--> Extensible Markup Language (XML) is a file format which shares the data on the world wide web.
--> XML is similar to HTML it contains markup tags.
--> We can extract XML files using the "XML" package.
install.packages("XML")
library(XML)
Reading XML File:
-----------------
--> Read XML file by using xmlParse() function.
emp_xml <- xmlParse("C:/Users/Sreenu/Desktop/MLDataSets/emp10.xml")
print(emp_xml)
# Exract the root node form the xml file.
emp_root <- xmlRoot(emp_xml)
# Extract the details of the first node
emp_root[1]
# Get the first element of the first node.
emp_root[[1]][[1]]
# Get the fifth element of the first node.
emp_root[[1]][[5]]
# Get the second element of the third node.
emp_root[[3]][[2]]
# Find number of nodes in the root.
emp_size <- xmlSize(emp_root)
print(emp_size)
XML to Data Frame:
------------------
--> For data analysis it is better to convert the xml file into a data frame.
--> We have to use xmlToDataFrame() function to convert into data frame.
emp_df <- xmlToDataFrame("C:/Users/Sreenu/Desktop/MLDataSets/emp10.xml")
print(emp_df)
dim(emp_df)
Data Extraction from JSON:
==========================
--> JavaScript Object Notation files can read by using the rjson package.
install.packages("rjson")
library(rjson)
Read the JSON File:
-------------------
--> Read the JSON file by using fromJSON() function.
a <- fromJSON(file = "file_name.json")
JSON to Data Frame:
-------------------
--> For data analysis it is better to convert the JSON file to a data frame.
--> We have to use as.data.frame() function to convert into data frame.
b <- as.data.frame(a)
Data Extraction from CLIPBOARD, URL, XML
Data:
=====
--> Data is a raw fact (collection of characters, numeric values, special characters etc)
--> Whatever we input from the keyboard is known as data.
--> Data will not provide any meaningful statements to the user.
--> Ex:- ec@mw2lo1e3
Information:
============
--> Processing the data is called as information.
--> Information always provides meaningful statements to the user.
--> Ex:- welcome@123
Database:
=========
--> Database is a collection of information that can be written in predetermined manner and saved at a particular location is called database.
Database Management System (DBMS):
==================================
--> DBMS is a tool which can be used to maintain & manage the data with in the database.
--> DBMS is used for stroing the information, accessing the information, sharing the information and providing security to the information.
Models of DBMS:
===============
--> DBMS contains 6 models:
1. File management system (FMS)
2. Hierarchy management system (HMS)
3. Network database management system (NDBMS)
4. Relational database management system (RDBMS)
5. Object relational database management system (ORDBMS)
6. Object oriented relational database management system (OORDBMS)
File Management System:
=======================
--> FMS is a first model of DBMS which was designed & developed in 1950's.
--> In this model, the data will be stored into sequential manner or continuous stream of a character manner.
Drawbacks:
----------
--> Costly in maintanence
--> Required more man power
--> Accessing the data in time taken process or time consume method
--> It is difficult to maintain large amount of data
--> There is no security
--> It is not possible to share the information to the multiple programmers
--> Programmer will get delay response.
Hierarchy Management System:
============================
--> HMS is a second model of DBMS which was designed & developed by IBM company when they are developing a project called as IMS (Information Management System) in 1960's.
--> In this model the data will be stored in the form of tree structure or level manner.
--> In tree structure the user has to maintain the following levels those are
Root level will represent Database Name,
Parent level will represent Table's Name,
Child level will represent Column names of a table,
Leaf level will represent Additional columns.
--> The main advantage of HMS model is to access the data from the location without taking much time.
Drawbacks:
----------
--> In this model only one programmer can interact with the data symultaniously.
--> There is no security for database information
--> It is not possible to share the database to multiple programmers or location.
Network Database Management System:
===================================
--> NDBMS is the third model of DBMS which was designed & developed by IBM company when they are enhancing the features in IBM project in 1969
--> In this model the data will be stored in the form of tree structure and located with in networks environment.
--> The main advantage of NDBMS is to shared the required database to the multiple programmers at a time and communicate with same database.
Drawbacks:
----------
--> There is no proper security for centralized database system
--> Database redundency will be increased (duplicate values)
--> It occupies more memory
--> Application performance will be reduced
--> User will get delay responses
NOTE: The above 3 models are outdated.
Relational Database Management System:
======================================
--> RDBMS is a 4th model of DBMS which was designed & developed by German scientist E.F Codd in 1970.
--> E.F Codd defined 12 Codd Rules
Rule 0: Foundation rule
Rule 1: Information rule
Rule 2: Guaranteed access rule
Rule 3: Systematic treatment of null values
Rule 4: Active online catalog
Rule 5: Comprehensive data sub-language rule
Rule 6: View updating rule
Rule 7: High-level insert, update and delete
Rule 8: Physical data independence
Rule 9: Logical data independence
Rule 10: Integrity independence
Rule 11: Distribution independence
Rule 12: Non-subversion rule
--> If any database satisfy atleast 6 codd rules then the DBMS is called RDBMS product.
--> Here relation can be defined as a commonness between the objects
--> Relations again calssified into 3 types
1. one to one relation
2. one to many relation
3. many to many relation
--> A object can have a relationship with an other object is known as one to one relation.
Ex:- studnet <-> sid
--> A object can have a relationship with an other many object is known as one to many relation
Ex:- student <-> C, C++, Java
--> Many objects can have a relationship with an other many objects is known as many to many relation.
Ex:- vendor1, vendor2, vender3 <-> product1, product2, product3
--> E.F Codd was designed the above relations based on mathematical concept is called as "Relational Algebra".
--> The above 3 relations are called as Degree of Relationship.
Features of RDBMS:
==================
--> In RDBMS model the data should be stored in the format of table.
--> Table is a collection of rows and columns.
--> The vertical lines is called as column or field or attribute and the horizontal lines is called as row or record or tuple.
--> The intersection between row & column is called CELL / ATOMIC
--> RDBMS will provide pure security to the database information
--> Accessing the data is very easy and user friendly
--> RDBMS will provide sharing the database from one location to another location without loosing the data faciltiy.
--> It is easy to perform manipulations on tables data.
--> RDBMS will improved application performance and avoid database redundancy problems.
# For data analysis mostly we have to read data from various data bases like MySQL, Microsoft SQL server, Oracle Database server etc.
# So, as data scientist we should be able to write query to get data on various criteria from the database.
Data Extraction from Databases:
===============================
--> R connect with many relational databases like MySql, Oracle, Sql Server, PostgressSQL, SQLite etc.
--> R can fetch records from databases as a data frame.
--> Once the data is available in the R environment, then it becomes easy to manipulate or analyze using packages and functions.
Different R Packages:
---------------------
MySQL - RMySQL
PostgressSQL - RPostgressSQL
Oracle - ROracle
--> Here I am using MySql for connecting to R.
--> To work with MySql we have to install & load RMySQL Package
install.packages("RMySQL")
# Automatically DBI Package is also installed
library(RMySQL)
or
library(DBI)
# library(RMySQL) not required
Connecting R to MySql:
----------------------
--> For connecting R to MySql it takes the username, password, database name and host name as input.
--> dbConnect() function is used to create a connection to a DBMS
Syntax:
dbConnect(DB, ...)
Arguments:
----------
user:- for the user name (default: current user)
password:- for the password
dbname:- for the name of the database
host:- for the host name (default: local connection)
# Here, I am connnecting with "roadway_travels" database
travels = dbConnect(MySQL(), user = 'root', password = 'datahills', dbname = 'roadway_travels', host = 'localhost')
# dbListTables() function is used to display the list of tables available in this database.
dbListTables(travels)
Querying the Tables:
--------------------
--> dbSendQuery() function is used to execute a query on a given database connection.
--> fetch() function is used to store the result set as a data frame in R.
# Query the "bus" tables to get all the rows.
busdb = dbSendQuery(travels, "select * from bus")
# Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.
bus5 = fetch(busdb, n = 5)
print(bus5)
dbClearResult(busdb)
# Query with Filter Clause:
bushyd = dbSendQuery(travels, "select * from bus where SOURCE='HYDERABAD' ")
# Fetch all the records(with n = -1) and store it as a data frame.
busall = fetch(bushyd, n = -1)
print(busall)
dbClearResult(bushyd)
# Updating Rows in the Tables
dbSendQuery(travels, "update bus set SOURCE = 'DELHI' where BNO = 50")
# After executing the above code we can see the table updated in the MySql.
# Inserting Data into the Tables
dbSendQuery(travels,"insert into bus values (70,'ORANGE','PUNE','DELHI','20:00:00','06:30:00')")
# After executing the above code we can see the row inserted into the table in the MySql.
# Dropping Tables in MySql
dbSendQuery(travels, 'drop table reservation')
# After executing the above code we can see the table is dropped in the MySql.
Creating Tables in MySql:
-------------------------
--> dbWriteTable() function is used to create tables in the MySql.
--> It takes a data frame as input.
--> It overwrites the table if it already exists.
# Use the R data frame "mtcars" to create the table in MySql.
dbWriteTable(travels, "mtcars", mtcars[,], overwrite = TRUE)
# After executing the above code we can see the table created in the MySql.
Closing connections:
--------------------
--> dbDisconnect() function is used to disconnect the created connections with MySQL.
dbDisconnect(travels)
dplyr:
======
--> The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data.
--> The package "dplyr" contain many functions that perform mostly used data manipulation operations such as
applying filter,
selecting specific columns,
sorting data,
adding or deleting columns and
aggregating data.
--> This package was written by Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc.
-->dplyr functions are similar to SQL commands such as
select() for selecting columns,
group_by() - group data by grouping column,
join() - joining two data sets.
Also includes inner_join() and left_join().
It also supports sub queries for which SQL was popular for.
--> But SQL was never designed to perform data analysis, it was designed for querying and managing data.
--> There are many data analysis operations where SQL fails or makes simple things difficult.
--> For example, calculating median for multiple columns, converting wide format data to long format etc.
--> Whereas, dplyr package was designed to do data analysis.
# install and load dplyr package
install.packages("dplyr")
library(dplyr)
# dplyr Functions:-
dplyr Function Description Equivalent SQL
============== =========== ==============
select() Selecting columns (variables) SELECT
filter() Filter (subset) rows. WHERE
group_by() Group the data GROUP BY
summarise() Summarise (or aggregate) data -
arrange() Sort the data ORDER BY
join() Joining data frames (tables) JOIN
mutate() Creating New columns COLUMN ALIAS
# I am using the sampledata.csv file which contains income generated by states from year 2002 to 2015.
mydata = read.csv("C:/Users/Sreenu/Desktop/MLDataSets/sampledata.csv")
dim(mydata) # 51 observations (rows) and 16 variables (columns)
# Selecting Random N Rows
--> sample_n() function selects random rows from a data frame (or table).
--> The second parameter of the function tells R the number of rows to select.
sample_n(mydata,3)
# Selecting Random Fraction of Rows
--> sample_frac() function returns randomly N% of rows.
sample_frac(mydata,0.1) # it returns randomly 10% of rows
Data Extraction from Databases Part 2 &
# Remove Duplicate Rows based on all the columns (Complete Row)
--> distinct() function is used to eliminate duplicates.
x1 = distinct(mydata)
dim(x1)
# In this dataset, there is not a single duplicate row so it returned same number of rows as in mydata.
# Remove Duplicate Rows based on a column
--> .keep_all argument is used to retain all other columns in the output data frame.
x2 = distinct(mydata, Index, .keep_all= TRUE)
dim(x2)
# Remove Duplicates Rows based on multiple columns
--> we are using two columns - Index, Y2010 to determine uniqueness.
x2 = distinct(mydata, Index, Y2010, .keep_all= TRUE)
dim(x2)
select():
---------
--> It is used to select only desired columns.
syntax:
select(data , ....)
data : Data Frame
.... : columns by name or by function
# Selecting Columns
--> Selects column "Index", columns from "Y2006" to "Y2008".
mydata2 = select(mydata, State, Y2006:Y2008)
# Dropping columns
--> The minus sign before a column tells R to drop the variable.
mydata2 = select(mydata, -Index, -State)
--> The above code can also be written like :
mydata2 = select(mydata, -c(Index,State))
# Selecting or Dropping columns starts with 'Y'
--> starts_with() function is used to select columns starts with an alphabet.
mydata3 = select(mydata, starts_with("Y"))
head(mydata3)
--> Adding a negative sign before starts_with() implies dropping the columns starts with 'Y'
mydata33 = select(mydata, -starts_with("Y"))
head(mydata33)
The following functions helps you to select columns based on their names:
Helpers Description
======= ===========
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Columns in character vector.
everything() All columns.
# Selecting columns contain 'I' in their names
mydata4 = select(mydata, contains("I"))
# Reorder columns
--> The column 'State' in the front and the remaining columns follow that.
mydata5 = select(mydata, State, everything())
rename():
---------
--> It is used to change column name.
syntax:
rename(data , new_name = old_name)
data : Data Frame
new_name : New column name you want to keep
old_name : Existing column Name
# Rename Columns
--> The rename function can be used to rename columns.
--> we are renaming 'Index' column to 'Index1'.
mydata6 = rename(mydata, Index1=Index)
filter():
---------
--> It is used to subset data with matching logical conditions.
syntax : filter(data , ....)
data : Data Frame
.... : Logical Condition
# Filter Rows
--> To filter rows and retain only those values in which Index is equal to A.
mydata7 = filter(mydata, Index == "A")
# Multiple Selection Criteria
--> The %in% operator can be used to select multiple items.
--> Select rows against 'A' and 'C' in column 'Index'.
mydata7 = filter(mydata, Index %in% c("A", "C"))
# 'AND' Condition in Selection Criteria
--> Filtering data for 'A' and 'C' in the column 'Index' and income greater than 13 lakh in Year 2002.
mydata8 = filter(mydata, Index %in% c("A", "C") & Y2002 >= 1300000)
# 'OR' Condition in Selection Criteria
--> | (OR) in the logical condition. It means any of the two conditions.
mydata9 = filter(mydata, Index %in% c("A", "C") | Y2002 >= 1300000)
# NOT Condition
--> The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata, !Index %in% c("A", "C"))
# CONTAINS Condition
--> The grepl() function is used to search for pattern matching.
--> we are looking for records wherein column state contains 'Ar' in their name.
mydata10 = filter(mydata, grepl("Ar", State))
summarise():
------------
--> It is used to summarize data.
syntax : summarise(data , ....)
data : Data Frame
..... : Summary Functions such as mean, median etc
# Summarize selected columns
--> we are calculating mean and median for the column Y2015.
summarise(mydata, Y2015_mean = mean(Y2015), Y2015_med=median(Y2015))
# Summarize Multiple Columns
--> we are calculating number of records, mean and median for columns Y2005 and Y2006.
--> summarise_at() function allows us to select multiple columns by their names.
summarise_at(mydata, vars(Y2005, Y2006), funs(n(), mean, median))
Working on another dataset:
---------------------------
# I am using the airquality dataset from the datasets package.
# The airquality dataset contains information about air quality measurements in New York from May 1973 – September 1973.
dim(airquality)
names(airquality)
head(airquality)
sample_n(airquality, size = 10)
sample_frac(airquality, size = 0.1)
# we can return all rows with Temp greater than 70 as follows:
filter(airquality, Temp > 70)
# return all rows with Temp larger than 80 and Month higher than 5.
filter(airquality, Temp > 80 & Month > 5)
# adds a new column that displays the temperature in Celsius.
mutate(airquality, TempInC = (Temp - 32) * 5 / 9)
summarise(airquality, mean(Temp, na.rm = TRUE))
summarise(airquality, Temp_mean = mean(Temp, na.rm = TRUE))
# Group By
--> The group_by function is used to group data by one or more columns.
--> we can group the data together based on the Month, and then use the summarise function to calculate and display the mean temperature for each month.
summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))
# Count
--> The count function calculates the no. of observations based on a group.
--> It is slightly similar to the table function in the base package.
count(airquality, Month)
--> This means that there are 31 rows with Month = 5, 30 rows with Month = 6, and so on.
# Arrange
--> The arrange function is used to arrange rows by columns.
--> Currently, the airquality dataset is arranged based on Month, and then Day.
--> We can use the arrange function to arrange the rows in the descending order of Month, and then in the ascending order of Day.
arrange(airquality, desc(Month), Day)
# Pipe
--> The pipe operator in R, represented by %>% can be used to chain code together.
--> It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.
--> For example, let’s say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month.
--> The conventional way to write the code for this would be:
filteredData <- filter(airquality, Month != 5)
groupedData <- group_by(filteredData, Month)
summarise(groupedData, mean(Temp, na.rm = TRUE))
--> With piping, the above code can be rewritten as:
airquality %>%
filter(Month != 5) %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm = TRUE))
Plyr:
=====
--> The plyr package is a tool for doing split-apply-combine (SAC) procedures.
--> This is an extremely common pattern in data analysis:
we solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again.
Install and Load plyr:
install.packages("plyr")
library(plyr)
Plyr provides a set of functions for common data analysis problems:
-------------------------------------------------------------------
arrange:
re-order the rows of a data frame by specifying the columns to order by
mutate:
add new columns or modifying existing columns, like transform, but new columns can refer to other columns that you just created.
summarise:
like mutate but create a new data frame, not preserving any columns in the old data frame.
join:
an adapation of merge which is more similar to SQL, and has a much faster implementation if you only want to find the first match.
match_df:
a version of join that instead of returning the two tables combined together, only returns the rows in the first table that match the second.
colwise:
make any function work colwise on a dataframe
rename:
easily rename columns in a data frame
round_any:
round a number to any degree of precision
count:
quickly count unique combinations and return return as a data frame.
plyr vs dplyr:
==============
--> dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R.
--> dplyr is the next iteration of plyr, focussing on only data frames.
--> dplyr is faster, has a more consistent API and should be easier to use.
--> Lets compare plyr and dplyr with a little example, using the Batting dataset from the fantastic Lahman package which makes the complete Lahman baseball database easily accessible from R.
--> Pretend we want to find the five players who have batted in the most games in all of baseball history.
Install and Load Lahman:
install.packages("Lahman")
library(Lahman)
--> The basic format is two letters followed by ply().
--> The first letter refers to the format in and the second to the format out.
--> The three main letters are:
d = data frame
a = array (includes matrices)
l = list
--> So, ddply means: take a data frame, split it up, do something to it, and return a data frame.
--> ldply means: take a list, split it up, do something to it, and return a data frame.
--> This extends to all combinations.
--> In the following table, the columns are the input formats and the rows are the output format:
ObjectType dframe list array
========== ====== ==== =====
data frame ddply ldply adply
list dlply llply alply
array daply laply aaply
In plyr, we might write code like this:
games <- ddply(Batting, "playerID", summarise, total = sum(G))
head(arrange(games, desc(total)), 5)
--> We use ddply() to break up the Batting dataframe into pieces according to the playerID column, then apply summarise() to reduce the player data to a single row.
--> Each row in Batting represents one year of data for one player, so we figure out the total number of games with sum(G) and save it in a new column called total.
--> We sort the result so the most games come at the top and then use head() to pull off the first five.
# If you need functions from both plyr and dplyr, please load plyr first, then dplyr.
# If we load plyr after dplyr - it is likely to cause problems.
In dplyr, the code is similar:
players <- group_by(Batting, playerID)
games <- summarise(players, total = sum(G))
head(arrange(games, desc(total)), 5)
--> Grouping is a top level operation performed by group_by(), and summarise() works directly on the grouped data, rather than being called from inside another function.
--> The other big difference is speed. plyr took about 9 seconds on my computer, and dplyr took 0.2s, a 35x speed-up.
--> This is common when switching from plyr to dplyr, and for many operations you’ll see a 20x-1000x speedup.
--> dplyr provides another innovation over plyr: the ability to chain operations together from left to right with the %>% operator.
This makes dplyr behave a little like a grammar of data manipulation:
Batting %>%
group_by(playerID) %>%
summarise(total = sum(G)) %>%
arrange(desc(total)) %>%
head(5)
tidyr:
======
--> tidyr package is an evolution of reshape2 (2010-2014) and reshape (2005-2010) packages.
--> It's designed specifically for data tidying (not general reshaping or aggregating)
--> tidyr is new package that makes it easy to “tidy” your data.
--> it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (modelling packages).
Install and Load tidyr:
install.packages("tidyr")
library(tidyr)
Translation b/w the terminology used in different places:
tidyr gather spread
==== ====== ======
reshape(2) melt cast
spreadsheets unpivot pivot
databases fold unfold
# I will use the mtcars dataset from the datasets library.
head(mtcars)
dim(mtcars)
# Let us include the names of the cars in a column called car for easier manipulation.
mtcars$car <- rownames(mtcars)
dim(mtcars)
head(mtcars)
mtcars <- mtcars[, c(12, 1:11)]
head(mtcars)
gather():
=========
--> gather() function is used to converts wide data to longer format.
--> It is analogous to the melt function from reshape2.
syntax:
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
where ... is the specification of the columns to gather.
# We can replicate what melt does as follows:
mtcarsNew <- mtcars %>% gather(attribute, value, -car)
dim(mtcarsNew)
head(mtcarsNew)
tail(mtcarsNew)
--> As we can see, it gathers all the columns except car and places their name and value into the attritube and value column respectively.
--> The great thing about tidyr is that you can gather only certain columns and leave the others alone.
--> If we want to gather all the columns from mpg to gear and leave the carb and car columns as they are, we can do it as follows:
mtcarsNew <- mtcars %>% gather(attribute, value, mpg:gear)
dim(mtcarsNew)
head(mtcarsNew)
spread():
=========
--> spread() fucntion is used to converts long data to wider format.
--> It is analogous to the cast function from reshape2.
syntax:
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
We can replicate what cast does as follows:
mtcarsSpread <- mtcarsNew %>% spread(attribute, value)
head(mtcarsSpread)
unite():
========
--> unite() fucntion is used to combines two or more columns into a single column.
syntax:
unite(data, col, ..., sep = "_", remove = TRUE)
where ... represents the columns to unite and col represents the column to add.
# Let us create some dummy data:
date <- as.Date('2016-01-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
print(data)
# Now, let us combine the date, hour, min, and second columns into a new column called datetime. # Usually, datetime in R is of the form Year-Month-Day Hour:Min:Second.
dataNew <- data %>%
unite(time, hour, min, second, sep = ':')
print(dataNew)
dataNew <- data %>%
unite(time, hour, min, second, sep = ':') %>%
unite(datetime, date, time, sep = ' ')
print(dataNew)
separate():
===========
--> separate() is used to splits one column into two or more columns.
syntax:
separate(data, col, into, sep, remove, convert, extra , fill , ...)
# We can get back the original data we created using separate as follows:
data1 <- dataNew %>%
separate(datetime, c('date', 'time'), sep = ' ') %>%
separate(time, c('hour', 'min', 'second'), sep = ':')
print(data1)
--> It first splits the datetime column into date and time, and then splits time into hour, min, and second.
Factor Analysis:
================
table():
--------
--> returns the count for each categorical values.
--> table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv",stringsAsFactors=TRUE)
dim(cars)
names(cars)
head(cars)
str(cars)
cars[,2]
or
cars$model
table(cars$model)
sum(table(cars$model))
prop.table():
-------------
--> Proportionality
--> Express Table Entries as Fraction of Marginal Table
--> In mathematics, two variables are proportional if there is always a constant ratio between them.
--> The constant is called the coefficient of proportionality or proportionality constant.
prop.table(cars$model) #it is not possible
prop.table(table(cars$model))
prop.table(table(cars$model))*100 #result in percentage
table(cars$transmission)
prop.table(table(cars$transmission))
prop.table(table(cars$transmission))*100
table(cars$color)
prop.table(table(cars$color))
prop.table(table(cars$color))*100
sum(table(cars$model))
table(cars$model)
sum(table(cars$transmission))
table(cars$transmission)
sum(table(cars$model,cars$transmission))
table(cars$model,cars$transmission)
prop.table(table(cars$model,cars$transmission))
prop.table(table(cars$model,cars$transmission))*100
table(cars$transmission,cars$model)
prop.table(table(cars$transmission,cars$model))
prop.table(table(cars$transmission,cars$model))*100
table(cars$model,cars$transmission,cars$color)
prop.table(table(cars$model,cars$transmission,cars$color))
prop.table(table(cars$model,cars$transmission,cars$color))*100
CrossTable():
-------------
--> CrossTable() function belongs to "gmodels" package (for more analysis)
install.packages("gmodels")
library(gmodels)
--> Cross Tabulation With Tests For Factor Independence
--> The CrossTable( ) function in the gmodels package produces crosstabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS. It has a wealth of options.
--> We can control whether
row percentages (prop.r),
column percentages (prop.c),
table percentages (prop.t),
chisq percentages (prop.chisq) by making them TRUE.
CrossTable(cars$model,cars$transmission)
CrossTable(cars$model,cars$transmission,prop.t=F,prop.c=F,prop.r=F,prop.chisq=F)
CrossTable(cars$model,cars$transmission,prop.t=F,prop.c=F,prop.r=T,prop.chisq=F)
CrossTable(cars$model,cars$transmission,prop.t=F,prop.c=T,prop.r=T,prop.chisq=F)
CrossTable(cars$model,cars$transmission,prop.t=T,prop.c=T,prop.r=T,prop.chisq=F)
Statistical Observations:
=========================
--> Statistical analysis in R is performed by using many in-built functions.
--> Most of these functions are part of the R base package & stats package.
--> These functions take R vector as an input along with the arguments and give the result.
min()
max()
mean()
median()
quantile()
min() & max():
==============
--> Returns the (regular or parallel) maxima and minima of the input values.
Syntax:
max(..., na.rm = FALSE)
min(..., na.rm = FALSE)
Arguments:
----------
--> ... numeric or character arguments
--> na.rm is a logical indicating whether missing values should be removed.
# Create a vector.
x <- c(12,41,21,-32,23,24,65,-12,10,-8)
min(x)
max(x)
car <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")
min(car$price)
max(car$price)
mean():
=======
--> It is calculated by taking the sum of the values and dividing with the number of values in a data series.
--> mean() function is used to calculate this value.
Syntax:
mean(x, trim = 0, na.rm = FALSE, ...)
Arguments:
----------
--> x is the input vector.
--> trim is used to drop some observations from both end of the sorted vector.
--> na.rm is used to remove the missing values from the input vector.
# Find Mean.
a <- mean(x)
print(a)
mean(car$price)
Applying Trim Option:
---------------------
--> When trim parameter is supplied, the values in the vector get sorted and then the required
numbers of observations are dropped from calculating the mean.
--> When trim = 0.3, 3 values from each end will be dropped from the calculations to find mean.
--> In this case the sorted vector is (-32 -12 -8 10 12 21 23 24 41 65) and the values removed from the vector for calculating mean are (-32 -12 -8) from left and (24 41 65) from right.
a <- mean(x,trim = 0.3)
print(a)
mean(car$price, trim = 0.3)
Applying NA Option:
----------------------------
--> If there are missing values, then the mean function returns NA.
--> To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA values.
# Create a vector.
x <- c(12,41,21,-32,23,24,65,-12,10,-8,NA)
# Find mean.
a <- mean(x)
print(a) # NA
# Find mean dropping NA values.
a <- mean(x,na.rm = TRUE)
print(a)
mean(car$price, na.rm = TRUE)
median():
========
--> The middle most value in a data series is called the median.
--> median() function is used to calculate this value.
Syntax:
median(x, na.rm = FALSE)
Arguments:
----------------
--> x is the input vector.
--> na.rm is used to remove the missing values from the input vector.
# Find the median.
b <- median(x)
print(b)
median(car$price)
quantile():
===========
--> quantile produces sample quantiles corresponding to the given probabilities.
--> The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.
quantile(x)
quantile(car$price)
summary():
==========
--> it gives all the above 5 statistical observations.
--> summary is a generic function used to produce result summaries of the results of various model fitting functions.
--> The function invokes particular methods which depend on the class of the first argument.
summary(x)
summary(car$price)
Mode:
=====
--> The mode is the value that has highest number of occurrences in a set of data.
--> Unike mean and median, mode can have both numeric and character data.
--> R does not have a standard in-built function to calculate mode.
--> So we create a user function to calculate mode of a data set in R.
--> This function takes the vector as input and gives the mode value as output.
# Create the function.
# unique is used to Extract Unique Elements
# which.mean determines the location, i.e., index of the (first) minimum or maximum of a numeric (or logical) vector.
# tabulate takes the integer-valued vector bin and counts the number of times each integer occurs in it.
# match returns a vector of the positions of (first) matches of its first argument in its second.
mod <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Create the vector with numbers.
v <- c(2,7,5,3,7,6,1,7,2,5,7,9,7,6,0,7,5)
# Calculate the mode using the user function.
a <- mod(v)
print(a)
mod(car$year)
mod(car$price)
# Create the vector with characters.
charv <- c("Analysis","DataHills","DataScience","DataHills")
# Calculate the mode using the user function.
a <- mod(charv)
print(a)
mod(car$model)
mod(car$color)
mean = median
-------------
--> No skewed
--> Normal distribution
--> Data is equal distribution
mean > median
-------------
--> Right skewed
--> Data is extended more on the right hand side
--> Positive skewness
mean < median
-------------
--> Left skewed
--> Data is extended more on the left hand side
--> Negative skewness
cars <-read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")
str(cars)
min(cars$price)
max(cars$price)
min(cars$mileage)
max(cars$mileage)
range(cars$price)
range(cars$mileage)
mean(cars$price) #12961.93
median(cars$price) #13591.5
mean(cars$mileage)
median(cars$mileage)
#1 to 1000 median is 500
a=seq(0,1000,10)
print(a)
mean(a) #500
median(a) #500
range(cars$price)
median(cars$price) #13591
mean(cars$price) #12961
13591-3800 #9791
21992-13591 #8401
# to check the left skewed data in the graph
boxplot(cars$price, horizontal=T)
mean(cars$mileage) #44260
median(cars$mileage) #36385
range(cars$mileage) #4867 151479
36385-4867 #31518
151479-36385 #115094
boxplot(cars$mileage, horizontal=T)
credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv")
dim(credit)
names(credit)
str(credit)
range(credit$age) #19 75
mean(credit$age) #35.546
median(credit$age) #33
boxplot(credit$age,horizontal=T)
# quantile
0%(min) 25%(Q1) 50%(median)(Q2) 75%(Q3) 100%(max)
cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")
quantile(cars$price) #it gives the values of 0%,25%,50%,75%,100%
range(cars$price)
median(cars$price)
quantile(cars$price,seq(0.1,1,0.1)) #it gives 10%,20%,.......100%
quantile(cars$price,seq(0.25,0.75,0.25)) #it gives 25%,50%,75%
summary(cars$price)
# it gives Min. 1st Qu. Median Mean 3rd Qu. Max
summary(cars$mileage)
hist(cars$price) #in this histogram we can observe left skewed data
hist(cars$mileage) #in this histogram we can observe right skewed data
boxplot(cars$price,horizontal=TRUE)
boxplot(cars$mileage,horizontal=TRUE)
IQR():
------
--> IQR(x, na.rm = FALSE)
--> Q3-Q1 i.e., middle 50% data
--> Computes interquartile range of the x values.
var():
------
--> variance (symbolized by S2)
--> It returns that how much variance is from the mean value.
--> It is calculated as the average squared deviation of each number from the mean of a data set.
--> For example,
a <- c(1,2,3)
mean(a) # 2
var(a) # 1
for the numbers 1, 2, and 3 the mean is 2 and the variance is 0.667.
[(1 - 2)2 + (2 - 2)2 + (3 - 2)2] ÷ 3 = 0.667
[squaring deviation from the mean] ÷ number of observations-1 = variance
Syntax:
var(x, na.rm = FALSE)
var(1:10) # 9.166667
var(1:5, 1:5) # 2.5
sd():
-----
--> standard deviation (the square root of the variance, symbolized by S)
--> This function computes the standard deviation of the values in x.
--> If na.rm is TRUE then missing values are removed before computation proceeds.
Syntax:
sd(x, na.rm = FALSE)
sd(1:2)
sd(1:2) ^ 2
# IQR (Interquartile Range):
IQR(cars$price) # it gives middle 50%
14904-10995 # manually Q3-Q1 value
IQR(cars$mileage)
marks <- c(76,80,72,78)
mean(marks) # 76.5
sd(marks)
#3.415 it means that mean value +sd value is max value(approx) & mean value -sd value is min value(approx)
var(marks) # 11.666
marks <- c(76.5,76.5,76.5,76.5)
mean(marks) # 76.5
sd(marks) # 0
var(marks) # 0
sd(cars$price) # 3122.482
mean(cars#price) # 12961.93
12961-3122 # 9839
12961+3122 # 16083
cars$price
# check the values from 16083 to 9839. 38 values excluding outof 150 values. i.e., total 112 values are considering for sd
mean(cars$mileage) # 44260.65
sd(cars$mileage) # 26982.1
44260-26982 # 17278
44260+26982 # 71242
sort(cars$mileage)
cars_new <- subset(cars,mileage>=17278 & mileage<=71242)
dim(cars_new)
quantile(cars_new$mileage)
mean(cars_new$mileage) # 38356.13
median(cars_new$mileage) # 36124
mean(cars$mileage) # 44260.65
median(cars$mileage) # 36385
sd(cars_new$mileage) # 12265.27
sd(cars$mileage) # 26982
hist(cars$mileage) # right skewed deviation
hist(cars_new$mileage) # nearly to normal deviation
Data Visualization / Plotting:
==============================
1. High level plotting
--> Generates a new plot
2. Low level plotting
--> Editing the existing plot
--> R Programming language has many libraries to create charts and graphs.
--> Data can be visualized in the form of
Pie Charts
Bar Charts
Boxplots
Histograms
Line Graphs
Scatterplots
Pie Charts:
===========
--> A pie-chart is a representation of values as slices of a circle with different colors.
--> The slices are labeled and the numbers corresponding to each slice is also represented in the chart.
--> In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
--> The additional parameters are used to control labels, color, title etc.
Syntax:
pie(x, labels, radius, main, col, clockwise)
Arguments:
----------
--> x is a vector containing the numeric values used in the pie chart.
--> labels is used to give description to the slices.
--> radius indicates the radius of the circle of the pie chart.(value between -1 and +1).
--> main indicates the title of the chart.
--> col indicates the color palette.
--> clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
--> Simple pie-chart using the input vector and labels.
--> It will create and save the pie chart in the current R working directory.
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
# png - Graphics devices for BMP, JPEG, PNG and TIFF format bitmap files.
png(file = "city.jpg")
# Plot the chart.
pie(x,labels)
# Save the file.
# dev.off function provide control over multiple graphics devices.
dev.off()
Pie Chart Title and Colors:
---------------------------
--> We can expand the features of the chart by adding more parameters to the function.
--> We will use parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour pallet while drawing the chart.
--> The length of the pallet should be same as the number of values we have for the chart.
--> Hence we use length(x).
# Give the chart file a name.
png(file = "city_title_colours.jpg")
# Plot the chart with title and rainbow color pallet.
pie(x, labels, main = "City pie chart", col = rainbow(length(x)))
# Save the file.
dev.off()
Slice Percentages and Chart Legend:
-----------------------------------
--> We can add slice percentage and a chart legend by creating additional chart variables.
piepercent<- round(100*x/sum(x), 1)
# Give the chart file a name.
png(file = "city_percentage_legends.jpg")
# Plot the chart.
pie(x, labels = piepercent, main = "City pie chart",col = rainbow(length(x)))
# legend --> used to add legends to plots
legend("topright", c("London","New York","Singapore","Mumbai"), cex = 1.0,
fill = rainbow(length(x)))
# Save the file.
dev.off()
3D Pie Chart:
-------------
--> A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix has a function called pie3D() that is used for this.
# Install & Load plotrix package.
install.packages("plotrix")
library(plotrix)
# Give the chart file a name.
png(file = "3d_pie_chart.jpg")
# Plot the chart.
pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")
# Save the file.
dev.off()
Bar Charts:
===========
--> A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable.
--> R uses the function barplot() to create bar charts.
--> R can draw both vertical and Horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.
Syntax:
barplot(H,xlab,ylab,main, names.arg,col)
Arguments:
----------
--> H is a vector or matrix containing numeric values used in bar chart.
--> xlab is the label for x axis.
--> ylab is the label for y axis.
--> main is the title of the bar chart.
--> names.arg is a vector of names appearing under each bar.
--> col is used to give colors to the bars in the graph.
Data Visualization, Pie Charts, 3D Pie Charts
--> Creating a bar chart using the input vector and the name of each bar.
# Create the data for the chart
H <- c(5,10,30,3,40)
# Give the chart file a name
png(file = "barchart.png")
# Plot the bar chart
barplot(H)
# Save the file
dev.off()
Bar Chart Labels, Title and Colors:
-----------------------------------
--> The features of the bar chart can be expanded by adding more parameters.
--> The main parameter is used to add title.
--> The col parameter is used to add colors to the bars.
--> The args.name is a vector having same number of values as the input vector to describe the meaning of each bar.
# Create the data for the chart
H <- c(5,10,30,3,40)
M <- c("Jan","Feb","Mar","Apr","May")
# Give the chart file a name
png(file = "barchart_months_revenue.png")
# Plot the bar chart
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",main="Revenuechart", border="red")
# Save the file
dev.off()
Group Bar Chart and Stacked Bar Chart:
--------------------------------------
--> We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.
--> More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.
# Create the input vectors.
colors = c("red","blue","green")
months <- c("Jan","Feb","Mar","Apr","May")
regions <- c("East","West","North")
# Create the matrix of the values.
Values <- matrix(c(3,9,4,13,8,4,9,7,3,15,8,2,7,11,12), nrow = 3, ncol = 5, byrow = TRUE)
# Give the chart file a name
png(file = "barchart_stacked.png")
# Create the bar chart
barplot(Values, main = "total revenue", names.arg = months, xlab = "month", ylab = "revenue", col = colors)
# Add the legend to the chart
legend("topleft", regions, cex = 1.3, fill = colors)
# Save the file
dev.off()
Boxplots:
=========
--> Boxplots are a measure of how well distributed is the data in a data set.
--> It divides the data set into three quartiles.
--> This graph represents the minimum, maximum, median, first quartile and third quartile in the data set.
--> It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.
--> Boxplots are created in R by using the boxplot() function.
Syntax:
boxplot(x, data, notch, varwidth, names, main)
Arugments:
----------
--> x is a vector or a formula.
--> data is the data frame.
--> notch is a logical value. Set as TRUE to draw a notch.
--> varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size.
--> names are the group labels which will be printed under each boxplot.
--> main is used to give a title to the graph.
--> Use the data set "mtcars" available in the R environment to create a basic boxplot.
# Let's look at the columns "mpg" and "cyl" in mtcars.
head(mtcars)
# Creating the Boxplot:
--> The below script will create a boxplot graph for the relation between mpg (miles per gallon) and cyl (number of cylinders).
# Give the chart file a name.
png(file = "boxplot.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")
# Save the file.
dev.off()
Boxplot with Notch:
-------------------
--> We can draw boxplot with notch to find out how the medians of different data groups match with each other.
-->The below script will create a boxplot graph with notch for each of the data group.
# Give the chart file a name.
png(file = "boxplot_with_notch.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars,
xlab = "Number of Cylinders",
ylab = "Miles Per Gallon",
main = "Mileage Data",
notch = TRUE,
varwidth = TRUE,
col = c("green","yellow","purple"),
names = c("High","Medium","Low")
)
# Save the file.
dev.off()
Histograms:
===========
--> A histogram represents the frequencies of values of a variable bucketed into ranges.
--> Histogram is similar to bar chat but the difference is it groups the values into continuous ranges.
--> Each bar in histogram represents the height of the number of values present in that range.
--> R creates histogram using hist() function.
Syntax:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Arguments:
----------
--> v is a vector containing numeric values used in histogram.
--> main indicates title of the chart.
--> col is used to set color of the bars.
--> border is used to set border color of each bar.
--> xlab is used to give description of x-axis.
--> xlim is used to specify the range of values on the x-axis.
--> ylim is used to specify the range of values on the y-axis.
--> breaks is used to mention the width of each bar.
--> Creating a histogram using input vector, label, col and border parameters.
# Create data for the graph.
v <- c(8,14,23,9,38,21,14,44,34,31,17)
# Give the chart file a name.
png(file = "histogram.png")
# Create the histogram.
hist(v,xlab = "Weight",col = "yellow",border = "blue")
# Save the file.
dev.off()
Range of X and Y values:
------------------------
--> To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim parameters.
--> The width of each of the bar can be decided by using breaks.
# Give the chart file a name.
png(file = "histogram_lim_breaks.png")
# Create the histogram.
hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5),
breaks = 5)
# Save the file.
dev.off()
Line Graphs:
============
--> A line chart is a graph that connects a series of points by drawing line segments between them.
--> These points are ordered in one of their coordinate (usually the x-coordinate) value.
--> Line charts are usually used in identifying the trends in data.
--> The plot() function in R is used to create the line graph.
Syntax
plot(v,type,col,xlab,ylab)
Arguments:
----------
--> v is a vector containing the numeric values.
--> type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines.
--> xlab is the label for x axis.
--> ylab is the label for y axis.
--> main is the Title of the chart.
--> col is used to give colors to both the points and lines.
--> Creating a line chart using the input vector and the type parameter as "O".
# Create the data for the chart.
v <- c(5,10,30,3,40)
# Give the chart file a name.
png(file = "line_chart.jpg")
# Plot the bar chart.
plot(v,type = "o")
# Save the file.
dev.off()
Line Chart Title, Color and Labels:
-----------------------------------
--> The features of the line chart can be expanded by using additional parameters.
--> We add color to the points and lines, give a title to the chart and add labels to the axes.
# Give the chart file a name.
png(file = "line_chart_label_colored.jpg")
# Plot the bar chart.
plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
# Save the file.
dev.off()
Multiple Lines in a Line Chart:
-------------------------------
--> More than one line can be drawn on the same chart by using the lines() function.
--> After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in the chart,
# Create the data for the chart.
v <- c(5,10,30,3,40)
t <- c(15,9,8,25,5)
# Give the chart file a name.
png(file = "line_chart_2lines.jpg")
# Plot the bar chart.
plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
lines(t, type = "o", col = "blue")
# Save the file.
dev.off()
Scatterplots:
=============
--> Scatterplots show many points plotted in the Cartesian plane.
--> Each point represents the values of two variables.
--> One variable is chosen in the horizontal axis and another in the vertical axis.
--> Scatterplot is created using the plot() function.
Syntax:
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Arguments:
----------
--> x is the data set whose values are the horizontal coordinates.
--> y is the data set whose values are the vertical coordinates.
--> main is the tile of the graph.
--> xlab is the label in the horizontal axis.
--> ylab is the label in the vertical axis.
--> xlim is the limits of the values of x used for plotting.
--> ylim is the limits of the values of y used for plotting.
--> axes indicates whether both axes should be drawn on the plot.
--> Using the data set "mtcars" available in the R environment to create a basic scatterplot.
# Let's use the columns "wt" and "mpg" in mtcars.
head(mtcars)
Creating the Scatterplot
The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles per gallon).
# Give the chart file a name.
png(file = "scatterplot.png")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = mtcars$wt,y = mtcars$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
# Save the file.
dev.off()
Scatterplot Matrices:
---------------------
--> When we have more than two variables and we want to find the correlation between one variable versus the remaining ones we use scatterplot matrix.
--> We use pairs() function to create matrices of scatterplots.
Syntax:
pairs(formula, data)
Arguments:
----------
--> formula represents the series of variables used in pairs.
--> data represents the data set from which the variables will be taken.
--> Each variable is paired up with each of the remaining variable.
--> A scatterplot is plotted for each pair.
# Give the chart file a name.
png(file = "scatterplot_matrices.png")
# Plot the matrices between 4 variables giving 12 plots.
# One variable with 3 others and total 4 variables.
pairs(~wt+mpg+disp+cyl, data = mtcars, main = "Scatterplot Matrix")
# Save the file.
dev.off()
Box Plots: view quantiles
----------
cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")
quantile(cars$price)
boxplot(cars$price)
boxplot(cars$price, outline=FALSE)
boxplot(cars$price, outline=FALSE, col="blue")
boxplot(cars$price, outline=FALSE, col="blue", border="red")
boxplot(cars$price, col="blue", border = c("red","yellow","pink"))
# it contains only one plot, so it takes only one border.
# IQR --> Q3-Q1 1.5*IQR < Q1--------------Q3 < 1.5*IQR this is whiskers.
# Once read 68 95 99.7 rule and understand this rule.
IQR(cars$price) #3909.5
1.5*IQR(cars$price) # 5864.25
10995-5864 # 5131 Q1-1.5*IQR
14904+5864 # 20768 Q3+1.5*IQR
cars$price
# check the outlier values having 2 in lower and 2 in upper.
cars <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/usedcars.csv")
boxplot(cars$price,outline=FALSE,col="blue",border="red")
boxplot(cars$price,outline=FALSE,col="blue",border="red",horizontal=TRUE)
# Low level plotting:
--------------------
title(main="Cars Price")
# First generate the box plot and work on low level plotting otherwise it gives an error
title(xlab="Price")
boxplot(price ~ model, data=cars) #it generates 3 boxplots for 3 models
boxplot(price ~ model, data=cars, border=c("red","yellow","blue"))
boxplot(price ~ color, data=cars, border=c("red","yellow","blue"))
boxplot(price ~ color, data=cars, border=c("black","blue","gold","gray","green","black","yellow","black","yellow"))
boxplot(price ~ transmission, data=cars)
credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv")
names(credit)
boxplot(amount ~ purpose, data=credit)
boxplot(amount ~ default, data=credit)
hist --> historgrams are used to display the frequency
----
hist(cars$price)
hist(cars$price,col="blue")
hist(cars$price,border="red")
hist(cars$price,col="blue",border="red")
hist(cars$price,border="red",density=10)
hist(cars$price,border="red",density=20)
hist(cars$price,border="red",density=20,col="blue")
hist(cars$price,border="red",density=20,col="blue",angle=0)
plot --> scatter plots
----
plot(cars$mileage)
plot(cars$mileage,col="red")
plot(cars$mileage,col="red",pch=15)
plot(cars$mileage,col="red",pch=16)
#check the pch starts from 0 to 25, default is 21. practice all the pch sizes
plot(cars$mileage,col=ifelse(cars$mileage<50000,"green","red"),pch=20)
plot(cars$mileage,col=ifelse(cars$transmission=="AUTO","green","red"),pch=20)
a <- c(2,10,5,20,15,6,30)
plot(a)
plot(a,pch=20)
plot(a,pch=20,col="blue")
plot(a,pch=20,col="blue",type="p")
plot(a,pch=20,col="blue",type="l")
plot(a,pch=20,col="blue",type="b")
plot(a,pch=20,col="blue",type="o")
plot(a,pch=20,col="blue",type="h")
plot(a,pch=20,col="blue",type="s")
plot(a,pch=20,col="blue",type="S")
x <- c(1,2,3,4)
y <- c(10,20,30,40)
plot(x,y)
plot(x,y,type="l")
plot(x,y,type="l",col="red")
plot(x,y,type="o",col="red")
plot(x,y,type="o",col="red",pch=20)
plot(cars$price,cars$mileage)
barplot(cars$price)
pie(cars$model) # ERROR:'x' values must be positive numeric, otherwise
pie(table(cars$model))
dotchart(cars$price)
dotchart(cars$mileage)
barplot():
==========
--> Consider the following data preparation:
grades<-c("A","A+","B","B+","C")
Marks <- sample(grades,40,replace=T,prob=c(.2,.3,.25,.15,.1))
print(Marks)
sample():
---------
--> sample takes a sample of the specified size from the elements of x using either with or without replacement.
Syntax:
sample(x, size, replace = FALSE, prob = NULL)
Arguments:
----------
--> x: either a vector of one or more elements from which to choose, or a positive integer.
--> size: a non-negative integer giving the number of items to choose.
--> replace: should sampling be with replacement?
--> prob: a vector of probability weights for obtaining the elements of the vector being sampled.
# A bar chart of the Marks vector is obtained from
barplot(table(Marks), main="Mid-Marks")
--> Notice that, the barplot() function places the factor levels on the x-axis in the lexicographical order of the levels.
--> Using the parameter names.arg, the bars in plot can be placed in the order as stated in the vector, grades.
# plot to the desired horizontal axis labels
barplot(table(Marks), names.arg=grades, main="Mid-Marks")
# Colored bars can be drawn using the col= parameter.
barplot(table(Marks),names.arg=grades,col = c("lightblue", "lightcyan", "lavender", "mistyrose", "cornsilk"), main="Mid-Marks")
# A bar chart with horizontal bars can be obtained as follows:
barplot(table(Marks),names.arg=grades,horiz=TRUE,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"), main="Mid-Marks")
# A bar chart with proportions on the y-axis can be obtained as follows:
barplot(prop.table(table(Marks)),names.arg=grades,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"), main="Mid-Marks")
# The sizes of the factor-level names on the x-axis can be increased using cex.names parameter.
barplot(prop.table(table(Marks)),names.arg=grades,col = c("lightblue", "lightcyan", "lavender", "mistyrose", "cornsilk"),main="Mid-Marks",cex.names=2)
--> The heights parameter of the barplot() could be a matrix.
--> For example it could be matrix, where the columns are the various subjects taken in a course, the rows could be the labels of the grades.
# Consider the following matrix:
gradTab <- matrix(c(13,10,4,8,5,10,7,2,19,2,7,2,14,12,5), ncol = 3, byrow = TRUE)
rownames(gradTab) <- c("A","A+","B","B+","C")
colnames(gradTab) <- c("DataScience", "DataAnalytics","MachineLearning")
print(gradTab)
# To draw a stacked bar, simply use the command:
barplot(gradTab,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"),legend.text = grades, main="Mid-Marks")
# To draw a juxtaposed bars, use the besides parameter, as given under:
barplot(gradTab,beside = T,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"),legend.text = grades, main="Mid-Marks")
# A horizontal bar chart can be obtained using horiz=T parameter:
barplot(gradTab,beside = T,horiz=T,col = c("lightblue","lightcyan", "lavender", "mistyrose", "cornsilk"),legend.text = grades, cex.names=.75, main="Mid-Marks")
Density plot:
=============
--> A very useful and logical follow-up to histograms would be to plot the smoothed density function of a random variable.
# A basic plot produced by the command
plot(density(rnorm(100)),main="Normal density",xlab="x")
# We can overlay a histogram and a density curve with
x=rnorm(100)
hist(x,prob=TRUE,main="Normal density + histogram")
lines(density(x),lty="dotted",col="red")
Combining Plots:
================
--> It's often useful to combine multiple plot types in one graph (for example a Barplot next to a Scatterplot.)
--> R makes this easy with the help of the functions par() and layout().
par():
------
--> par uses the arguments mfrow or mfcol to create a matrix of nrows and ncols c(nrows, ncols) which will serve as a grid for your plots.
# The following example shows how to combine four plots in one graph:
par(mfrow=c(2,2))
plot(cars, main="Speed vs. Distance")
hist(cars$speed, main="Histogram of Speed")
boxplot(cars$dist, main="Boxplot of Distance")
boxplot(cars$speed, main="Boxplot of Speed")
layout():
---------
--> The layout() is more flexible and allows you to specify the location and the extent of each plot within the final combined graph.
# This function expects a matrix object as an input:
layout(matrix(c(1,1,2,3), 2,2, byrow=T))
hist(cars$speed, main="Histogram of Speed")
boxplot(cars$dist, main="Boxplot of Distance")
boxplot(cars$speed, main="Boxplot of Speed")
Getting Started with R_Plots:
=============================
Scatterplot:
------------
# We have two vectors and we want to plot them.
x_values <- rnorm(n = 20 , mean = 5 , sd = 8) #20 values generated from Normal(5,8)
y_values <- rbeta(n = 20 , shape1 = 500 , shape2 = 10) #20 values generated from Beta(500,10)
# If we want to make a plot which has the y_values in vertical axis and the x_values in horizontal axis, we can use the following commands:
plot(x = x_values, y = y_values, type = "p") # standard scatter-plot
plot(x = x_values, y = y_values, type = "l") # plot with lines
plot(x = x_values, y = y_values, type = "n") # empty plot
# We can type ?plot in the console to read about more options.
Boxplot:
--------
# We have some variables and we want to examine their Distributions
# boxplot is an easy way to see if we have some outliers in the data.
boxplot(y_values)
y_values[c(19 , 20)] <- c(0.95 , 1.05) # replace the two last values with outliers
print(y_values)
boxplot(y_values) # Points are the outliers of variable z.
Histograms:
-----------
# Easy way to draw histograms
hist(x = x_values) # Histogram for x vector
hist(x = x_values, breaks = 3) #use breaks to set the numbers of bars you want
Pie_charts:
-----------
# If we want to visualize the frequencies of a variable just draw pie
# First we have to generate data with frequencies, for example :
P <- c(rep('A' , 3) , rep('B' , 10) , rep('C' , 7) )
print(P)
t <- table(P) # this is a frequency matrix of variable P
print(t)
pie(t) # And this is a visual version of the matrix above
Basic Plot:
===========
--> A basic plot is created by calling plot().
--> Here we use the built-in cars data frame that contains the speed of cars
and the distances taken to stop in the 1920s.
--> To find out more about the dataset, use help(cars).
plot(x = cars$speed, y = cars$dist, pch = 1, col = 1,
main = "Distance vs Speed of Cars",
xlab = "Speed", ylab = "Distance")
--> We can use many other variations in the code to get the same result.
--> We can also change the parameters to obtain different results.
with():
-------
--> Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data.
--> Syntax: with(data, expr, ...)
with(cars, plot(dist~speed, pch = 2, col = 3,
main = "Distance to stop vs Speed of Cars",
xlab = "Speed", ylab = "Distance"))
--> Additional features can be added to this plot by calling points(), text(), mtext(), lines(), grid(), etc.
plot(dist~speed, pch = "*", col = "magenta", data=cars,
main = "Distance to stop vs Speed of Cars",
xlab = "Speed", ylab = "Distance")
mtext("In the 1920s.")
grid(col="lightblue")
Histograms:
===========
--> Histograms allow for a pseudo-plot of the underlying distribution of the data.
hist(ldeaths)
hist(ldeaths, breaks = 20, freq = F, col = 3)
# ldeaths belongs to UKLungDeaths Package
# Monthly Deaths from Lung Diseases in the UK
# Three time series giving the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979, both sexes (ldeaths), males (mdeaths) and females (fdeaths).
Analysis with Scatter Plot, Box Plot,
Matplot:
========
--> matplot is useful for quickly plotting multiple sets of observations from the same object, particularly from a matrix, on the same graph.
--> Here is an example of a matrix containing four sets of random draws, each with a different mean.
xmat <- cbind(rnorm(100, -3), rnorm(100, -1), rnorm(100, 1), rnorm(100, 3))
head(xmat)
--> One way to plot all of these observations on the same graph is to do one plot call followed by three more points or lines calls.
plot(xmat[,1], type = 'l')
lines(xmat[,2], col = 'red')
lines(xmat[,3], col = 'green')
lines(xmat[,4], col = 'blue')
--> However, this is both tedious, and causes problems because, among other things, by default the axis limits are fixed by plot to fit only the first column.
--> Much more convenient in this situation is to use the matplot function, which only requires one call and automatically takes care of axis limits and changing the aesthetics for each column to make them distinguishable.
matplot(xmat, type = 'l')
--> Note that, by default, matplot varies both color (col) and linetype (lty) because this increases the number of possible combinations before they get repeated.
--> However, any (or both) of these aesthetics can be fixed to a single value...
matplot(xmat, type = 'l', col = 'black')
--> ...or a custom vector (which will recycle to the number of columns, following standard R vector recycling rules).
matplot(xmat, type = 'l', col = c('red', 'green', 'blue', 'orange'))
--> Standard graphical parameters, including main, xlab, xmin, work exactly the same way as for plot.
--> For more on those, see ?par.
--> Like plot, if given only one object, matplot assumes it's the y variable and uses the indices for x.
--> However, x and y can be specified explicitly.
matplot(x = seq(0, 10, length.out = 100), y = xmat, type='l')
# In fact, both x and y can be matrices.
xes <- cbind(seq(0, 10, length.out = 100),
seq(2.5, 12.5, length.out = 100),
seq(5, 15, length.out = 100),
seq(7.5, 17.5, length.out = 100))
matplot(x = xes, y = xmat, type = 'l')
Empirical Cumulative Distribution Function:
===========================================
--> A very useful and logical follow-up to histograms and density plots would be the Empirical Cumulative Distribution Function.
--> We can use the function ecdf() for this purpose.
# A basic plot produced by the command
plot(ecdf(rnorm(100)),main="Cumulative distribution",xlab="x")
Create a box-and-whisker plot with boxplot()
============================================
# This example use the default boxplot() function and the iris data frame.
head(iris)
--> The iris dataset has been used for classification in many research publications.
--> It consists of 50 samples from each of three classes of iris
owers.
--> One class is linearly separable from the other two, while the latter are not linearly separable from each other.
--> There are five attributes in the dataset:
sepal length in cm,
sepal width in cm,
petal length in cm,
petal width in cm, and
class: Iris Setosa, Iris Versicolour, and Iris Virginica.
--> Detailed desription of the dataset and research publications citing it can be found at the UCI Machine Learning Repository.
--> Below we have a look at the structure of the dataset with str().
--> Note that all variable names, package names and function names in R are case sensitive.
str(iris)
--> From the output, we can see that there are 150 observations (records, or rows) and 5 variables (or columns) in the dataset.
--> The first four variables are numeric.
--> The last one, Species, is categoric (called as "factor" in R) and has three levels of values.
# Simple boxplot (Sepal.Length)
# Create a box-and-whisker graph of a numerical variable
boxplot(iris[,1],xlab="Sepal.Length",ylab="Length(in centemeters)",
main="Summary Charateristics of Sepal.Length(Iris Data)")
# Boxplot of sepal length grouped by species
# Create a boxplot of a numerical variable grouped by a categorical variable
boxplot(Sepal.Length~Species,data = iris)
Mat Plot, ECDF & Box Plot with IRIS Data set
# Bring order
# To change order of the box in the plot you have to change the order of the categorical variable's levels.
# For example if we want to have the order virginica - versicolor - setosa
newSpeciesOrder <- factor(iris$Species, levels=c("virginica","versicolor","setosa"))
boxplot(Sepal.Length~newSpeciesOrder, data = iris)
# Change groups names
# If you want to specify a better name to your groups you can use the Names parameter.
# It take a vector of the size of the levels of categorical variable
boxplot(Sepal.Length~newSpeciesOrder,data = iris,names=c("name1","name2","name3"))
Small improvements:
-------------------
# Color
# col : add a vector of the size of the levels of categorical variable
boxplot(Sepal.Length~Species,data = iris,col=c("green","yellow","orange"))
# Proximity of the box
# boxwex: set the margin between boxes.
boxplot(Sepal.Length~Species,data = iris,boxwex = 0.1)
boxplot(Sepal.Length~Species,data = iris,boxwex = 1)
See the summaries which the boxplots are based plot=FALSE:
----------------------------------------------------------
--> To see a summary you have to put the paramater plot to FALSE.
boxplot(Sepal.Length~newSpeciesOrder,data = iris,plot=FALSE)
$stats #summary of the numerical variable for the 3 groups
$n #number of observations in each groups
$conf #extreme value of the notchs
$out #extreme value
$group #group in which are the extreme value
$names #groups names
Additional boxplot style parameters:
====================================
Box:
----
boxlty - box line type
boxlwd - box line width
boxcol - box line color
boxfill - box fill colors
Median:
-------
medlty - median line type ("blank" for no line)
medlwd - median line widht
medcol - median line color
medpch - median point (NA for no symbol)
medcex - median point size
medbg - median point background color
Whisker:
--------
whisklty - whisker line type
whisklwd - whisker line width
whiskcol - whisker line color
Staple:
-------
staplelty - staple line type
staplelwd - staple line width
staplecol - staple line color
Outliers:
---------
outlty - outlier line type ("blank" for no line)
outlwd - outlier line width
outcol - outlier line color
outpch - outlier point type (NA for no symbol)
outcex - outlier point size
outbg - outlier point background color
Example:
--------
--> Default and heavily modified plots side by side
par(mfrow=c(1,2))
# Default
boxplot(Sepal.Length ~ Species, data=iris)
# Modified
boxplot(Sepal.Length ~ Species, data=iris,
boxlty=2, boxlwd=3, boxfill="cornflowerblue", boxcol="darkblue",
medlty=2, medlwd=2, medcol="red", medpch=21, medcex=1, medbg="white",
whisklty=2, whisklwd=3, whiskcol="darkblue",
staplelty=2, staplelwd=2, staplecol="red",
outlty=3, outlwd=3, outcol="grey", outpch=NA
)
Displaying multiple plots:
==========================
--> Display multiple plots in one image with the different facet functions.
--> An advantage of this method is that all axes share the same scale across charts, making it easy to compare them at a glance.
--> We'll use the mpg dataset included in ggplot2.
# Wrap charts line by line (attempts to create a square layout):
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~class)
# Display multiple charts on one row, multiple columns:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(.~class)
# Display multiple charts on one column, multiple rows:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(class~.)
# Display multiple charts in a grid by 2 variables:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(trans~class) #"row" parameter, then "column" parameter
set.seed():
===========
--> Random Number Generation
--> runif will not generate either of the extreme values unless max = min or max-min is small compared to min, and in particular not for the default arguments.
# Now make the data into random order, just observe the example and do it
a = c(1,2,3,4,5)
print(a) # 1,2,3,4,5
runif(5) # it gives random values
runif(5) # again it gives random values
runif(5) # again it gives random values
sort(runif(5)) # it puts the data in sorting
sort(runif(5)) # again it puts the data in sorting
order(runif(5)) # it gives the position of the values # 1 3 4 5 2
order(runif(5)) # it gives the position of the values # 4 1 2 3 5
a = c(10,20,30,40,50)
a[order(runif(5))] # 20 10 40 50 30
a[order(runif(5))] # 50 10 30 20 40
#if we want to put the data in same order, we have seeding technique (set.seed(1))
set.seed(1)
runif(5)
set.seed(1)
runif(5) # it generates the same values
set.seed(2)
runif(5) # it generates another values
set.seed(1)
runif(5) # it gives already generated values of seed 1
set.seed(1)
sort(runif(5))
set.seed(1)
order(runif(5))
# now use this techique in our credit.csv data
credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv")
set.seed(1)
credit_n <- credit[order(runif(1000)),]
head(credit)
head(credit_n) # once check the data before and after randamized
names(credit)
summary(credit$amount)
summary(credit_n$amount) # data is same,just we randamized the data
Prepare your data for plotting:
===============================
--> ggplot2 works best with a long data frame.
--> The following sample data which represents the prices for sweets on 20
different days, in a format described as wide, because each category has a column.
set.seed(47)
sweetsWide <- data.frame(date = 1:20,
chocolate = runif(20, min = 2, max = 4),
iceCream = runif(20, min = 0.5, max = 1),
candy = runif(20, min = 1, max = 3))
head(sweetsWide)
--> To convert sweetsWide to long format for use with ggplot2, several useful functions from base R, and the packages reshape2, data.table and tidyr (in chronological order) can be used:
# reshape from base R
sweetsLong <- reshape(sweetsWide, idvar = 'date', direction = 'long',
varying = list(2:4), new.row.names = NULL, times = names(sweetsWide)[-1])
dim(sweetsLong)
head(sweetsLong)
# melt from 'reshape2'
library(reshape2)
sweetsLong <- melt(sweetsWide, id.vars = 'date')
dim(sweetsLong)
head(sweetsLong)
# melt from 'data.table'
# which is an optimized & extended version of 'melt' from 'reshape2'
install.packages("data.table")
library(data.table)
sweetsLong <- melt(setDT(sweetsWide), id.vars = 'date')
dim(sweetsLong)
head(sweetsLong)
# gather from 'tidyr'
library(tidyr)
sweetsLong <- gather(sweetsWide, sweet, price, chocolate:candy)
dim(sweetsLong)
head(sweetsLong)
--> See also Reshaping data between long and wide forms for details on converting data between long and wide format.
--> The resulting sweetsLong has one column of prices and one column describing the type of sweet.
--> Now plotting is much simpler:
library(ggplot2)
ggplot(sweetsLong, aes(x = date, y = price, colour = sweet)) + geom_line()
--> Add horizontal and vertical lines to plot
--> Add one common horizontal line for all categorical variables
# sample data
df <- data.frame(x = c('A', 'B'), y = c(3, 4))
p1 <- ggplot(df, aes(x=x, y=y)) +
geom_bar(position = "dodge", stat = 'identity') + theme_bw()
p1 + geom_hline(aes(yintercept=5), colour="#990000", linetype="dashed")
--> Add one horizontal line for each categorical variable
# sample data
df <- data.frame(x = c('A', 'B'), y = c(3, 4))
# add horizontal levels for drawing lines
df$hval <- df$y + 2
p1 <- ggplot(df, aes(x=x, y=y)) +
geom_bar(position = "dodge", stat = 'identity') + theme_bw()
p1 + geom_errorbar(aes(y=hval, ymax=hval, ymin=hval), colour="#990000", width=0.75)
--> Add horizontal line over grouped bars
# sample data
df <- data.frame(x = rep(c('A', 'B'), times=2),
group = rep(c('G1', 'G2'), each=2),
y = c(3, 4, 5, 6),
hval = c(5, 6, 7, 8))
p1 <- ggplot(df, aes(x=x, y=y, fill=group)) +
geom_bar(position="dodge", stat="identity")
p1 + geom_errorbar(aes(y=hval, ymax=hval, ymin=hval),
colour="#990000",
position = "dodge",
linetype = "dashed")
--> Add vertical line
# sample data
df <- data.frame(group=rep(c('A', 'B'), each=20),
x = rnorm(40, 5, 2),
y = rnorm(40, 10, 2))
p1 <- ggplot(df, aes(x=x, y=y, colour=group)) + geom_point()
p1 + geom_vline(aes(xintercept=5), color="#990000", linetype="dashed")
Set.Seed Function & Preparing Data for
Scatter Plots:
==============
--> We plot a simple scatter plot using the built-in iris data set as follows:
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
geom_point()
Produce basic plots with qplot:
===============================
--> qplot is intended to be similar to base r plot() function, trying to always plot out your data without requiring too much specifications.
# basic qplot
qplot(x = disp, y = mpg, data = mtcars)
# adding colors
qplot(x = disp, y = mpg, colour = cyl,data = mtcars)
# adding a smoother
qplot(x = disp, y = mpg, geom = c("point", "smooth"), data = mtcars)
Vertical and Horizontal Bar Chart:
==================================
?diamonds # Prices of 50,000 round cut diamonds
ggplot(data = diamonds, aes(x = cut, fill =color)) +
geom_bar(stat = "count", position = "dodge")
--> it is possible to obtain an horizontal bar chart simply adding coord_flip() aesthetic to the ggplot object:
ggplot(data = diamonds, aes(x = cut, fill =color)) +
geom_bar(stat = "count", position = "dodge") +
coord_flip()
Violin plot:
============
--> Violin plots are kernel density estimates mirrored in the vertical plane.
--> They can be used to visualize several distributions side-by-side, with the mirroring helping to highlight any differences.
ggplot(diamonds, aes(cut, price)) +
geom_violin()
--> Violin plots are named for their resemblance to the musical instrument, this is particularly visible when they are coupled with an overlaid boxplot.
--> This visualisation then describes the underlying distributions both in terms of
Tukey's 5 number summary (as boxplots) and full continuous density estimates (violins).
ggplot(diamonds, aes(cut, price)) +
geom_violin() +
geom_boxplot(width = .1, fill = "black", outlier.shape = NA) +
stat_summary(fun.y = "median", geom = "point", col = "white")
Statistical Methods:
====================
--> When analyzing data, it is possible to have a statistical approach.
--> The basic tools that are needed to perform basic analysis are:
Correlation analysis
Chi-squared Test
T-test
Analysis of Variance
Analysis of Covariance
Hypothesis Testing
Time Series Analysis
Survival Analysis
--> When working with large datasets, it doesn’t involve a problem as these methods are not computationally intensive with the exception of Correlation Analysis.
--> In this case, it is always possible to take a sample and the results should be robust.
Correlation Analysis:
=====================
--> Correlation Analysis seeks to find linear relationships between numeric variables.
--> This can be of use in different circumstances.
--> First of all, the correlation metric used in the mentioned example is based on the Pearson coefficient.
--> There is however, another interesting metric of correlation that is not affected by outliers.
--> This metric is called the spearman correlation.
--> The spearman correlation metric is more robust to the presence of outliers than the Pearson method and gives better estimates of linear relations between numeric variable when the data is not normally distributed.
--> Correlation is a statistical tool which studies the relationship between two variables.
--> Co-efficient of correlation gives the degree(amount) of correlation between two variables.
Formulae: summation(dxdy)/sqrt of dx2*dy2
Steps:
1. Denote one series by X and the other series by Y
2. Calculate x' and y'
3. Calculate dx and dy [i.e deviation]
dx=x-x' dy=y-y'
4. Square these deviations i.e dx2 and dy2
5. Multiply the respective dx and dy
6. Apply the formula to calculate r.
Calculate the coefficient of correlation between X and Y for the following data:
X: 10,6,9,10,12,13,11,9
Y: 9,4,6,9,11,13,8,4
Solution: X, dx(x-x'), dx2, y, dy(y-y'), dy2, dxdy
library(ggplot2)
# Select variables that are interesting to compare pearson and spearman correlation methods.
x = diamonds[, c('x', 'y', 'z', 'price')]
# From the histograms we can expect differences in the correlations of both metrics.
# In this case as the variables are clearly not normally distributed, the spearman correlation
# is a better estimate of the linear relation among numeric variables.
par(mfrow = c(2,2))
colnm = names(x)
for(i in 1:4) {
hist(x[[i]], col = 'deepskyblue3', main = sprintf('Histogram of %s', colnm[i]))
}
--> From the histograms in the following figure, we can expect differences in the correlations of both metrics.
--> In this case, as the variables are clearly not normally distributed, the spearman correlation is a better estimate of the linear relation among numeric variables.
# Correlation Matrix - Pearson and spearman
cor_pearson <- cor(x, method = 'pearson')
cor_spearman <- cor(x, method = 'spearman')
# Pearson Correlation
print(cor_pearson)
# Spearman Correlation
print(cor_spearman)
QPlot, ViolinPlot, Statistical Methods &
Data Exploration and Visualization:
===================================
Have a Look at iris Data:
-------------------------
dim(iris)
names(iris)
str(iris)
attributes(iris)
iris[1:5, ]
head(iris)
tail(iris)
# draw a sample of 5 rows
a <- sample(1:nrow(iris), 5)
print(a)
iris[a, ]
iris[1:10, "Sepal.Length"]
iris[1:10, 1]
iris$Sepal.Length[1:10]
Explore Individual Variables:
-----------------------------
summary(iris)
quantile(iris$Sepal.Length)
quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
var(iris$Sepal.Length)
hist(iris$Sepal.Length)
plot(iris$Sepal.Length)
plot(density(iris$Sepal.Length))
table(iris$Species)
pie(table(iris$Species))
barplot(table(iris$Species))
Explore Multiple Variables:
---------------------------
# Calculate covariance and correlation between variables with cov() and cor().
cov(iris$Sepal.Length, iris$Petal.Length)
cov(iris[,1:4])
cor(iris$Sepal.Length, iris$Petal.Length)
cor(iris[,1:4])
# Compute the stats of Sepal.Length of every Species with aggregate()
aggregate(Sepal.Length ~ Species, summary, data=iris)
boxplot(Sepal.Length ~ Species, data=iris, xlab="Species", ylab="Sepal.Length")
with(iris, plot(Sepal.Length, Sepal.Width, col=Species, pch=as.numeric(Species)))
## same function as above
# plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=as.numeric(iris$Species))
# When there are many points, some of them may overlap. We can use jitter() to add a small amount of noise to the data before plotting.
plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width))
# A smooth scatter plot can be plotted with function smoothScatter(), which a smoothed color density representation of the scatterplot, obtained through a kernel density estimate.
smoothScatter(iris$Sepal.Length, iris$Sepal.Width)
# A Matrix of Scatter Plots
pairs(iris)
More Explorations:
------------------
--> A 3D scatter plot can be produced with package scatterplot3d
library(scatterplot3d)
scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
--> Package rgl supports interactive 3D scatter plot with plot3d().
library(rgl)
plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
--> A heat map presents a 2D display of a data matrix, which can be generated with heatmap() in R.
--> With the code below, we calculate the similarity between different flowers in the iris data with dist() and then plot it with a heat map.
distMatrix <- as.matrix(dist(iris[,1:4]))
heatmap(distMatrix)
--> A level plot can be produced with function levelplot() in package lattice --> Function grey.colors() creates a vector of gamma-corrected gray colors.
--> A similar function is rainbow(), which creates a vector of contiguous colors.
library(lattice)
levelplot(Petal.Width~Sepal.Length*Sepal.Width, iris, cuts=9, col.regions=grey.colors(10)[10:1])
--> Contour plots can be plotted with contour() and filled.contour() in package graphics, and with contourplot() in package lattice.
?volcano # Understand the volcano dataset.
filled.contour(volcano, color=terrain.colors, asp=1, plot.axes=contour(volcano, add=T))
--> Another way to illustrate a numeric matrix is a 3D surface plot shown as below, which is generated with function persp().
persp(volcano, theta=25, phi=30, expand=0.5, col="lightblue")
--> Parallel coordinates provide nice visualization of multiple dimensional data.
--> A parallel coordinates plot can be produced with parcoord() in package MASS, and with parallelplot() in package lattice.
library(MASS)
parcoord(iris[1:4], col=iris$Species)
library(lattice)
parallelplot(~iris[1:4] | Species, data=iris)
library(ggplot2)
qplot(Sepal.Length, Sepal.Width, data=iris, facets=Species ~.)
Save Charts into Files:
-----------------------
--> Save charts into PDF and PS files respectively with functions pdf() and postscript().
--> Picture files of BMP, JPEG, PNG and TIFF formats can be generated respectively with bmp(), jpeg(), png() and tiff().
--> Note that the files (or graphics devices) need be closed with graphics.off() or dev.off() after plotting.
# save as a PDF file
pdf("myPlot.pdf")
x <- 1:50
plot(x, log(x))
dev.off()
# Save as a postscript file
postscript("myPlot2.ps")
x <- -20:20
plot(x, x^2)
graphics.off()
Machine Learning:
=================
--> It is similar like Human Learning
--> Machine learning is the subfield of computer science that, according to Arthur Samuel, gives "computers the ability to learn without being explicitly programmed."
--> Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term "Machine Learning" in 1959 while at IBM.
--> Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed.
Traditional Programming vs Machine Learning:
===========================================
--> In traditional programming, if we give inputs + programs to the computer, then computer gives the output.
--> In machine learning, if we give inputs + outputs to the computer, then computer gives the program (Predictive Model).
Example 1: Here "a" and "b" are inputs and "c" is output
---------------
a b c
-- -- --
1 2 3
2 3 5
3 4 7
4 5 9
9 10 ?
What is the output of c?
Example 2: Here "x" is input and "y" is output
---------------
x y
-- --
1 10
2 20
3 30
4 40
5 ?
500 ?
y ~ x : y=10x
Example 3: Here "x" is input and "y" is output
---------------
x y
-- --
1 14
2 18
3 22
4 26
5 ?
750 ?
here we can observe linear regression
y ~ x : y=mx+c here m is slope and c is contant
y=4x+10
Machine Learning Engineer:
==========================
1. Convert the business data into statistical model
2. Make the machine to develop (train) the model
3. Evaluate the performance of the model
Actual vs Predicted (% accurancy, % error)
4. Techniques to improve the performance.
(Classification, Regression, Clustering)
Types of Machine Learning:
==========================
--> There are 3 types of Machine Learning Algorithms.
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
--> Supervised and unsupervised are mostly used by a lot machine learning engineers
--> Reinforcement learning is really powerful and complex to apply for problems.
Supervised Learning:
====================
--> As we know machine learning takes data as input and output (Training data)
--> The training data includes both Inputs and Labels(Targets or Outputs)
--> For example addition of two numbers a=5,b=6 result =11, Inputs are 5,6 and Target is 11.
--> We first train the model with the lots of training data(inputs & targets) then with new data and the logic we got before we predict the output
--> This process is called Supervised Learning which is really fast and accurate.
--> This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). --> Using these set of variables, we generate a function that map inputs to desired outputs.
--> The training process continues until the model achieves a desired level of accuracy on the training data.
Types of Supervised Learning:
-----------------------------
1. Classification
2. Regression
Classification:
---------------
--> This is a type of problem where we predict the categorical response value where the data can be separated into specific “classes” (ex: we predict one of the values in a set of values).
Some examples are :
--> this mail is spam or not?
--> will it rain today or not?
--> is this picture a cat or not?
Basically ‘Yes/No’ type questions called binary classification.
Other examples are :
--> mail is spam or important or promotion?
--> is this picture a cat or a dog or a tiger?
This type is called multi-class classification.
Regression:
-----------
--> This is a type of problem where we need to predict the continuous-response value (ex : above we predict number which can vary from -infinity to +infinity)
Some examples are:
--> what is the price of house in a specific city?
--> what is the value of the stock?
--> how many total runs can be on board in a cricket game?
Supervised Learning Algorithms:
-------------------------------
Decision Trees
KNN
Naive Bayes Classification
Support vector machines for classification problems
Random forest for classification and regression problems
Linear regression for regression problems
Ordinary Least Squares Regression
Logistic Regression
Unsupervised Learning:
======================
--> The training data does not include Targets here so we don’t tell the system where to go , the system has to understand itself from the data we give.
--> Here training data is not structured (contains noisy data,unknown data and etc..)
--> In this algorithm, we do not have any target or outcome variable to predict / estimate.
--> It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention.
--> Unsupervised learning is bit difficult to implement and its not used as widely as supervised.
Types of Unsupervised Learning:
-------------------------------
1. Clustering
2. Pattern Detection (Association Rule)
Clustering:
-----------
--> This is a type of problem where we group similar things together.
--> Bit similar to multi class classification but here we don’t provide the labels, the system understands from data itself and cluster the data.
Some examples are :
--> given news articles, cluster into different types of news
--> given a set of tweets, cluster based on content of tweet
--> given a set of images, cluster them into different objects
Association Rule:
-----------------
--> Association rules are if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases.
--> An association rule has two parts: an antecedent (if) and a consequent (then). --> An antecedent is an item found within the data.
--> A consequent is an item found in combination with the antecedent.
Unsupervised Learning Algorithms:
---------------------------------
K-means for clustering problems
Apriori algorithm for association rule learning problems
Principal Component Analysis
Singular Value Decomposition
Independent Component Analysis
Reinforcement Learning:
=======================
--> Using this algorithm, the machine is trained to make specific decisions.
--> It works this way: the machine is exposed to an environment where it trains itself continually using trial and error.
--> This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions.
--> Example of Reinforcement Learning: Markov Decision Process
Machine Learning, Types of ML with
Machine Learning:
=================
--> Solve the real time problems
Understand the data
Insights from the data(75%)
Test the performance(25%)
Insights are applied on the new data
to get the prediction
Ex 1:
Jio --> 3 months free of cost
Airtel
Aircel
Vodafone
Idea
10% people moved from airtel --> Jio
40% are waiting to move after 3 months jio service
1 Crore--10% means 10 Lak * RS 200 = 20 Crore Loss
for 1 year 20*12=240 Crores loss
Airtel will done the analysis on 10% based on their data like IMEI no., Location, Internet data, Recharge plan..... and they changed thier recharge plans like dynamic pricing(in airlines,bus tickets)
Ex 2:
Marketing Team--> 1000 Custemers information
age,salary,credit,loans,Responds
34,50K,4,2,yes
40,80K,2,4,no
..........,yes
..........,no
..........,yes
age>30,Salary<50K,credit<4, loans<2,responds=yes
yes = called, then no response
no = don't whether they are response or not
Data sets divide into 2 parts --> Train + Test
Data sets --> Train(75%) + Test(25%) #thumb rule
Split the data into 4 parts like 25% + 25% + 25% + 25% and check the 3 parts to 4th part in 4 ways.
First test on Known data later test on new data.
Target Attribute --> Categorical (Classification)
Target Attribute --> Numerical (Regression)
(Identifying the relation between the attributes)
K-Nearest Neighbour(KNN) Classsificaton:
========================================
--> K-Nearest Neighbors is one of the most basic classification algorithms in Machine Learning.
--> It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.
--> Mostly used for Life Sciences
Ex: BP,Col,BSugar,...HeartAttack
It calculates "Distance between the Data Points"
K is a number, intially its value is 1(for even values may have chance for 50 50 distance points, it is better to take odd values like 11 in the place of 10)
x=(x1,x2,...xn)
y=(y1,y2,...yn)
sqrt of (x1-y1)2 + (x2-y2)2 +...(xn-yn)2
sum of i=1 to n sqrt of (xi-yi)2
Mathematically this distance is called as "EUCLIDEAN DISTANCE".
KDD depends on --> Distance
NB depends on --> Probability
DT depends on --> Information gain & entropy
# Steps to follow in machine learning
1. Collecting the data
2. Data Preparation
3. Train the model
4. Evaluate the performance (Actual vs Predicted)
5. Improve the performance
KNN:
Train Data, Train Labels
Test Data, Test Labels
Machine --> Train Data, Train Labels --> Data Model --> Test Data --> Labels for the test data
Distance between test data & train data
predicted vs actual
# Simple Example for KNN Classification
m1,m2,m3,result
80,92,56,pass
90,75,79,pass
40,32,80,fail
73,79,22,fail
56,89,42,pass
75,87,67,pass
33,82,34,fail
20,32,12,fail
67,54,45,pass
10,40,55,fail
# copy this data and load in r by using clipboard
marks <- read.delim("clipboard", sep=",", stringAsFactors=F)
print(marks) #check the data
str(marks) #change the result datatype as factor
marsk$result <- factor(marks$result)
train_data <- marks[1:7,-4]
test_data <- marks[8:10,-4]
print(train_data)
print(test_data)
train_labels <- marks[1:7,4]
test_labels <- marks[8:10,4]
print(marks)
print(train_data)
print(train_labels)
print(test_data)
print(test_labels)
# install & attach the "class" package to work on KNN algorithm
install.packages("class")
library(class)
search()
predicted_labels <- class::knn(train_data,test_data,train_labels,k=1)
predicted_labels #fail fail fail
test_labels #fail pass fail
predicted_labels <- class::knn(train_data,test_data,train_labels,k=3)
predicted_labels #fail pass fail
test_labels #fail pass fail
predicted_labels <- class::knn(train_data,test_data,train_labels,k=5)
predicted_labels #fail pass fail
test_labels #fail pass fail
predicted_labels <- class::knn(train_data,test_data,train_labels,k=7)
predicted_labels #pass pass pass
test_labels #fail pass fail
# for k=3 & k=5 it is predicting correctly
# Having a data set which is related to cancer contains Malignant(harmful spread across the body) and Benign(not harmful). -- First understand the data in data set.
# Collecting the data
cancer <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/cancer.csv",stringsAsFactors=FALSE)
dim(cancer) #569 32
# Data Preparation
names(cancer) #out of 32 column "id" is not requried for analysis
cancer <- cancer[-1] #except 1st column 31 contains
str(cancer) #only 1 string column is available, convert it into factorial data type
dim(cancer) #569 31
names(cancer)
cancer$diagnosis <- factor(cancer$diagnosis,levels=c("B","M"),labels=c("Benign","Malignant"))
str(caner) # now 1st colum changed to factor data type
table(caner$diagnosis) # Benign is 357 Malignant is 212
prop.table(table(cancer$diagnosis))
prop.table(table(cancer$diagnosis))*100 # 62.8 37.2
summary(cancer[-1]) # here data range is different, make the data range as common, use normalize i.e., 0 to 1.
Normalization is a kind of making any data to drop between the same range of data.
NORMALIZE --> (x-min(x)) / (max(x)-min(x))
10,20,5,1,2,15,25,31
10-5/31-5,20-5/31-5,5-5/31-5,....31-5/31-5 (Here min value is 0 and max value is 1)
0 to 1 --> Normalization
normalize <- function(x)
{
return((x-min(x)) / (max(x)-min(x)))
}
cancer_n <-lapply(cancer[-1],normalize)
class(cancer_n) # it is list of values but we want a data frame
cancer_n <- data.frame(lapply(cancer[-1],normalize))
class(cancer_n) # now it is data frame values
summary(cancer_n) # now min to max all the values are in common range i.e., 0 to 1 only
# Now split the data set(100%) i.e., 569 observations into Train(75%) i.e., first 427 observations and Test(25%) i.e., last 142 observations.
train_data <- cancer_n[1:427,]
test_data <- cancer_n[428:569,]
train_labels <- cancer[1:427,1]
test_labels <- cancer[428:569,1]
dim(train_data) #427 30
dim(test_data) #142 30
length(train_labels) #427
length(test_labels) #142
# Train the Model
predict_labels <- knn(train_data,test_data,train_labels,k=1)
predict_labels[1:10] #predicted vs actual is matched
test_labels[1:10]
predict_labels[11:20] #predicted vs actual is matched
test_labels[11:20]
predict_labels[21:30] #predicted vs actual is not matched
test_labels[21:30]
KNN Classification with Cancer Data set Part
# Evaluate the performance of the model
table(test_labels)
table(predict_labels)
# for comparing all labels we use CrossTable contains in "gmodels" package
install.packages("gmodels")
library(gmodels)
CrossTable(test_labels,predict_labels,prop.r=FALSE,prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE,dnn=c("Actual","predicted"))
CrossTable(test_labels,predict_labels,prop.r=TRUE,prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE,dnn=c("Actual","predicted"))
# Improve the performance of the model
# Some of the labels are not matching, let us change the k value from 1 to 3
predict_labels <- knn(train_data,test_data,train_labels,k=3)
CrossTable(test_labels,predict_labels,prop.r=TRUE,prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE,dnn=c("Actual","predicted"))
# Change the K values from 3 to 5 and evaluate
predict_labels <- knn(train_data,test_data,train_labels,k=5)
CrossTable(test_labels,predict_labels,prop.r=TRUE,prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE,dnn=c("Actual","predicted"))
predict_labels <- knn(train_data,test_data,train_labels,k=7)
CrossTable(test_labels,predict_labels,prop.r=TRUE,prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE,dnn=c("Actual","predicted"))
# here some benign data is not correct - not a problem
predict_labels <- knn(train_data,test_data,train_labels,k=9)
CrossTable(test_labels,predict_labels,prop.r=TRUE,prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE,dnn=c("Actual","predicted"))
# here some malignant data is not correct - problem
# now do the prediction with first 25% test data and last 75% train data
train_data <- cancer_n[143:569,]
test_data <- cancer_n[1:142,]
train_labels <- cancer[143:569,1]
test_labels <- cancer[1:142,1]
predict_labels <- knn(train_data,test_data,train_labels,k=7) #here the data is not proper propertionality, so gives more error
prop.table(table(cancer$diagnosis)) #here B is 63 & M is 37
prop.table(table(train_labels)) #here B is 70 & M is 30
prop.table(test_labels)) #here B is 42 & M is 58
KNN Classification with Cancer Data set Part
Naive Bayes: is a Bayesian theorem
============
--> Naive Bayes classifier is a simple classifier that has its foundation on the well known Bayes’s theorem.
--> Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets.
Probability - Chances for occuring
Probability = no. of times event occured / total no. of chances
10 days --> 6 days rained, 4 days not rained (in independent events)
p(rain=yes) = 6/10 = 60%
p(rain=no) = 4/10 = 40%
10 days --> 6 days rained, 4 days not rained (in dependent events)
p(rain=yes) = 6/10 = 60% - depends like temp,cloud,...
p(rain=no) = 4/10 = 40% - depends like temp,cloud,...
Joint Probability:
--------------------
it contains in 2 types
1. Independent Events
P(A/B) = P(A) * P(B)
2. Dependent Events
P(B/A) * P(A)
P(A/B) = ------------------
P(B)
liklihood * prior prob
posterior prob = -------------------------
evidence (or) marginal prob
# once goto simple sms_spam1.csv file and understand the data
# Calculate this by using independent events
p(spam) -- 0.4
p(ham) -- 0.6
p(spam/sms) = p(spam) * p(sms)
4/10 * 3/10 = 0.12
p(ham/sms) = p(ham) * p(sms)
6/10 * 3/10 = 0.18 #this is not the currect way for calculating the independent events, so we have to follow bayesian theorem (dependent events)
# Calculate this by using dependent events(bayesian theorem)
p(spam) -- 0.4
p(ham) -- 0.6
p(sms/spam) * p(spam)
p(spam/sms) = --------------------------
p(sms)
1/4 * 4/10
= -------------- = 1/3 = 33.33%
3/10
p(sms/ham) * p(ham)
p(ham/sms) = ---------------------------
p(sms)
2/6 * 6/10
= -------------- = 2/3 = 66.66%
3/10
# Once goto sms_spam.csv file and understand data
# Collecting the data
sms_data <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/sms_spam.csv", stringsAsFactors=FALSE)
str(sms_data)
# Data Preparation
sms_data$type = factor(sms_data$type)
str(sms_data)
# install tm package for text mining on our data
install.packages("tm")
library(tm)
library(help="tm") # once gothrough the fn's in tm
# 1. Create the Corpus for collection of document
# 2. Clean the data by common case letter format, removing numbers, stopwords, punctuation, whitespaces
sms_corpus <- Corpus(VectorSource(sms_data$text))
inspect(sms_corpus) #it inspect all 5574 documents
inspect(sms_corpus[1:3]) #it inspect first 3 documents
# Here the data in different cases like upper & lower, let convert into single case in lower by using tm_map()
sms_clean <- tm_map(sms_corpus,tolower)
inspect(sms_corpus[1:3])
inspect(sms_clean[1:3])
# Remove numbers from the data
sms_clean <- tm_map(sms_clean,removeNumbers)
inspect(sms_corpus[1:3])
inspect(sms_clean[1:3])
# Remove stopwords from data, stopwords() contains 174 words
stopwords()
sms_clean <- tm_map(sms_clean,removeWords,stopwords())
inspect(sms_corpus[1:3])
inspect(sms_clean[1:3])
# Remove Punctuation from data
sms_clean <- tm_map(sms_clean,removePunctuation)
inspect(sms_corpus[1:3])
inspect(sms_clean[1:3])
# Remove Whitespace from data
sms_clean <- tm_map(sms_clean,stripWhitespace)
inspect(sms_corpus[1:3])
inspect(sms_clean[1:3])
# Collect the spam messages and clean the data
sms_spam <- subset(sms_data,type=="spam")
spam_corpus <- Corpus(VectorSource(sms_spam$text))
spam_clean <- tm_map(spam_corpus, tolower)
spam_clean <- tm_map(spam_clean, removeNumbers)
spam_clean <- tm_map(spam_clean, removeWords, stopwords())
spam_clean <- tm_map(spam_clean, removePunctuation)
spam_clean <- tm_map(spam_clean, stripWhitespace)
inspect(spam_corpus[1:3])
inspect(spam_clean[1:3])
# Collect the ham messages and clean the data
sms_ham <- subset(sms_data,type=="ham")
ham_corpus <- Corpus(VectorSource(sms_ham$text))
ham_clean <- tm_map(ham_corpus, tolower)
ham_clean <- tm_map(ham_clean, removeNumbers)
ham_clean <- tm_map(ham_clean, removeWords, stopwords())
ham_clean <- tm_map(ham_clean, removePunctuation)
ham_clean <- tm_map(ham_clean, stripWhitespace)
inspect(ham_corpus[1:3])
inspect(ham_clean[1:3])
Navie Bayes Classification with SMS Spam
# Collecting the data
sms_data <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/sms_spam.csv", stringsAsFactors=FALSE)
# The type element is currently a character vector.
# Convert it into a factor.
sms_data$type <- factor(sms_data$type)
# Displays description of each variable
str(sms_data)
head(sms_data)
table(sms_data$type)
prop.table(table(sms_data$type))
# Data preparation - cleaning and standrdizing text data
# The tm package can be installed via the install.packages("tm") and
# loaded with the library(tm) command.
library(tm)
# corpus can use to collect text documents
# In order to create a corpus, VCorpus() is used which is in the tm package
# The VectorSource is reader function to create a source object from the existing sms_data$text
sms_corpus <- VCorpus(VectorSource(sms_data$text))
print(sms_corpus)
# View a summary of the first and second SMS messages in the corpus
inspect(sms_corpus[1:2])
# The as.character() is used to view actual message text
as.character(sms_corpus[[1]])
# The lapply() function is used to apply procedure to each element of an R data structure.
lapply(sms_corpus[1:2], as.character)
# Text transformation
# The tm_map() function provides a method to apply a transformation
# to a tm corpus.
# New transformation save the result in a new object called sms_cleaned_corpus
# Convert text into lowercase. Here used following functions:
# content_transformer(); tm wrapper function
# tolower(); lowercase transformation function
sms_cleaned_corpus <- tm_map(sms_corpus, content_transformer(tolower))
# Check the difference between sms_corpus and sms_cleaned_corpus
as.character(sms_corpus[[1]])
as.character(sms_cleaned_corpus[[1]])
# Remove numbers from SMS messages
sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, removeNumbers)
# Remove filler words using stopwords() and removeWords() functions
sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, removeWords, stopwords())
# Remove punctuation characters
sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, removePunctuation)
# Reducing words to their root form using stemming. The tm package provides
# stemming functionality via integration with the SnowballC packge.
# The SnowballC package can be installed via the install.packages("SnowballC") and
# loaded with the library(SnowballC) command.
library(SnowballC)
# Apply stemming
sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, stemDocument)
# Remove additional whitespace
sms_cleaned_corpus <- tm_map(sms_cleaned_corpus, stripWhitespace)
# Data preparation - splitting text documents into words(Tokenization)
# Create a data structure called a Document Term Matrix(DTM)
sms_dtm <- DocumentTermMatrix(sms_cleaned_corpus)
sms_dtm
# Divide the data into a training set and a test set with ratio 75:25
# The SMS messages are sorted in a random order.
sms_dtm_train <- sms_dtm[1:4181, ]
sms_dtm_test <- sms_dtm[4182:5574, ]
# Create labels that are not stored in the DTM
sms_train_lables <- sms_data[1:4181, ]$type
sms_test_lables <- sms_data[4182:5574, ]$type
# Compare the proportion of spam in the training and test data
prop.table(table(sms_train_lables))
prop.table(table(sms_test_lables))
# Visualizing text data using word clouds
# The wordcloud package can be installed via the install.packages("wordcloud") and
# loaded with the library(wordcloud) command.
library(wordcloud)
# Create wordcloud from a tm corpus object
pal <-brewer.pal(8,"Dark2")
wordcloud(sms_cleaned_corpus, min.freq=40, random.order = FALSE, colors=pal)
# Create wordcloud for spam and ham data subsets
spam <- subset(sms_data, type == "spam")
wordcloud(spam$text, max.word = 40, scale = c(4, 0.8), colors=pal)
ham <- subset(sms_data, type == "ham")
wordcloud(ham$text, max.word = 40, scale = c(4, 0.8), colors=pal)
# Data preparation - Creating indicator features for frequent words
sms_frequent_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_frequent_words)
sms_dtm_freq_train<- sms_dtm_train[ , sms_frequent_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_frequent_words]
# print the most frequent words in each class.
sms_corpus_ham <- VCorpus(VectorSource(ham$text))
sms_corpus_spam <- VCorpus(VectorSource(spam$text))
sms_dtm_ham <- DocumentTermMatrix(sms_corpus_ham, control = list(tolower = TRUE,removeNumbers = TRUE,stopwords = TRUE,removePunctuation = TRUE,stemming = TRUE))
sms_dtm_spam <- DocumentTermMatrix(sms_corpus_spam, control = list(tolower = TRUE,removeNumbers = TRUE,stopwords = TRUE,removePunctuation = TRUE,stemming = TRUE))
sms_dtm_ham_frequent_words <- findFreqTerms(sms_dtm_ham, lowfreq= 0, highfreq = Inf)
head(sms_dtm_ham_frequent_words)
tail(sms_dtm_ham_frequent_words)
sms_dtm_spam_frequent_words <- findFreqTerms(sms_dtm_spam, lowfreq= 0, highfreq = Inf)
head(sms_dtm_spam_frequent_words)
tail(sms_dtm_spam_frequent_words)
# The following defines a convert_counts() function to convert counts to
# Yes / No strings:
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
# Apply above function to train and test data sets.
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,convert_counts)
# Training a model using Naive Bayes
library(e1071)
sms_classifier <- naiveBayes(sms_train, sms_train_lables)
# Evaluating model
sms_test_pred <- predict(sms_classifier, sms_test)
library(gmodels)
CrossTable(sms_test_pred, sms_test_lables,prop.chisq = FALSE, prop.t = FALSE,dnn = c('predicted', 'actual'))
# Accuracy : Measures of performance
library(caret)
confusionMatrix(sms_test_pred, sms_test_lables, positive = "spam")
# Improving model performance
# Adding Laplace estimator
new_sms_classifier <- naiveBayes(sms_train, sms_train_lables, laplace = 1)
new_sms_classifier_pred <- predict(new_sms_classifier, sms_test)
# Compare the predicted classes to the actual classifications using cross table
CrossTable(new_sms_classifier_pred, sms_test_lables, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('Predicted', 'Actual'))
Train & Evaluate a Model using Navie Bayes
DECISION TREES:
===============
--> Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems.
--> It works for both categorical and continuous input and output variables.
--> In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
Types of Decision Trees:
------------------------
Types of decision tree is based on the type of target variable we have. It can be of two types:
--> Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree.
--> Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.
Ex: Decision taking to join in job or not
salary: high, medium, low
working hrs: high, medium, low
distance: long, medium, short
if salary=high --> join=yes
if salary=medium or low
working=medium or low
distance=medium or low --> join=yes
if salary=low
working=low
distance=low --> join=yes
--> "ENTROPY" is used in decision trees
--> Generally, entropy refers to disorder or uncertainty and the definition of entropy used in information theory is directly analogous to the definition used in statistical thermodynamics.....if these values are equally probable, the entropy (in bits) is equal to this number.
--> For entropy formula once goto www.pmean.com/definitions/entropy.htm
--> In 1975 Ross Quinlan developed an algorithm ID3 (Iterative Dichotomiser 3) --> C4.4 ---> C5.0
# Collecting the data
credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv", stringsAsFactors=TRUE)
dim(credit)
str(credit)
# Once goto UCI Machine Learning Repository and search credit.csv file and understood the data by each column and 17th column default (1,2)
# In default column 1 means not defaulter (paying regularly) and 2 means defaulter (not paying regularly)
table(credit$checking_balance)
table(credit$months_loan_duration)
table(credit$credit_history)
table(credit$purpose)
table(credit$default)
# Preparing the Data
default <- subset(credit,default==2)
nondefault <- subset(credit,default==1)
dim(default)
dim(nondefault)
table(default$checking_balance)
table(nondefault$checking_balance)
prop.table(table(default$checking_balance))
prop.table(table(nondefault$checking_balance))
prop.table(table(default$checking_balance))*100
prop.table(table(nondefault$checking_balance))*100
prop.table(table(default$purpose))*100
prop.table(table(nondefault$purpose))*100
# Just understood the data about how much percentage of defaulters and nondefaulters in checking_balance, purpose, employment_length......
# Age column is the numerical data, so it is not depends on default data.
summary(default$age)
summary(nondefault$age)
# Some of the columns are not possible to identify the defaulters, identify those columns
# Here we have to install the package "C50"
install.packages("C50")
library(C50)
# once again i am collecting the credit.csv data set.
credit <- read.csv("C:/Users/Sreenu/Desktop/MLDataSets/credit.csv", stringsAsFactors=TRUE)
str(credit)
credit$default <- factor(credit$default, levels=c(1,2), labels=c("NO","YES"))
str(credit)
# Now make the data into random order by using seeding technique on credit data
set.seed(1)
credit_n <- credit[order(runif(1000)),]
head(credit_n)
head(credit) # once check the data before and after randamized
summary(credit$amount)
summary(credit_n$amount) #data is same, just we randamized the data
names(credit)
train_data <- credit_n[1:750,-17]
test_data <- credit_n[751:1000,-17] #75,25% is actually not required, we can give 100% data for training - no problem, later we observe with 100% training data
train_lables <- credit_n[1:750,17]
test_labels <- credit_n[751,1000,17]
prop.table(table(train_labels))
prop.table(table(test_labels))
# here it is giving exactly 70 and 30% data, just change the set.seed(2) and observe it
set.seed(2)
credit_n <- credit[order(runif(1000)),]
prop.table(table(train_labels))
prop.table(table(test_labels))
# here also it is giving approx 70 and 30% data.
# Let it make it as a 90% training data and observe
train_data <- credit_n[1:900,-17]
test_data <- credit_n[901:1000,-17]
train_lables <- credit[1:900,17]
test_labels <- credit[901,1000,17]
prop.table(table(train_labels))
prop.table(table(test_labels))
# here also it is giving approx 70 and 30% data, same proposnality, if we train 100% no problem
# Train the model
library(C50)
credit_classifier <- C50::C5.0(credit_n[,-17],credit_n[,17])
# Here i gave all the columns for training, later we will observe one by one column
credit_classifier
# Evaluate the performance of the model
# Now observe the tree
summary(credit_classifier)
# System has developed the decision tree classifier like this, just like if and else statement
# Observe the attribute usage and check the error percentage
# Improve the performance of the model
# Now i am taking only 3 columns
credit_classifier <- C50::C5.0(credit_n[,c(1,2,3)],credit_n[,17])
summary(credit_classifier)
# Now size of the tree decreases and check the error %, and add the required columns and check the error % (it will decrease).
credit_classifier <- C50::C5.0(credit_n[,c(1,2,3,4,5)],credit_n[,17])
summary(credit_classifier)
# Once goto search in R and type C5.0 and check the parameter - trials
# trails - the number of boosting iterations.
credit_classifier <- C50::C5.0(credit_n[,c(1,2,3,4,5)],credit_n[,17],trials=5)
summary(credit_classifier)
credit_classifier <- C50::C5.0(credit_n[,c(1,2,3,4,5)],credit_n[,17],trials=10)
summary(credit_classifier)
credit_classifier <- C50::C5.0(credit_n[,-17],credit_n[,17],trials=10)
summary(credit_classifier)
# Check the examples of C5.0 in help like plotting.......
plot(credit_classifier) #it is very bit tree, try with some columns
credit_classifier <- C50::C5.0(credit_n[,c(1,2,3)],credit_n[,17],trails=10)
plot(credit_classifier)
test_data <- credit_n[1:250,-17]
predict_labels <- predict(credit_classifier,test_data)
library(gmodels)
CrossTable(credit_n[1:250,17],predict_labels,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE,prop.chisq=FALSE)
CrossTable(credit_n[1:250,17],predict_labels,prop.t=FALSE,prop.r=TRUE,prop.c=FALSE,prop.chisq=FALSE)
# Check the no.of defaulters and proposnality
test_data <- credit_n[251:500,-17]
CrossTable(credit_n[251:500,17],predict_labels,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE,prop.chisq=FALSE)
test_data <- credit_n[501:750,-17]
CrossTable(credit_n[501:750,17],predict_labels,prop.t=FALSE,prop.r=FALSE,prop.c=FALSE,prop.chisq=FALSE)
Regression:
===========
--> This is a type of problem where we need to predict the continuous-response value (ex : predict number which can vary from -infinity to +infinity)
Some examples are:
--> what is the price of house in a specific city?
--> what is the value of the stock?
--> how many total runs can be on board in a cricket game?
Algorithms in Regression:
-------------------------
1. Linear Regression
2. Logistic Regression
Linear Regression:
==================
--> Regression analysis is a very widely used statistical tool to establish a relationship model between two variables.
--> One of these variable is called predictor variable whose value is gathered through experiments.
--> The other variable is called response variable whose value is derived from the predictor variable.
--> In Linear Regression these two variables are related through an equation, where exponent(power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph.
--> A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
--> The general mathematical equation for a linear regression is:
y = ax + b
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
lm():
-----
--> lm() function creates the relationship model between the predictor and the response variable.
Syntax:
lm(formula,data)
Arguments:
--> formula is a symbol presenting the relation between x and y.
--> data is the vector on which the formula will be applied.
Example 1:
----------
x <- c(1,2,3,4,5)
y <- c(14,18,22,26,30)
# Apply the lm() function.
relation <- lm(y~x)
print(relation)
summary(relation)
# Predict the y value for x=500
a <- data.frame(x=500)
result <- predict(relation,a)
print(result)
# Predict the y values for x values
x <- data.frame(x)
result <- predict(relation,x)
print(result)
# Evaluate the performance of the model
table(result,y)
# Visualize the Regression Graphically
plot(y,x)
plot(y,x,col="red",pch=16)
Example 2:
----------
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
print(relation)
summary(relation)
# Predict the weight of a new person with height 170.
a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
# Predict the weights of existing persons
x <- data.frame(x)
result <- predict(relation,x)
print(result)
# Evaluate the performance of the model
table(result,y)
# Visualize the Regression Graphically
# Give the chart file a name.
png(file = "linearregression.png")
# Plot the chart.
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")
# Save the file.
dev.off()
Multiple Regression:
====================
--> Multiple regression is an extension of linear regression into relationship between more than two variables.
--> In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response
variable.
--> The general mathematical equation for multiple regression is:
y = a + b1x1 + b2x2 +...bnxn
y is the response variable.
a, b1, b2...bn are the coefficients.
x1, x2, ...xn are the predictor variables.
--> We create the regression model using the lm() function in R.
--> The model determines the value of the coefficients using the input data.
--> Next we can predict the value of the response variable for a given set of predictor variables using these coefficients.
Syntax:
lm(y ~ x1+x2+x3...,data)
Example 1:
----------
input <- mtcars[,c("mpg","disp","hp","wt")]
head(input)
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")
a <- coef(model)[1]
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)
Example 2:
----------
--> Australian CPI (Consumer Price Index) data, which are quarterly CPIs from 2008 to 2010.
--> In this example, an x-axis is added manually with function axis(), where las=3 makes text vertical.
year <- rep(2008:2010, each=4)
quarter <- rep(1:4, 3)
cpi <- c(162.2, 164.6, 166.5, 166.0, 166.2, 167.0, 168.6, 169.5, 171.0, 172.1, 173.3, 174.0)
plot(cpi, xaxt="n", ylab="CPI", xlab="")
# draw x-axis
axis(1, labels=paste(year,quarter,sep="Q"), at=1:12, las=3)
# Check the correlation between CPI and the other variables, year and quarter.
cor(year,cpi)
cor(quarter,cpi)
# Built a linear regression model with lm(), using year and quarter as predictors and CPI as response.
fit <- lm(cpi ~ year + quarter)
print(fit)
# With the above linear model, CPI is calculated as
cpi = c0 + c1*year + c2*quarter;
# where c0, c1 and c2 are coefficients from model fit. Therefore, the CPIs in 2011 can be get as follows.
# An easier way for this is using function predict(), which will be demonstrated at the end of this subsection.
cpi2011 <- fit$coefficients[[1]] + fit$coefficients[[2]]*2011 + fit$coefficients[[3]]*(1:4)
attributes(fit)
# differences between observed values and fitted values
residuals(fit)
summary(fit)
# Plot the fitted model
plot(fit)
# We can also plot the model in a 3D plot as below, where function scatterplot3d() creates a 3D scatter plot and plane3d() draws the fitted plane.
# Parameter lab specifies the number of tickmarks on the x- and y-axes.
library(scatterplot3d)
s3d <- scatterplot3d(year, quarter, cpi, highlight.3d=T, type="h", lab=c(2,3))
s3d$plane3d(fit)
# With the model, the CPIs in year 2011 can be predicted as follows, and the predicted values are shown as red triangles
data2011 <- data.frame(year=2011, quarter=1:4)
cpi2011 <- predict(fit, newdata=data2011)
style <- c(rep(1,12), rep(2,4))
plot(c(cpi, cpi2011), xaxt="n", ylab="CPI", xlab="", pch=style, col=style)
axis(1, at=1:16, las=3, labels=c(paste(year,quarter,sep="Q"), "2011Q1", "2011Q2", "2011Q3", "2011Q4"))
Generalized Linear Regression:
==============================
--> The generalized linear model (GLM) generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be a function of its predicted value.
--> It unifies various other statistical models, including linear regression, logistic regression and Poisson regression.
--> Function glm() is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
Example:
--------
# A generalized linear model is built below with glm() on the bodyfat data
data("bodyfat", package="TH.data")
myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth
bodyfat.glm <- glm(myFormula, family = gaussian("log"), data = bodyfat)
summary(bodyfat.glm)
pred <- predict(bodyfat.glm, type="response")
# type indicates the type of prediction required. The default is on the scale of
the linear predictors, and the alternative "response" is on the scale of the response variable.
plot(bodyfat$DEXfat, pred, xlab="Observed Values", ylab="Predicted Values")
abline(a=0, b=1)
# if family=gaussian("identity") is used, the built model would be similar
to linear regression. One can also make it a logistic regression by setting family to binomial("logit").
Non-linear Regression:
======================
--> While linear regression is to find the line that comes closest to data, non-linear regression is to fit a curve through data.
--> Function nls() provides nonlinear regression. Examples of nls() can be found by running \?nls" under R.
Logistic Regression:
====================
--> Logistic regression is a classification model in which the response variable is categorical.
--> It is an algorithm that comes from statistics and is used for supervised classification problems.
--> The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1.
--> It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables.
--> The general mathematical equation for logistic regression is:
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
y is the response variable.
x is the predictor variable.
a and b are the coefficients which are numeric constants.
glm():
------
--> glm() function is used to create the regression model and get its summary for
analysis.
Syntax:
glm(formula,data,family)
Arguments:
--> formula is the symbol presenting the relationship between the variables.
--> data is the data set giving the values of these variables.
--> family is R object to specify the details of the model. It's value is binomial for logistic regression.
Example 1:
----------
library(ElemStatLearn)
head(spam)
# Split dataset in training and testing
inx = sample(nrow(spam), round(nrow(spam) * 0.8))
train = spam[inx,]
test = spam[-inx,]
# Fit regression model
fit = glm(spam ~ ., data = train, family = binomial())
summary(fit)
# Make predictions
preds = predict(fit, test, type = "response")
preds = ifelse(preds > 0.5, 1, 0)
tbl = table(target = test$spam, preds)
tbl
sum(diag(tbl)) / sum(tbl)
Example 2:
----------
--> The in-built data set "mtcars" describes different models of a car with their various engine specifications.
--> In "mtcars" data set, the transmission mode (automatic or manual) is described by the column am which is a binary value (0 or 1).
--> We can create a logistic regression model between the columns "am" and 3 other columns - hp, wt and cyl.
# Select some columns form mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]
head(input)
# Create Regression Model
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
summary(am.data)
Conclusion:
-----------
--> In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and "hp", we consider them to be insignificant in contributing to the value of the variable "am".
--> Only weight (wt) impacts the "am" value in this regression model.
Generalized Linear Regression, Non Linear
Clustering:
===========
--> This is a type of problem where we group similar things together.
--> Bit similar to multi class classification but here we don’t provide the labels, the system understands from data itself and cluster the data.
Some examples are :
--> given news articles, cluster into different types of news
--> given a set of tweets, cluster based on content of tweet
--> given a set of images, cluster them into different objects
K Means Clustering:
===================
--> K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity.
--> Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.
--> In k means clustering, we have the specify the number of clusters we want the data to be grouped into.
--> The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster.
--> Then, the algorithm iterates through two steps:
1. Reassign data points to the cluster whose centroid is closest.
2. Calculate new centroid of each cluster.
--> These two steps are repeated till the within cluster variation cannot be reduced any further.
--> The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.
Example 1:
----------
# k-means clustering of iris data.
# At first, we remove species from the data to cluster.
# After that, we apply function kmeans() to iris2, and store the clustering result in kmeans.result.
# The cluster number is set to 3 in the code below.
iris2 <- iris
iris2$Species <- NULL
kmeans.result <- kmeans(iris2, 3)
# The clustering result is then compared with the class label (Species) to check whether similar objects are grouped together.
table(iris$Species, kmeans.result$cluster)
# The above result shows that cluster "setosa" can be easily separated from the other clusters, and that clusters "versicolor" and "virginica" are to a small degree overlapped with each other.
# Next, the clusters and their centers are plotted. Note that there are four dimensions in the data and that only the first two dimensions are used to draw the plot below.
# Some black points close to the green center (asterisk) are actually closer to the black center in the four dimensional space. We also need to be aware that the results of k-means clustering may vary from run to run, due to random selection of initial cluster centers.
plot(iris2[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster)
# plot cluster centers
points(kmeans.result$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 8, cex=2)
Example 2:
----------
# Exploring the data:
# The iris dataset contains data about sepal length, sepal width, petal length, and petal width of flowers of different species. Let us see what it looks like:
library(datasets)
head(iris)
# After a little bit of exploration, I found that Petal.Length and Petal.Width were similar among the same species but varied considerably between different species, as demonstrated below:
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
# Clustering:
# Okay, now that we have seen the data, let us try to cluster it. Since the initial cluster assignments are random, let us set the seed to ensure reproducibility.
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
# Let us compare the clusters with the species.
table(irisCluster$cluster, iris$Species)
# As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly classified two data points belonging to versicolor and six data points belonging to virginica.
# We can also plot the data to see the clusters:
irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster)) + geom_point()
Association Rules (Market Basket Analysis):
===========================================
--> Association rules are rules presenting association or correlation between itemsets.
--> An association rule is in the form of A => B, where A and B are two disjoint itemsets, referred to respectively as the LHS (left-hand side) and RHS (right-hand side) of the rule.
--> The three most widely-used measures for selecting interesting rules are support, confidence and lift.
--> Support is the percentage of cases in the data that contains both A and B.
--> Confidence is the percentage of cases containing A that also contain B.
--> Lift is the ratio of confidence to the percentage of cases containing B.
Lets consider the rule A => B in order to compute these metrics.
Number of transactions with both A and B
Support = ---------------------------------------- = P(AnB)
Total number of transactions
Number of transactions with both A and B
Confidence = ---------------------------------------- = P(AnB) / P(A)
Total number of transactions with A
Number of transactions with B
ExpectedConfidence = ------------------------------ = P(B)
Total number of transactions
Confidence
Lift = ------------------- = P(AnB) / P(A).P(B)
Expected Confidence
--> A classic algorithm for association rule mining is APRIORI.
--> It is a level-wise, breadth-first algorithm which counts transactions to find frequent itemsets and then derive association rules from them.
--> An implementation of it is function apriori() in package arules.
--> Another algorithm for association rule mining is the ECLAT algorithm, which finds frequent itemsets with equivalence classes, depth-first search and set intersection instead of counting.
--> It is implemented as function eclat() in the same package.
--> With the apriori() function, the default settings are:
1) supp=0.1, which is the minimum support of rules;
2) conf=0.8, which is the minimum confidence of rules; and
3) maxlen=10, which is the maximum length of rules.
Example 1 on Titanic dataset:
=============================
--> The Titanic dataset in the datasets package is a 4-dimensional table with summarized information on the fate of passengers on the Titanic according to social class, sex, age and survival.
--> To make it suitable for association rule mining, we reconstruct the raw data as titanic.raw, where each row represents a person.
str(Titanic)
df <- as.data.frame(Titanic)
head(df)
titanic.raw <- NULL
for(i in 1:4) {
titanic.raw <- cbind(titanic.raw, rep(as.character(df[,i]), df$Freq))
}
titanic.raw <- as.data.frame(titanic.raw)
names(titanic.raw) <- names(df)[1:4]
dim(titanic.raw)
str(titanic.raw)
head(titanic.raw)
summary(titanic.raw)
library(arules)
# find association rules with default settings
rules.all <- apriori(titanic.raw)
quality(rules.all) <- round(quality(rules.all), digits=3)
rules.all
inspect(rules.all)
# use code below if above code does not work
arules::inspect(rules.all)
# rules with rhs containing "Survived" only
rules <- apriori(titanic.raw, control = list(verbose=F),
parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No", "Survived=Yes"),
default="lhs"))
quality(rules) <- round(quality(rules), digits=3)
rules.sorted <- sort(rules, by="lift")
inspect(rules.sorted)
Removing Redundancy:
--------------------
# find redundant rules
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
# remove redundant rules
rules.pruned <- rules.sorted[!redundant]
inspect(rules.pruned)
Interpreting Rules:
-------------------
rules <- apriori(titanic.raw,
parameter = list(minlen=3, supp=0.002, conf=0.2),
appearance = list(rhs=c("Survived=Yes"),
lhs=c("Class=1st", "Class=2nd", "Class=3rd", "Age=Child", "Age=Adult"),
default="none"),
control = list(verbose=F))
rules.sorted <- sort(rules, by="confidence")
inspect(rules.sorted)
Visualizing Association Rules:
------------------------------
--> Next we show some ways to visualize association rules, including scatter plot, balloon plot, graph and parallel coordinates plot.
--> More examples on visualizing association rules can be found in the vignettes of package "arulesViz".
install.packages("arulesViz")
library(arulesViz)
plot(rules.all)
plot(rules.all, method="grouped")
plot(rules.all, method="graph")
plot(rules.all, method="graph", control=list(type="items"))
plot(rules.all, method="paracoord", control=list(reorder=TRUE))
Association Rules (Market Basket Analysis)

Very analytical and useful!