The field of statistics requires careful standards and practices when collecting, arranging and classifying pieces of information “datum”, plural: data. These standards help ensure reproducible experiments and analysis to draw a conclusion, answer a question, or make a valid assertion about that information. Statistics uses charts and graphs to communicate that information and display the outcomes in a way that’s easy for a non-scientist to understand.
Statistics is very important to many fields of science from social sciences to finance to computer programming and artificial intelligence. Statistics are the “language of data” - and understanding how to properly quantify and express that information, and visualize those statistics clearly, can help create compelling arguments and persuade the reader from the class room to the board room.
This guide compiles many frequently used formulas, visualizations and theories in this field of study into the R programming language and presents these concepts in a written, textbook-style guide. In this guide, each chapter contains a different general subject matter, and contains a variety of formulas that can be called as a function. These functions may be called in a series, or nested into other formulas, to generate useful calculations and visualizations.
We will also explore some real-world applications of these concepts, such as:
…
R and R Markdown provide a powerful platform for general statistical programming. Visualization capabilities in R are enhanced through the “Grammar of Graphics” as implemented through the GGPlot2 and GGForce packages. Data manipulation functions like select, mutate, and summarize are added by the dplyr package, enabling more convenient and readable R language constructs, using the pipe %>% operator. This operator can help your write intuitive syntax in R that resembles common structured query languages like SQL.
This document will not provide basic instruction in R. It is assumed that the reader has some familiarity with the R language, syntax patterns, installation, and configuration. If this is your first introduction to R, please check out the appendix to learn more about the basics of this language and installing the RStudio programming environment. The Rmd files (which this document was written in) may be downloaded from github, enabling the reader to change any portion of this document and customize the word problems to cover different applied subjects.
This guide is intended for finance professionals, statisticians, data scientists, math teachers, students*, or anyone else who needs to generate, visualize, apply, or solve finite math problems.
This Rmarkdown file will use several packages as defined below. This code will load the packages only of they are not already installed. These packages will be loaded into the local scope via the library functions.
if(!require(pracma)) install.packages("pracma", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(ggforce)) install.packages("ggforce", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(gtools)) install.packages("gtools", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(matlib)) install.packages("matlib", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(MASS)) install.packages("MASS", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(kableExtra)) install.packages("kableExtra", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(RcppAlgos)) install.packages("RcppAlgos", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(latex2exp)) install.packages("latex2exp", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(xfun)) install.packages("xfun", repos = "http://cran.us.r-project.org", dependencies = TRUE)
if(!require(readxl)) install.packages("readxl", repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(readxl)
library(pracma)
library(dplyr)
library(ggplot2)
library(ggforce)
library(gtools)
library(matlib)
library(MASS)
library(kableExtra)
library(RcppAlgos)
library(latex2exp)
library(xfun)
options(scipen=999)
This section provides a basic primer about data. Data is the plural form of datum - which means, a single observation. So data is, by definition, more than one observation. Observations or measurements may be qualitative - those measuring a trait or attribute - or quantitative - those measuring a quantity or value - usually in a concrete, numerical or categorical way.
We will explore how to define variables, which can help with important choices on how to analyze and visualize the data being collected. All variables in a strongly typed or explicit environment must have a data type defined - which could be abstract, like a generic “object” or it may be specific like “integer” or “decimal” or “character”. Every programming language has different rules for how variables are defined, and whether they are strongly typed. R does not require that you define a variable before assigning it’s value. So R is weakly typed. A variable exists in teh type that befits the first value passed in to the variable. A variable may be a numeric value in a given unit (Like centimeters of distance, liters of volume, kilometers per hour of speed, etc.), or a variable may b a categorical value of a given type.
#create some variables of different types
var_num <- 5.01
var_chr <- "hello"
var_vector <- c(1,2,3,4,5)
var_df <- data.frame(greeting="salutations")
#check the types
typeof(var_num)
## [1] "double"
typeof(var_chr)
## [1] "character"
typeof(var_vector)
## [1] "double"
typeof(var_df)
## [1] "list"
Variables defined in R may be set as various types like you see above.
The unit of a variable is the unit of measurement you are using. WHen recording observations, one should try to avoid mixing units, and always label values clearly as to what unit is being used. In this companion,we will generally add _unit to the variable name, so if the unit is centimeters, it will add _cm to the variable name. Below are some unit abbreviations you may see in this companion.
Local Distance _km = kilometers _m = meters _cm = centimeters _mm = millimeters _um = micrometers (microns) _nm = nanometers
Astronomical Distance _pc = parsecs _ly = light years _au = astronomical units
Mass _t = tons _kg = kilograms _g = grams _mg = milligrams
Speed _kph = kilometers per hour _mps = meters per second
Time _hr = hours _min = minutes _sec = seconds
The precision of a variable, is a measure of its rounding. The precision is the number of digits to the right of the decimal place in any numeric value. It’s important because when taking observations of a population, the experiment should use the same amount of precision, and same unit of measure, in taking all measurements.
It’s important to understand the difference between formatting a number through rounding (in which the number of digits to the right of the decimal point is fixed - vs. formatting using significant digits, where the final number of digits displays depends on the position of the 0’s. In the sections below we will explore how values are rounded or formatted to show varying amounts of precision. Remember, when you “round off” you are “losing precision” so - that’s not always a good idea, and should usually happen AFTER all other operations like multiplication, division, averaging etc have occurred.
var_pi <- 3.1415926
stmt <- paste("Pi rounded to 2 places is:", round(pi,2))
stmt
## [1] "Pi rounded to 2 places is: 3.14"
pc <- 6.62607015 * 10 ^ -34
stmt <- paste("Plancks constant rounded to 2 places is:", round(pc,2))
stmt
## [1] "Plancks constant rounded to 2 places is: 0"
stmt <- paste("Plancks constant rounded to 36 places is:", round(pc,36))
stmt
## [1] "Plancks constant rounded to 36 places is: 0.000000000000000000000000000000000663"
Significant digits are those numbers that have meaning in a variable - so pretty much any non-zero number. the significant digits counts from the left to right, so a value with the same number of significant You can limit the significant digits of a variable using format.
var_bignum <- 9876543210
say_bignum <- numbers_to_words(var_bignum)
stmt <- paste(say_bignum," formatted to 3 significant digits is:", format(var_bignum,digits=3))
stmt
## [1] "nine billion, eight hundred seventy-six million, five hundred forty-three thousand, two hundred ten formatted to 3 significant digits is: 9876543210"
var_mpp <- 1.917*10^13
say_mpp <- numbers_to_words(var_mpp)
stmt <- paste(say_mpp," formatted to 3 significant digits is:", format(var_mpp,digits=3))
stmt
## [1] "nineteen trillion, one hundred seventy billion formatted to 3 significant digits is: 19170000000000"
var_pi <- 3.1415926
stmt <- paste("Pi formatted to 3 significant digits is:", format(pi,digits=3))
stmt
## [1] "Pi formatted to 3 significant digits is: 3.14"
pc <- 6.62607015 * 10 ^ -34
stmt <- paste("Plancks constant formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "Plancks constant formatted to 3 significant digits is: 0.000000000000000000000000000000000663"
pc <- 10000.999
stmt <- paste("10000.999 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "10000.999 formatted to 3 significant digits is: 10001"
pc <- 10000.99
stmt <- paste("10000.99 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "10000.99 formatted to 3 significant digits is: 10001"
pc <- 10000
stmt <- paste("10000 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "10000 formatted to 3 significant digits is: 10000"
pc <- 1000
stmt <- paste("1000 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "1000 formatted to 3 significant digits is: 1000"
pc <- 100
stmt <- paste("100 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "100 formatted to 3 significant digits is: 100"
pc <- 10
stmt <- paste("10 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "10 formatted to 3 significant digits is: 10"
pc <- 10.1
stmt <- paste("10.1 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "10.1 formatted to 3 significant digits is: 10.1"
pc <- 10.12
stmt <- paste("10.12 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "10.12 formatted to 3 significant digits is: 10.1"
pc <- 1
stmt <- paste("1 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "1 formatted to 3 significant digits is: 1"
pc <- 1.2
stmt <- paste("1.2 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "1.2 formatted to 3 significant digits is: 1.2"
pc <- 1.23
stmt <- paste("1.23 formatted to 3 significant digits is:", format(pc,digits=3))
stmt
## [1] "1.23 formatted to 3 significant digits is: 1.23"
A common problem when dealing with data is rounding amounts that are not a decimal. In this function we will divide by the unit we want to round to, then round that product with 0 points of precision, then multiply by that number to get the “nearest number” that is divisible.
#define a list of misc. numbers
num_list <- c(11,23,43,24,32,56,55,76,60,87,98,92)
#round_to will round val to the nearest amount specified
round_to <- function(val,amount){
rounded_amount <- round(val / amount,0) * amount
rounded_amount
}
#apply to the list using 10 as the factor to round to
sapply(num_list,round_to,10)
## [1] 10 20 40 20 30 60 60 80 60 90 100 90
#apply to the list using 5 as the factor to round to
sapply(num_list,round_to,5)
## [1] 10 25 45 25 30 55 55 75 60 85 100 90
A group of observations about the same subject is called a population, and in many cases we will examine some elements of a population, but not all, which is referred to as a sample. A sample can give you insight into characteristics of the population, even if it does not contain the entire population.
A sample is fundamental to statistics - So if we are looking to determine water quality in a well - we take a sample (say, a 10 milliliters) and determine how much arsenic and lead is in that volume. Then we can multiple that by the total volume of water we drink each day, and determine approximately how much arsenic or lead we are ingesting. It’s a lot easier to take a sample, than to measure all the water in the well.
Sometimes, its impossible to measure something without taking a sample. For example, we can’t measure the exact number of stars in the galaxy- we can only estimate how many there are based on how many we can see in a given area of the night sky. Samples allow us to extrapolate conclusions, based on the limited information we have available.
Thankfully, we can extrapolate information that is pretty darn accurate - This is due to concepts we will explore later, including the concept of standard deviation, the central limit theorem, the law of large numbers, and other important principles in statistical analysis.
# Set the seed of R's random number generator, which is useful for creating simulations or random objects that can be reproduced.
set.seed(17)
# Create a matrix object
nCols = 5
nRows = 3
population <- matrix(runif(nCols * nRows), ncol = nCols )
# Print the Population
print.listof(list(population))
## Component 1 :
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1550508 0.7768197 0.2068877 0.193943927 0.8346921
## [2,] 0.9683788 0.4078857 0.1871036 0.434231178 0.8300830
## [3,] 0.4682631 0.5387971 0.7799697 0.002274518 0.9569678
The most obvious example of a discrete variable is an integer or whole number.
So what makes a variable “discrete”? A measurement is discrete when it belongs a to well-defined group.
So by this definition - A discrete variable is a variable that “stands alone” - while, by contrast, a continuous variable is one that “bleeds together”.
As an example, an integer must be either 1 or 2 or 3 …, not 1.5 or 1.78 or 3.1415926…
But this is where it gets tricky. Just because a number has a decimal, does not necessarily mean it’s continuous. You could have a set of discrete variables that also contain decimals.
Lets consider movie ratings - if you have a web site where you can rate a movie:
1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 stars
These discrete variables contain decimals - but they are still discrete, because your entire population of answers will always fall into one of these 9 values. Even if you have a million response,s you only have 9 possible ratings.
Determining if a variable is continuous or discrete may not always be easy - and it often depends on the natural process that is generating that variable - and also whether that measurement is subject to any rounding.
Lets consider a few more examples:
Measuring the size of a bird’s wing precisely - will produce a “continuous” variable - it could be 13.35 cm, or 14.42 cm, or 15.18 cm, etc.
So a continuous variable like this “could be anything”, within a reasonable expected range. So a bird wing is not going to be 60 feet long, or negative 3 inches. That would be ridiculous. But it could be anything from, lets say, an inch for a tiny Hummingbird to over 6 feet for a Condor. The range of a population of data is equal to th largest / maximum value minus the smallest / minimum value.
So in this sense - a continuous variable could be anything, but within an expected range, with some expected lower and upper boundaries or “bounds”. WHen describing variables, its also important to consider the precision - which formally is the number of decimal places you are measuring to. So a bird wing measured to a tenth of a cm has 1 place of precision; measuring to a hundredth has 2 places, a thousandth has 3, etc.
When making observations, its fairly important that all observations are measured to the same minimum level of precision - and that should be clear to whomever is making the measurement, YOu would not want one of your observers making very precise measurements, and another rounding to the nearest whole number. This would create inconsistent data for your population which could skew or bias the results.
A continuous variable could be a weight, a height, a length, a time span - just about anything that can be measured with a high level of precision and variability.
On the other hand a discrete variable will usually represent a distinct, fixed quantity - a count, a category, a letter of the alphabet, etc. So discrete variables don’t blend in to one another like continuous variables do.
But here’s another complication - if you were to round off a set of continuous measurements to the nearest whole number, that population could then also be considered discrete. So if we decide when we measure birds wings that we don’t need to be that precise - and round off that variable to the nearest cm, then youyour population is always going to be 13, or 14 or 15 cm, etc.
So in that sense, data can be both discrete and continuous - even the same data, depending on how its collected, and the level of precision you are using.
So - how can we scientifically determine this important element? In short - you can’t. Its always a question that is answered by the data itself. However - in creating a data collection system you can certainly add filters to the data to ensure it is discrete (to a certain level of precision) or you can allow that data to be continuous. So in this function we will explore how we could create a function that enforces one of these types.
write_vector <- function(v){
s <- paste("{",paste(v, collapse = ","),"}")
s
}
collate_list <- function(data_point,data_type){
#we are taking the data point passed in and applying rounding to ensure it falls within an expected discrete population
new_point <- ifelse(data_type=='discrete',round(data_point),data_point)
new_point
}
data_collection <- c(13.35,14.42,15.18)
discrete_measurements <- sapply(data_collection,collate_list,'discrete')
discrete_measurements
## [1] 13 14 15
continuous_measurements <- sapply(data_collection,collate_list,'continuous')
continuous_measurements
## [1] 13.35 14.42 15.18
#write a sentence to summarize the result
result <- paste("discrete_measurements = ",write_vector(discrete_measurements),"\r\n continuous_measurements = ",write_vector(continuous_measurements),sep="")
#result
Result |
---|
discrete_measurements = { 13,14,15 } |
continuous_measurements = { 13.35,14.42,15.18 } |
Its useful to understand that a variable is discrete or continuous as it can help us decide what to do with that data and how to best display it.
BUt it also is helpful to consider if a discrete variable is Categorical, Ordinal, or Cardinal.
An Ordinal variable is variable in a population that has a natural “order” or position-
Numbers are ordinal since you expect to see them in some kind of order - 1, then 2 then 3 then 4, etc.
Letters like like A, B, C, D is an example of ordinal, discrete variables
Sizes like Small, Medium, Large (Or Tall, Grande, Venti - if you like to go your own way) that’s another example of a ordinal.
Ordinal, Discrete variable can be displayed in Ascending (going from small to large, or start to end) or Descending (large to small, or end to start) order.
An offshoot of this idea is that of Cardinal variables which describe a quantity as opposed to a position. So 1 horse, 2 horses, 3 horses are Cardinal. WHereas saying that this is the First Horse, Second Horse, and Third Horse is Ordinal
In addition, you might say that discrete variables are Interval if the distance between them is fixed. like 100 pounds, 200 pounds, 300 pounds. These could be considered Discrete, Cardinal and Interval.
So what about “Categorical” variables? These are variables that are discrete, but generally not subject to an “expected” order.
An example of cardinal variable could be families of insects: Ants, Moths, Bees, Caterpillars, Grasshoppers, Beetles, etc. These are Categorical variables. You don’t necessarily expect to see Ants before Bees in a list, or Moths before Grasshoppers.
You might expect to see these categories ordered by some other “ordinal” measure like the average weight or wingspan; or you might expect to see them in a list that is alphabetized. But in that case - you are actually looking at an ordinal variable that is derived from the categorical variable (the first letter of the category name). So again, a category is not generally ordinal - it has no “intrinsic” order.
However - once again there’s some gray area here. Lets take colors - one might say that red, blue and yellow are categorical and not ordinal since they exist independently of one another. So if someone asked you what are your favorite colors, in order, you could say Blue, then Red then Yellow. But in another sense they ARE ordinal because the colors of the spectrum are determined by the electromagnetic wavelengths of visible light - and we can recall our dear friend Roy G. Biv to remember the “expected order” of colors - (Red, Orange, Yellow, Green, Blue, Indigo, Violet). But once again - the category itself is not ordinal - rather, a property of the color (its wavelength) is ordinal.
So here we can see that it can be complicated to determine if a variable is ordinal or categorical, depending on what the variable describes, and how that relates to natural systems around us, and how we as humans expect things to be positioned in a list.
In this function we will return a statement that guesses whether a list is ordinal, cardinal, or categorical, based on the type of data in the list.
categorize_list <- function(set,measure){
is_all_numeric <- is.numeric(set)
is_all_integer <- is.integer(set)
is_nonnum <- !is.numeric(set)
if(is_all_numeric){
y <- 0
interval_set <- c()
precision_set <- c()
for(x in sort(set)){
is_integer <- x == round(x)
precision_set <- c(precision_set,is_integer)
if(y != 0){
since_last <- x - y
interval_set <- c(interval_set,since_last)
}
y <- x
}
datarelate <- 'uneven'
if(length(unique(interval_set)) == 1){
datarelate <- 'even'
}
precision <- 'numeric values'
continuity <- 'continuous'
if(precision_set == TRUE){
precision <- 'integer values'
continuity <- 'discrete'
}
if(measure == 'quantity' | measure == 'amount' | measure == 'value' | measure == 'width' | measure == 'height' | measure == 'length' | measure == 'weight' | measure == 'mass' | measure == 'volume' | measure == 'wavelength' | measure == 'amplitude'){
msg <- paste('This set contains ',continuity,' ',precision ,', and describes quantity. It has ',datarelate,' intervals. It\'s probably cardinal data.',sep='')
}else if(measure == 'position' | measure == 'rank' | measure == 'placement' | measure == 'order'){
msg <- paste('This set contains ',continuity,' ',precision ,', and describes position. It has ',datarelate,' intervals. It\'s probably ordinal data.',sep='')
}else{
msg <- paste('This set contains ',continuity,' ',precision ,', and describes ',measure,'. It has ',datarelate,' intervals. It\'s either ordinal or cardinal data.',sep='')
}
}else{
msg <- paste('This set contains some non-numeric values describing ',measure,'. It\'s probably categorical data.',sep='')
}
msg
}
collected_data1 <- c(1,7,14,23,2,4,51,5)
collected_data2 <- c(1,2,3,4,5,6)
collected_data3 <- c(1.1,1.7,2,2.234,3,3.14159,5,5.1)
collected_data4 <- c('bird','fish','cat','dog')
collected_data5 <- c('red','orange','yellow','green','blue','indigo','violet')
collected_data1
## [1] 1 7 14 23 2 4 51 5
categorize_list(collected_data1,'quantity')
## [1] "This set contains discrete integer values, and describes quantity. It has uneven intervals. It's probably cardinal data."
collected_data2
## [1] 1 2 3 4 5 6
categorize_list(collected_data2,'rank')
## [1] "This set contains discrete integer values, and describes position. It has even intervals. It's probably ordinal data."
collected_data3
## [1] 1.10000 1.70000 2.00000 2.23400 3.00000 3.14159 5.00000 5.10000
categorize_list(collected_data3,'length')
## [1] "This set contains continuous numeric values, and describes quantity. It has uneven intervals. It's probably cardinal data."
collected_data4
## [1] "bird" "fish" "cat" "dog"
categorize_list(collected_data4,'animals')
## [1] "This set contains some non-numeric values describing animals. It's probably categorical data."
collected_data5
## [1] "red" "orange" "yellow" "green" "blue" "indigo" "violet"
categorize_list(collected_data5,'colors')
## [1] "This set contains some non-numeric values describing colors. It's probably categorical data."
This function will provide a rough guess as to how each of these lists would probably be categorized.
We know that collected_data1 contains discrete integer values with uneven intervals - let’s take a look at what that set actually contains again.
Now to get a sense of whats in the vector - we will go ahead and sort the list so we can see what all the values are in order, since the data appears to be ordinal.
When dealing with ordinal data, there are some key commands in base R that are helpful. ONe of these is the “sort” function that will order the data in ascending order by default.
Alternatively you can “negate” the data AND the sort to get a descending order version of the set.
sorted_quantities_asc <- sort(collected_data1)
sorted_quantities_asc
## [1] 1 2 4 5 7 14 23 51
sorted_quantities_desc <- -sort(-collected_data1)
sorted_quantities_desc
## [1] 51 23 14 7 5 4 2 1
Now we will define a data frame. THe data frame is R’s version of a database table. To define a data frame you must set each column name and pass in a vector of values to build the rows of your data frame. A data frame can be a single column or many columns. We will start with a one column data frame.
#create a data frame with one column
df_quantities <- data.frame(quantity=collected_data1)
df_quantities
Data frames are convenient for working with the GGPlot package; they allow you to name your vectors and combine them into tables, then clearly defining the x/y axis and aesthetics of your GGPlot using easy-to-interpret language constructs.
#plot
simple_plot <- ggplot(data=df_quantities, aes(x=rank(-quantity),y=quantity)) +
geom_col()
simple_plot
Here we can visualize the values in teh set, using the quantity as the vertical y axis and the rank of the quantity, as the order along the horizontal x axis. Rank is the ordinal position of this value in the list if the list is sorted from least to most or A to Z. The rank is reversed with a - to place the highest values first resulting in the chart above.
Now, lets consider a less abstract scenario; we are asked to compile some statistics about bird specimens in a natural science collection. For each bird, we must measure 5 variables- its’ length in cm, wingspan in mm, wing color, weight in grams, and species common name.
For each type of measurement below we can run the categorize_list function to confirm the characteristics of this vector of observation. These vectors will then be combined into a data frame below.
length_cm <- c(25,20,12,8,30,12,21)
wingspan_cm <- c(38,28,20,4,39,22,36)
weight_g <- c(77,32.2,15,7.5,76.5,16,68)
species_name <- c('American Robin','Baltimore Oriole','House finch','Hummingbird','Scrub jay','Indigo bunting','Starling')
species_color <- c('red','orange','yellow','green','blue','indigo','violet')
length_cm
## [1] 25 20 12 8 30 12 21
categorize_list(length_cm,'length')
## [1] "This set contains discrete integer values, and describes quantity. It has uneven intervals. It's probably cardinal data."
wingspan_cm
## [1] 38 28 20 4 39 22 36
categorize_list(wingspan_cm,'wingspan')
## [1] "This set contains discrete integer values, and describes wingspan. It has uneven intervals. It's either ordinal or cardinal data."
weight_g
## [1] 77.0 32.2 15.0 7.5 76.5 16.0 68.0
categorize_list(weight_g,'weight')
## [1] "This set contains continuous numeric values, and describes quantity. It has uneven intervals. It's probably cardinal data."
species_name
## [1] "American Robin" "Baltimore Oriole" "House finch" "Hummingbird"
## [5] "Scrub jay" "Indigo bunting" "Starling"
categorize_list(species_name,'species')
## [1] "This set contains some non-numeric values describing species. It's probably categorical data."
df_birds <- data.frame(
length_cm = length_cm,
wingspan_cm = wingspan_cm,
weight_g = weight_g,
species_name = species_name
)
head(df_birds)
simple_plot <- ggplot(data=df_birds, aes(x=weight_g,y=wingspan_cm,label=species_name)) + #geom_smooth(method="lm") +
geom_point() +
geom_label(nudge_x=3)
simple_plot
Now lets jump ahead a bit and get a little taste of how data.frame and principles of statistics can be used for linear modeling in R. Suppose you are given the task now of estimating the weight for a few more bird specimens that are mounted, and cannot be accurately weighed. A generalized linear model (GLM) can use the length and wingspan of these new specimens to predict the unknown weight.
Before we delve into all the theory and math behind this prediction, lets just use some common sense - if we need to predict an unknown weight; we know intuitively that bigger things are usually heavier. And we know that we can measure the length and wingspan. So to test thisnotion that bogger bords are heavier mathmatically, we will use the cor function in R to confirm this phenomenon. The cor function will show us the correlation between multiple variables in our data frame. A correlation of 1 means that they move together perfectly, in a 1:1 manner, or some other ratio.
df_birds_measurements <- df_birds %>% dplyr::select(length_cm,wingspan_cm,weight_g)
cor(df_birds_measurements)
## length_cm wingspan_cm weight_g
## length_cm 1.0000000 0.9230812 0.9347757
## wingspan_cm 0.9230812 1.0000000 0.9113025
## weight_g 0.9347757 0.9113025 1.0000000
Sure enough, weight, wingspan and length are all highly correlated in these birds. So we can estimate based on the length. But we also have this other metric, the wingspan - longer wings would probably support heavier loads… so which one do we use? In this case, teh answer is - use both. A glm can incorporate both the length and the wingspan.
The glm function will generate a model which will predict a value for an unknown variable based on one or more known variables. Later, we will learn more of the inner workings of the glm to understand how it does this.
You can now use your available bird data to make a prediction about the weight of these additional birds.
model <- glm(weight_g ~ wingspan_cm + length_cm, data=df_birds)
summary(model)
##
## Call:
## glm(formula = weight_g ~ wingspan_cm + length_cm, data = df_birds)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## 9.515 -14.826 -5.755 9.575 -4.172 -6.371 12.034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.1140 12.8097 -1.961 0.121
## wingspan_cm 0.8079 1.0657 0.758 0.491
## length_cm 2.4759 1.6902 1.465 0.217
##
## (Dispersion parameter for gaussian family taken to be 159.4973)
##
## Null deviance: 5782.08 on 6 degrees of freedom
## Residual deviance: 637.99 on 4 degrees of freedom
## AIC: 59.452
##
## Number of Fisher Scoring iterations: 2
the model has now been fitted to the data.
new_wing <-c(24,22,34,14)
new_length <-c(16,14,26,10)
new_species <-c('Nuthatch*','Titmouse*','Dove*','Junco*')
#define new observation
newdata = data.frame(wingspan_cm=new_wing, length_cm=new_length)
Now we will introduce some new birds with one field (weight) missing, and use this data to make a prediction on the missing weight.
#use model to predict value of am
predict(model, newdata, type="response")
## 1 2 3 4
## 33.89072 27.32307 66.72893 10.95600
predicted_weight <- predict(model, newdata, type="response")
The predicted weights will come back as a list in the same order that rows of data were passed in. We can now recombine the data from the new birds, and the predicted weights.
df_new <- data.frame(
length_cm = new_length,
wingspan_cm = new_wing,
weight_g = predicted_weight,
species_name = new_species)
df_birds_combined <- df_birds %>% bind_rows(df_new)
simple_plot <- ggplot(data=df_birds_combined, aes(x=weight_g,y=wingspan_cm,label=species_name)) + #geom_smooth(method="lm") +
geom_point() +
geom_label(nudge_x=1,hjust=0) +
ggtitle('Selected Birds Weight by Wingspan (*=estimates)')
simple_plot
OK - so we have seen how we can load data manually, one value at a time, but clearly this is not very helpful when you have large data sets you are working with, usually in excel or some kind of database format.
Excel files are a common format for data collection and interchange - and R has some excellent capabilities to import excel file types. So we will start there. You must load the readxl package to gain the read_excel function in R. In this example, the data file is in the same directory as my .R file:
tbl_birds <-read_excel("minnesota_birds.xlsx")
df_minnesota_birds <- data.frame(tbl_birds)
summary(df_minnesota_birds)
## Name Scientific.Name Indicator.or.Other.Notes
## Length:444 Length:444 Length:444
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Category State Kingdom Phylum
## Length:444 Length:444 Length:444 Length:444
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Class Order Family Genus
## Length:444 Length:444 Length:444 Length:444
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Conservation.status Length.Min..cm. Length.Max..cm. Body.Mass.Min..g.
## Length:444 Min. : 7.00 Min. : 8.9 Min. : 2.0
## Class :character 1st Qu.: 13.88 1st Qu.: 16.0 1st Qu.: 18.0
## Mode :character Median : 20.50 Median : 25.0 Median : 55.0
## Mean : 28.51 Mean : 35.4 Mean : 327.3
## 3rd Qu.: 39.00 3rd Qu.: 46.0 3rd Qu.: 331.0
## Max. :138.00 Max. :180.0 Max. :9200.0
##
## Body.Mass.Max..g. Wingspan.Min..cm. Wingspan.Max..cm.
## Min. : 3.0 Min. : 8.00 Min. : 10.60
## 1st Qu.: 28.0 1st Qu.: 23.00 1st Qu.: 25.00
## Median : 88.5 Median : 38.00 Median : 42.00
## Mean : 665.6 Mean : 52.67 Mean : 60.62
## 3rd Qu.: 589.5 3rd Qu.: 72.75 3rd Qu.: 83.00
## Max. :14300.0 Max. :240.00 Max. :300.00
## NA's :1 NA's :1
The above code reads in a table from a XLSX file and converts it to a data.frame.
After we do this, we can use dplyr as before to mutate, then select specific fields, then mutate it again, etc. until our data is ready for whatever task we have in store.
In this case I want only the 3 measurements I used in my prior data set. I will discard a bunch of extra fields that are in the excel data. Also recall that our model above had only an average measurement for length, wingspan and weight - not a range of min/max values. So if we want to compare this external data with our own data, we may need to fortify the data with some new columns for averages to make the values line up.
In this example we will use the pipe operator to pipe data from the left to the right. Each time you pipe data you are saving the step of assigning that operation to a named variable. Piping data is a sort of shorthand way to apply multiple transformations to your data in a series, and just apply the final transformation into a named variable.
df_small_birds <- df_birds_combined %>% filter(weight_g <= 50)
df_small_birds
df_large_birds <- df_birds_combined %>% filter(weight_g > 50)
df_large_birds
In the example above only one pipe operation is performed; so this is the most basic version. You just take your data (df_quantities), pipe it into a function (filter), and assign that to a variable (df_small_birds). It’s works exactly like if you just called a single function and returned the result. Below is an example of using multiple pipe operators on a line - first filtering, then adding a column.
df_small_birds <- df_birds_combined %>% filter(weight_g <= 50) %>% mutate(class='small')
df_small_birds
df_large_birds <- df_birds_combined %>% filter(weight_g > 50) %>% mutate(class='large')
df_large_birds
In the example below we are filtering to larger birds, adding a column, and arranging the result list by weight.
df_large_birds_byweight <- df_birds_combined %>% filter(weight_g > 50) %>% mutate(class='large') %>% arrange(weight_g)
df_large_birds_byweight
dplyr’s data.frame manipulation features combined with the pipe operator provide a great tool for running these kinds of multi-step transformations - it enables you to pipe data from step to step to step, with no intermediate variables “holding” data - which is great for reproducible science - there’s no extra variable names to potentially mix up, your development environment only retains the starting and ending state, and there’s simply no gap from one operation to the next. You feed in one table, and get out another one, no matter how many steps you do in the process.
This example is intended to be a little complex - because we want to demonstrate how the data can be manipulated multiple times in a single operation. In this case we are selecting some fields, renaming them, averaging 2 fields values row by row, then selecting only the averages we want, with a in a new field order, and a sorted arrangement of rows. That’s 5 different transformations. But the pipe means we don’t need to create 5 new data frames to hold each iteration of change were going to perform. We just run the operations in order, like so:
#assign to df_minnesota_bird_measurements
df_minnesota_bird_measurements <-
#using df_minnesota_birds as the source
df_minnesota_birds %>%
#mutate the field names we want to keep, removing dots and upper case letters, etc.
dplyr::mutate(
species_name = Name,
weight_g_min = Body.Mass.Min..g.,
weight_g_max = Body.Mass.Max..g.,
length_cm_min = Length.Min..cm.,
length_cm_max = Length.Max..cm.,
wingspan_cm_min = Wingspan.Min..cm.,
wingspan_cm_max = Wingspan.Max..cm.
) %>%
#now select only those fields
dplyr::select(
species_name,
weight_g_min,
weight_g_max,
length_cm_min,
length_cm_max,
wingspan_cm_min,
wingspan_cm_max
) %>%
#group them row wise for the next operations
dplyr::rowwise() %>%
#average the min/max values for each row into a new average column
dplyr::mutate(
weight_g=mean(c(weight_g_min, weight_g_max), na.rm=T),
length_cm=mean(c(length_cm_min, length_cm_max), na.rm=T),
wingspan_cm=mean(c(wingspan_cm_min, wingspan_cm_max), na.rm=T)
) %>%
#select just the averages, in teh same order as our previous table
dplyr::select(
length_cm,
wingspan_cm,
weight_g,
species_name
) %>%
#reorder the table by wingspan descending
dplyr::arrange(-wingspan_cm)
#look at the table now
head(df_minnesota_bird_measurements,5)
So now we’ve completed this series of transformations to create a clear chain of events, that takes the data in the external file, and makes it match the format I need for further work.
simple_plot <- ggplot(data=df_minnesota_bird_measurements, aes(x=log10(weight_g),y=log10(wingspan_cm),label=species_name)) + geom_smooth(method="loess") +
geom_point() +
#geom_text() +
ggtitle('Selected Birds Weight by Wingspan')
simple_plot
df_birds_all <- df_birds_combined %>% bind_rows(df_minnesota_bird_measurements %>% dplyr::select(length_cm,wingspan_cm,weight_g,species_name) %>% mutate(species_color = 'Not Available'))
df_birds_all %>% arrange(-wingspan_cm) %>% head(20)
#grab a small subset of the data
df_sample <- df_birds_all %>% filter(wingspan_cm>=18 & wingspan_cm < 24 & weight_g >= 15 & weight_g < 25)
simple_plot <- ggplot(data=df_sample, aes(x=log10(weight_g),y=log10(wingspan_cm),label=species_name)) +
#geom_smooth(method="loess") +
geom_point() +
geom_label(aes(alpha=.5)) +
ggtitle('Selected Birds Weight by Wingspan')
simple_plot
Now that we have seen how a data set we have hand collected or acquired can be used, Lets take a look at how data sets may also be provided in base R or through R packages.
The mtcars data set can be loaded by simply typing “data(mtcars)” - which places a data frame called mtcars into your R environment. This is a “baked in” data set in the R environment that you can access. You will see may tutorials in R using this data on 1973 automobiles, or the “iris” data set featuring data about iris species traits and measurements or the “diamonds” data set measuring cost and carat size of diamonds. These data sets are great resources for testing if your code makes sense, trying out visualizations, and similar activities.
data(mtcars)
df_cars <- data.frame(mtcars)
head(df_cars)
Here is another point plot to visualize how horsepower correlates to displacement.
simple_plot <- ggplot(data=df_cars, aes(x=disp,y=hp,label=rownames(df_cars))) + geom_smooth(method="lm") +
geom_point() +
geom_label(nudge_x=3)
simple_plot
simple_plot <- ggplot(data=df_cars, aes(x=cyl,y=mpg,label=rownames(df_cars))) +
geom_point() +
geom_label(nudge_x=.1,hjust=0)
simple_plot
simple_plot <- ggplot(data=df_cars, aes(x=wt,y=drat,label=rownames(df_cars))) +
geom_point() +
geom_label(nudge_x=.1,hjust=0)
simple_plot
### Built in Data Sets
To see a list of tall teh built in data sets you have available in your currently loaded environment - just types “data()” and you will see the full list.
result <- data()
result
data(Seatbelts)
Seatbelts <- data.frame(Year=floor(time(Seatbelts)),
Month=factor(cycle(Seatbelts),
labels=month.abb), Seatbelts)
id <- as.numeric(rownames(Seatbelts))
Seatbelts <- cbind(id=id, Seatbelts)
Seatbelts <- Seatbelts %>% group_by(Year) %>% mutate(yearavg = round(mean(DriversKilled)))
simple_plot <- ggplot(data=Seatbelts, aes(x=id,y=DriversKilled,label=yearavg)) +
geom_point() +
geom_line() +
geom_text(aes(x=id,y=yearavg,label=ifelse(id %% 12 == 0, yearavg,'')),nudge_x=.1,hjust=0) +
geom_vline(xintercept = 170,color='red')
simple_plot
For a given population, the population arithmetic mean \(\mu\) may be determined by adding all items in the population and dividing by that number of items.
The sample mean on the other hand, is a mean of the items taken in a random sample. This is represented in statistics as x bar \(\bar{x}\).
#build a small population of 20 integers
N <- c(40,29,23,25,31,22,36,33,44,22,24,34,28,29,31,33,42,32,26)
n <- sample(N,10)
Population variance is a measure of the variance from the mean in a population.
First we must define the mean \(\mu\) as:
\(\mu =\frac{(x_1) + (x_2) + (x_3) + ... + (x_N)}{N}\)
pop_mean <- function(x) sum(x)/length(x)
mu <- pop_mean(N)
mu
## [1] 30.73684
Then \(\mu\) can be used to get the population variance using:
\(\sigma^2 =\frac{(x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_N-\mu)^2}{N}\)
pop_variance <- function(x) sum((x-pop_mean(x))^2)/length(x)
sigma_squared <- pop_variance(N)
sigma_squared
## [1] 40.29917
And the standard deviations of the population can be had by getting the square root of the population variance.
pop_stddev <- function(x) sqrt(pop_variance(x))
sigma <- pop_stddev(N)
sigma
## [1] 6.348163
In many cases you will be using a sample, as opposed to using the entire population, to estimate a value or run some analysis. In this case, you will need these functions to estimate sample variance.
The term “x bar” \(\bar{x}\) refers to the sample mean, and can be defined as:
\(\bar{x} =\frac{(x_1) + (x_2) + (x_3) + ... + (x_n)}{n-1}\)
#create a function for the sample mean, (the standard arithmetic mean)
sample_mean <- function(x) sum(x)/length(x)
x_bar <- sample_mean(n)
x_bar
## [1] 30.5
And the sample variance formula can use x bar in this equation as:
\(s^2 =\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 + ... + (x_N-\bar{x})^2}{n-1}\)
Note that the denominator is n-1 as opposed to just n. This will result in the sample mean more closely approximating the population mean.
#sample standard deviation
sample_variance <- function(x) sum((x-sample_mean(x))^2)/(length(x)-1)
s_squared <- sample_variance(n)
s_squared
## [1] 60.72222
The sample variance can be expressed as:
\(s =\sqrt\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 + ... + (x_N-\bar{x})^2}{n-1}=\sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}}\)
#sample standard deviation
sample_stddev <- function(x) sqrt(sample_variance(n))
s <- sample_stddev(n)
s
## [1] 7.792446
The range of a set can be found by subtracting the minimum value from the maximum:
\(r =\max(x)-min(x)\)
#sample standard deviation
pop_range <- function(x) max(x)-min(x)
range <- pop_range(n)
range
## [1] 22
s1 <- c(2,4,6,8,10,12,14,16,18)
s2 <- c(2,6,7,9,10,11,13,14,18)
s1_var <- sample_variance(s1)
s2_var <- sample_variance(s2)
s1_var
## [1] 30
s2_var
## [1] 22.5
As we have seen above, you can use R to get the mean, variance, and standard deviation by applying custom-built functions, derived form base R functions, across a given vector. The strength of R is in this ability to apply functions quickly across a list or vector of N values.
Although its important to understand the mechanics of all of these functions, you can use built in functions in R to get the same values with less code. Here are a few examples:
#build a small population of 20 integers
N <- c(40,29,23,25,31,22,36,33,44,22,24,34,28,29,31,33,42,32,26)
n <- sample(N,10)
#get the mean
mu <- mean(N)
paste("The arithmetic population mean is:",mu)
## [1] "The arithmetic population mean is: 30.7368421052632"
#get the median
med <- median(N)
paste("The population median is:",med)
## [1] "The population median is: 31"
#get the mean of a sample set
x_bar <- mean(N)
paste("The sample mean is:",x_bar)
## [1] "The sample mean is: 30.7368421052632"
#determine the range of values in the population
pop_range <- diff(range(N))
paste("The population range is:",pop_range)
## [1] "The population range is: 22"
#determine the range of values in the sample
sample_range <- diff(range(n))
paste("The sample range is:",sample_range)
## [1] "The sample range is: 20"
You can use dplyr to pipe data through multiple transformations as demonstrated above. dplyr may also be used to get sums, averages, and medians, or provide other custom grouping, summarization, and totaling features, using built in constructs like group_by and summarize.
ages_list <- c(24,26,25,29,33,24,32,34)
heights_list <- c(64,66,68,67,71,68,69,70)
heights_df <- data.frame(height=heights_list,age=ages_list)
mean_height <- heights_df %>% summarise(mean(height))
mean_height
mean_height_under_30 <- heights_df %>% filter(age>30) %>% summarise(mean(height))
mean_height_under_30
Effective use of statistical information requires visualizations. A as you learn about statistics, you will encounter both charts that are very familiar - like bar charts, pie charts, and line charts, and you will also encounter charts that are probably new to you - like histograms or ogives, scatterplots, confusion tables and other unusual and exotic sounding visualizations that are common to the practice of statistical analysis in a research or academic setting.
It is often useful in R to create a simulated data set that matches a pre-set distribution. Lets create a list of 10000 random decimal numbers following a normal distribution. We can take a look at an R histogram or “ogive” of this vector using R’s built in hist plot function. A histogram measures the numberic values in the created vector and indicates with the y axis how many values fall within the range of each decimal “bin” indicated on the x axis.
list_amounts_x <- rnorm(1000)
head(list_amounts_x,50)
## [1] 0.68102765 -0.68203337 -0.72325674 1.67352596 -0.59575563 1.15984384
## [7] 0.11742241 0.25922139 0.38236210 -0.71148174 -1.17756957 -0.96016651
## [13] -0.87895224 -3.55613648 -1.41674984 -0.44876927 -0.77596771 -0.83182805
## [19] 0.05183012 -0.61655131 0.25439675 -0.33152046 -0.20631931 1.21533709
## [25] 1.89205642 0.07419352 1.75169617 -0.23148744 0.54345248 -0.98900140
## [31] 0.31553146 2.44232746 0.54969286 -0.02924337 -0.83078338 1.24643439
## [37] -1.37575529 -0.33110845 0.98091149 2.19077056 0.03054575 -0.78551741
## [43] 0.32544056 -0.88084355 0.20932594 0.15103295 -0.34347879 0.90587760
## [49] 0.91895485 -0.55598749
mean(list_amounts_x)
## [1] 0.007079853
pop_stddev(list_amounts_x)
## [1] 1.018073
hist(list_amounts_x, main = "Normal Distribution",breaks=50)
df_amounts <- data.frame(x=list_amounts_x)
plot <- ggplot(data=df_amounts, aes(x=x),width=400,height=400) +
geom_histogram(bins=50)
plot
As shown above, the numbers in this vector are both positive and negative decimals, which have a mean of around 0 and a standard deviation of 1. The range is about 6 - from just under -3 to just over 3.
If we want to generate a multivariate random dataset, we can just create another list of random values with rnorm, add it to a data.frame as the “y” variable for our existing x variable vector. We can visualize this as a sort of “shotgun blast” pattern of dots at x and y positions that are randomly placed in a normal distribution, concentrated around 0.
#create another random normal distribution
list_amounts_y <- rnorm(1000)
#create a data frame with the 2 distributions as the x/y coordinates of each point
df_amounts <- data.frame(x=list_amounts_x,y=list_amounts_y)
#visualize with ggplot
plot <- ggplot(data=df_amounts, aes(x=x,y=y),width=400,height=400) +
geom_point()
plot
YOu may add as many dimensions of data as needed this way - here’s an example of data that might have an x, y and z variable. This can be visualized by adding color to represent the value of the z variable The z variable will again use rnorm to generate a random normal distribution, but this time, it will be centered around mean of 100 with standard deviation of 15.
#create another random normal distribution
list_amounts_z <- rnorm(1000,100,15)
#create a data frame with the 2 distributions as the x/y coordinates of each point
df_amounts <- data.frame(x=list_amounts_x,y=list_amounts_y,z=list_amounts_z)
#visualize with ggplot
plot <- ggplot(data=df_amounts, aes(x=x,y=y,color=z),width=400,height=400) +
geom_point()
plot
Colored scatter plots are a great way to visualize data in an x/y/z coordinate space.
Additional plots we will explore in future updates to this guide include box plots, qq plots, confusion matrices, and advanced chart features like faceting and grouping to provide more visualization options.
Chebychev’s THeorem states that for any distribution of numbers with a population mean \(\mu\) and a standard deviation \(\sigma\) the probability that a randomly chosen outcome lies between \(\mu - k\) and \(\mu + k\) is at least 1 - \(\frac{\sigma^2}{k^2}\)
# Chebychev's theorem
chebychev_probability <- function(k,list){
sigma <- pop_stddev(list)
denom <- sigma^2/k^2
prob <- 1 - denom
prob
}
k <- 2
cheb_mu <- mean(list_amounts_x)
result <- chebychev_probability(k,list_amounts_x)
paste("The probability that a randomly chosen outcome lies between ",cheb_mu-k,"and", cheb_mu+k," is:",round(result*100,2),"%")
## [1] "The probability that a randomly chosen outcome lies between -1.99292014688804 and 2.00707985311196 is: 74.09 %"
This concludes “A Statistics Companion in R Markdown” - Thanks for reading, and please check back for more installments:
In “An Algebra Companion in R Markdown” we will look at some fundamental algebra operations and concepts including polynomials, roots, and radials, factorization, lowest common multiple, and how R handles standard arithmetic equations.
In “A Finite Math Companion in R Markdown”, we explore more fully the subjects of finite mathematics including sets, financial formulas, interest and annuities, matrix operations, and more.
In “A Linear Algebra Companion in R Markdown” we will more thoroughly explore the subject of vector spaces, and how these to visualize vectors geometrically using the principles of Linear Algebra. This guide will explore more about the inner product, number spaces, and the concepts of magnitude and orthogonality.
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To get the latest version of R, please visit:
Packages used in this document are available from the Comprehensive R Archive Network (CRAN):
For more info on R packages visit:
This document was generated by RMarkdown using R code and LaTeX, and rendered to HTML.
Additional formatting has been applied to results variables using the kable and kableExtra packages. These code blocks are not shown in the output, for readability.