Learning R

Learning the R Language

Around the end of January 2020, I enrolled in the HarvardX Data Science certificate program. After about a year of self-paced study, I completed the nine courses for this program. The courses consist of several units per course, with each unit having a collection of videos that correspond with chapters in the textbook. The courses are taught by Professor Rafael Irizarry, PhD.

The courses require that you install the R environment and RStudio to work through the course material. R code samples are provided in the textbook, which you will have to understand well to get through the assessments and exercises throughout the course. So far, I have been able to use the template code provided in the textbook within R studio to successfully reproduce the examples in the book, and through trial and error, I have adapted the examples to answer questions presented at the conclusion of each unit.

The assessments at the end of each chapter are fairly challenging – sometimes a single question requires 4 or 5 facts to be correctly inferred from a data set or chart. You usually have more than one chance to get it right, but not always. I have also run into some technical challenges with the R environment, R studio, packages, and data sources throughout the process which I will document in greater detail later in this paper.

Setup Steps

All examples will use R. Before you get started, download and install R and RStudio on your development machine

https://www.r-project.org/

https://rstudio.com/

If you are interested in the course - it can be found here:

https://www.edx.org/professional-certificate/harvardx-data-science

The HarvardX honor code prohibits posting any answers to questions, class work, etc - so I will instead construct data that demonstrates the basic concepts we are looking at - without spoiling the course or spilling the beans on all the cool subjects you will get to study!

My Background & First Impressions

Starting this course, I was already familiar with programming in general, and SQL in particular for data operations. I have been doing full stack development in the dotnet/SQL sphere for about 20 years professionally. My first impression, coming from a web programming background, is that R, dotnet, and SQL features overlap substantially. Syntax is similar to javascript or C#, but more fundamentally built around vectors / lists vs individual variable values. Although every language can handle arrays and hash tables in some manner - R implements these constructs very elegantly. This makes sense, as statistics is all about set theory, and this is what R was built for. There is also connectivity between SQL and R via the odbc package… allowing you to blend the two and choose cases where R offers advantages, or cases where SQL might be better for a specific analysis task.

Even under dotnet - you can always split analysis between your SQL, ASP, and HTML/javascript layers- but R offers a powerful all-in-one language to go from database to presentation all using the same language, with very simplified technology stack architecture. You can create reproducible, discrete code blocks for logical analysis and take data from the database to a finished chart without having to run through so many transformations. Less hand-offs between layers means fewer type conversions, encoding headaches, which-quotes-do-I-use-here’s etc. etc. R is especially suited to charting and visualization of data, random sampling, modeling, and statistical analysis tasks.

As an example - Under a typical setup for a web based ASP.net dashboard you might:

pull data from SQL using a SQL query or SPROC
transform it in your vb or C# code to match the format needed for Google charts
pass the data to a javascript function to Google charts API
send your data to Google
get back the chart details
render it on your web page via javascript

Under an R script, all of these steps above can be shortened to:

pull data from SQL in R, transform it inside of R, generate the plot graphic or HTML in R, and return it to the browser

ONE STEP - No network traffic!

One of my goals / challenges in this learning endeavor is bridging the gap between my large existing code base in vb.net and javascript, SQL databases, and new projects I may want to build in R. How will I bring over data in a secure and controlled way? How will I call R from dotnet? How will I scale local development to a production environment?

I found connecting to my SQL server to be relatively easy from a local context. But real world deployment of a direct SQL connection will require more investigation. Another approach (vs SQL odbc) would be to use an API middle layer using entity framework for data provisioning to R - but that effectively draws a line between your R and your SQL applications, whereas the odbc package in R allows you to use SQL DML seamlessly with R. Either way - it will take some time to experiment and find a suitable architecture.

Course Prerequisites and Level of Effort

The Data Science for Professionals Certificate program requires a mix of programming and math skills. As for math skills going in - The first 4 courses were pretty introductory and anyone with high school algebra skills would be fine in those. The modeling and regression courses get pretty heavy on deeper math skills and you start to get waist deep in procedural programming. COming from a programming background has been a major asset for me.

The probability related courses in this program cover the same ground as a college-level statistics or analytics course- so if you have already taken college level statistics, the probability, inference, and regression courses will cover likely similar ground. Even so - your college stats class probably did not use R - so it’s cool to see this content taught in the context of R programming. R has many shortcuts for common stats functions.

You don’t have to be a math wizard to take this program but you should probably enjoy working with mathematics and computers, and have some comfort in “command prompt” programming… or the course might prove frustrating.

The assessments are not easy, but definitely passable if you closely read the chapters and ALWAYS run the code examples given in your local RStudio. A free / optional donation text book is provided with the course, and each video slide corresponds with a chapter or collection of chapters. It’s a treat because the videos are narrated by the instructor, who also wrote the textbook (Rafael Irizarry, PhD, Professor of Biostatistics, T.H. Chan School of Public Health, Harvard University). Generally each video is between 2-10 minutes long. His lessons are basically reading you the textbook for the course. It does help to hear the lessons in his own words, with his emphasis and inflection.

There are many assessments throughout each course - so it’s a good deal of work to get through all the course work. If you did not do every exercise, you would struggle to understand the next chapter. You have to look at the data sets your code is producing along the way as you transform the data. You must get 70% or greater total score to pass each course.

Each chapter within a course builds on the previous one and assumes a complete understanding of previous work. Some of the courses briefly review content from earlier courses (since they are not always taken as a cohesive program) - but in general there is not much review or practice - you are expected to really know the material before you move on.

As an example - the Linear regression course has a small “brush up” on Bayesian statistics. Realistically you would need to have taken a Probability course that dedicated multiple chapters to Bayesian statistics or it would be a struggle to get through regression with such a cursory review of the subject.

Some questions are multiple choice and you may get more than one chance to get it right. Those have been the easiest. The data camp assessments are tougher - on these you have to write code that results in a correct outcome. Sometimes it’s a line or two - sometimes it’s a whole block of code. But you can’t pass until you get it right or give up. I have spent 30 minutes to a couple hours working on a single question several times in data camp. Lastly, you have some questions where you have to write-in a numeric answer - which means you have first to write correct code in R studio to get to the right answer, and then enter it to the 3rd decimal place or in some cases the 6th decimal place. These are probably the toughest since you have only a couple chances and it’s so easy to accidentally get the wrong value, even if your code is 99% correct. By the 8th course - machine learning - the code you had to write was pretty challenging!

R Basics

The first course, R Basics, was a broad overview of the R language including all the basic syntax for using R.

https://www.r-project.org/about.html

R is one of the go-to languages for data science professionals. It was developed at Bell Labs by John Chambers and colleagues. R is a variant implementation of the S language. The R language resembles javascript or a merger of javascript and SQL. It is a functional language which integrates data storage, a suite of operators, tools for data analysis, a graphics facility for rendering, and a comprehensive programming language with all key constructs for conditionals, loops, functions, and importantly, as a functional language - it supports a family of apply functions which can apply transformations to fields and data points by applying a function with a single output. Unlike traditional programming in a loop model - writing code for apply family functions requires that you carefully encapsulate needed transformations into discrete repeatable functions. Doing so allows you to speed up your code vs loops by many, many times. This is important in big data - while loops, for loops, and other similar constructs can simply bog down your code to the point that its not useful.

The R environment is basically just a command prompt (think MS-DOS) , but there is a free IDE called R Studio, which improves on the functionality of the command prompt and lets you save files, run chunks of code, view data trees, debug, compile, and manage your R project.

In R, the native unit is a vector or a list of values. A list of values may be of any number of types including numeric, character, or boolean (1/0 or TRUE/FALSE) values. The <- assignment operator is the prefered syntax for assigning a value to the variable on the left. I prefer this over a single = sign, which is traditional in many languages. If you have ever programmed in javascript you know ho easy it is to mix up assignment (=) and equality (==) operators.

A vector in R is equivalent to a one-dimensional array in javascript or a column in SQL. A vector represents x observations (fields) of the same parameter / column - A basic list of values, of the same data type, that can be accessed through square braces. Data frames in R are extension of this concept, enabling assignment of complete X by X tables of name/value pairs where each column is a vector of the same number of values.

Although the course is called R basics - there is also a focus on the science behind data and programming - like how you should think about data - is the data binary? character? numeric? if numeric - are the values continuous or discrete? Are characters categorical factors or ordinals? How do two pieces of data combine through traditional arithmetic and binary arithmetic? All of these core principles drive future decisions when analyzing and manipulating your data, choosing mathematical models and determining how to best approach wrangling your data sets.

foodlist <- c("Apple","Broccoli","Carrot","Banana","Orange","Pineapple","Guava","Walnut","Almond","Peanut") #vector of character
foodcats <- c("Fruit","Veggie","Veggie","Fruit","Fruit","Fruit","Fruit","Legume","Legume","Legume") #vector of factors
foodprices <- c(2.50,3.19,2.25,.79,2.79,4.79,0.49,5.95,4.95,3.79) #vector of decimal/numeric
foodoz <- c(16,24,12,16,16,32,1,10,14.5,36) #vector of decimal/numeric
foodunit <- c("lb","each","each","lb","lb","each","oz","each","each","each") #vector of factors
foodinpkg <- c(FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE) #vector of boolean

foodpriceperoz <- foodprices / foodoz #arithmetic on vectors produces vectors of same length in same order
foodsoldbyweight <- !foodinpkg # boolean arithmetic on vectors produces vectors of same length in same order

foodtable <- data.frame(title= foodlist, 
                        type = foodcats, 
                        priceamt = foodprices, 
                        weightoz = foodoz, 
                        priceperoz = foodpriceperoz, 
                        unit = foodunit, 
                        packaged = foodinpkg, 
                        byweight = foodsoldbyweight)

foodtable

##        title   type priceamt weightoz priceperoz unit packaged byweight
## 1      Apple  Fruit     2.50     16.0  0.1562500   lb    FALSE     TRUE
## 2   Broccoli Veggie     3.19     24.0  0.1329167 each     TRUE    FALSE
## 3     Carrot Veggie     2.25     12.0  0.1875000 each     TRUE    FALSE
## 4     Banana  Fruit     0.79     16.0  0.0493750   lb    FALSE     TRUE
## 5     Orange  Fruit     2.79     16.0  0.1743750   lb    FALSE     TRUE
## 6  Pineapple  Fruit     4.79     32.0  0.1496875 each    FALSE     TRUE
## 7      Guava  Fruit     0.49      1.0  0.4900000   oz    FALSE     TRUE
## 8     Walnut Legume     5.95     10.0  0.5950000 each     TRUE    FALSE
## 9     Almond Legume     4.95     14.5  0.3413793 each     TRUE    FALSE
## 10    Peanut Legume     3.79     36.0  0.1052778 each     TRUE    FALSE

Including Plots

Once you have some data to work with, plotting is a very simple matter of calling the plot function and passing in some parameters to describe the x/y values, at a minimum. Although the plot function shown here is the most basic implementation - R at it’s purest - in practice, most of the course has used a package called ggplot2 to render plots. This library builds a ton of useful formatting and functionality on top of the plot function. More on that in the Visualization section of this report.

plot(foodtable$title,foodtable$priceamt)

plot(foodtable$type,foodtable$priceperoz)

hist(foodtable$priceamt)

hist(foodtable$priceperoz)

Transforming data

Basic if/else logic can be applied across vectors using a helpful ifelse(test, result if true, result if false) shortcut function, similar to javascript shorthand conditional like- test ? true : false

pricelist <- foodtable$priceamt
over3bucks <- ifelse(pricelist > 3, TRUE, FALSE)
halfprice <- round(pricelist / 2,2)

pricelist

##  [1] 2.50 3.19 2.25 0.79 2.79 4.79 0.49 5.95 4.95 3.79

over3bucks

##  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

Investigating data

The R Basics course focuses a lot on defining fundamental concepts like discrete data vs continuous, data types, how to construct functions, conditionals, loops, and so forth- Setting up the student with a toolbox of tools to use in the following courses. R offers a lot of the same features as SQL in terms of filtering, grouping data, joining, summarizing, ordering, and other features common to SQL DML language.

maxrowindex <- which.max(foodtable$priceamt)
minrowindex <- which.min(foodtable$priceamt)

# The name of the food with the maximum unit price:
foodtable$title[maxrowindex]

## [1] Walnut
## 10 Levels: Almond Apple Banana Broccoli Carrot Guava Orange ... Walnut

# The type of the food with the minimum unit price:
foodtable$type[minrowindex]

## [1] Fruit
## Levels: Fruit Legume Veggie

# ordering the food name vector by price per ounce, in decreasing order
foodtable$title[order(foodtable$priceamt, decreasing=TRUE)]

##  [1] Walnut    Almond    Pineapple Peanut    Broccoli  Orange    Apple    
##  [8] Carrot    Banana    Guava    
## 10 Levels: Almond Apple Banana Broccoli Carrot Guava Orange ... Walnut

Summary

The summary function is very useful - giving you a quick summary of the data types and ranges in your data.

summary(foodtable)

##       title       type      priceamt        weightoz       priceperoz     
##  Almond  :1   Fruit :5   Min.   :0.490   Min.   : 1.00   Min.   :0.04938  
##  Apple   :1   Legume:3   1st Qu.:2.312   1st Qu.:12.62   1st Qu.:0.13711  
##  Banana  :1   Veggie:2   Median :2.990   Median :16.00   Median :0.16531  
##  Broccoli:1              Mean   :3.149   Mean   :17.75   Mean   :0.23818  
##  Carrot  :1              3rd Qu.:4.540   3rd Qu.:22.00   3rd Qu.:0.30291  
##  Guava   :1              Max.   :5.950   Max.   :36.00   Max.   :0.59500  
##  (Other) :4                                                               
##    unit    packaged        byweight      
##  each:6   Mode :logical   Mode :logical  
##  lb  :3   FALSE:5         FALSE:5        
##  oz  :1   TRUE :5         TRUE :5        
##                                          
##                                          
##                                          
##

Head

Head is equivalent to a SELECT TOP X in SQL. Just getting the top few records, - but also can be very useful for splitting your data frame into chunks for additional analysis.

head(foodtable,5)

##      title   type priceamt weightoz priceperoz unit packaged byweight
## 1    Apple  Fruit     2.50       16  0.1562500   lb    FALSE     TRUE
## 2 Broccoli Veggie     3.19       24  0.1329167 each     TRUE    FALSE
## 3   Carrot Veggie     2.25       12  0.1875000 each     TRUE    FALSE
## 4   Banana  Fruit     0.79       16  0.0493750   lb    FALSE     TRUE
## 5   Orange  Fruit     2.79       16  0.1743750   lb    FALSE     TRUE

Tail

The opposite of head - Just getting the bottom few records. You could do this in SQL by ordering in DESC order, selecting, then reordering the selection. But this is just a great, quick alternative to that kind of tedious multi-step operation.

tail(foodtable,3)

##     title   type priceamt weightoz priceperoz unit packaged byweight
## 8  Walnut Legume     5.95     10.0  0.5950000 each     TRUE    FALSE
## 9  Almond Legume     4.95     14.5  0.3413793 each     TRUE    FALSE
## 10 Peanut Legume     3.79     36.0  0.1052778 each     TRUE    FALSE

Glimpse (DPLYR)

Glimpse from the DPLYR package is another useful function showing you the data type and head of several columns.

dplyr::glimpse(foodtable)

## Rows: 10
## Columns: 8
## $ title      <fct> Apple, Broccoli, Carrot, Banana, Orange, Pineapple, Guav...
## $ type       <fct> Fruit, Veggie, Veggie, Fruit, Fruit, Fruit, Fruit, Legum...
## $ priceamt   <dbl> 2.50, 3.19, 2.25, 0.79, 2.79, 4.79, 0.49, 5.95, 4.95, 3.79
## $ weightoz   <dbl> 16.0, 24.0, 12.0, 16.0, 16.0, 32.0, 1.0, 10.0, 14.5, 36.0
## $ priceperoz <dbl> 0.1562500, 0.1329167, 0.1875000, 0.0493750, 0.1743750, 0...
## $ unit       <fct> lb, each, each, lb, lb, each, oz, each, each, each
## $ packaged   <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRU...
## $ byweight   <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE...

In SQL it can be cumbersome to say, create a temporary table with column definitions, populate it, query it, alter it, etc… That is all a line of code in R.

At each step – you can add more data, link additional sources via joins, and continue meshing and merging your data sets until you have the final product you need for plotting.

It also makes it so easy – one line of code – to go from a table of data, to a chart graphic. One more line – and the graphic is saved as a JPG/PNG etc.

Visualization

This is an amazing course! The GGPlot2 package and DPLYR are magnificent add-on’s to the R environment. I would almost consider these two packages to be “essential” to programming R – first of all – the language is way more streamlined by using DPLYR – you can “pipe” data through a series of %>% operators and assign the final data to a variable in one line. This would otherwise require many lines to keep transforming your data, assigning it to a new variable, then continuing to process that variable. This often results in some very long series of variables that have only one use as an intermediate step in a series of transformations. In R – these multiple transformation type operations can be “piped” through each step, creating a statement that is very elegant and clean.

By using the dplyr package you can add many useful utilities and add the %>% (pipe operator) which pipes data on the left into an operation on the right. I highly recommend using this package - it makes the practice of development a lot faster and more direct with less temporary, single use variables needing to be set. Plus it just looks cleaner.

In addition – the GGPLot2 package opens up a huge world of additional plotting styles to the very basic plots rendered by the R standard language. I can’t imagine why anyone would not use these two packages on virtually any R application. I suspect there is some performance issues with certain plot types etc - But overall, the packages seem to perform very well and generate extremely complex plots relatively quickly.

As to performance - R is supposed to be built for big data. From what I can tell - the performance of the graphics renderer is sufficient for commercial use, although it may be slow at times. Caching and pre-loading strategies may be advisable for some more complex charts - or perhaps there are hardware options to increase performance. These are all areas I plan to investigate.

Loading libraries

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

foodtable %>% ggplot(aes(x=priceamt,y=weightoz)) + geom_point() + geom_line()

Probability

OK- things are starting to heat up now. The probability course covered a lot of ground in the mathematics of probability and conditional probability.

The course is broken into sections about discrete probability (where the outcome is a binary or categorical outcome), continuous probability (where the outcome is a numeric value), and a final section about models and the central limit theorem (CLT).

This course is focused on mathematical theories, particularly set theory - the study of populations, and their distributions. The course covers how to identify and classify different types of distributions using histograms and QQ plots, knowing when a set of observations can be considered normally distributed or following another known distribution type. The course provides a solid theoretical foundation and many useful tools to extend your exploratory data analysis capabilities. This class also provides a firm understanding of mean \(\mu\) and standard deviation \(\sigma\) calculations.

The central limit theorem and Bayes theorem are discussed, and many questions center around calculating probabilities of outcomes with various combinations and prerequisites, like “the probability of A, given B and C”. The probability of A in this context has a known mathematical relationship to the probabilities of B and C. so knowing how these relate, is a powerful mathematical tool.

This family of logic is generally termed Bayesian statistics and it comprises a large part of the probability coursework and assessments.

The final, fourth section of this course included subjects about like home loan values and predicting defaults in the mortgage markets. It’s great to see how these principles of statistical probability can be used to foretell risks and discover problematic behaviors or trends before they happen.

Inference & Modeling

This course focuses heavily on polling subjects and how to assess the quality of your probabilistic predictions.

In this course you learn about sample distributions, calculating spread and determining the standard error and confidence intervals from equations using \(\overline{x}\) and \(\sigma\).

The course discusses the subject of confidence intervals in some depth. A confidence interval of 95% means that the probability of a random interval falls on top of p, is 95%.

Confidence Intervals

In this code, A Monte Carlo simulation can be used to confirm that a 95% confidence interval contains the true value of P (roughly) 95% of the time.

B <- 10000
p <-  .5

inside <- replicate(B, {
    X <- sample(c(0,1), size = B, replace = TRUE, prob = c(1-p, p))
    X_hat <- mean(X)
    SE_hat <- sqrt(X_hat*(1-X_hat)/B)
    between(p, X_hat - 2*SE_hat, X_hat + 2*SE_hat)    # TRUE if p in confidence interval
})
mean(inside)

## [1] 0.9535

The course also discusses the concept of polling power. Samples with too few samples have large standard error.

This means that the confidence intervals may include 0. In polling this is called “lack of power”. Power can be thought of as the probability of detecting a spread, that does not include 0. To increase power, we can increase sample size, lowering the standard error and reducing the likelihood that our spread contains 0 or in other words - where the outcome is a “toss up”.

P Values

The idea of P-values is introduced as a way of defining how strong is your evidence that a particular spread does not cover the null hypothesis. The p-value is the probability of observing a value as extreme or more extreme than the result given that the null hypothesis is true.

N <- 100    # sample size
z <- sqrt(N) * 0.02/0.5    # spread of 0.02
1 - (pnorm(z) - pnorm(-z))

## [1] 0.6891565

We looked at polling data and used dplyr to filter groups and summarize data, to filter out the noise, and try to account for bias in the poll data. By stratifying polls by pollster, you can begin to see patterns emerge and account for some of the possible pollster bias in the polling process. These are broadly termed confounders and you must account for confounders in any analysis of many interconnected variables, otherwise you are liable to double-count effects or fail to account for situations where one effect and another are offsetting each other.

Some of the theory stuff gets pretty complex. Some of the more advanced subjects include using t-distributions instead of confidence intervals, and how to aggregate the results of multiple polls to create more accurate predictors.

Productivity

The productivity course was mostly review for me, although it did contain some great information about using RMarkdown (which is what this document was written using) and this is super important as you work through the later chapters in the program. The capstone requires two larger projects formatted in this file type.

I completed this course in a couple days. It was a nice respite form the math-heavy subjects of the prior two courses. This one focused more on UNIX, Command line syntax, and how to take advantage of github and other resources for code sharing and re-use. Relative to the other courses, I felt like the assessments were super basic and did not require agonizing twenty or thirty minute code sessions to get to a right answer. In many cases, the assessment questions are answered by plugging code in to RStudio that is more-or-less provided on the screen.

Linear Regression

This class was fairly difficult. We learned all about calculating correlation and plotting regression lines over a scatter plot. Its kinda funny because all the complex math that goes into linear regression, in R is encapsulated by one-line functions… But this is one of the cool parts about taking a data science course. Anyone can drive a car, but not everyone understands whats going on under the hood. As a data scientist - you can’t just apply a linear model unless you understand the mechanics of it and why you might choose one approach vs another.

This course really enlightens the student on how these methods operate under the hood. The course dissects the topic of regression, starting by giving you a fundamental understanding of how \(\rho\) is calculated. This is the Correlation coefficient which can indicate that two sets of data are bivariate normal or jointly normal. This is so important to understand - when can we use one set of data to make predictions about another? This practice - the fundamental challenge of machine learning - is thoroughly explored in this course.

The earlier probability course is absolutely fundamental to this course - so be sure you are comfortable with that course before you try to crack this egg.

Data Wrangling

IN the data wrangling course, substantial time is spent on the subjects of acquiring and transforming data that may be messy or disorganized. Powerful tools like regular expressions can help the data scientist distill raw data to a clean and usable form.

We looked at how you can quantify a lot of tricky things, like how many times a word or set of words appears in a novel, or how to convert from one form of a date format to another. Packages like lubridate were introduced which extend the value of base R for dates and duration. This is another key package that you should never try to program in R without!

Machine Learning

All the 7 previous courses were like fueling up your space ship - but ML is where you actually get to go into orbit. The L course covers a broad array of machine learning techniques, showing you how to apply linear regression models for useful predictions, and extending beyond that into Decision Trees, Random Forests.

Substantial time is also devoted to computer vision as we delved into the MNIST package which contains a reference set of hand-written numbers which you can apply ML to, in order to assign numeric values to hand-written text.

It may sound easy enough, but the science is quite complex - it requires all of the concepts we learned from wrangling the data, analyzing probabilities, visualizing distributions, and building models using inference techniques to infer the correct or most likely guess for the meaning of a handwritten glyph or symbol. The ideas behind ML have provided the world with innumerable benefits from machines capable of sorting mail to highly complex imaging systems for detecting disease in x-rays and DNA assays.

Capstone

The capstone portion of the program, for me, was the most challenging. The capstone required 2 different projects. The first one was very constrained, in which you had to reach a certain goal performance for a specified machine learning task.

The second capstone is more self-directed and required that you choose your own data set and build a supervised ML model. These two assignments took me several months to complete, although not everyone appeared to put that much effort in.

As part of the capstone, You are required to grade other peers work, so it’s a great opportunity to learn from your peers and see how other people have approached problems.

Unfortunately, about half of the capstones I reviewed were very clearly violating the honor policy by plagiarizing other sources, or in some cases, they just seems very slap-shod and rushed. I truly admired the efforts of many of my peers from around the world to turn in excellent, polished work. I also saw some examples where the conclusions were unclear, or where basic mistakes were made like not labeling your axes, not specifying the units, not spell checking. So it was disappointing to see some projects showing lack of effort, but also inspiring to see some very well done projects which showed great insight, creativity, and innovation.

I was very proud to receive a grade of 100% on both my capstone projects- it was a lot of hard work, but time well spent!

Challenges / R Troubleshooting

Installing packages – be sure you have called install.package() for any new libraries you are using/
Environment Freezing – had to delete and re-install once
Often had to restart R, clear workspace, terminate session, reload libraries, etc.
In EDX – the assessments require exact answers, can’t be off by any amount or round less than 3 SIGNIFICANT DIGITS (unless otherwise noted)
Be sure you include the set seed statement when you run any simulations – this must be executed as you run distribution simulations like dnorm or you will get different results each time and they won’t match the expected right answer in edx.
Several questions require multiple check boxes to be checked. Even if one of the check boxes are correct, the question is considered incorrect until all check boxes are correct.
was getting the error could not find function “%>%” and determined it was because RStudio had restarted and was not loading tidyverse before the package forced the reload of the session.
Changing the size of the plot window while you are running a script can crash the session. RStudio sometimes freezes when you Run App a shiny App.
You may get massages saying “Warning in install.packages : cannot remove prior installation of package ‘…” – in this case – restart the R Session and try again.
Use the Tools > Check for Updates to Packages tool if you have compatibility Issues.
The debugger is pretty cryptic. You can see the call stack in the debug window, but it’s not usually real clear what the offending code is from the name of the error. Sometimes the line numbers are there, but they may be somewhat hidden in the call stack.
Sometimes changing the “theme” can cause unexpected results with the x/y axis. E.g.under a theme_bw a plot works, then under no theme or a different theme, it breaks. Seems to be related to the units of the axes… probably an issue with converting variables between continuous vs discrete, integer vs date etc.

More Projects

After completing these first 3 courses, I started working on some real-world use cases for R and R Studio in my professional environment.

I came up with a couple interesting ideas

For creating fairly static, non-interactive plots -

I set up an approach that uses the ODBC package to connect to SQL server, which then gives me the data I need in a data frame I can pass to GGPlot2, which can be generated as a Plotly interactive chart (adds hover effects, pan zoom, etc.), and exported via HTMLWidget to a self-contained HTML page and uploaded back to the web server. The output plot can be served securely via an aspx page or framed approach.

It’s a little clunky but it works. What it lacks is any interactivity. This is a game killer. Static charts just aren’t that useful, even with flyouts and hover effects. Not entirely satisfied with this approach.

I have also set up a test case for using R with dotnet forms via System.Diagnostics.Process.Start on the RScript.exe.

https://github.com/elaborative/R-dotnet

I also have been delving into several public repos and government supplied API-based data sets including demographics data from the US Census Bureau, weather data from NOAA, stock market data from quantmod, and mapping features including choropleths and isochrones (drive times) over leaflet maps.

I have also experimented with the R product Shiny which is an integrated server environment for R. You can run your R shiny app on a server provided for free.

For more examples of my work in Shiny, please check out my web site and visit the Data Science section.

Thanks for reading about my experience with this e-learning program. I felt that I received far more than my money’s worth from this, and it also helped me in intangible ways - like increasing my understanding of the nature of data, polling techniques, and how to spot spurious arguments based on biased statistics.