TL;DR

In this class we will use R to build models. R is a powerful/flexible programming language for these types of problems. R is easy to install and I highly recommend installing RStudio also. Debugging is a difficult art and will be the hardest part of this course. There are some links in the “Learning R” section you should visit. Likewise, read the In Summary Section to know what to do next. # Introduction to the course

Hello and welcome to EEE/IE 595. This course aims to provide an introduction to predictive risk analytics. We would like to leverage computational statistics and data mining to build predictive models based on large data sets and more importantly draw interpretations and insights from those models to understand or mitigate risk in complex systems. In other words, this course will teach you how to use R to build and interprate beginner/intermediate statistical learning models. For more information on the specific learning outcomes please see the syllabus on blackboard. This course will have two concurrent components: lecture and lab. The lectures are designed to provide a rigorous and theoretical understanding of the current state of statistical learning. The labs will show you how to apply these techniques to real data. This document (and all those on the website) are the material which accompany the lab portion of the course. Accordingly, most (if not all) of the lab material will take place in R.

This is a programming-heavy course. You are not expected to be a brilliant coder already, but you must be ready to code/debug/struggle in order to succeed. That being said, let’s jump in!!!

Introduction to R

A chemist has beakers, a machinist has lathes. As data analysts, we have R. As a programming language, R has been around since 1993 (probably older than most of you !?). The wikipedia page on R provides a surprisingly good look at the history of the language and a high level overview of its features and development. R at its core is an interpreted (it doesn’t get compiled like Java or C) statistical/graphical-oriented language. More importantly, R packages are convenient bundles of code which are easy to share. This allows statisticians and computer scientists to develop cutting edge techniques and share them with the world. This is the most useful feature of R for our course: we will use R packages to apply the cutting edge of statistical learning theory in a simple, intuitive way.

R is highly flexible and easily customizable (in fact this webpage was created in R!!). We will talk more later about how to use R, but below are some examples of what we can do.

x <- 2+3
print(x)
## [1] 5
output_string <- "hello world"
print("hello world")
## [1] "hello world"

To begin, I will provide a cursory introduction to R’s setup as well as some helpful hints and tools I have found through my experience with it.

Learning R

The list of things you can do in R is too extensive for even a large volume of books. So instead, your next task should be to familiarize yourself with the basics. For that I highly recommend one of the following resources to cover the basics. It will be assumed you have this level of comfort prior to class so please give this task the appropriate effort. Give yourself a few hours a day for a few days to learn these topics before coming back to this document.

  1. Hopkins Intro to R Coursera
  2. R tutor’s intro page
  3. Native R Documentation, note this is quite exhaustive
  4. The tryR school

As an aside, I took an older version of the Hopkins R course and found it helpful at a high level and I read the R documentation frequently. I have no experience with or comments on the other links but you may find them helpful.

IDEs for R and RStudio

While acquainting yourself with the basics of R in the console (ie, type command -> see result), you may have found yourself wanting to remember or somehow record your commands. You can easily do so in what is called an r script (from now on just referred to as a script). This is the set of instructions which takes R from a glorified calculator to an actual tool used for data analysis.

You can write a script file any way you want (ie, notepad, text editors, VIM, Microsoft word if you were flagellating enough) and then by running the script, you basically are telling R that what you have written is all R code and it should execute it in order. R scripts are saved as .r files. As they can be written in any text editor, there are naturally better and worse ways to write scripts. My personal favorite (and an extremely commonly used tool in programming) is to use an Integrated Development Environment (IDE for short). If you’ve written Java in Eclipse, that’s an IDE. Netbeans for C or Java is also an IDE so is PyCharm for Python. In essence, and IDE is a text editor specifically designed for use with a given programming language. It will have bespoke features and customizations which will aid in your development. It’s tough to describe what the specific benefits are in a list, but think of coding with an IDE as having a tool which is specifically designed to make your job easier.

With that, I highly recommend using RStudio as your platform for the course. RStudio is free, stable, cross platform, and relatively simple. You can find the download here. (Note get open source not commercial and desktop not server). Installation should be fast and simple and once you have it running you can begin running scripts immediately.

RStudio Examples

To start, open RStudio and you should see something like the following: