Important Dates/Information

Due date: Friday, January 19, 2018 11:59PM . Note not all due dates will be consistent for the first few weeks.

Submit both Rscript, Rdata and a pdf of the answers to the exercises below. Please submit all files in such format: _lab#_mmdd.extension Failure to submit in such format will be penalized by 1 pt each (so you might lose a total of 3 points).

Example: If your name is John Doe and today’s date is January 11, your submission files would be: Doe_lab1_0111.R, Doe_lab1_0111.Rdata, Doe_lab1_0111.pdf ********

Introduction

Overview

In this lab we will continue learning how to create, mainuplate, visualize, and load data. The exercises will build in difficulty and complexity from the last activity however the overall feel should be similar. Note as in last time, the answers to the lab questions may not be explicitily given in the document and some searching may be required.

What you should know and have done prior to this

  • Lab 1 (material from class and the assignment)
  • All the material given before lab 1

Learning Goals

  1. Conditional Subsetting
  2. Making Boxplots
  3. Saving Plots

Setup

You should go ahead and execute the code which is set off in ‘chunks’ for the remainder of this document. Your output might not be identical, but should be similar to what is seen below every chunk

Data Importing

Last lab, we imported data from a package. More commonly, I find myself with some (usually quite many) CSV file which contains my data. To load a csv file into your workspace as a data frame we will do the following:

colleges.df <- read.csv("College.csv",stringsAsFactors = FALSE)

Note if you have an error here, check your directories. If the file you are trying to load is not in your working directory, you will need to specify the path in the filename.

Additionally, the argument stringsAsFactors controls whether text-based columns in the CSV are imported as the data type character or the data type factor. From experience, my recommendation is to set this argument as FALSE (it defaults to true). Factors can be cumbersome if you don’t need them so I import as text and explicitly convert to factors when needed.

The College datset contains a number of variables for 777 different universities and colleges in the US. A description of the variables in the dataset is listed below.

VARIABLE : DESCRIPTION
Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10 % of high school class 
Top25perc : New students from top 25 % of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate 

This will be the dataset we use today. I am calling it college.df but feel free to call it whatever you’d like.

Data Types

We previously talked about using R and it was mostly numerical manipulation. You will frequently find you have text data (categorical) quite frequently. Some times it is explanatory (like someone’s name) and other times it is a description (whether a sample came from lab A or lab B). In the former of those situations, we would use a character data type (also called a string), and in the latter we would use a factor . A character is just a string of text. It can be parsed but ultimately functions as text would in any programming language. A factor on the other hand is special to R (see this link for more). Factors function as indicators, and if you use a factor, each string will automatically get converted to a number in the background (1,2,3..). Imagine I was trying to compare samples from two populations and they were labeled Purdue and Notre Dame in each row. If I let that column be a factor, R would replace Purdue and Notre Dame with it’s own numerical representation which means it can evaluate things like “the average height of populations subset by school” almost automatically for us.

Of importance to you, characters have the ability to be checked for equality ie

"abc" == c("abc","def","ghi")
## [1]  TRUE FALSE FALSE

(This may be handy for question 5)

Conditional Subsetting Continued

Last time we did an example where we took f[f$assets>1000, c('name','sales')] to subset. I can also replace data in the same way. In our College example, let’s say I wanted to make a new column which was a 1 if the graduation rate was above 90 percent, and a 0 otherwise. I could do that like so:

colleges.df$bestGraduationRates <- 0
colleges.df$bestGraduationRates[which(colleges.df$Grad.Rate>90)] <-1

In two steps I am doing the following (1) creating a new column of all 0s called bestGraduationRates and then I am replacing the 0s with 1s when a particular row has a Grad.Rate>90.

Saving Plots

To save a plot requires two steps. Before creating the plot we call the pdf (or png)function, then make the plot, then save it with dev.off(). ?png shows other options for file formats. Note all of the commands can take a ton of arguments specifying the size of the file and many of its properties. What follows is a simple example

png(filename = "bensPlot.png")
plot(seq(1:100)~runif(100))
dev.off()

Boxplots

Similar to the hist function discussed last time, R also has built in boxplots. The function, boxplot takes in data in a specific format (numerical~categorical) and and plots the distribution of the numerical broken down by the categorical.

boxplot(Expend~Private,data=colleges.df)

As with most things in R, we have the flexibility to change headers and lablels to our hearts content.

Wow, a quick one! Note besides importing the College.csv file, none of the above commands are required for completion of this lab. **************** # Questions (For each of the plots, please comment on your finding)

  1. Which of the predictors are quantitative, and which are qualitative? What is the range of each quantitative predictor? (1 point)
  2. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. (1 point)
  3. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50 %. (2 point)
    • How many elite universities are there?
    • Now use the plot() function to produce side-by-side boxplots of Grad.Rate versus Elite, Expend versus Elite, S.F.Ratio versus Elite and F.Undergrad versus Elite. Use par(mfrow=c(2,2)) code, what does this code do? What stories can you get from these 4 boxplots?
  4. Using corrplot() from “corrplot” package, plot the correlations of all the quantitative predictors and comment on the plot (i.e. which predictors have the highest correlation and provide a hypothesis as to why) (2 point)
  5. Find all universities that have ‘Texas’ or ‘TX’ in the university name (4 point)
    • How many of them are private and public?
    • Create a side-by-side boxplots comparing the expenses of the students attending private and public university in Texas. Which type of university is more expensive and how much more expensive on average in terms of percentage?