Important Dates/Information

Due date: Friday, January 12, 2018 11:59PM . Note not all due dates will be consistent for the first few weeks.

Submit both Rscript, Rdata and a pdf of the answers to the exercises below. Please submit all files in such format: _lab#_mmdd.extension Failure to submit in such format will be penalized by 1 pt each (so you might lose a total of 3 points).

Example: If your name is John Doe and today’s date is January 11, your submission files would be: Doe_lab1_0111.R, Doe_lab1_0111.Rdata, Doe_lab1_0111.pdf ********

Introduction

Overview

This lab is about simple data analysis and Manipulation. You will read in data from packages, look at it and draw conclusions. The file called “Lab 1 Slide” in blackboard is your friend. Google is as well.

What you should know and have done prior to this

  • R (and RStudio) should be installed on your computer
  • If given code, you should be able to execute it in the console and in an R script.
  • You should have an understanding of the basics of R (how to add, subtract etc… how to assign and call variables etc…)

Learning Goals

  1. Install Packages
  2. Load Packages
  3. Set working directories
  4. Create R-base plots
  5. Subset/Manipulate Data

Setup

You should go ahead and execute the code which is set off in ‘chunks’ for the remainder of this document. Your output might not be identical, but should be similar to what is seen below every chunk

Implicit in data analysis is that we have access to data. In this case, we are going to take it from an R Package. For info on the details, specifics see here. In short, packages are a universal way to share R code which has been rigorously tested.

Data Importing

We will begin by installing the package 'HSAUR2' which contains useful data for learning how to use R. Installing a package is simply a matter of knowing what it is called and invoking the following code. We can replace HSAUR2 with whatever the package is called. Note the single quotes are required.

install.packages('HSAUR2')
## 
## The downloaded binary packages are in
##  /var/folders/yw/j1_hbyxj5x3gy0_v3vc35_w80000gq/T//RtmpyN1GIy/downloaded_packages

Once we have it installed, we can load the library with the library command. Again, note the quote.

library('HSAUR2')
## Loading required package: tools

This takes the package and loads it into the current R session. Ie, we can use the data and methods the package authors have developed. Next is to use data from this package. As well as packaging code, R has built-in a way to simultaneously package data as well. That’s done using the creatively-named data command. Below we can see that we are taking the Forbes200 dataset from the package HSAUR2.

data("Forbes2000", package="HSAUR2")

This will load data (data on the Forbes 2000 ) into our current environment. The environment contains all the variables, functions, and data we can access through the console. To list what is in the environment we can use the list command.

ls()
## [1] "Forbes2000" "r"

An environment also contains a working directory which can be listed with the command getwd(). Likewise, it can be set with the setwd() command.

getwd()
## [1] "/Users/brachuno/Documents/__college/grad_school/4_spring_2018/EEE595_TA/lab_assignments/lab1"
#setwd() 
# Note I can comment out a line of R code with the # symbol. This line does not execute.

The path it outputs is where ‘R is running’. This folder will be the default for the R session.

Viewing Data

It may be helpful to get a sense of what my data looks like once I have it. There are a variety of ways to do that. We can start with the elementary print function. Note here we have stopped using quotes. This is because Forbes2000 is a variable not a string.

print(Forbes2000)
# note I have suppressed the output because it is *enormous*

This is great for seeing the data, but is obviously a little too much to take in. If we want to get a sense of the structure of our data, the str command is useful. It shows the data types for each column, the data type for the variable, and the number of rows for each.

str(Forbes2000)
## 'data.frame':    2000 obs. of  8 variables:
##  $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ name       : chr  "Citigroup" "General Electric" "American Intl Group" "ExxonMobil" ...
##  $ country    : Factor w/ 61 levels "Africa","Australia",..: 60 60 60 60 56 60 56 28 60 60 ...
##  $ category   : Factor w/ 27 levels "Aerospace & defense",..: 2 6 16 19 19 2 2 8 9 20 ...
##  $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
##  $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
##  $ assets     : num  1264 627 648 167 178 ...
##  $ marketvalue: num  255 329 195 277 174 ...

Some packages (and all good ones) come with documentation on what the data is, and what it means. This can be accessed with the help command. Of every tool in R, this is certainly the one I use the most. Use it liberally and frequently to solve your problems.

help("Forbes2000")
# You can use a proceeding question mark also: ?Forbes2000 

Now that you understand the help command (or ? interface), I’m going to breeze over the obviously-named functions and you can read the help on them to your heart’s content. Most of these are also quite useful for debugging.

class(Forbes2000)

dim(Forbes2000)

nrow(Forbes2000)

names(Forbes2000)

class(Forbes2000[,'rank'])

length(Forbes2000[,'rank'])

class(Forbes2000[,'category'])

nlevels(Forbes2000[,'category'])

table(Forbes2000[,'category'])

Manipulating Data

The next step is to start to play with what we have so we can start to dig in.

First we will assign Forbes2000 to the variable f. In R <- and = are valid assignment operators, however (my opinion is)<- should be used whenever possible so that = is only used as a logical test. We now have a variable f we can manipulate instead of Forbes2000.

f<-Forbes2000

I can subset data with the $ command easily to get a specific column. Here I will output only the name column of the table f. The head command shows only the first 6 lines. f$name gives us only the values from the name column.

head(f)
##   rank                name        country             category  sales
## 1    1           Citigroup  United States              Banking  94.71
## 2    2    General Electric  United States        Conglomerates 134.19
## 3    3 American Intl Group  United States            Insurance  76.66
## 4    4          ExxonMobil  United States Oil & gas operations 222.88
## 5    5                  BP United Kingdom Oil & gas operations 232.57
## 6    6     Bank of America  United States              Banking  49.01
##   profits  assets marketvalue
## 1   17.85 1264.03      255.30
## 2   15.59  626.93      328.54
## 3    6.46  647.66      194.87
## 4   20.96  166.99      277.02
## 5   10.27  177.57      173.54
## 6   10.81  736.45      117.55
head(f$name)
## [1] "Citigroup"           "General Electric"    "American Intl Group"
## [4] "ExxonMobil"          "BP"                  "Bank of America"

Using the syntax foo[], where foo is a variable name, we can start to access specific parts of data. Let’s learn by example.

f[f$assets>1000, c('name','sales')]
##                 name sales
## 1          Citigroup 94.71
## 9         Fannie Mae 53.13
## 403 Mizuho Financial 24.40

f in this case is a two-dimensional (rows and columns) table. By using the [ ] we are telling R we want parts of it. What comes before the , is the condition we wish to apply to the rows and what comes after is what we wish to apply to the columns. In this case, we want to look at f, but only when the value in the asset column of f is greater than 1000. What comes after the comma is telling us that we want only the name and sales columns of the data frame. All together, we get the name and sales of every company who has assets greater than 1000.

I personally like the table function for tabulating things. It can be very handy in specific cases.

table(f$assets>1000)
## 
## FALSE  TRUE 
##  1997     3

Another handy command everyone should be aware of is complete.cases. It tells us which rows of the data frame have values in every column. Here we can see that we have 5 rows without all the columns filled in.

table(complete.cases(Forbes2000))
## 
## FALSE  TRUE 
##     5  1995

subset is useful for extracting data also. In this case, we are making a new variable UK which contains the data from all the rows of f when the country is the UK. (Again, I’m not including the specifics of most commands and their syntax. help(subset) will tell you more than I ever could and with greater clarity).

UK<-subset(f,country=="United Kingdom")

Again learning by example, lets take the values of UK and sort them by their profits. To see how this works at a fundamental level, try executing just the order(UK$profits, decreasing = TRUE) and see what you get.

UK<- UK[order(UK$profits, decreasing = TRUE),]

In some instances, there may be a function we wish to apply in multiple places. This could be across rows, across columns, or across items in a list. The apply series of functions does just that. lapply, tapply, vapply all apply a function over a list or vector. The particular syntax of each is something I still find difficult and usually have to look it up. The next line of code (lapply), takes the function summary and applies it to the object f. In this case, it will perform the summary operation on each row. lapply (pronounced L-apply I believe) applies the function to each element of the object. For data frames, the objects are the columns. Thus, we will see the summary function applied to each column

lapply(f, summary)
## $rank
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   500.8  1000.0  1000.0  1500.0  2000.0 
## 
## $name
##    Length     Class      Mode 
##      2000 character character 
## 
## $country
##                       Africa                    Australia 
##                            2                           37 
##    Australia/ United Kingdom                      Austria 
##                            2                            8 
##                      Bahamas                      Belgium 
##                            1                            9 
##                      Bermuda                       Brazil 
##                           20                           15 
##                       Canada               Cayman Islands 
##                           56                            5 
##                        Chile                        China 
##                            4                           25 
##               Czech Republic                      Denmark 
##                            2                           10 
##                      Finland                       France 
##                           11                           63 
##       France/ United Kingdom                      Germany 
##                            1                           65 
##                       Greece              Hong Kong/China 
##                           12                           20 
##                      Hungary                        India 
##                            2                           27 
##                    Indonesia                      Ireland 
##                            7                            8 
##                      Islands                       Israel 
##                            1                            8 
##                        Italy                        Japan 
##                           41                          316 
##                       Jordan                   Kong/China 
##                            1                            4 
##                        Korea                      Liberia 
##                            4                            1 
##                   Luxembourg                     Malaysia 
##                            2                           16 
##                       Mexico                  Netherlands 
##                           17                           28 
##  Netherlands/ United Kingdom                  New Zealand 
##                            2                            1 
##                       Norway                     Pakistan 
##                            8                            1 
##       Panama/ United Kingdom                         Peru 
##                            1                            1 
##                  Philippines                       Poland 
##                            2                            1 
##                     Portugal                       Russia 
##                            7                           12 
##                    Singapore                 South Africa 
##                           16                           15 
##                  South Korea                        Spain 
##                           45                           29 
##                       Sweden                  Switzerland 
##                           26                           34 
##                       Taiwan                     Thailand 
##                           35                            9 
##                       Turkey               United Kingdom 
##                           12                          137 
##    United Kingdom/ Australia  United Kingdom/ Netherlands 
##                            1                            1 
## United Kingdom/ South Africa                United States 
##                            1                          751 
##                    Venezuela 
##                            1 
## 
## $category
##              Aerospace & defense                          Banking 
##                               19                              313 
##     Business services & supplies                    Capital goods 
##                               70                               53 
##                        Chemicals                    Conglomerates 
##                               50                               31 
##                     Construction                Consumer durables 
##                               79                               74 
##           Diversified financials            Drugs & biotechnology 
##                              158                               45 
##             Food drink & tobacco                     Food markets 
##                               83                               33 
## Health care equipment & services     Hotels restaurants & leisure 
##                               65                               37 
##    Household & personal products                        Insurance 
##                               44                              112 
##                        Materials                            Media 
##                               97                               61 
##             Oil & gas operations                        Retailing 
##                               90                               88 
##                   Semiconductors              Software & services 
##                               26                               31 
##  Technology hardware & equipment      Telecommunications services 
##                               59                               67 
##                Trading companies                   Transportation 
##                               25                               80 
##                        Utilities 
##                              110 
## 
## $sales
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.010   2.018   4.365   9.697   9.548 256.300 
## 
## $profits
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -25.8300   0.0800   0.2000   0.3811   0.4400  20.9600        5 
## 
## $assets
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.270    4.025    9.345   34.040   22.790 1264.000 
## 
## $marketvalue
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.02    2.72    5.15   11.88   10.60  328.50

tapply applies the function conditionally to elements in a list. In the next example, we are calculating the medianprofit broken down by the category.

mprofits<-tapply(f$profits, f$category, median, na.rm=T)

Plots

If “best uses of R” was a Family Feud topic, plotting would certainly be top of the board. R plots are very easy to make and with the use of a few other packages R can produce publication-quality plots in one line of code. The plotting functionality (or any functionality for that matter) inherent in R with no packages is called base functionality. We will now cover how to make base plots.

hist(f$marketvalue)

hist is a simple wrapper which takes any single column of data (note we are pulling the market value column from f) and creates a histogram with default values for labels, bin widths etc. To get a better sense of the data, we can take the log of the market value and plot that as well.

hist(log(f$marketvalue))

For all other types of plots (in this case we will talk about scatter plots), the plot function invokes the R base plot functionality. The ~ is used to denote one variable being a function of another (market value as a function of sales in the next example), and if you use help(plot) you can see how to change many other characteristics. I’ve included an example of a plot which can be made rather easily in R to illustrate the power.