Rogosa R-diary, Stat141

These notes serve to annotate the computing content of the class handouts; class handouts are available from the Course Examples and Files

Starting Out
The link on the main course page to the R-project provides the info needed to download and install (windows, MacOS, etc).
See also Manuals link and contributed documentation and Frequently Asked Questions on R
Additional Resources.
       Documentation for package `stats' version 2.3.1          Documentation for package `base' version 2.3.1          and an extensive Package Index
The Verzani book is a wonderful companion to the SW text; the on-line version linked on the main course page is a preliminary rough draft of the book.
In addition, supplemental materials for Verzani at http://wiener.math.csi.cuny.edu/UsingR/Data
The widespread use of R in statistics departments has led to myriad resources on-line (in addition to CRAN). Google searches are really the best recourse for any specific R-question. A couple of introductory guides worthy of note are from Berkeley and from UCLA  and from Vanderbilt.
Especially useful after you get started are A short list of the most useful R commands and the beautifully formatted R Reference Card.  

Besides the online and hard-copy versions of the Verzani text, another R-based text of special usefulness for analysis of variance and regression is a on-line version of a text by J Faraway, Practical Regression and Anova in R

Course Materials and Notes

Stat141 is an introductory course on statistical methods, not a comprehensive course on the capabilities of R. The point being that the approach here is to use R to implement statistical methods in the most direct way, incrementally, with minimal complexity and consequently minimized elegance.
The notes and comments below should be referred to the full copies of the indicated class handouts (hardcopies or from the course files and examples page)

Class 9/26.

Brain weight (data file on website as "brain.dat")

Data sets can have many forms and be imput in many ways. For simplicity/consistency I chose one format to use which is seen in brain.dat (first 10 and final 10 observations below). The top row contains the variable names (labels), each row following has 3 columns: the row number, the value of first variable, brainwt, and the second variable (m/f) indicator for sex.
brainwt	sex
001    1607	m
002    1157	m
003    1248	m
004    1310	m
005    1398	m
006    1237	m
007    1232	m
008    1343	m
009    1380	m
010    1274	m
...........................
176    1190	f
177    1153	f
178    1037	f
179    1120	f
180    1212	f
181    1024	f
182    1135	f
183    1177	f
184    1096	f
185    1114	f
First step is to bring the data file into R; one way to do that is the "read.table" command as in
 brain = read.table(file="D:\\stat141\\brain.dat", header = T)  
Above we tell read.table where the file is located and whether labels are being used. The result is a data set now called brain that R can use.
In the handout, the first command is
  summary(brain) 
which gives 5-number summary (plus mean) for the quantitative variable "barianwt", and a tally for the qualitative (m/f) "sex"
Note that "summary" will operate on the dataset.
The command
  stem(brain$brainwt) 
produces a character stem-and-leaf diagram for the variable with label brainwt. To access that variable without attaching the dataset via "attach(brain)" the syntax is dataset$variable. If the dataset is attached then only the variable name is required.
Conditioning on a classification variable is one of the most common and useful activities in data analysis. Here, compare male and female brains. The commands
 tapply(brain$brainwt, brain$sex, summary)

 tapply(brain$brainwt, brain$sex, stem)   
produce summary and stem-and-leaf diagrams for males and females seperately.
The syntax for tapply(variable to be summarized, classification variable to be conditioned on, type of summary or data display).

MAO

The file MAO.dat on the course website has 3 columns of data (two versions of diagnosis). Read into R by
  maodat = read.table(file="D:\\stat141\\MAOactiv.rd", header = T)  
Since I did the command
 attach(maodat)    
I could/should have referred to the variables such as MAO.activity in
 stem(maodat$MAO.activity)   
directly without the maodat$ prefix .
In addition to the stem-and-leaf display, I also showed boxplots in this example both for MAO.activity for all cases and also conditioning on group (either form)
  boxplot(maodat$MAO.activity)
  tapply(maodat$MAO.activity, maodat$group, boxplot, na.rm=T)
 boxplot(maodat$MAO.activity ~ maodat$group, na.rm=T)
One note on graphics (e.g. boxplot). Graphics in R, at least in the Windows GUI appear in a seperate window. Be aware that graphics will be replaced by the next plot unless you turn on "recording" in the history menu when graphic is in focus. You can save any graphic (pdf or emf are good) using file "save as". Turning on recording allows scrolling back through the series of plots.
One additional feature/complexity of the maodat data set is the presence of missing values, indicated by "NA" in the data set. The option "na.rm=T" in the various commands tells R to ignore the missing values.

A note on saving your R sessions. You can save all the ascii content of your console (input and character output) using the "save to file" option at the bottom of the file menu. You can also save just your input commands by the "save history" option. Having these as text files is useful. Alternatively, when your R console is active you can copy and paste into another application.

Class 9/28.

Cricket

In crickets.dat one variable, singtime. A log transformation of singtime (transform to symmetry) is done by the command
  logcrick = log(cricket$singtime)  

Class 10/3.

Handouts

Horseshoe Crab

The file horseshoecrab.dat has 5 variables, some categorical, some counted, some measured. To get a tally of the number of the females at each satellite count the command
  table(crab$satell)  
To obtain the relative frequencies, divide by the total number of females, obtained from "length" command. Both numerical and graphic display from the commands
table(crab$satell)/length(crab$satell)
barplot(table(crab$satell)/length(crab$satell))  

Verzani, Albatross data

New commands "install.packages" which when working will search internet and sucessfully download onto your machine the library containing data files etc. Once the package is installed on your machine, the library and data commands
library(UsingR)
data(bycatch)  
makes the data set "bycatch" active in the R session.

Extra Calcs from lecture

1. Dichotomizing cricket data
command sort orders the data
command singtime > 10   < transforms values greater than 10 into TRUE and values less than or equal to 10 into FALSE
command sum(singtime > 10)   < produces the number of TRUE
command mean(singtime > 10)   < produces the proportion TRUE
2. Categorical data, evolution example
command rep repeats the argument (character or numeric) the indicated number of times.


Class 10/5.

Random Sampling

The "sample" command takes a number of arguments; in
 sample(k, size = 5, p = rep(1:1, 100)/100)   
the population is represented by the vector k (here the integers between 0 and 99 via k=0:99), the size of the desired sample, and the vector of probabilities p, here set to make all values of k equally likely via the "rep" command. The default setting in "sample" is without replacement, replace = FALSE. To sample with replacement insert "replace = TRUE" in the sample command.


Class 10/10.

Binomial, Bernoulli Trials

The binomial distribution function Pr{Y=k} for k=0,1,2, for 2 trials, each with Pr{success} = 1/4 is obtained from the command
 dbinom(0:2, s=2, p=1/4)   
The binomial cumulative distribution function Pr{Y less than or equal to k} for k=1 for 2 trials, each with Pr{success} = 1/4 is obtained from the command
 pbinom(1, s=2, p=1/4)   

Poisson Distribution

The command
  dpois(2, .610)  
gives the Pr{Y=2} for a poisson distributed quantity with mean number of events .610.
The similar but fancier command
  rbind(k, signif(dpois(k, .610)*200,4))
lists values of the Poisson fit for 200 Prussian calveries 200*Pr{Y=k} for each k=0:6 where rbind lines up (k, dpois) rows and signif(xxx,4) gives output to 4 significant digits.

More of the same, the command
 rbind(k, signif(dpois(k, 2),3), signif(ppois(k, 2),3))   
gives values for the Poisson distribution and cumulative distribution function and each k to 3 significant digits.

Class 10/12.

Continuous Distributions

The command
 unifdat = runif(1000,1,5)   
produces a vector of 1000 values drawn from a rectangular distribution with endpoints (1,5).
The command
 pnorm( 60, 54,4.5) -  pnorm(58, 54,4.5)   
finds the area between 58 and 60 under the density curve for N(54,4.5) variate .
Note the listing of distributions and functions in the "keepsake" section.

percentiles vs quantiles, pnorm vs qnorm
plotting continuous density functions, the command
  curve(dnorm(x, 10, 2), 0,20)  
plots the normal density for N(10,2) for x between 0,20 (5 sd from mean gets the full plot)
  curve(dexp(x, 1/10), 0,25)  
plots the density for an exponential distribution with mean 10, sd 10 between 0,25.

Class 10/17.

The command
 qqnorm(percmoist);qqline(percmoist)     
produces a Normal Probability Plot for the percentage moisture observation . The superimposed qqline goes through quartiles.

How do sample means behave? Demonstrations of CLT

Simulation loop using for(i   ) loops. The code below draws a sample of size 10 for a Uniform[0,1] distribution calculates the mean, stores it in "result" and repeats that 100 times.
 result = c()        #Verzani p.167
> for(i in 1:100) {
+ result[i] = mean(runif(10)) } 
<< q-q plots referenced to the normal distribution are created by the command: qqnorm(result)
To plot a density such as the exponential distribution with mean and sd 1/4, use the R code: curve(dexp(x,1/4), 0, 10)


Class 10/19.

Inference for a single mean

To obtain quantiles for the t-distribution in R, qt(1 - errvec/2,5) for a t-distribution with 5 df and for percentile points defined by civec =c(.9, .95, .99); errvec = 1 - civec #alpha = 1 - confidence (prob of coverage)
To obtain tests and confidence intervals using "t-tests" the command is simply t.test(croach) or t.test(croach, c= .9) for data croach and Type I error rates .05 or .10 respectively.

There are many switches and option for data displays like stem and leaf, one that is often useful is to change the scale as seen in the 10/19 handout:
    stem(diff, s=2)               # change scale of stem-and leaf