Rogosa R-diary, Stat141
These notes serve to annotate the computing content of the class handouts; class handouts are available from the Course Examples and Files
Starting Out
The link on the main course page to the R-project provides the info needed to download and install (windows, MacOS, etc).
See also Manuals link and contributed documentation and Frequently Asked Questions on R
Additional Resources.
Documentation for package `stats' version 2.3.1 Documentation for package `base' version 2.3.1 and an extensive Package Index
The Verzani book is a wonderful companion to the SW text; the on-line version linked on the main course page is a preliminary rough draft of the book.
In addition, supplemental materials for Verzani at http://wiener.math.csi.cuny.edu/UsingR/Data
The widespread use of R in statistics departments has led to myriad resources on-line (in addition to CRAN). Google searches are really the best recourse for any specific R-question. A couple of introductory guides worthy of note are from Berkeley and from UCLA and from Vanderbilt.
Especially useful after you get started are A short list of the most useful R commands and the beautifully formatted R Reference Card.
Besides the online and hard-copy versions of the Verzani text, another R-based text of special usefulness for analysis of variance and regression is a on-line version of a text by J Faraway, Practical Regression and Anova in R
Course Materials and Notes
Stat141 is an introductory course on statistical methods, not a comprehensive course on the capabilities of R. The point being that the approach here is to use R to implement statistical methods in the most direct way, incrementally, with minimal complexity and consequently minimized elegance.
The notes and comments below should be referred to the full copies of the indicated class handouts (hardcopies or from the course files and examples page)
Class 9/26.
Brain weight (data file on website as "brain.dat")
Data sets can have many forms and be imput in many ways. For simplicity/consistency I chose one format to use which is seen in brain.dat (first 10 and final 10 observations below). The top row contains the variable names (labels), each row following has 3 columns: the row number, the value of first variable, brainwt, and the second variable (m/f) indicator for sex.
brainwt sex
001 1607 m
002 1157 m
003 1248 m
004 1310 m
005 1398 m
006 1237 m
007 1232 m
008 1343 m
009 1380 m
010 1274 m
...........................
176 1190 f
177 1153 f
178 1037 f
179 1120 f
180 1212 f
181 1024 f
182 1135 f
183 1177 f
184 1096 f
185 1114 f
First step is to bring the data file into R; one way to do that is the "read.table" command as in
brain = read.table(file="D:\\stat141\\brain.dat", header = T)
Above we tell read.table where the file is located and whether labels are being used. The result is a data set now called brain that R can use.
In the handout, the first command is
summary(brain)
which gives 5-number summary (plus mean) for the quantitative variable "barianwt", and a tally for the qualitative (m/f) "sex"
Note that "summary" will operate on the dataset.
The command
stem(brain$brainwt)
produces a character stem-and-leaf diagram for the variable with label brainwt. To access that variable without attaching
the dataset via "attach(brain)" the syntax is dataset$variable. If the dataset is attached then only the variable name is required.
Conditioning on a classification variable is one of the most common and useful activities in data analysis. Here, compare male and
female brains. The commands tapply(brain$brainwt, brain$sex, summary)
tapply(brain$brainwt, brain$sex, stem)
produce summary and stem-and-leaf diagrams for males and females seperately.
The syntax for tapply(variable to be summarized, classification variable to be conditioned on, type of summary or data display).
MAO
The file MAO.dat on the course website has 3 columns of data (two versions of diagnosis). Read into R by
maodat = read.table(file="D:\\stat141\\MAOactiv.rd", header = T)
Since I did the command attach(maodat)
I could/should have referred to the variables such as MAO.activity in
stem(maodat$MAO.activity)
directly without the maodat$ prefix .
In addition to the stem-and-leaf display, I also showed boxplots in this example both for
MAO.activity for all cases and also conditioning on group (either form)
boxplot(maodat$MAO.activity)
tapply(maodat$MAO.activity, maodat$group, boxplot, na.rm=T)
boxplot(maodat$MAO.activity ~ maodat$group, na.rm=T)
One note on graphics (e.g. boxplot). Graphics in R, at least in the Windows GUI appear in a seperate window. Be aware that graphics will be replaced by the next plot unless you turn on "recording" in the history menu when graphic is in focus. You can save any graphic
(pdf or emf are good) using file "save as". Turning on recording allows scrolling back through the series of plots.
One additional feature/complexity of the maodat data set is the presence of missing values, indicated by "NA" in the data set.
The option "na.rm=T" in the various commands tells R to ignore the missing values.
A note on saving your R sessions. You can save all the ascii content of your console (input and character output) using the "save to file" option at the bottom of the file menu. You can also save just your input commands by the "save history" option. Having these as text files is useful. Alternatively, when your R console is active you can copy and paste into another application.
Class 9/28.
Cricket
In crickets.dat one variable, singtime. A log transformation of singtime (transform to symmetry) is done by the command
logcrick = log(cricket$singtime)
Class 10/3.
Handouts
Horseshoe Crab
The file horseshoecrab.dat has 5 variables, some categorical, some counted, some measured. To get a tally of the number of the females at each satellite count the command
table(crab$satell)
To obtain the relative frequencies, divide by the total number of females, obtained from "length" command. Both numerical and graphic display from the commands
table(crab$satell)/length(crab$satell)
barplot(table(crab$satell)/length(crab$satell))
Verzani, Albatross data
New commands "install.packages" which when working will search internet and sucessfully download onto your machine the library containing data files etc. Once the package is installed on your machine, the library and data commands
library(UsingR)
data(bycatch)
makes the data set "bycatch" active in the R session.
Extra Calcs from lecture
1. Dichotomizing cricket data
command sort orders the data
command singtime > 10 < transforms values greater than 10 into TRUE and values less than or equal to 10 into FALSE
command sum(singtime > 10) < produces the number of TRUE
command mean(singtime > 10) < produces the proportion TRUE
2. Categorical data, evolution example
command rep repeats the argument (character or numeric) the indicated number of times.
Class 10/5.
Random Sampling
The "sample" command takes a number of arguments; in
sample(k, size = 5, p = rep(1:1, 100)/100)
the population is represented by the vector k (here the integers between 0 and 99 via k=0:99), the size of the desired sample, and the
vector of probabilities p, here set to make all values of k equally likely via the "rep" command. The default setting in "sample" is
without replacement, replace = FALSE. To sample with replacement insert "replace = TRUE" in the sample command.
Class 10/10.
Binomial, Bernoulli Trials
The binomial distribution function Pr{Y=k} for k=0,1,2, for 2 trials, each with Pr{success} = 1/4 is obtained from the command
dbinom(0:2, s=2, p=1/4)
The binomial cumulative distribution function Pr{Y less than or equal to k} for k=1 for 2 trials, each with Pr{success} = 1/4 is obtained from the command
pbinom(1, s=2, p=1/4)
Poisson Distribution
The command
dpois(2, .610)
gives the Pr{Y=2} for a poisson distributed quantity with mean number of events .610.
The similar but fancier command
rbind(k, signif(dpois(k, .610)*200,4))
lists values of the Poisson fit for 200 Prussian calveries 200*Pr{Y=k} for each k=0:6 where rbind lines up (k, dpois) rows and signif(xxx,4)
gives output to 4 significant digits.
More of the same, the command
rbind(k, signif(dpois(k, 2),3), signif(ppois(k, 2),3))
gives values for the Poisson distribution and cumulative distribution function and each k to 3 significant digits.
Class 10/12.
Continuous Distributions
The command
unifdat = runif(1000,1,5)
produces a vector of 1000 values drawn from a rectangular distribution with endpoints (1,5).
The command
pnorm( 60, 54,4.5) - pnorm(58, 54,4.5)
finds the area between 58 and 60 under the density curve for N(54,4.5) variate .
Note the listing of distributions and functions in the "keepsake" section.
percentiles vs quantiles, pnorm vs qnorm
plotting continuous density functions, the command
curve(dnorm(x, 10, 2), 0,20)
plots the normal density for N(10,2) for x between 0,20 (5 sd from mean gets the full plot)
curve(dexp(x, 1/10), 0,25)
plots the density for an exponential distribution with mean 10, sd 10 between 0,25.
Class 10/17.
The command
qqnorm(percmoist);qqline(percmoist)
produces a Normal Probability Plot for the percentage moisture observation . The superimposed qqline goes through quartiles.
How do sample means behave? Demonstrations of CLT
Simulation loop using for(i ) loops. The code below draws a sample of size 10 for a Uniform[0,1] distribution calculates the mean, stores it in "result" and repeats that 100 times.
result = c() #Verzani p.167
> for(i in 1:100) {
+ result[i] = mean(runif(10)) } <<
q-q plots referenced to the normal distribution are created by the command: qqnorm(result)
To plot a density such as the exponential distribution with mean and sd 1/4, use the R code: curve(dexp(x,1/4), 0, 10)
Class 10/19.
Inference for a single mean
To obtain quantiles for the t-distribution in R, qt(1 - errvec/2,5) for a t-distribution with 5 df and for percentile points defined by civec =c(.9, .95, .99); errvec = 1 - civec #alpha = 1 - confidence (prob of coverage)
To obtain tests and confidence intervals using "t-tests" the command is simply t.test(croach) or t.test(croach, c= .9) for data croach and Type I error rates .05 or .10 respectively.
There are many switches and option for data displays like stem and leaf, one that
is often useful is to change the scale as seen in the 10/19 handout:
stem(diff, s=2) # change scale of stem-and leaf