lecture 1




First time using R in Math 152:

1. Create a directory(folder) in a convenient drive (e.g. your U drive) called Math 152.
   This will be where you do all your work with R in the course.

2. Open R. It is available on all machines in Poppa laboratory under 
the academic software menu.
  It is also freely available on line at http://www.r-project.org/.

3. In the File menu of R select Change Directory.

4. Change to the new directory which you created in step 1.
    e.g. U:\Math 152

5. In the R window type q() at the arrow prompt.

6. Choose "Yes" when asked if you want to save the workspace image.

7. Open your new folder and double click on .RData or the blue R icon. You are ready to go.
   Any objects you create will be saved in your workspace image. Everything you type will be saved in a 
file called RHISTORY. You can type ls() to see things which are saved in your workspace image.

8. As you move along with R, you may find it convenient to have a notebook open at the same time to
 save things and for editing and writing  commands. 

--------------------------------------------------------------------------------------------------------------

Go to my webpage  math.cmc.edu/moneill in your internet browser.
From the Math 152 link go to handouts and open hospitals.txt. This is the data set used for many of 
the examples in chapter 7 of your text. Save the file (or page ) to your working directory as 
hospitals.txt.

At the arrow prompt in R,  type(or paste):

hospitals<-read.table("U:/Math 152/hospitals.txt",header=T)

to import the dataset into R as an object( a data frame) named hospitals.
Note that the slashes are in the forward direction.


On my computer at home (windows)  I have followed the above steps 1 through 7 to create a directory called Math 152
 in my C: drive.
With the file hospitals.txt saved in this drectory and working from the image of R inside of it, I just need to type

hospitals<-read.table("hospitals.txt",header=T)

to import the dataset.


Still another method would be to import it directly from my web page


hospitals<-read.table("http://math.cmc.edu/moneill/Math152/Handouts/hospitals.txt",header=T)


Now type 

hospitals

to see what you've got.



----------------------------------------------------------------------------------------------------------
If you wanted to get the same data file from the diskette that comes with your 
book you would 
load the diskette, open the file called HOSP with notebook (or a text editor),
 insert the headers
 "discharges" and "beds" with a space between them and a line space between 
the headers and the data,
 and then save the file as before.
----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
To save yourself from having to retype commands or from a lot of copying 
and pasting with your mouse,
do the following:

After opening R, go to the file menu and select "new script". A script 
window will open.

Paste this file or any other list of commands you would like to work with 
into the script window.

Put you mouse cursor on the line you would like to execute and right click 
to select "Run line or selection".
The command will run in the R console window.

When closing the script, you will be asked whether to save it. Save it as rnotes1.R
to keep it handy in your work space. You can then open it using the open script command 
in the file menu.
-------------------------------------------------------------------------------------------------------------


Now try some of the built in functions from R:

mean(hospitals$discharges)

var(hospitals$discharges)

sd(hospitals$discharges)

hist(hospitals$discharges)

hist(hospitals$discharges,breaks=15)


For documentation, type  ?mean,  ?var, ?sd, ?hist
--------------------------------------------------------------------------------------------------------
Take a look at what happens with 

mean(discharges)


If we get tired of typing hospitals$discharges any time we want to manipulate the discharges 
column of the data frame we can use

attach(hospitals)

Now try

mean(discharges)

detach(hospitals)

mean(discharges)

attach(hospitals)
---------------------------------------------------------------------------------------------------------
It's possible to write your own functions in R.

e.g.

f<-function(x){x*x}

To create a list of the first 500 squares we could use

sapply(1:500,f)

or equivalently

sapply(seq(1:500),f)

Check out the ingredients in this example:  1:500, ?sapply,?seq
----------------------------------------------------------------------------------------------------------
To create histograms of sample means as in the text we could write a function 
(hospitals must be attached for this)

f<-function(x){mean(sample(discharges,16))}

And then do

hist(sapply(1:500,f))

Notice that our function doesn't really depend on the variable "x", so this is really just a way of 
telling R to do the same thing  many times and record the result as a vector.

R has a built in way to do this 

replicate(500,mean(sample(discharges,16)))


It will be useful to have the following function:

g<-function(m,n,v){replicate(m,mean(sample(v,n)))}

Try to think through what the function "g" does before demonstrating it with


hist(g(500,16,hospitals$discharges))

For a little more practice with histograms try


hist(g(500,16,hospitals$discharges),main=NULL,xlab="whatever we want",ylab="whatever we want")

x=hist(g(500,16,hospitals$discharges),main=NULL,xlab="discharges",ylab="counts",col= "red")

Naming the histogram in the last line provides an another way to examine the information contained in it.
Enter 

x

to see what's going on.



Don't forget to at least skim the help files

?sample
?replicate
?hist

-----------------------------------------------------------------------------------------------------------
R has very flexible and powerful graphics capabilities  for visualizing data. Since the 
software is open source, there is quite a lot of help available on line. For example, one can
 use Google to search on "R graphics" and find many examples and tutorials.

Here is an attempt at reproducing Figure 7.2 of your text.

Assuming that you have the function "g" defined above, type

op<-par(no.readonly=TRUE)

This gives the name "op" to the original parameter state of the  "plot" function.

now do

par(mfcol=c(4,1),pin=c(2,.9))

hist(g(500,8,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL)

hist(g(500,16,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL)

hist(g(500,32,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL)

hist(g(500,64,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL)

to get a column  of 4 reasonably sized plots.

Then do 

par(op)

to get back to the original state of parameters for the plot function.

By resizing the display window I can now (more or less) reproduce figure 7.2.
The main point is clear, increasing the sample size decreases the variance and the
standard deviation is inversely proportional to the square root of the sample size.
--------------------------------------------------------------------------------------------------
Estimating a proportion: (hospitals must be attached)

length(discharges[discharges<1000])/393

  phat<-function(x){sample(discharges,25)->s;length(s[s<1000])/25}

  hist(sapply(1:500,phat),prob=T)

----------------------------------------------------------------------------------------------------------
The following function takes a sample of size 25 from the hospital discharges and with (approximately) 
probability p produces an interval which contains the mean value of the number of discharges.


cint<-function(xsamp,p){sxbar<-sqrt((mean((xsamp-mean(xsamp))^2)/24)*(1-(25/393)))
                         
                        z<-function(a){qnorm(1-a)}

                        c(mean(xsamp)-z((1-p)/2)*sxbar ,mean(xsamp) + z((1-p)/2)*sxbar)


}

qnorm is the "quantile function" for the normal density. (i.e. it is the inverse function of the CDF).



We can use the function "cint" to reproduce the confidence interval demonstration
 in figure 7.4 of the text as follows.


Create a matrix called w whose columns are our twenty confidence intervals.


replicate(20,cint(sample(discharges,25),.95))->w


Plot the endpoints of the confidence intervals.


plot(c(1:20,1:20),c(w[1,1:20],w[2,1:20]),xlab="",ylab="Number of discharges")


The first vector gives the x coordinates and the second vector the y coordinates.
Now join the appropriate pairs of points with lines.


for(j in 1:20) lines(c(j,j),c(w[1,j],w[2,j]))

Finally, add a horizontal line which indicates the true mean value of the hospital discharges.


lines(c(1,20),c(mean(discharges),mean(discharges)),lty=2)

(Here "lty = 2" makes the last line dotted.)
----------------------------------------------------------------------------------------------------------
It may be useful to be able to sketch the graph of the normal distribution

curve(dnorm(x,mean=1.1,sd=.32),xlim=c(-5,6))

We will cover more about graphics capabilities as needed. For now, note that you can type
?dnorm or ?curve for more information.
-----------------------------------------------------------------------------------------------------------
Problem 65 in chapter 7  asks you to repeat the treatment of the hospital data made in chapter 7 on 
the data set cancer.txt (also available on my web page). 

It is, of course, not too early to start trying to do that yourself.
------------------------------------------------------------------------------------------------------------

For help with R you may:

1. Check the help menu. In particular there is a .pdf manual there called "An introduction to R".

2. Use an internet search engine such as Google to search on (e.g.) "R tutorial".

3. Consult one of the many textbooks on R or S+ (which is, for our purposes, the same).

e.g. Introductory Statistics with R by Peter Dalgaard
     Modern Applied Statistics with S by W.N. Venables and B.D. Ripley