lecture 1 First time using R in Math 152: 1. Create a directory(folder) in a convenient drive (e.g. your U drive) called Math 152. This will be where you do all your work with R in the course. 2. Open R. It is available on all machines in Poppa laboratory under the academic software menu. It is also freely available on line at http://www.r-project.org/. 3. In the File menu of R select Change Directory. 4. Change to the new directory which you created in step 1. e.g. U:\Math 152 5. In the R window type q() at the arrow prompt. 6. Choose "Yes" when asked if you want to save the workspace image. 7. Open your new folder and double click on .RData or the blue R icon. You are ready to go. Any objects you create will be saved in your workspace image. Everything you type will be saved in a file called RHISTORY. You can type ls() to see things which are saved in your workspace image. 8. As you move along with R, you may find it convenient to have a notebook open at the same time to save things and for editing and writing commands. -------------------------------------------------------------------------------------------------------------- Go to my webpage math.cmc.edu/moneill in your internet browser. From the Math 152 link go to handouts and open hospitals.txt. This is the data set used for many of the examples in chapter 7 of your text. Save the file (or page ) to your working directory as hospitals.txt. At the arrow prompt in R, type(or paste): hospitals<-read.table("U:/Math 152/hospitals.txt",header=T) to import the dataset into R as an object( a data frame) named hospitals. Note that the slashes are in the forward direction. On my computer at home (windows) I have followed the above steps 1 through 7 to create a directory called Math 152 in my C: drive. With the file hospitals.txt saved in this drectory and working from the image of R inside of it, I just need to type hospitals<-read.table("hospitals.txt",header=T) to import the dataset. Still another method would be to import it directly from my web page hospitals<-read.table("http://math.cmc.edu/moneill/Math152/Handouts/hospitals.txt",header=T) Now type hospitals to see what you've got. ---------------------------------------------------------------------------------------------------------- If you wanted to get the same data file from the diskette that comes with your book you would load the diskette, open the file called HOSP with notebook (or a text editor), insert the headers "discharges" and "beds" with a space between them and a line space between the headers and the data, and then save the file as before. ---------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------ To save yourself from having to retype commands or from a lot of copying and pasting with your mouse, do the following: After opening R, go to the file menu and select "new script". A script window will open. Paste this file or any other list of commands you would like to work with into the script window. Put you mouse cursor on the line you would like to execute and right click to select "Run line or selection". The command will run in the R console window. When closing the script, you will be asked whether to save it. Save it as rnotes1.R to keep it handy in your work space. You can then open it using the open script command in the file menu. ------------------------------------------------------------------------------------------------------------- Now try some of the built in functions from R: mean(hospitals$discharges) var(hospitals$discharges) sd(hospitals$discharges) hist(hospitals$discharges) hist(hospitals$discharges,breaks=15) For documentation, type ?mean, ?var, ?sd, ?hist -------------------------------------------------------------------------------------------------------- Take a look at what happens with mean(discharges) If we get tired of typing hospitals$discharges any time we want to manipulate the discharges column of the data frame we can use attach(hospitals) Now try mean(discharges) detach(hospitals) mean(discharges) attach(hospitals) --------------------------------------------------------------------------------------------------------- It's possible to write your own functions in R. e.g. f<-function(x){x*x} To create a list of the first 500 squares we could use sapply(1:500,f) or equivalently sapply(seq(1:500),f) Check out the ingredients in this example: 1:500, ?sapply,?seq ---------------------------------------------------------------------------------------------------------- To create histograms of sample means as in the text we could write a function (hospitals must be attached for this) f<-function(x){mean(sample(discharges,16))} And then do hist(sapply(1:500,f)) Notice that our function doesn't really depend on the variable "x", so this is really just a way of telling R to do the same thing many times and record the result as a vector. R has a built in way to do this replicate(500,mean(sample(discharges,16))) It will be useful to have the following function: g<-function(m,n,v){replicate(m,mean(sample(v,n)))} Try to think through what the function "g" does before demonstrating it with hist(g(500,16,hospitals$discharges)) For a little more practice with histograms try hist(g(500,16,hospitals$discharges),main=NULL,xlab="whatever we want",ylab="whatever we want") x=hist(g(500,16,hospitals$discharges),main=NULL,xlab="discharges",ylab="counts",col= "red") Naming the histogram in the last line provides an another way to examine the information contained in it. Enter x to see what's going on. Don't forget to at least skim the help files ?sample ?replicate ?hist ----------------------------------------------------------------------------------------------------------- R has very flexible and powerful graphics capabilities for visualizing data. Since the software is open source, there is quite a lot of help available on line. For example, one can use Google to search on "R graphics" and find many examples and tutorials. Here is an attempt at reproducing Figure 7.2 of your text. Assuming that you have the function "g" defined above, type op<-par(no.readonly=TRUE) This gives the name "op" to the original parameter state of the "plot" function. now do par(mfcol=c(4,1),pin=c(2,.9)) hist(g(500,8,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL) hist(g(500,16,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL) hist(g(500,32,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL) hist(g(500,64,hospitals$discharges),xlim=c(0,1500),ylab="counts",main=NULL) to get a column of 4 reasonably sized plots. Then do par(op) to get back to the original state of parameters for the plot function. By resizing the display window I can now (more or less) reproduce figure 7.2. The main point is clear, increasing the sample size decreases the variance and the standard deviation is inversely proportional to the square root of the sample size. -------------------------------------------------------------------------------------------------- Estimating a proportion: (hospitals must be attached) length(discharges[discharges<1000])/393 phat<-function(x){sample(discharges,25)->s;length(s[s<1000])/25} hist(sapply(1:500,phat),prob=T) ---------------------------------------------------------------------------------------------------------- The following function takes a sample of size 25 from the hospital discharges and with (approximately) probability p produces an interval which contains the mean value of the number of discharges. cint<-function(xsamp,p){sxbar<-sqrt((mean((xsamp-mean(xsamp))^2)/24)*(1-(25/393))) z<-function(a){qnorm(1-a)} c(mean(xsamp)-z((1-p)/2)*sxbar ,mean(xsamp) + z((1-p)/2)*sxbar) } qnorm is the "quantile function" for the normal density. (i.e. it is the inverse function of the CDF). We can use the function "cint" to reproduce the confidence interval demonstration in figure 7.4 of the text as follows. Create a matrix called w whose columns are our twenty confidence intervals. replicate(20,cint(sample(discharges,25),.95))->w Plot the endpoints of the confidence intervals. plot(c(1:20,1:20),c(w[1,1:20],w[2,1:20]),xlab="",ylab="Number of discharges") The first vector gives the x coordinates and the second vector the y coordinates. Now join the appropriate pairs of points with lines. for(j in 1:20) lines(c(j,j),c(w[1,j],w[2,j])) Finally, add a horizontal line which indicates the true mean value of the hospital discharges. lines(c(1,20),c(mean(discharges),mean(discharges)),lty=2) (Here "lty = 2" makes the last line dotted.) ---------------------------------------------------------------------------------------------------------- It may be useful to be able to sketch the graph of the normal distribution curve(dnorm(x,mean=1.1,sd=.32),xlim=c(-5,6)) We will cover more about graphics capabilities as needed. For now, note that you can type ?dnorm or ?curve for more information. ----------------------------------------------------------------------------------------------------------- Problem 65 in chapter 7 asks you to repeat the treatment of the hospital data made in chapter 7 on the data set cancer.txt (also available on my web page). It is, of course, not too early to start trying to do that yourself. ------------------------------------------------------------------------------------------------------------ For help with R you may: 1. Check the help menu. In particular there is a .pdf manual there called "An introduction to R". 2. Use an internet search engine such as Google to search on (e.g.) "R tutorial". 3. Consult one of the many textbooks on R or S+ (which is, for our purposes, the same). e.g. Introductory Statistics with R by Peter Dalgaard Modern Applied Statistics with S by W.N. Venables and B.D. Ripley