R tutorial: Introduction to R

Summer School in Statistics for Astronomers XVIII

June 2023

Instructor: Eric Feigelson (Penn State)


R is a powerful software environment for data analysis, graphics, and especially statistical analysis. It is available free to the public at www.r-project.org with easily installed binaries for Linux, MacOS and Windows. This notebook provides an introduction to R designed for students and researchers in astronomy. Some familiarity with scripting languages like Matlab or IDL is helpful.


To use Jupyter notebooks, both Python and R need to be installed on your computer. Python's anaconda distribution and conda package management system automatically include Jupyter Notebooks as an application accessed through the Anaconda Navigator.

However, it is possible that your Anaconda release did not automatically include the R kernel for Jupyter. You can tell whether it is present if the R logo (blue R in front of gray ellipse) is shown at the top-right of the Jupyter page. If it is missing, then the code cells below will fail upon execution. To install the R kernel within Python, type:

conda install -c r r-irkernel 


Alternatively, you can make R available to Jupyter from an R console by typing:

install.packages('IRkernel')  ;  IRkernel::installspec() 


For further discussion of the relationship between Python, R, Jupyter and other notebook environments, see https://www.datacamp.com/community/blog/jupyter-notebook-r


Some basic information:

# I.  Set up your session


getwd()                     # find working directory.  

#setwd('/Users/ericfeigelson/Desktop/Rdir')

getwd()                     # see the working directory has changed

citation()                  # quote this citation in any publication using R

sessionInfo()               # learn about your software environment

library()                   # see packages installed on your computer

                            # ~30 are installed automatically with R

# II. Create and characterize a vector


a <- c(33, 44, 92, 58)      # combine numbers into a vector

length(a)

ls()                        # list names of objects in your environment

class(a)                    # state the `class' of an R object (described in III below)

str(a)                      # state the structure of an R object

a                           # state the contents of an R object

# R vectors and arrays start at index 1, 

# unlike Python where the first element has index 0

write(file='output', a)     # write an ASCII file into the working directory

save(file='output_bin', a)  # write a binary file

Save(file='output_bin', a)  # error because 'Save' is not a known function.

                            # R syntax is case sensitive.

sum(a)


# Annotated write to console. The \n symbol is a carriage return.

cat('Sum of ', length(a), ' elements in the vector a = ', sum(a), '\n')

summary(a)                  # many R objects have a built-in 'summary' function

# Manipulation of vector indices


a[1:4]          # show all four elements of the vector

a[3]            # note that R vectors start with index 1 not 0, unlike Python

a > 40          # logical operation

sum(a[a>40])    # note the self-referential use of vector/array indices here

which.max(a)    # R has many built-in functions for vectors and arrays

match(44, a)


R classes

R objects are placed into classes: numeric, character, logical, vector, matrix, factor, data.frame, list, and dozens of others designed by advanced R functions and CRAN packages. plot, print, summary functions are adapted to class objects; see e.g. methods(summary).

Two particularly important classes are the 'data frame' used for tabular data and the 'list' used as a bucket with heterogeneous content. The data frame is a 2-dimensional array with associated column names. The list class allows a hierarchical structure of R objects such as scalars, vectors, arrays, and attributes. Here we make a hierarchical list, use 'str' (structure) to show its contents, and access an element of the list using the $ delimiter.

# III.  R classes and packages


# Make and write a data.frame, a 2D array with column names


d <- data.frame(cbind(seq(1:4), a, a^3))  # Bind columns into data frame

names(d) <- c('ID', 'a', 'a_cubed') # Column names for data frame                                         

d2 <- d[-4,-1]                            # Remove 4th row and 1st column

d ; d2

write.table(d, file='d.txt', quote=FALSE, row.names=FALSE)

# Make and show a list.


b_list <- list(star=c('Sirius', 'Procyon'), SpTy=c('O','B','A'), Hubble_km.s=68)

str(b_list)

b_list[['SpTy']] = list(subtype=seq(0.1:0.9, by=0.1))

str(b_list)

b_list$SpTy$subtype[1:3]


R packages

The 20,000 CRAN packages are obtained on-the-fly when they are needed. Several dozen packages are already in the base-R release, and can be used with the library or require command. Most packages are on mirror sites hosted by institutions around the world, but are most easily downloaded from the commercial cloud at https://cloud.r-project.org.

There is no useful index of the CRAN packages and their many functions. The collection expanded exponentially during the 2000s and is now growing by sevearal packages every day. Expert volunteers in ~40 statistical areas update lists of CRAN packages in their area; these are accesses on the Web at CRAN Task Views. Task Views of particular interest to astronomers include Bayesian, Cluster, HighPerformanceComputing, MachineLearning, Multivariate, Optimization, Robust, Spatial, Survival, and TimeSeries.

Astronomy-specific packages (e.g. stellar evolutionary tracks, celestial mechanics) are listed in the ChemPhys CRAN Task View. The package FITSio reads astronomical FITS formatted files and headers, converting them into an R list. The package astrolibR is a collection of long-established functionalities useful in astronomical research, translated from the IDL Astronomy Library.

# Download and use a CRAN package

install.packages('xtable') # Download CRAN package

library(xtable)                         # Bring LaTeX package into this R session

print(xtable(d), file='d.tex')          # Use a function in the package

                                        # Here we write an ASCII table with LaTeX format


R help files

R help files give essential information on all functions in a standard format:

  1. The top lines give the package where the function resides and a brief description.
  2. Usage gives the list of inputs for the function. Input parameters with an = have a default and do not need to be specified by the user.
  3. Arguments describes these input parameters.
  4. Details summarizes the functionality including formulae and algorithms.
  5. Value gives the output of the function. Typically the program specifies outfn <- fn(x,y,z, option='special') so the full list of output values are available for use, such as plot(outfn$x, outfn$y).
  6. References to published literature where the function is described.
  7. See also links to R functions with related purpose to the current function.
  8. Examples show usage of the function, often with a built-in dataset. Examples in R help files can always be cut-and-pasted into any R console.

R programmers are constantly referring help files. Read a few help files used below in this tutorial, such as seq, mad, and integrate.

help(seq)

help(mad)

help(integrate)

# IV. Arithmetic, algebra, trigonometry, and formatting numerics


5 + 3 ; 5-3 ; 5*3 ; 5/3 ; 5^3

x <- 5 ; y <- 3

x+y

sin(0)  ; sin(pi/2)         # note angles are in radians

ang <- seq(0, pi/2, length=30) 

sin(ang)

trunc(12345.6789) ; round(12345.6789)  

format(12345.6789, digits=2, scientific=TRUE)   


log(20)  ;  log10(20)   # log() in R is base-e. Use log10() for base-10 logarithms.


Exercise 1: Practice R syntax. Practice with some elementary R functions: arithmetic and algebra, 2-dimensional array manipulation, producing multi-element lists. Write a brief program exercising program flow control (if, ifelse, when, repeat, break, stop, etc.)

# V. Astrophysical calculations of galaxy distances 


# The `function' function: Note how one function uses another

# This how R builds new capabilities based on old capabilities in a compact syntax.


# First, make a simple calculation without functions


z <- seq(0.0, 0.5, 0.1)

z

H_0 <- 68  # km/s/Mpc,  Planck value

speed.light <- 3.0E5  # km/s

dist <- speed.light*z / H_0

dist

class(dist)

# Now, make a more complicated calculation with function


Omega_m <- (0.022068 + 0.12029) / (H_0/100)^2    

Omega_Lambda <- 0.6825                    # Planck satellite values


E.H0 <- function(redshift) {sqrt(Omega_m*(1+redshift)^3 + Omega_Lambda)}


lum.dist <- function(redshift) {  

luminosity.distance = (speed.light/H_0) * integrate(E.H0, 0, redshift)$value 

return(luminosity.distance)  }


distGR <- Vectorize(lum.dist)(z)

# Plot the results

# The 'plot' function has extensive options to change format; see 'help(par)'


options(jupyter.plot_scale=1)

options(repr.plot.width = 7, repr.plot.height = 5)


plot(z, distGR, type='l', lty=2, lwd=2, ylab='Distance (Mpc)')

lines(z, dist, lty=1, lwd=2)   

legend(0.0, 2500, lty=c(1,2), lwd=c(2,2), title='Galaxy distances', 

     legend=c('Euclidean', expression(Lambda*CDM)))


Exercise 2. Integrate an astrophysical function. Estimate the age of the Universe as a function of redshift for a standard $\Lambda$CDM universe model: $$ t(z) = H_0^{-1} \int_{z}^{\infty}\frac{dz'}{(1+z')h(z')} $$ where $ h(z) = \sqrt{(1-\Omega_{total})(1+z)^2 + \Omega_m(1+z)^3 + \Omega_{\Lambda} } $, $\Omega_m$ is the matter density parameter, and $\Omega_{\Lambda}$ is the dark energy density parameter. Plot the age of the Universe vs. redshift ($z=0$ to 10) for three hypothetical universes: matter-dominated ($\Omega_m=1.0$ and $\Omega_{\Lambda}=0.0$), dark-energy-dominated ($\Omega_m=0.01$ and $\Omega_{\Lambda}=0.99$), and a realistic universe ($\Omega_m=0.31$ and $\Omega_{\Lambda}=0.69$). This problem and solution is courtesy of graduate student Phoebe Sandhaus, Penn State.

# VI. Examine, summarize and plot univariate distributions: 

# dot plot, box plot, histogram


set.seed(1)

x <- sample(seq(0.01, 3, length.out=500))

y <- 0.5*x + 0.3^(x^2) + rnorm(500, mean=0, sd=(0.05*(1+x^2)))

xy <- cbind(x, y)


plot(xy, pch=20)

summary(x) ; summary(y)   # Summarizes properties of an R object

par(mfrow=c(1,2))  # Set up a two-panel figure

boxplot(x,  notch=T, main='Boxplot for X')

boxplot(y,  notch=T, pch=20, cex=0.5, main='Boxplot for Y')

dev.copy2pdf(file='box.pdf')

par(mfrow=c(1,1))

hist(x, breaks=30, main='', xlim=range(x), ylim=c(0,100), 

     xlab='Yvar', col='royalblue4')

hist(y, breaks=30, main='', xlab='Yvar', 

     col='#ee3b3b70', add=TRUE) # add=TRUE suppresses a new plot

qqnorm(y, pch=20, cex=0.5)  # Quantile function of y compared to normal distribution

qqline(y) # Expected relationship if y is normal

plot(ecdf(x), pch=20, cex=0.0, verticals=TRUE, main='',ylab='EDF',xlab='')  

plot(ecdf(y), pch=20, cex=0.0, verticals=TRUE, add=T)

text(2.0,0.5,"X") ; text(1.4,0.8,"Y")             # text adds annotation within a plot

dev.copy2pdf(file='ecdf.pdf')

# VII. Arrays, data frames and filtering


# Here xy is an `array' of numbers created by `column bind'


xy <- cbind(x, y)  ;  str(xy) 


# A data.frame associates names to the columns


xy <- as.data.frame(xy)         

names(xy) <- c('Xvar', 'Yvar')  


# Collect rows where the first column value exceeds 2


high_x1 <- xy[xy[,1]>2,]        

high_x2 <- subset(xy, xy[,1]>2) # Another way to extract rows

setequal(high_x1, high_x2)      # test equality of two vectors

# VIII. Sampling and bootstrapping 


trials <- sample.int(length(xy[,1]),20) # 20 random rows

xy[trials,]


# 20 bootstrap resamples


trials <- sample.int(length(xy[,1]),20, replace=T) 

xy[trials,]


# Estimate the standard error of the median of Yvar


median(xy[,2]) 


# Median absolute deviation estimate of median s.e.


mad(xy[,2]) / sqrt(500)

library(boot)  # The following function in a base-R library

med <- function(x,index) median(x[index])


# Read help(boot) to understand its output list structure


boot(xy[,2], med, R=1000) # Bootstrap estimate of median s.e. 

hist(boot(xy[,2], med, R=1000)$t, breaks=50, 

     xlab='Bootstrap median of Yvar')


Exercise 3. Bootstrap resampling. Use bootstrap resampling (random sampling with replacement) to estimate uncertainties of a statistic. Create a univariate sample with a weird distribution … maybe sampling from a polynomial or nonlinear function over some interval. First, calculate the median and a robust measure of its standard error: $1.48*(MAD$ where MAD is the median absolute deviation and the 1.48 scales it to the standard deviation for a Gaussian distribution. Second, estimate the uncertainty of the median from a bootstrap resampling. Give the standard error, 95% confidence interval, and plot a histogram of the bootstrap medians.

# IX. Bivariate plots and correlation tests


# Scatterplot. See help(points) for symbol shapes.

par(mfrow=c(1,2))

plot(xy, pch=20, cex=0.5)  

plot(log10(xy), pch=20, cex=0.5, xlab='log(Xvar)', ylab='log(Yvar)')

length(x[x>2]) # State length of a vector.  Use `dim' for an array or data.frame.

# Parametric hypothesis test for bivariate correlation 

cor.test(x[x>2],y[x>2], method='pearson')

# Nonparametric hypothesis test for bivariate correlation

cor.test(x[x>2],y[x>2], method='kendall')


X. R for astronomical data analysis

R was designed in the 1990s for interactive data exploration and analysis by a single user using a single CPU core with a graphics window. The compiler was not optimized for speed. However, today R has much improved compilers and considerable ability to use a multi-core environment (including GPUs). It also was not viewed at a language appropriate for pipeline analysis of large datasets such as those emerging from astronomical satellites or observatories. This criticism is valid in the sense that it does not have a large and growing software library of specialized code for astronomical data, such as the ~2000 CRAN packages in the Bioconductor toolbox for genomics or the diverse packages in the astropy project in Python.

However, R can be operated in batch mode, running scripts with the Rscript (or with more flexibility, CRAN package littler) command line interface from the operating system prompt. These commands can be embedded in parallelized scripts for supercomputers using PBS or related scripts. Using Rscript, we at Penn State have reanalized the full dataset of a major NASA satellite survey (4-year lightcurves for ~150,000 stars observed by Kepler) and ~2 million lightcurves from a ground-based observatory. The motivation for this approach is that the instrument-specific characteristics of the data reduction had already been performed in Python (Levels 2 and 3 data products), and the data were ready for advanced statistical analysis using CRAN packages. In our case, sophisticated packages originally designed for econometrics play a central role, along with a Fortran code we wrote for a computationally intensive step. R code for astronomical pipelines can be remarkably brief if one effectively uses the tens-of-thousands of advanced functions available in CRAN.

The many statistical functionalities of R can also be run directly from Python and several other languages through bi-language interfaces. R scripts can be embedded in Python using rpy2 and, conversely, Python programs can be embedded in R scripts using CRAN package reticulate or several other options.


XI. Resources for further study of R

We give here a miscellaneous collection of commands and resources useful for learning more about R as a programming language.

# A list of the ~30 important CRAN packages embedded in the base-R environment

library()


# A full list of ~400 functions in R's `base' package

library(help = "base")


# Statistics in base R (~400 functions, tens-of-thousands more in CRAN and elsewhere in R)

library(help='stats')


# List current contents of your session environment

ls()


# Programming utilities including:

#    Use `source' to bring in external R scripts

#    Use `edit' to edit an R object

#    Use 'environment' to segregate a collection of objects

#    Functions `debug', `trace' and `browser' assist with code testing

#    Function 'process.events' allows low-level handling of R commands

library(help = 'utils')


# Loops:  for( i in 1:100) { ... }

# Program flow control:  if/else, ifelse, switch, while, repeat, next, break, stop

foo <- 2

if(foo == 1) cat('Hello world!') else cat('Do nothing')

for(i in 1:10) { cat(' Num = ', i, '\n') }


# Graphics and devices in base R (other packages in CRAN)

library(help='graphics')

library(help='grDevices')


# Parallel computing control in base R 

# CRAN has dozens of other high performance computing packages

library(help='parallel')


# Run an R script residing on disk

help(source)


# Save R objects (or your full environment) onto disk

help(save) ; help(load)


# Save or load history of R commands

help(savehistory)  ;  help(loadhistory)


# Connections, pipes, sockets, URLs, clipboard, compression, etc.

help(connections)

 

# Interact with host computer 

Sys.info()

system('ls -l')

system.time(fft(seq(0,1,length.out=1000000))) # A million fast Fourier transforms


# Construct composite strings using 'paste'

# Extract postions of a string using `substring'

band_ir <- 'J'

paste('NGC1068',band_ir,'FITS', sep='.')


# FITS format reader/writer

install.packages('FITSio') ; library(FITSio)


# IDL Astro Library translated into R

install.packages('astrolibR') ; library(astrolibR)


# R/CRAN functions are public domain and can be wrapped from Python

# programs using package rpy2.  Example:

### pip install rpy2

### import rpy2

### import rpy2.robjects as robjects

### R = robjects.r

### ranGauss = R.rnorm(100)

### print ranGauss


# Python code can be wrapped into R using CRAN package 'reticulate' (among others)

# R has similar interfaces to many other languages.