R tutorial: Introduction to R

Summer School in Statistics for Astronomers XVI

June 2021

Instructor: Eric Feigelson (Penn State)

R is a powerful software environment for data analysis, graphics, and especially statistical analysis. It is available free to the public at www.r-project.org with easily installed binaries for Linux, MacOS and Windows. This notebook provides an introduction to R designed for students and researchers in astronomy. Some familiarity with scripting languages like Matlab or IDL is helpful.

To use Jupyter notebooks, both Python and R need to be installed on your computer. Python's anaconda distribution and conda package management system automatically include Jupyter Notebooks as an application accessed through the Anaconda Navigator.

However, it is possible that your Anaconda release did not automatically include the R kernel for Jupyter. You can tell whether it is present if the R logo (blue R in front of gray ellipse) is shown at the top-right of the Jupyter page. If it is missing, then the code cells below will fail upon execution. To install the R kernel within Python, type:

conda install -c r r-irkernel 

Alternatively, you can make R available to Jupyter from an R console by typing:

install.packages('IRkernel')  ;  IRkernel::installspec() 

For further discussion of the relationship between Python, R, Jupyter and other notebook environments, see https://www.datacamp.com/community/blog/jupyter-notebook-r

Some basic information:

R classes

R objects are placed into classes: numeric, character, logical, vector, matrix, factor, data.frame, list, and dozens of others designed by advanced R functions and CRAN packages. plot, print, summary functions are adapted to class objects; see e.g. methods(summary).

Two particularly important classes are the 'data frame' used for tabular data and the 'list' used as a bucket with heterogeneous content. The data frame is a 2-dimensional array with associated column names. The list class allows a hierarchical structure of R objects such as scalars, vectors, arrays, and attributes. Here we make a hierarchical list, use 'str' (structure) to show its contents, and access an element of the list using the $ delimiter.

R packages

The 17,000+ CRAN packages are obtained on-the-fly when they are needed. Several dozen packages are already in the base-R release, and can be used with the library or require command. Most packages are on mirror sites hosted by institutions around the world, but are most easily downloaded from the commercial cloud at https://cloud.r-project.org.

There is no useful index of the 17,000+ CRAN packages and their many functions. The collection expanded exponentially during the 2000s and is now growing by sevearal packages every day. Expert volunteers in ~40 statistical areas update lists of CRAN packages in their area; these are accesses on the Web at CRAN Task Views. Task Views of particular interest to astronomers include Bayesian, Cluster, HighPerformanceComputing, MachineLearning, Multivariate, Optimization, Robust, Spatial, Survival, and TimeSeries.

Astronomy-specific packages (e.g. stellar evolutionary tracks, celestial mechanics) are listed in the ChemPhys CRAN Task View. The package FITSio reads astronomical FITS formatted files and headers, converting them into an R list. The package astrolibR is a collection of long-established functionalities useful in astronomical research, translated from the IDL Astronomy Library.

R help files

R help files give essential information on all functions in a standard format:

  1. The top lines give the package where the function resides and a brief description.
  2. Usage gives the list of inputs for the function. Input parameters with an = have a default and do not need to be specified by the user.
  3. Arguments describes these input parameters.
  4. Details summarizes the functionality including formulae and algorithms.
  5. Value gives the output of the function. Typically the program specifies outfn <- fn(x,y,z, option='special') so the full list of output values are available for use, such as plot(outfn$x, outfn$y).
  6. References to published literature where the function is described.
  7. See also links to R functions with related purpose to the current function.
  8. Examples show usage of the function, often with a built-in dataset. Examples in R help files can always be cut-and-pasted into any R console.

R programmers are constantly referring help files. Read a few help files used below in this tutorial, such as seq, mad, and integrate.

Exercise 1: Practice R syntax. Practice with some elementary R functions: arithmetic and algebra, 2-dimensional array manipulation, producing multi-element lists. Write a brief program exercising program flow control (if, ifelse, when, repeat, break, stop, etc.)

Exercise 2. Integrate an astrophysical function. Estimate the age of the Universe as a function of redshift for a standard $\Lambda$CDM universe model: $$ t(z) = H_0^{-1} \int_{z}^{\infty}\frac{dz'}{(1+z')h(z')} $$ where $ h(z) = \sqrt{(1-\Omega_{total})(1+z)^2 + \Omega_m(1+z)^3 + \Omega_{\Lambda} } $, $\Omega_m$ is the matter density parameter, and $\Omega_{\Lambda}$ is the dark energy density parameter. Plot the age of the Universe vs. redshift ($z=0$ to 10) for three hypothetical universes: matter-dominated ($\Omega_m=1.0$ and $\Omega_{\Lambda}=0.0$), dark-energy-dominated ($\Omega_m=0.01$ and $\Omega_{\Lambda}=0.99$), and a realistic universe ($\Omega_m=0.31$ and $\Omega_{\Lambda}=0.69$). This problem and solution is courtesy of graduate student Phoebe Sandhaus, Penn State.

Exercise 3. Bootstrap resampling. Use bootstrap resampling (random sampling with replacement) to estimate uncertainties of a statistic. Create a univariate sample with a weird distribution … maybe sampling from a polynomial or nonlinear function over some interval. First, calculate the median and a robust measure of its standard error: $1.48*(MAD$ where MAD is the median absolute deviation and the 1.48 scales it to the standard deviation for a Gaussian distribution. Second, estimate the uncertainty of the median from a bootstrap resampling. Give the standard error, 95% confidence interval, and plot a histogram of the bootstrap medians.

X. R for astronomical data analysis

R was designed in the 1990s for interactive data exploration and analysis by a single user using a single CPU core with a graphics window. The compiler was not optimized for speed. However, today R has much improved compilers and considerable ability to use a multi-core environment (including GPUs). It also was not viewed at a language appropriate for pipeline analysis of large datasets such as those emerging from astronomical satellites or observatories. This criticism is valid in the sense that it does not have a large and growing software library of specialized code for astronomical data, such as the ~2000 CRAN packages in the Bioconductor toolbox for genomics or the diverse packages in the astropy project in Python.

However, R can be operated in batch mode, running scripts with the Rscript (or with more flexibility, CRAN package littler) command line interface from the operating system prompt. These commands can be embedded in parallelized scripts for supercomputers using PBS or related scripts. Using Rscript, we at Penn State have reanalized the full dataset of a major NASA satellite survey (4-year lightcurves for ~150,000 stars observed by Kepler) and ~2 million lightcurves from a ground-based observatory. The motivation for this approach is that the instrument-specific characteristics of the data reduction had already been performed in Python (Levels 2 and 3 data products), and the data were ready for advanced statistical analysis using CRAN packages. In our case, sophisticated packages originally designed for econometrics play a central role, along with a Fortran code we wrote for a computationally intensive step. R code for astronomical pipelines can be remarkably brief if one effectively uses the tens-of-thousands of advanced functions available in CRAN.

The many statistical functionalities of R can also be run directly from Python and several other languages through bi-language interfaces. R scripts can be embedded in Python using rpy2 and, conversely, Python programs can be embedded in R scripts using CRAN package reticulate or several other options.

XI. Resources for further study of R

We give here a miscellaneous collection of commands and resources useful for learning more about R as a programming language.