R (and S-Plus) for Ecologists
R is exceptional statistical software for ecological analysis as it includes a
broad range of analyses employed in ecological analysis, as well as numerous
routines for exploratory data analysis (EDA). Technically, the language is
called S, and R is the open source implementation available for many
systems for free; S-Plus is a commercial implementation of the S language.
R and S-Plus are extremely similar,
although not identical. I will specify explicitly when any of the following
information differs between systems, and a brief summary of the significant
differences is given here. I will generally use S to
mean information common to both packages.
Unfortunately, the syntax of S is
moderately quirky unless you are a C programmer. The Windoze versions of S-Plus (and the
most recent unix/linux versions) have a graphical user interface (GUI)
avaliable, but to fully employ the power of S you really want to know
the syntax of the command line or command window. The following
is a general guide to S with hints for performing the
exercises in in the accompanying labs.
Data Structures
S is a 4th generation language, meaning that it includes
high-level routines for working with data structures, rather
than requiring extensive programming by the analyst. In S
there are 4 primary data structures we will use repeatedly.
- vectors --- vectors are one-dimensional ordered sets
composed of a single data type. Data types include integers,
real numbers, and strings (character variables)
- matrices --- matrices are two dimensional
ordered sets composed of a single data type, equivalent to the concept of matrix in linear algebra.
- data frames --- data frames are one to multi-dimensional sets,
and can be composed of different data types (although all data in a single
column must be of the same type). In addition, each column and row in a
data frame may be given a label or name to identify it. Data frames are
equivalent to a flat file database, and similar to spreadsheets. Accordingly,
we often refer to specific columns in a data frame as "fields."
- lists --- lists are compound objects of associated data. Like data
frames, they need not contain only a single data type, but can include
strings (character variables), numeric variables, and even such things as
matrices and data frames. In contrast to data frames, lists items do not have a
row-column structure, and items need not be the same length; some can be a
single values, and others a matrix. It's a little hard to imagine how lists
operate in the abstract, but you will see that many of the results of analyses
in S are returned as lists, so we'll introduce them as necessary that way.
Vectors and Matrices
Vectors, matrices, data frames and lists are identified by a name given the
data structure at the time it is created. Names should be unique, and long
enough to clearly identify the contents of the structure. Names can consist
of letters, numbers, and the character ".". They may not start with a
number, or include the characters "$" or "_" or any arithmetic symbols
as these have special meaning in S.
Individual items within a vector or matrix can be identified by
subscript (numbered 1-n), which is indicated by a number (or numeric
variable) within
square brackets. For example, if the number of species per plot is stored
in a vector spcplt, then
spcplt[37] = the number of species in plot 37
Matrices are specified in the order "row, column", so that
veg[23,48] = row 23, column 48 in matrix veg
Individual rows or columns within a matrix can be referred to
by implied subscript, where the the value of the desired row or column
is specified, but other values are omitted. For example,
veg[,3] = third column of matrix veg
represents the third column of matrix veg, as the row number
before the comma was omitted. Similarly,
veg[5,] = row 5 of matrix veg
represents
row 5, as the column after the comma was omitted. In addition, a number
of specialized subscripts can be used.
veg[] = all rows and columns of matrix veg
spcplt[a:b] = spcplt[a] through spcplt[b]
spcplt[-a] = all of vector spcplt except spcplt[a]
veg[a:b,c:d] = a submatrix of veg from row a to b and column c to d
Data Frames
Data frames can be accessed exactly as can matrices, but can also be
accessed by data frame and column or field name, without knowing the column
number for a specific
data item. For example, in the Bryce dataset, there is a column labeled
"elev" that holds the elevation of each sample plot. This column
can be accessed as bryce$elev, where "bryce" is the name of the data
frame, "elev" is the name of the field or column of interest, and
the "$" is a separator to distinguish data frame from field. If you
are routinely working with one or a few data frames, S can be told the
name(s) of the data frames in an "attach " statement, and the data frame
name and separator can be omitted. For example, if we give the command
attach(bryce)
we can specify the field "elev" simply as "elev" rather than "bryce$elev."
This is more concise notation, but means that we cannot have any variables
with the same name as a field in a data frame that is attached. Data frames
are extraordinarily useful in command line S, and critical in GUI S.
Lists
As noted above, a list is a compound object composed of associated data. Items
within a lists are generally referred to as components. Similar to data
frames, components in a list can be given a name, and the component can be
specified by name at any time. In addition, components can be specified by
their position in the list, similar to a subscript in a vector. However, in
contrast to a vector, lists components are specified in double [[ ]] delimiters.
We will ultimately find it quite handy to create our own lists, but for the
first few labs we will just see them as results from analyses, so we'll take
them as they come and demonstrate their properties by example.
For the time being, I'll give a very simple example. Using the spcplt
vector above, and the names of the veg data frame.
list.demo <- list(spcplt,names(veg))
names(list.demo) <- c('species per plot','species names')
list.demo
$"species per plot":
50001 50002 50003 50004 50005 50006 50007 50008 50009 50010 50011 50012 50013
9 14 12 8 16 11 12 8 8 16 19 18 9
50014 50015 50016 50017 50018 50019 50020 50021 50022 50023 50024 50025 50026
14 19 8 10 12 13 9 15 6 13 18 16 12
50027 50028 50029 50030 50031 50032 50033 50034 50035 50036 50037 50038 50039
19 13 6 13 19 10 15 16 13 16 15 9 27
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
50156 50157 50158 50159 50160 50161 50162 50163 50164 50165 50166 50167 50168
6 5 6 6 7 4 10 13 3 12 4 5 16
50169 50170 50171 50172
10 10 8 12
$"species names":
[1] "ACHMIL" "AGOGLA" "AGRCRI" "AGRDAS" "AGRSCR" "AGRSMI" "AMEUTA" "ANEMUL"
[9] "ANTROS" "APOAND" "ARAHOL" "ARAPEN" "ARCPAT" "ARCUVA" "AREFEN" "ARTARB"
. . . . . . . .
. . . . . . . .
. . . . . . . .
[153] "SPC475" "SPC476" "SPC477" "SPC478" "SPC479" "SPC.70" "SPHCOC" "STAPIN"
[161] "STETEN" "STICOM" "STILET" "STIPIN" "STRCOR" "SWERAD" "SYMORE" "TAROFF"
[169] "TETCAN" "THAFEN" "TOWMIN" "TRADUB" "VALACU" "VICAME"
In this case, the first component "species per plot" has 160 numbers (each with
the plot identifier attached), and the second item has 174 strings.
S Vector and Matrix Operators
Because S is a 4th generation language, it is often possible to perform
fairly sophisticated routines with little programming. The key is to recognize
that S operates best on vectors, matrices, or data fames, and to capitalize
on that. A large number of functions exists for manipulating vectors, and by
extension, matrices. For example, if veg is a vegetation matrix
of 100 sample plots and 200 species (plots as rows and species as columns),
we can perform the following, where "<-" is the S assignment operator:
- x <- max(veg[,3]) --- assigns the maximum value of species 3 among all plots
- y <- sum(veg[,5]) --- assigns the sum of species 5 abundance in all plots to y
- logveg <- log(veg+1) --- creates a new matrix called "logveg" with all values the
log of the respective values in veg (+1 to avoid log(0) which is undefined)
In addition, S supports logical subscripts, where the subscript is applied
whenever the logical function is true. Logical operators include:
- > for "greater than"
- >= for "greater than or equal to"
- < for "less than"
- <= for "less than or equal to"
- == for "equal to"
- != for "not equal to"
- & for "and"
- | for "or"
For example
- q <- sum(veg[,8]>10) --- assigns q the number of plots where the abundance of species 8
is greater than 10 (veg[,8]>0 is evaluated as 1 (true) or 0 (false),
so that the sum is of 0's and 1's).
- r <- sum(veg[,8][veg[,8]>10]) --- assigns r the sum of the abundance for species 8 in
plots where species 8 has abundance greater than 10
- deep14 <- max(veg[,14][soil=='deep']) --- assigns the maximum abundance for species 14
on plots with deep soils
A final special case is of special note. Missing values in a vector or matrix are
always a problem in ecological data sets. Sometimes it is best simply to remove samples
with missing data, but often only one or a few values are missing, and it's best to keep
the sample in the matrix with a suitable missing value code. We'll discuss missing value
codes in more detail in the next section, but for now lets assume that we have missing
values in a vector. To use all of the vector EXCEPT the missing value, use
spcplt[!is.na(spcplt)]
That's complicated enough to merit some discussion. The S function to
identify a missing value is
is.na( )
so that to say all of a vector except missing values, we set a logical test to be true when
values are not missing. Since the S operator for "not" is !, the correct
test is
!is.na( )
and to specify which vector we're testing for missing value, we put the vector in parentheses
as follows:
!is.na(spcplt)
Accordingly, the full expression is
spcplt[!is.na(spcplt)]
While the symbol for a missing value in a vector or matrix is NA, using
spcplt[spcplt!="NA"]
will NOT work.
We can use the missing value test on any vector as necessary. For example, the
vector of elevations, except where the number of species per plot is missing, is
elev[!is.na(spcplt)]
This use of missing values is critical to S because all operations on vectors or matrices
must have the same number of elements. So, if there are missing values in any field we're using in a calculation,
the same record (row) must be omitted from all the other fields as well. In a
later lab I'll demonstrate how to create a "mask" that we can use to simplify
working with vectors or matrices with missing values.
Row or Column Operations on a Matrix
Vector operators can be applied to every row or column of a matrix to produce
a vector with the apply command. For example:
spcmax <- apply(veg,2,max) creates a vector "spcmax" with the maximum value
for each species in its respective position. The apply operator is
employed as:
apply("matrix name",1(rowwise) or 2(columnwise),vector operator)
so that
pltsum <- apply(veg,1,sum)
creates a vector of total species abundance in each plot. The vector is as
long as the number of rows in matrix veg. If the function to be applied
doesn't exist, it can be created on the fly as follows:
pltspc <- apply(veg,2,function(x){sum(veg[,x]>0)})
where function(x){sum(veg[,x]>0)}) sums the number of plots where species
x is greater than 0, and x is assigned to each column (species) in turn,
Triangular Matrices
Often in community ecology we work with symmetric matrices (e.g. similarity, dissimilarity,
of distance matrices). These matrices take up extra space (since the value of the diagonal
is known by definition, and since every other value is stored twice (matrix[x,y]=matrix[y,x]).
We can save space by only storing one triangle of the matrix. In addition, some analyses require a vector
argument, rather than a matrix, and it's convenient to convert the triangular matrix to a vector.
This can be done as follows:
triang <- matrix[row(matrix) > col(matrix)]
Getting Data Into S
Getting data into any program is often the hardest part about using the program.
For S, this is generally not true, as long as the data are reasonably formatted.
The R Development Core Team has developed a special manual to cover the ins and
outs of getting data into and out of R. It's available as a PDF or HTML at http://cran.r-project.org. Much of the
material covered there is also applicable to S-Plus.
The easiest way is to format the data in columns, with column headings, and blanks
or tabs between. For example:
plot elev aspect slope text
1 1300 240 30 loam
2 1640 170 20 clay.loam
3 1840 NA 24 silty.clay.loam
. . . . .
. . . . .
. . . . .
100 1730 70 15 sandy.loam
The columns do not need to be straight, but multi-word variables like "clay loam"
need to be connected or put in quotes. The S convention (but it is just a convention)
is to connect with a period, as shown above. It CANNOT be connected with "$" or "_".
The above file (if named "site.dat" for instance) could be read with the
read.table command as follows:
site <- read.table('site.dat',header=TRUE)
The resulting data frame would be named "site", and the columns would be named
exactly as in the data file. Note that the value for aspect in the third plot
is NA. This is a missing value code, and will cause S to treat that value
as missing, rather than as a code NA. It's possible to use other codes as missing
values if you specify them in the read.table command. For example,
suppose in your data set you used -999 as the missing value code. To tell S
to set -999 to missing, add the na.strings= argument as follows:
site <- read.table('site.dat',header=TRUE,na.strings="-999")
Alternatively, data can be organized as in traditional
spreadsheet "csv" comma delimited files, as follows:
plot,elev,aspect,slope,text
1,1300,240,30,loam
2,1640,170,20,clay.loam
3,1840,90,24,silty.clay.loam
. . . . .
. . . . .
. . . . .
100,1730,70,15,sandy.loam
In which case it would be read:
site <- read.table('site.dat',header=TRUE,sep=",")
to tell S that the values were separated by commas. Alternatively, in R, you
can use
site <- read.csv('site.dat',header=TRUE)
to read the file, as read.csv() calls read.table() witht he
appropriate parameters as defaults.
Finally, if the data
are in a formatted file with no delimiters (spaces or commas) it can be read
by specifying the columns that start each field. For example:
1130024030loam
2164017020clay.loam
31840 9024silty.clay.loam
. . . . .
. . . . .
. . . . .
1001730 7015sandy.loam
can be read
site <- read.table('site.dat',sep=c(1,4,8,11,13)).
In this case (or in any case where
column headings are absent), they can be entered separately with the names
command. For example:
names(site) <- c("plot","elev","aspect","slope","text")
where c is an S function meaning "combine." Row names (such as plot IDs)
can also added if desired, using the row.names() function in a similar
way.
There is one element of read.table() that sometimes causes problems.
Ordinarily, read.table() will use the first column that contains all
unique values as the row labels. Generally (but not universally) this is the
first column. It is often best to explicitly specify which coulmn contains row
identifiers (as opposed to data), using the row.labels= specifier.
Going back to the original example,
site <- read.table('site.dat',header=TRUE, row.labels=1)
makes sure that S knows that the first column is identifiers, not data.
The beauty of the read.table() function is the way it handles
variables. If any value in a column is alphabetic, it treats the column as
composed of "factors," or categorical variables. There is NEVER a reason to
convert categorical variable to numeric. However, if you already have
categorical variables coded as integers, you can explain that to S with the
factor() function after you read the data in. If all values in a column
are numeric, it treats that variable as numeric.
Plotting in S
S has a powerful graphics capability that is much of the appeal to using the
system. Many of the analyses have special plotting capabilities that allow you to
plot results without storing multiple intermediate products. (S likes to point
out that it is "object oriented", and that this object orientation is what allows
the generality of its plotting routines. While that is generally true, the SYNTAX of S
is more appropriately viewed as functional, rather than object oriented, and we will concern
ourselves largely with syntax, rather than implementation).
S supports a fairly broad range of graphic devices in addition to excellent
on-screen plotting. Reflecting its origins on unix computers, it is quite good at
Postscript output, but also includes other formats.
The devices available to you for plotting will depend to some extent on your
operating system (Windows versus unix/linux) and whether you are using S-Plus or
R.
X11
In unix/linux, we will be mostly working with X11. If you give S a
plotting command without first opening a device, an X11 window will pop up
automatically to contain the plot. This plotting area is usually a convenient
size for working, and can be resized with the mouse to almost any size.
Normally, this is convenient and sufficient. Sometimes, however, we want
absolute control over the aspect ratio of the plot, so that 100 units on the X
axis is exactly the same size as 100 units on the Y axis. There is a small
number of ways to ensure that the plotting is "square", but all of them assume
that the plotting window has not been re-sized with the mouse. Accordingly. it
is sometimes important to know how to create a plotting window of a specific size.
This is one of the interesting areas of difference between S-Plus and R.
In S-Plus, the window is created by the motif() function, named after
the window manager of the Open Software Foundation. This is true even if you are
not running an OSF operating system or window manager. The S-Plus
motif() function speaks directly to the X11 window manager, and can
pass a large number of X11 options and specifiers. This is quite helpful if
you are familiar with X11, but quite cryptic if you are not. I won't attempt to
teach X11 here, but merely show how to create a plotting window of a specific
size. The size is specified in PIXELS, not centimeters or inches, and includes
a position indicator as well. It is all specified with the X11
geometry command as follows. Suppose we want a plotting window of 800
by 600 pixels in the upper left of our monitor. We would enter
motif("-geometry 800x600+10+10")
This means create a window 800 pixels wide by 600 pixels high down 10 pixels
from the top and 10 pixels from the left edge. Note that the entire expression
is enclosed in quotes, that the expression begins with a dash (to specifiy an
option), and that the size is delimited with a x (to mean "by") while the offset
is delimited with plus signs. This seems like a fairly complicated scheme, but
is consisitent with X11 syntax in general.
in R, the X11 window is controlled by the x11() function. The
size of the window is specified in inches as arguments to the function. For
example, to get a window 8 inches wide by 6 inches tall
x11(height=6,width=8)
This is simpler, except that you can't control the location. You can, however,
move the window with your mouse. As long as you don't resize it you are fine.
Other Devices
The list of other devices you can plot to also depends on operating system and
S-Plus versus R. Recent versions of S-Plus include java.graph,
pdf.graph, and wmf.graph for Java, portable document format, and
Windows metafile respectively, as well as hpgl, hplj, and
postscript for hardcopy output on Hewlett Packard compatible plotters,
HP Laserjet compatible printers, and postscript devices respectively.
R includes postscript, pictex, png, jpeg, and xfig
devices as well as x11.
On either system, type
?Devices or help(Devices)
to get a list of available devices and their names (note the capital D on
Devices). Each of the devices has options that can be set to control
plot size, orientation (landscape or portrait), font size, etc.
While S is an expansive language with a large number of routines
already included, it doesn't include everything, and has several specific areas
of omission with respect to vegetation ecology (e.g. no NMDS or CCA).
Fortunately, the core routines are easily augmented with additional user-written
routines which can be loaded into your copy of S. These routines are
usually provided in what S calls a "library," and which R calls a "package," which is a package with the
routine itself (which may be partially implemented in FORTRAN or C, as well as
S), help files, often test data, and other items as necessary. Accordingly,
it's necessary to know how to load libraries to make the most of S.
Fortunately, in recent releases (S-Plus 5+ or S-Plus 2000 or R > 1.2) many of the libraries
we want are already included and installed in the correct locations. For
example, we will frequently use functions from the MASS library by Venables and
Ripley. Lucky for us, it is inclued in both S-Plus and R. Before
going to a great effort to install needed libraries,
find out which libraries are already installed on your
machine. Depending on your operating system and R versus S-Plus, do the
following:
- Windows/S-Plus: Click on the File menu and scroll down to the
Load Libraries item. This will pop up a widget with a scrolling window
listing all the available libraries. Simply click on the desired library, and
it will be loaded. Alternatively, in the Commands window,
enter library(). This will pop up a window listing the installed
libraries. To load the library, include the library name as listed in the
library function. For example, enter library(MASS) to load the MASS library
of Venables and Ripley.
- unix/linux/S-Plus: At the S-Plus prompt, enter library() to see
a list of available libraries. To load the library, include the library name as
listed in the
library function. For example, enter library(MASS) to load the MASS library
of Venables and Ripley.
- Windows/R: Click on the Packages menu and scroll down to the
Load package item. This will pop up a widget listing all available
packages. To load the library, simply click on the desired library.
Alternatively, in the R Console, enter library() or
installed.packages(). The first will produce a simple list of libraries
installed; the second will produce similar list with additional information on
dependencies on other libraries and other information.
To load the library, include the library name as
listed in the
library function. For example, enter library(MASS) to load the MASS library
of Venables and Ripley.
- unix/linux/R: At the prompt, enter either installed.packages() or
library(). The first will produce a list of installed libraries with
additional information and the second a simple list. To load the library,
include the library name as
listed in the
library function. For example library(MASS) to load the MASS library
of Venables and Ripley.
If the library you want is not installed, you will have to install it yourself.
Again, depending on operating system and program, the details are somewhat
different.
Installing S-Plus libraries
First, you have to locate the libraries you want to install. One of the best
repositories for S-Plus libraries is StatLib at the Department of
Statistics at Carnegie Mellon University (http://lib.stat.cmu.edu). Look under
S Archive or simply
http://www.lib.stat.cmu.edu/S for unix/linux or
http://lib.stat.cmu.edu/DOS/S
for Windows. Depending on operating system, the files and conventions differ.
unix/linux formats
- shell archives (also called shar files): These are ASCII files intended to
be processed by the Bourne shell. These work well for reasonably
small libraries, and are quite common for older libraries. If the library
includes FORTRAN or C programs, it will generally include a makefile (ask your
system administrator for help if you are unfamiliar with makefiles) to compile
and link the routines. To unpack a shar file, either make the file executable
(e.g. chmod +x demo.sh) and execute it (just type its name. e.g.
demo.sh) or
pass the file to the Bourne shell (e.g. sh demo.sh). Then follow the
instructions provided (usually README or Install or Instructions). If at all
possible (and this requires su status) you will want to install the library in
the S-Plus library location (often $SHOME/library or
/usr/splus/library. This in turn implies that you will want to create
a subdirectory under the S-Plus library location and copy the shar file to this
location before unpacking it.
- tar files: These are archives created with the unix tar command. If, for
example, you have a library called "demo.tar.", then you can uncompress it with
tar -xvf demo.tar
which will unload it in the current directory, creating subdirectories as
necessary to preserve the original strcuture. If the library includes FORTRAN
or C routines as well as S, it will generally include a Makefile (see your
system administrator is you are unfamiliar with Makefiles). If at all possible (and this
requires su status) you will want to create a suitable subdirectory under the
S-Plus library directory and extract the tar file there.
- tz or tar.gz files: These are simply tar files (see item above) that
have been gzip'ed to save space. Simply uncompress them (e.g. gunzip
demo.tar.gz) and follow the instruction just above.
windows formats
- zip files: These are archives created by the pkzip or winzip
programs. Simply copy the zip file to a suitable location and unzip it.
Generally, if the library includes FORTRAN or C routines, they will be already
compiled into executable or DLL files for you.
- text files: If the library conists of pure S code, it may be saved as
a simple ASCII file, which can then be included into your session of S-Plus with
the source command. For example, if you downloaded demo to
c:\spluslib then from inside S-Plus (at the command prompt) you could
enter demo <- source("c:/spluslib/demo") (remember to use forward
slashes) and use the function by simply typing its name.
- executable files: These are compiled programs (generally either
FORTRAN or C) that can be executed directly.
Windows .zip files are likely to be pre-compiled and ready to load as specified
above.
Under unix/linux, it's unlikely the library is compiled (although linux binaries
are not too uncommon), so you will need to compile
the executables with the make command. After unpacking the library,
move to the subdirectory that is the root of the library.
Then, at the shell prompt,
enter S-Plus CHAPTER (you might use a different name for S-Plus, such as
S-Plus5 or splus). Then enter S-Plus make to compile the source code
into objects which are suitable for dynamic linking with S-Plus. If your
SHOME environment variable is not defined, or you have your libraries
in an unusual location, you may have to edit the Makefile to get this to work.
Installing packages or libraries in R
The best respository
for R packages is CRAN at
http://cran.r-project.org/. R generally
refers to "packages" rather than "libraries,", but packages are simply
collections of libraries. The R site has separate areas for source code (S
functions and FORTRAN or C code in uncompiled ASCII) and binaries (compiled
code for a specific machine). If your machine and operating system are
supported, it's usually simpler to use the pre-compiled binaries.
If your machine is on the internet, R has routines available to automatically
install or update libraries or packages from CRAN. This is one of the areas
where R really outshines S-Plus.
- Windows/R: Click on the Packages menu and scroll down to
Install packages from CRAN. This will pop up a widget that lists all
the packages available for DOS/R. Simply click on the desired package and it
will install. It's wonderful!
- linux/unix/R: At the prompt, enter CRAN.packages() to get a
list of packages available for linux/unix/R. After identifying the name of the
desired package, enter install.package(package name). R will download
the package from CRAN and attempt to install it in the default library location.
You may need to be root or have su status to install in this directory, so you
probably want to su first, then start R and do the install. If you don't have
su status, you can specify a different location for the library with
install.package(package name, lib=directory), where you substitute the
actual library name and a real direcory, e.g.
install.packages("pcurve",lib="~/Rlib") to install the pcurve package
into a directory you have write permission for called ~/Rlib.
Libraries and Packages for Vegetation Ecology
At present, there are two libraries or packages available specifically for
vegetation ecology: vegan from Jari Oksanen and labdsv from Dave
Roberts. At present, vegan is only available for R, but many of the
routines should work in S-Plus with a little work.
vegan is available at CRAN
http://cran.r-project.org/, and
labdsv is available at http://labdsv.nr.usu.edu/.
Between the two of them they provide improved PCA, PCO, NMDS, CA, CCA, FSO, DECORANA, and
a number of other utilities. We will make extensive use of them in subsequent
labs.
On With The Good Stuff
This has been a trivial introduction to an expansive statistical language, but
my intention is to bring this power to vegetation ecologists, and this is more
easily done by example than continued abstract presentation. Accordingly,
further insights into S will be included in specific exercises as appropriate.
Begin with lab1
The first obvious difference is that S-Plus is a commercial program, while R is
an open source package. The practical significance of that difference is that
S-Plus is not free, while R is. In fact, you cannot buy S-Plus, but only lease
it for a year at a time; when the year is up you must pay for the renewal of
your license to keep your existing copy of S-Plus running. For commercial
enterprises S-Plus is quite expensive, but for academic use it is much more
affordable, and it is available to university students at a significant
discount.
The second major difference is in maturity. S-Plus has been a successful
commercial package for many years, and most elements of it work flawlessly due
to extenisve debugging over time. Nonetheless, there have been problems with
incompatibility between versions with upgrades, and the intial port to Windows
was quite problematic (now sorted out). R, on the other hand, is more recent,
and because it is an open source project, the development has been much more
diffuse, with contributions from a great many people. Overall, the management
of the project has been extraordinarily good for an open source project.
As a consequence, R has changed dramatically in capability and stability over the
last few years, and is now very stable and solid. Partly because R was
designed more recently and has no legacy code to support, it is much simpler and
cleaner in many aspects. Creating new libraries and adding libraries is much
simpler, and there is a central repository for contributed libraries.
Practical Differences
As a user, what sort of differences will you observe? Several, but generally
trivial.
- Objects --- Both S-Plus and R store variables, functions, etc. as objects
internally. The way these are stored is quite different, however. S-Plus
creates a directory called .Data, and each object is stored as a file in that
directory. S-Plus stores the object immediately when it's created.
R creates a file called .RData, and stores
each object as a binary record in the file. More importantly, R only commits
the objects to disk when you issue a save() function, or optionally
when you log out and specify the save workspace option.
There are two logical consequences to this. One, if the program crashes (most
likely from running buggy C or FORTRAN code inside a function), R will lose all
of the objects created since the last save, whereas S-Plus will not. In
addition, the fact that S-Plus stores objects as files means that they are
available to the operating system file system utilities, whereas R objects are
not. The practical significance of this is that S-Plus objects have dates and
sizes attached to to them. Often, after working on a particular dataset for a
long period of time, I have hundreds of objects created. By calling the
operating system (at least for unix/linux) I can get help. For example, if I
can't rememeber the exact name of a specific object,
!ls -lt .Data/r*
will give me a list of all objects that start with the letter "r", sorted by
date of creation, along with their sizes. Sometimes this is really helpful.
- Help Files --- In S-Plus, help files (in recent versions) are written in a
specified version of SGML. On unix or linux, it appears inside your xterm, and
generaly uses "less" for scrolling, with some keyboard shortcuts for hyperlinks.
R uses a multi-format approach, where help files often exist in a "man page"
style, very much like the S-Plus SGML approach, as HTML for Netscape, or dvi
format for TeX previers such as xdvi. The ability to pop up help files in
Netscape is very helpful (although many windows users will have to obtain a copy
of Netscape, Explorer is not supported). The inclusion of help files as dvi
files means that users can print high quality documentation in postscript
straight from the help files. This is a very nice feature.
- Libraries --- Over the years, libraries (see libraries above if you're not familiar with libraries) that
have proven popular and useful have been integrated directly into S-Plus, and
the functions they included
are now available without an explicit requirement to load a library. Until
recently, R, on the
other hand, has followed a minimalist philosophy where the base installation is
stripped-down, and where libraries must be loaded for many functions that are
included in S-Plus. Recently, and certainly by the time of R 2.0, many of the
libraries are included in base R, similar to S-Plus.
In addition, I find libraries MUCH easier to install in R
than in S-Plus, and once installed, they can be automatically loaded into your
session every time you start R.
- Functions --- Many (actually, nearly all) of the functions that exist in
S-Plus and R are based on the same code. Many were originally written for
S-Plus, and then ported to R. More recently, however, I suspect that many more
new libraries are written for R.
Occasionally, routines will exist in S-Plus and R
that have the same function name but which are not the same code. For
vegetation ecologists, the most important example is gam() for
"Generalized Additive Models". In S-Plus the gam() function is
based on code from Hastie and Tibshirani (1990). Until recently, in
R the gam() function was based on code from Wood (2000).
Now, R users can choose between the Wood algorithm in package mgcv and the
Hastie and Tibshirani algorithm in package gam. The algorithms are
different, but the function names are identical.
- Graphics --- The default screen graphic for S-Plus under unix and linux is a
black screen with yellow text and glyphs. The color scheme, in order, is
yellow, cyan, magenta, green, blue, red. The default glyph for points is the
asterisk. In R, the default is a white background with black text and glyphs;
the default color scheme is black, red, green, blue, cyan, magenta, yellow.
The default glyph is an open circle. In both systems the graphics are nearly
infinitely configurable, and you can have anything you want. Initially I found
the S-Plus defaults easier on the eyes on a CRT or in a browser. Over time, however,
I have become much more used to the R defaults. In addition, it
takes more care to convert the S-Plus graphics to hardcopy (e.g. postscript
output); the R graphics print just like they look. You will see
examples of both systems in the accompanying material, but over time it is increasingly
dominated by R.
In addition, as noted above, the list of graphic devices differs between the two
programs. S-Plus has a broader range of hardcopy output devices, and R a
better selection of web-compatible graphics formats.
Hastie, T. and Tibshirani, R. 1990. Generalized Additive
Models. Chapman and Hall
Wood, S.N. 2000. Modelling and smoothing parameter estimation with
multiple quadratic penalties. JRSSB 62(2):413-428.
Wood, S.N. and Augustin, N.H. 2002. GAMs with integrated model selection using
penalized regression splines and applications to environmental modelling.
Ecol. Model. 157(2-3):157-177.