R for Ecologists

Reading data with `read.table()` versus `read.csv()`

Given the prevalence of spreadsheets, data are often entered or stored in a spreadsheet and then exported out as CSV (comma-separated values) files. This is somewhat unfortunate because commas might also occur in text strings in the file, requiring that all strings be enclosed in quotes. This is a dumb convention, but it's widespread so it's best to know how to deal with it.

read.table() is the workhorse function for getting data into R. It has a LARGE number of arguments (enter ?read.table to see) with mostly reasonable defaults. read.csv() calls read.table() but changes the values of some of the arguments. Specifically, as it's name indicates it changes the default separator from white space (tabs or spaces) to a comma, i.e. sep=",". This what you want, and it saves you from having to type sep="," in your read.table() function call. However, it also changes

fill = TRUE
comment.char = ""

Both of these changes are problematic. First, fill = TRUE say that if some of your lines have fewer columns than others that's OK. Just full out the columns with NAs. That is disastrous. If some of your rows have fewer columns than others something is wrong with your data and you should stop and fix it. For example, if you have a sparse matrix or data set with lots of blanks in it, a CSV file will have ",,,," in it. If somehow one of those commas gets deleted, everything to the right would shift over one column to the left, the row would be shorter by one column, and read.csv() would not care. Your data are now a mess, but you got no warning. Even if all the ",,," occur at the end of the row and there is nothing to the right to displace, you should be concerned that something is wrong with your data and fix it.

comment.char = "" means that if "#" occurs in your data it will be interpreted just as any other character, not as a comment. This occurs when people use "#" to mean "number of" and put column headings in their data like

#species #trees 1 5 100 2 12 83

read.table() will ignore everything past the first "#", and name the columns V1 and V2. If you have a a "#" somewhere else in your data read.table() will stop reading that row at the "#", the row will then be too short, and read.table() will exit with an error. read.csv() would just keep going, and rename the columns "X.species" and "X.trees". Maybe that's a good thing, but I strongly blieve that you should use column headings that are legitimate R variable names.

Reading data with read.table() versus read.csv()

Reading data with `read.table()` versus `read.csv()`