[R] Importing subsets of rows into R from a large text data file

From: Marc Schwartz <MSchwartz_at_medanalytics.com>
Date: Sun 24 Apr 2005 - 02:32:16 EST


Just an FYI here as I think that this approach sometimes gets overlooked.

When one has a very large dataset to read into R from a text file (as opposed to an external database that can be queried) and in actuality only needs a subset of sequential records for subsequent analysis, there are options to read.table() and family, which enable you to skip records in the incoming data file and also restrict the number of records to be read in, using the 'skip' and 'nrow' arguments.

Simple example:

# Create a data.frame with 20 rows
> df <- data.frame(rec.no = 1:20, Col1 = sample(letters, 20),

+                  Col2 = sample(1:100, 20))

> df

   rec.no Col1 Col2

1       1    m   30
2       2    s   44
3       3    o    1
4       4    i   45
5       5    v   97
6       6    x   34
7       7    f   91
8       8    r    4
9       9    u   99
10     10    g   81
11     11    k   64
12     12    d   68
13     13    c   96
14     14    b   13
15     15    z   15
16     16    a   35
17     17    h   11
18     18    t   67
19     19    l   93
20     20    e   37


# Write it to a text file
> write.table(df, "df.dat", row.names = FALSE)

# Now read in recs 5:10
# Note that the header row (1) plus 4 data rows (total 5)
# are skipped here. Then 6 rows are read in from that point

> read.table("df.dat", skip = 5, nrow = 6)
  V1 V2 V3
1 5 v 97
2 6 x 34
3 7 f 91
4 8 r 4
5 9 u 99
6 10 g 81

Note that the header row is not read here, therefore generic colnames are created.

One quick way around this is to use the 'col.names' argument in read.table() to set the colnames for the incoming data in this fashion:

> read.table("df.dat", skip = 5, nrow = 6,

+             col.names = colnames(read.table("df.dat", nrow = 1, 
+                                             header = TRUE)))
  rec.no Col1 Col2
1      5    v   97
2      6    x   34
3      7    f   91
4      8    r    4
5      9    u   99
6     10    g   81

The construct:

> colnames(read.table("df.dat", nrow = 1, header = TRUE))
[1] "rec.no" "Col1" "Col2"

gets the colnames from the header row in the text file and then assigns them to the subset of data that is read in using the 'col.names' argument.

This is a very simple example, but with a large dataset and with perhaps RAM resource restrictions on your computer, this might be helpful in certain scenarios.

Needless to say the use of post-import indexing is easy and the subset() function brings to bear a great deal of flexibility post import, when more complex logic is required to subset rows based upon multiple parameters and/or where you might also only want a subset of columns.

See ?read.table, ?Extract and ?subset for more information.

HTH, Marc Schwartz



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sun Apr 24 02:36:54 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:31:25 EST