Re: [R] How do I paste double quotes arround a character string?

From: Philip James Smith <philipsmith_at_alumni.albany.edu>
Date: Thu, 03 Jul 2008 09:33:12 -0400

R Community:

At the risk of getting my hands slapped by posting "too much" on the forum, I've described the strategy for reading only certain portions of huge .csv files below.

I think that this very well could be of interest to others... I'm sure that I'm not alone in the need to read only certain variables (ie, columns) from VERY huge .csv files.

It has been suggested by Charles Berry, Ted Harding, and Brian Riply to use the unix "cut" command along with the R pipe() function. THeir advice has been invaluable.

As I've written the code so farm I'm finding that the "cut" command is not reading the file properly... or at least in the manner that I'm expecting.

Here was my strategy:
*STEP 1. read the whole huge file --- (almost impossible! even with a very good computer!)
STEP 2. use the pipe and cut commands to read only the desired columns of the file
STEP 3. compare results by tabulating a variable from the whole file with the file obtained in (2)*

I found that the comparision gave different tabulations! :-(

I've provided my code below. I'd be quite grateful for suggestions on how to fix this.

My sincere thanks to all who have or will provide guidance on this problem.

Phil Smith
Duluth, GA  

*## STEP 1: read the whole huge file*

##
## read the whole file
##

    your.file <- c("//home//philipsmith//mydata.csv")     dat <- read.csv( file = your.file )

##
## read the names from the 1st line of the whole file
## that line contains all of the variable names
##

    col.namz <- c( scan( your.file , what=character(0), nlines=1 , sep=",") )

##
## check to see whether all of the column names from the whole file
## are the same as in col.namz
##

     all( col.namz == names(dat))

##
## they are!! :-)
##

*## STEP 2: use the pipe and cut commands to read only the desired columns of the file*
##
## designate which variable names are to be read
## using the unix command "cut" and the function pipe()
##

    colz <- c("ESTIAP07" )

##
## find the column numbers in the whole file that correspond to
## the variables designated to be read by the unix command
## and specified in the colz vector
##

    col.pos <- match( colz , col.namz , nomatch=0 )

    ##
    ## the following line is commented out,
    ## since for this example the number of designated variables
    ## by colz is only 1 variable
    ##
    ## col.pos        <-    paste( col.pos , collapse=',' )

##
## character string of file name for unix read with cut function
##

    fn <- c("/home/philipsmith/mydata.csv")

##
## create a character vector of the unix command
##

    unix.cmd <- paste( "cut -d, -f" , col.pos , " " , fn , sep = '' )

##
## read the designated columns, only, from the whole file
## using pipe() and the unix command cut
##

    gnu.dat <- read.csv( pipe ( description=unix.cmd ) )

*## STEP 3. compare results by tabulating a variable from the whole file with the file obtained in (2)*
##
## tabulate the designated variable from the whole file
##

    table( dat$ESTIAP07 )

##
## tabulate the designated variable from the file
## that has the designated columns, only
##

    table( gnu.dat$ESTIAP07 )

 > table( dat$ESTIAP07 )

  1 2 4 5 6 7 8 10 11 12 13 14 16 17 18 19 20 22 24 25
340 278 304 319 334 295 405 342 519 474 413 476 511 322 517 393 364 377 447 425
 27 28 29 30 31 34 35 36 37 38 40 41 44 46 47 49 50 51 52 53
462 382 368 502 385 494 454 497 484 385 360 419 355 466 461 369 372 431 384 331
 54 55 56 57 58 59 60 61 62 63 64 65 66 68 69 72 73 74 75 76
478 468 348 323 363 287 322 364 317 363 423 337 409 312 370 360 348 309 244 300
 77 79 80 773
307 454 445 340

 >
 > ##
 > ## tabulate the designated variable from the file
 > ## that has the designated columns, only
 > ##
 > table( gnu.dat$ESTIAP07 )

  1 2 3 4 5 6 7 8 10 11 12 13 14 16 17 18 19 20 22 24
342 291 1 308 319 334 295 405 341 518 471 413 476 511 322 517 393 363 377 446
 25 27 28 29 30 31 34 35 36 37 38 40 41 44 46 47 49 50 51 52
425 461 382 368 502 385 494 454 496 483 385 360 419 354 466 461 369 371 431 384
 53 54 55 56 57 58 59 60 61 62 63 64 65 66 68 69 72 73 74 75
331 478 467 348 322 363 287 320 364 317 363 423 337 408 312 368 360 347 309 243
 76 77 79 80 157 773
300 307 454 445 1 340
 > ?pipe
 >



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 03 Jul 2008 - 13:39:29 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Jul 2008 - 14:31:03 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive