Re: [R] problem with white space

From: jim holtman <jholtman_at_gmail.com>
Date: Sun, 30 Mar 2008 18:28:08 -0500

How long is it taking? Can you send me the code that you are using.

Another technique is to recode you characters into numbers and store them as integers. You can then sample the values and reconstruct the output. Here is a faster way:

# create some test data -- might be read in the readLines # use 'raw' class for the data.
sdata <- sapply(1:10, function(x){

    charToRaw(paste(sample(LETTERS, 50, TRUE), collapse="")) # encode the character as a number
})
# now create 10 sample of size 20 and write in files for (i in 1:10){

    x <- sample(sdata, 100000, TRUE)
    # convert back to characters
    writeLines(rawToChar(x), con=paste("file", i, sep='')) }

On Sun, Mar 30, 2008 at 6:06 PM, Suraaga Kulkarni <suraaga.kulkarni_at_gmail.com> wrote:
> Jim,
>
> Thanks very much. I am very new to R and am trying to understand your code.
> It works perfectly on your sample data of course. I tried your code on my
> data. While it works, it takes too much time to generate each replicate.
> At present I'm outputting the replicates with only 2000 resampled
> characters. I actually need to resample something like 1-5 million
> characters. I work with the human genome, and need to generate 500
> bootstrap replicates of a scaled down version (about 2%) of each chromosome
> by means of resampling with replacement.
>
> Sorry about the cryptic code but I thought my initial description of the
> problem explained it. In any case, your guess was correct.
>
> Let me see if I can rework your code to suit my purposes. In the meanwhile,
> if you have any other suggestions, I'll be happy to hear them.

>
> Thanks again for the prompt response.
>
> S.
>
>
>
> On Sun, Mar 30, 2008 at 6:15 PM, jim holtman <jholtman@gmail.com> wrote:
> > Here is one way of doing it. I would suggest that you read in the
> > data with readLines and then combine into one single string so that
> > you can use substring on it. Since you did not provide provide
> > commented, minimal, self-contained, reproducible code, I will take a
> > guess at that your data looks like:
> >
> > # create some test data -- might be read in the readLines
> > sdata <- sapply(1:10, function(x){ # 10 lines of strings with 50
> characters
> > paste(sample(LETTERS, 50, TRUE), collapse='')
> > })
> > # put into one large string so you can do substring on it
> > sdata <- paste(sdata, collapse='')
> > # now create 10 sample of size 20 and write in files (file1, file2, ...
> file10)
> > for (i in 1:10){
> > x <- sample(nchar(sdata), 20)
> > writeLines(paste(substring(sdata, x, x), collapse=''),
> > con=paste("file", i, sep=''))
> >
> >
> >
> > }
> >
> >
> >
> >
> >
> > On Sun, Mar 30, 2008 at 3:41 PM, Suraaga Kulkarni
> > <suraaga.kulkarni_at_gmail.com> wrote:
> > > Hi,
> > >
> > > I need to resample characters from a dataset that consists of an
> extremely
> > > long string that is written over hundreds of thousands of lines, each of
> > > length 50 characters. I am currently doing this by first inserting a
> space
> > > after each character in the dataset and then using the following
> commands:
> > >
> > > y <- as.matrix(read.table("data.txt"), stringsAsFactors=FALSE)
> > > bstrap <- sample(length(y), 100000, TRUE)
> > > write(y[bstrap], file="Rep1.txt", ncolumns=50, append=FALSE)
> > > bstrap <- sample(length(y), 100000, TRUE)
> > > write(y[bstrap], file="Rep2.txt", ncolumns=50, append=FALSE)
> > > bstrap <- sample(length(y), 100000, TRUE)
> > > .
> > > .
> > > .
> > > and so on for 500 reps.
> > >
> > >
> > > I think there should be a better way of doing this. My specific
> questions:
> > >
> > > 1. Is there a way to avoid inserting spaces between the characters
> before
> > > calling the "sample" command (because I don't want spaces between the
> > > resampled characters in the output either; see number 2 below)?
> > >
> > > 2. If I have no choice but to insert the spaces in my data before
> > > resampling, is there a way to output the resampled data without spaces,
> but
> > > simply as 50-character long strings one below the other)? I tried
> inserting
> > > the following command: strip.white=TRUE in the write command line, but
> it
> > > gave me an error as it did not understand the command.
> > >
> > > 3. Finally, since I have to get 500 such resampled reps from each
> dataset
> > > (and there are over 20 such huge datasets) is there a way around having
> to
> > > write a separate write command for each rep?
> > >
> > > Any suggestions will be greatly appreciated.
> > >
> > > Thanks,
> > >
> > > S.
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help_at_r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem you are trying to solve?
> >
>
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sun 30 Mar 2008 - 23:32:20 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 31 Mar 2008 - 01:30:26 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive