Re: [R] Compare two data sets

From: jim holtman <jholtman_at_gmail.com>
Date: Wed, 26 Mar 2008 05:58:56 -0500

Easiest way to do it is to try it out and time it. Here is a case where I generated two sets of data with 120,000 characters each (just random numbers converted to character strings) and then asked for the intersection of them. Came up with 3 matched in about 0.2 seconds. That would seem fastest enough, unless you plan to do this operation tens of thousands of times:

> x <- as.character(runif(120000))
> y <- as.character(runif(120000))
> system.time(z <- intersect(x,y))

   user system elapsed
   0.22 0.00 0.22
> str(z)

 chr [1:3] "0.289942682255059" "0.75132836541161" "0.638638160191476"
>

Here is the timing if you get 50000 matches and it is about the same:

> x <- as.character(round(runif(120000),5))
> y <- as.character(round(runif(120000),5))
> system.time(z <- intersect(x,y))

   user system elapsed
    0.2 0.0 0.2
> str(z)

 chr [1:48908] "0.08385" "0.62639" "0.47603" "0.18578" "0.89447" "0.58435" "0.15297" ...
>

On Tue, Mar 25, 2008 at 10:28 PM, Suhaila Zainudin <suhaila.zainudin_at_gmail.com> wrote:
> Hi,
>
> Thanks for the feedback. I have tried it on the small size sample and ref
> and it works. Now I want to use a larger dataset for myref (the reference
> file) . The reference file contains 112189 rows. Can I use the same approach
> that works for the small example? Or are there other alternatives when
> dealing with data of that magnitude?
>
>
> --
> Suhaila Zainudin
> PhD Candidate
> Universiti Teknologi Malaysia

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 26 Mar 2008 - 11:03:02 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 26 Mar 2008 - 11:30:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive