[R] merge gives me too many rows

From: Denis Chabot <chabotd_at_globetrotter.net>
Date: Mon 18 Sep 2006 - 01:11:22 GMT


Hi,

I am using merge to add some variables to an existing dataframe. I use the option "all.x=F" so that my final dataframe will only have as many rows as the first file I name in the call to merge.

With a large dataframe using a lot of "by" variables, the number of rows of the merged dataframe increases from 177325 to 179690:

 >dim(test)
[1] 177325 9

 > test2 <- merge(test, fish, by=c("predateu", "origin", "navire", "nbpc", "no_rel", "trait", "tagno"), all.x=F)  > dim(test2)
[1] 179690 11

I tried to make a smaller dataset with R commands that I could post here so that other people could reproduce, but merge behaved as expected: final number of rows was the same as the number of rows in the first file named in the call to merge.

I took a subset of my large dataframe and could mail this to anyone interested in verifying the problem.

 > test3 <- test[100001:160000,]
 >
 > dim(test3)

[1] 60000 9

 > test4 <- merge(test3, fish, by=c("predateu", "origin", "navire", "nbpc", "no_rel", "trait", "tagno"), all.x=F)  >
 > dim(test4)
[1] 60043 11

I compared test3 and test4 line by line. The first 11419 lines were the same (except for added variables, obviously) in both dataframes, but then lines 11420 to 11423 were repeated in test4. Then no problem for a lot of rows, until rows 45756-45760 in test3. These are offset by 4 in test4 because of the first group of extraneous lines just reported, and are found on lines 45760 to 45765. But they are also repeated on lines 45765 to 45769. And so on a few more times.

Thus merge added lines (repeated a small number of lines) to the final dataframe despite my use of all.x=F.

Am I doing something wrong? If not, is there a solution? Not being able to merge is a setback! I was attempting to move the last few things I was doing with SAS to R...

Please let me know if you want the file test3 (2.3 MB as a csv file, but only 352 KB in R (.rda) format).

Sincerely,

Denis Chabot

 > R.Version()
$platform
[1] "powerpc-apple-darwin8.6.0"

$arch
[1] "powerpc"

$os
[1] "darwin8.6.0"

$system
[1] "powerpc, darwin8.6.0"

$status
[1] ""

$major
[1] "2"

$minor
[1] "3.1"

$year
[1] "2006"

$month
[1] "06"

$day
[1] "01"

$`svn rev`
[1] "38247"

$language
[1] "R"

$version.string
[1] "Version 2.3.1 (2006-06-01)"



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon Sep 18 11:17:01 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Mon 18 Sep 2006 - 06:30:05 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.