Re: [R] FW: new to R: don't understand errors

From: Fridolin Wild <>
Date: Wed 04 Oct 2006 - 09:58:51 GMT

Hello Jerad,

> It was suggested I contact you for possible help with this issue. Well,
> as you can see for the emails below, that is what I was told at R-help.
> Any insight to my lsa problems (also listed below) would be of great
> help.

from what I see, the problem probably indeed lies within the textfiles: for performance reasons, it was not possible to include any "check" routines that exclude a file if it contains no words (or words below a docFrequency) and thus produces an empty column-vector.

I am pretty sure that you do not want to use docFrequency with a value like 50 (it would mean that a term in a document is only included if it appears more than 50 times in *that* document).

I will send you the alpha-release of the updated lsa package in a separate message which also includes a parameter called minGlobFreq which is filtering out terms that appear less than x times in the whole document collection. I guess that is what you were looking for.

Considering the sanitizing: if you set minDocFreq to 1 and set minWordLength to 1, you should not get an error with your document collection as you then are basically taking everything (even a single character appearing only once). It probably is not so problematic as the LSA step will anyway group this low-frequency terms in a lower order factor. Of course you will still get an error if you use documents that are completely empty, so delete all 0 bytes documents beforehands.

I am thinking about what to do with this sanitizing part. It is not a good idea to integrate that into the textmatrix method -- it would slow things down tremendously.

So what about this idea: does it make sense to provide a sanitizing collection of methods that help to select the files you want to work with (copy them to a different directory or just return a list with the filenames of the ones that are "good")? What should we do with other sanitizing options (deleting urls from texts, deleting short words, etc.)?

Hope, I could be of help,


Fridolin Wild, Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration (WUW),
Augasse 2-6, A-1090 Wien, Austria
fon +43-1-31336-4488, fax +43-1-31336-746

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Wed Oct 04 20:04:44 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 04 Oct 2006 - 10:30:06 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.