[R] Re constructing a dataframe from a database of newspaper articles

From: David Duffy <David.Duffy_at_qimr.edu.au>
Date: Mon 24 Jul 2006 - 09:19:16 EST

> From: Bob Green <bgreen@dyson.brisnet.org.au>
> I am hoping for some assistance with formatting a large text file which
> consists of a series of individual records. Each record includes specific
> labels/field names (a sample of 1 record (one of the longest ones) is
> below - at end of post. What I want to do is reformat the data, so that
> each individual record becomes a row (some cells will have a lot of text).
> For example, the column variables I want are (a) HD in one column
> (b) BY in one column (c) WC data in one column, (d) PD data in one
> column, (e) SC data in one column (f) PG data in one column & g) LP and TD
> text in one column - this column can contain quite a lot of text, e.g 1900
> words. The other fields are unwanted
> If there were 150 individual records, when formatted this would be a 7
> column by 150 row dataset.

Most transparently,

txt <- readLines("c:\\cm-mht1.txt")
no_of_records <- length(grep("^HD",txt)
res <- matrix(nr=no_of_records, nc=8)
idx <- 0
for (i in 1:length(txt)) {
  if (regexpr("^HD", txt[i])!=-1) idx <- idx+1

  if (regexpr("^HD", txt[i])!=-1) res[idx, 1] <- txt[i]   if (regexpr("^BY", txt[i])!=-1) res[idx, 2] <- txt[i]   ...
  if (regexpr("^TD", txt[i])!=-1) res[idx, 8] <- txt[i] }
res[,7] <- paste(res[,7], res[,8], sep="; ") res <- res[,-8]

| David Duffy (MBBS PhD)                                         ,-_|\
| email: davidD@qimr.edu.au  ph: INT+61+7+3362-0217 fax: -0101  /     *
| Epidemiology Unit, Queensland Institute of Medical Research \_,-._/ | 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon Jul 24 10:41:59 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Mon 24 Jul 2006 - 14:22:10 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.