Re: [R] constructing a dataframe from a database of newspaper articles

From: jim holtman <>
Date: Mon 24 Jul 2006 - 03:17:57 EST

I was going to suggest the you use PERL, but here is my attempt at keeping it in R. This reads in each line, tried to determine if it has one of the 'separator' words at the beginning of the line and then constructs the output.

mySeps <- c("HD", "BY", "WC", "PD", "SC", "PG", "LP", "TD" ,

    "SN", "LA", "CY") # section separators myInc <- 0 # record number
foundHD <- FALSE
myFile <- file('c:/datafile.txt', 'r')
myRec <- list() # contains the data from each record myOutput <- list() # list with each record while(length(x <- readLines(myFile, n=1)) > 0){

    first <- gsub("^\\s*(\\w <file://w/>+).*", "\\1 <file://>", x) # get the first word

    if (!foundHD){ # skip till HD found (assumes this is the start of the article

        if (first == "HD") foundHD <- TRUE
        else next

    if (first == "NS"){ # skip to next HD (assumes ND ignores the rest
        foundHD <- FALSE
        myOutput[[myInc <- myInc + 1]] <- myRec
        myRec <- list()

    if (first %in% mySeps){
        myKey <- first  # use at key to myRec
        x <- sub(first, '', x)

    myRec[[myKey]] <- paste(myRec[[myKey]], x) # collect data from each mySep
# convert the list to 'long' dataframe for reshape myResult <- NULL
for (i in 1:length(myOutput)){

    .x <- cbind(i, names(myOutput[[i]]),

        unlist(myOutput[[i]][names(myOutput[[i]])]))     myResult <- rbind(myResult, .x)
myDF <-
myWide <- reshape(myDF, timevar="V2", idvar='i', direction='wide')

On 7/23/06, Bob Green <> wrote:
> I am hoping for some assistance with formatting a large text file which
> consists of a series of individual records. Each record includes specific
> labels/field names (a sample of 1 record (one of the longest ones) is
> below - at end of post. What I want to do is reformat the data, so that
> each individual record becomes a row (some cells will have a lot of text).
> For example, the column variables I want are (a) HD in one column
> (b) BY in one column (c) WC data in one column, (d) PD data in one
> column, (e) SC data in one column (f) PG data in one column & g) LP and
> TD
> text in one column - this column can contain quite a lot of text, e.g1900
> words. The other fields are unwanted
> If there were 150 individual records, when formatted this would be a 7
> column by 150 row dataset.
> I was advised to:
> 1. read in the file using readLines giving a character vector one element
> per input line.
> 2. convert that to lines of the form:
> id op text
> where each such line is a field and multiline fields have been collapsed
> into a single line of text. This step involves
> detailed processing and you could do it in a loop or you could try a
> vectorized approach. A vectorized approach
> will likely involve using
> 3. the lines created above could be converted to a data frame with three
> columns and
> 4. reshape used to create a "wide" data frame.
> 5. then write it out using write.csv.
> I have got as far as being able to read the text into R - I am unsure if
> the warning is a problem. I am however, not at all sure what I need to do
> next.
> Any assistance is much appreciated,
> Bob
> (A) syntax
> mht <- scan(what="c:\\cm-mht1.txt").
> readLines("c:\\cm-mht1.txt",n = -1)
> [8376] "(c) 2006 Dow Jones Reuters Business Interactive LLC (trading as
> Factiva). All "
> [8377] "rights reserved.
> "
> Warning message:
> incomplete final line found by readLines on 'c:\cm-mht1.txt'
> (B) sample data
> HD Was Charles Manson temporarily insane when he led a wild killing
> rampage in the US in 1969?
> BY By Deborah Cassrels.
> WC 1834 words
> PD 23 June 2001
> SN Courier Mail
> PG 30
> LA English
> CY (c) 2001 Queensland Newspapers Pty Ltd
> LP Was Charles Manson temporarily insane when he led a wild killing
> rampage in the US in 1969? Clearly he was mad and bad. But would
> Queensland have placed him before its Mental Health Tribunal, found
> him of
> unsound mind at the time of his crimes, institutionalised him and
> "treated" his illness? WHY is Queensland the only jurisdiction in
> the
> Commonwealth with a Mental Health Tribunal which establishes if an
> accused
> is fit to face trial or of unsound mind at the time of an alleged
> offence?
> Why is mental incompetence not determined in an adversarial court by
> a
> jury? Under the Mental Health Act 1974, the tribunal, a statutory
> body
> operating since 1985, comprises three-yearly appointments of a
> Supreme
> Court judge and two assisting psychiatrists, whose advice does not
> have to
> be accepted. The judge alone constitutes the tribunal, an
> inquisitorial
> process conducted in the Supreme Court in Brisbane.
> TD Victims or family are not notified of hearings or allowed to
> submit
> victim impact statements. They are prohibited from talking to the
> media
> until 28 days after the decision. And when patients return to the
> community there is no requirement for neighbours or victims to be
> notified. Is this legislation enlightened or are we just suckers,
> falling
> for time and money-saving strategies? The tribunal has earned a
> reputation
> as progressive, humane and economical among some judges who have
> presided
> over it. The inaugural chair, former Supreme Court judge Angelo
> Vasta QC,
> thinks the tribunal system is "enlightened" and "it saves an
> enormous
> amount of expenditure". He points to the humane side of treating the
> ill
> in a secure hospital rather than punishing them for offences but is
> uncomfortable with borderline cases. "Whether people are mad or bad
> ought
> to be established by a very thorough investigation.
> The associated Patient Review Tribunals (of which there are five)
> consist
> of three to six members, including the chair who is a legal officer,
> a
> medical practitioner and a mental health professional. A
> psychiatrist is
> not required. The other three have no specific qualifications and
> can
> include former patients. The tribunals operate in closed hearings
> and
> patients of unsound mind or unfit for trial are reviewed every 12
> months.
> Leave is granted either by the Mental Health Tribunal or the Patient
> Review Tribunal, which determine when a restricted patient is
> discharged
> into the community. Says the Director of Mental Health, Dr Peggy
> Brown:
> "In the case of serious offences you can be assured the period of
> monitoring is quite lengthy." Under the Mental Health Act 2000 to be
> implemented late this year, the tribunal will be replaced by a
> Mental
> Health Court and the Patient Review Tribunal by the Mental Health
> Review
> Tribunal. Queensland Health Minister Wendy Edmond says the name
> change
> reflects transparency, with proceedings under oath and
> cross-examination
> of witnesses. The legislation represents "real change to the rights
> of
> victims of crime". But there is still an embargo on publishing
> decisions
> in the media.
> Dr Brown says when patients are granted leave, victims or families
> can
> apply to be notified but decisions will be made on individual cases.
> "The
> (new) tribunal has to establish that there are reasonable grounds
> for the
> notification order to be made ... and it's also an appealable
> decision,"
> returning to the Mental Health Court.
> Brown says there are efficiencies in the new legislation but "it's
> not
> about saving money". The main advantages were that victims could
> make
> submissions to both bodies. Concerns still might not be addressed
> but
> reasons were expected to be provided. The court's composition and
> sole
> power of the judge will be retained. Victims or relatives can be
> notified
> of hearings and decisions about the patient. If not, reasons must be
> provided. The Patient Review Tribunals will be replaced by one
> tribunal
> with hearings still closed. It will comprise up to five members
> including
> a president (a lawyer of at least seven years' standing),
> psychiatrist or
> medical practitioner and community members and it will be chaired by
> a
> legal officer. Leave will be approved by the corresponding previous
> bodies. Chief Justice Paul de Jersey who presided over the 1995 case
> of
> Ross Farrah, a paranoid schizophrenic, who after murdering his
> girlfriend,
> Christine Nash, was allowed out of the John Oxley Centre to play
> sport and
> see movies, says the proposed legislative changes to the Mental
> Health Act
> appear to be "refinements". Two weeks ago, Nash's teenage son Wade
> committed suicide after suffering years of torment following his
> mother's
> murder. In May 1996, a letter was sent to the tribunal by now former
> director of secure care services at John Oxley Dr Peter Fama. It
> said:
> "Should Ross be committed to the Tribunal for trial on a charge of
> manslaughter or murder, I have to report that he is now fit to be
> placed
> in corrective custody ... There is no clinical need for further
> detention
> of Ross in hospital." De Jersey has been involved in the process of
> amendments in the new Act and believes the "adjustments" are
> satisfactory:
> "It's probably a question of how they're implemented. I thought the
> changes were more concerned with image than effecting substantial
> change
> to the system, calling it a court rather than a tribunal. There is
> some
> attempt to enhance the openness of the procedures such as the advice
> given
> by the existing psychiatrists being revealed in open court to the
> judge
> but they're aspects of streamlining rather than substantive change."
> He
> says many people are irked by a perceived disproportion between the
> treatment of mentally ill offenders and their victims. "As a
> community we
> need much more positively to address the situation of victims." De
> Jersey
> points to the James Bulger murder in the UK eight years ago when two
> 10-year-old boys abducted and battered James, two, to death. The
> killers
> are expected to be freed soon. Says de Jersey: "Whatever one thinks
> of
> future plans for the young offenders it is extraordinary, if
> reportedly
> correct, that so little help has been given to the bereft mother of
> the
> murdered toddler. "Similarly, here, it is generally indefensible
> where
> victims or the families of victims are not informed of details of
> the
> likely release of their offenders, and even before that where they
> are not
> given a proper explanation as to the process and counselling to help
> them
> comprehend that process and as well the consequences of the crime.
> We are
> as a community moving towards a greater focus on the position of
> victims
> but a lot more needs to be done. "The anguish of victims and the
> families
> of victims that insane offenders appear to escape punishment is
> understandable. The issue is whether the community is prepared to
> accept
> that insane offenders primarily need treatment." The Mental Health
> Tribunal worked on two assumptions, that offenders of unsound mind
> should,
> in the interests of the community, be treated rather than punished,
> and
> that a determination whether an offender was of unsound mind could
> responsibly be made by a Supreme Court judge with expert psychiatric
> assistance. "I have wondered whether with the ultimately serious
> crimes
> such as murder the community may not reasonably demand that in the
> interests of reassurance that the determination be made by a jury."
> He
> believes the community's longer term interests would best be served
> by
> medically treating insane offenders in a hospital rather than a
> prison,
> where if rehabilitated, they could contribute to the community. "I
> accept,
> however, that in many cases there will be serious residual concern,
> for
> example, can the offender be trusted, if left unsupervised, to
> continue to
> take the relevant medication?"
> De Jersey admits problems have arisen when offenders, granted leave,
> stopped taking medication but says if they can be relied upon to
> maintain
> stability through medication it would be inhumane to keep them
> locked up.
> Continued medical monitoring was necessary. If conditions were
> breached
> the person should be returned to restricted custody at the
> psychiatric
> hospital. While the most vulnerable in society deserve compassion it
> does
> not surprise there is public concern about lack of proper scrutiny,
> the
> capacity to re-offend and misuse of the legal process by using
> insanity as
> a defence. IN the general quest to improve treatment provisions for
> patients the 2000 Act says: "The new legislation provides for
> involuntary
> treatment in the community as an alternative to being an in-patient
> in a
> mental health service which reflects contemporary clinical practice
> and
> the principle of reform that involuntary treatment must be in the
> least
> restrictive form."
> Perhaps the overwhelming feeling is patients' rights have priority
> over
> victims' rights. Ted Flack, spokesman for the Queensland Homicide
> Victims
> Support Group says the new Act provides a better environment for
> victims'
> participation, but there are serious flaws. The rights of homicide
> victims
> were not guaranteed and this caused an inordinate amount of
> distress.
> "There's still considerable discretion in the hands of the Mental
> Health
> Court and the Mental Health Review Tribunal as to whether they would
> admit
> any evidence from the victims. The new Act is framed in such a way
> as to
> provide guaranteed rights to the person who's suffering from a
> mental
> illness and those rights come appropriately from the international
> conventions, but there are similar international conventions for
> victims
> and they are being completely ignored in the Act." Flack says the
> primary
> purpose of the Mental Health Tribunal is to save money and to
> safeguard
> the rights of the mentally disabled person. He believes the
> criminally
> insane can be catered for properly in jail. "The imprecise science
> of
> psychiatry is not an appropriate set of guidelines for the release
> into
> the community of dangerous killers," he says.
> NS
> GCAT : Political/General News | GCRIM : Crime/Courts | GHEA : Health
> |
> GHOME : Law Enforcement
> RE
> AUSNZ : Australia and New Zealand | AUSTR : Australia
> AN
> Document coumai0020010710dx6n005vl
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Mon Jul 24 03:23:52 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Mon 24 Jul 2006 - 04:17:31 EST.

Mailing list information is available at Please read the posting guide before posting to the list.