Re: [R] Data type problem when extract data from SQLite to R by using RSQLite

From: chen jia <chen_1002_at_fisher.osu.edu>
Date: Mon, 28 Feb 2011 21:57:13 -0500

Hi Seth,

Thanks for the reply. I provide info from sessionInfo() and about schema that you ask. Please take a look.

The output from sessionInfo() is
> sessionInfo()

R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UT> sessionInfo()

R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] plyr_1.2.1 RSQLite_0.9-2 DBI_0.2-5 filehash_2.1-1F-8   LC_NAME=C
 [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] plyr_1.2.1 RSQLite_0.9-2 DBI_0.2-5 filehash_2.1-1

The .schema of table annual_data3 is
sqlite> .schema annual_data3
CREATE TABLE "annual_data3"(
  PERMNO INT,
  DATE INT,
  CUSIP TEXT,
  EXCHCD INT,

  SICCD INT,
  SHROUT INT,
  PRC REAL,
  RET REAL,
  ...
  pret_var,
  pRET_sd,
  nmret,
  pya_var,
  pya_sd,
  nya,

  pya_var_ebi,
  pya_sd_ebi,
  pya_var_ebit,
  pya_sd_ebit,
  pya_var_ebitda,
  pya_sd_ebitda,

  logage REAL,
  logasset REAL,
  ...
  loglead1stdaret, loglead2stdaret)

Table annual_data3 is created by joining table annual_data2 and ya_vol. The column pya_var is initially in ya_vol.

dbGetQuery(sql.industry,

           "create table annual_data3 as
            select a.*, b.pya_var, b.pya_sd,
                   b.nya, b.pya_var_ebi, b.pya_sd_ebi,
                   b.pya_var_ebit, b.pya_sd_ebit,
                   b.pya_var_ebitda, b.pya_sd_ebitda
            from   annual_data2 as a left join
                      ya_vol as b
            on     a.permno = b.permno and
                     a.year = b.year
            order by permno, year")

Table ya_vol is created by
dbGetQuery(sql.industry,

           "create table ya_vol as
            select PERMNO, year,
                   variance(ya) as pya_var,
                   stdev(ya) as pya_sd,
                   count(*) as nya,
                   variance(ya_ebi) as pya_var_ebi,
                   stdev(ya_ebi) as pya_sd_ebi,
                   variance(ya_ebit) as pya_var_ebit,
                   stdev(ya_ebit) as pya_sd_ebit,
                   variance(ya_ebitda) as pya_var_ebitda,
                   stdev(ya_ebitda) as pya_sd_ebitda
            from   past_ya
            where  ya is not null
            group by PERMNO, year
            order by PERMNO, year")

The schema info of ya_vol is
sqlite> .schema ya_vol
CREATE TABLE ya_vol(
  PERMNO INT,
  year INT,
  pya_var,
  pya_sd,
  nya,

  pya_var_ebi,
  pya_sd_ebi,
  pya_var_ebit,
  pya_sd_ebit,
  pya_var_ebitda,
  pya_sd_ebitda

);
CREATE INDEX ya_vol_permno_year_idx

            on ya_vol (permno,year);

Interestingly, I find that the problem I reported does not for columns labeled real in the schema info. For example, the type of column RET never changes no matter what the first observation is.

> str(dbGetQuery(sql.industry,

+                "select RET from annual_data3
+                 where RET is not null limit 5"))
'data.frame':	5 obs. of  1 variable:

 $ RET: num -0.03354 -0.02113 0.03797 0.0013 -0.00678 >
> str(dbGetQuery(sql.industry,
+                "select RET from annual_data3
+                 where RET is null limit 5"))
'data.frame':	5 obs. of  1 variable:

 $ RET: num NA NA NA NA NA
> sapply(dbGetQuery(sql.industry,
+                "select RET from annual_data3
+                 where RET is null limit 5"),
+        typeof)
     RET

"double"
> sapply(dbGetQuery(sql.industry,
+                "select RET from annual_data3
+                 where RET is not null limit 5"),
+        typeof)
     RET

"double"

I still don't know how to solve this problem for variable pya_var, please help. Thanks.

Best,
Jia

On Mon, Feb 28, 2011 at 6:48 PM, Seth Falcon <seth_at_userprimary.net> wrote:
> Hi Jia,

>

> On Mon, Feb 28, 2011 at 12:37 PM, chen jia <chen_1002@fisher.osu.edu> wrote:
>> When I extract data from SQLite to R, the data types (or modes) of the
>> extracted data seems to be determined by the value of the first row.
>> Please see the following example.
>

> It would help to provide the output of sessionInfo() as well as the
> schema definition for the table in SQLite (or at least description of
> how it was created).
>

> Here's an example that works as you'd like:
>

>    > library(RSQLite)
>    > db = dbConnect(SQLite(), dbname = ":memory:")
>    > dbGetQuery(db, "create table t (a int, b real, c text)")
>    > df = data.frame(a=c(NA, 1L, 2L), b=c(NA, 1.1, 2.2), c=c(NA, "x",
> "y"),stringsAsFactors=FALSE)
>    > df
>       a   b    c
>    1 NA  NA <NA>
>    2  1 1.1    x
>    3  2 2.2    y
>    > dbGetPreparedQuery(db, "insert into t values (?, ?, ?)", df)
>    > dbGetQuery(db, "select * from t")
>       a   b    c
>    1 NA  NA <NA>
>    2  1 1.1    x
>    3  2 2.2    y
>    > sapply(dbGetQuery(db, "select * from t"), typeof)
>              a           b           c
>      "integer"    "double" "character"
>    > sapply(dbGetQuery(db, "select * from t limit 1"), typeof)
>              a           b           c
>      "integer"    "double" "character"
>    > sapply(dbGetQuery(db, "select a from t limit 1"), typeof)
>            a
>    "integer"
>    > sapply(dbGetQuery(db, "select a from t limit 2"), typeof)
>            a
>    "integer"
>    > sapply(dbGetQuery(db, "select a from t limit 1"), typeof)
>            a
>    "integer"
>
>
>> sessionInfo()

> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>

> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>

> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
>

> other attached packages:
> [1] RSQLite_0.9-4 DBI_0.2-5
>

> loaded via a namespace (and not attached):
> [1] tools_2.11.1
>
>
>
>

> --
> Seth Falcon | @sfalcon | http://userprimary.net/
>
-- 
700 Fisher Hall
2100 Neil Ave.
Columbus, Ohio  43210
http://www.fisher.osu.edu/~chen_1002/

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 01 Mar 2011 - 03:15:21 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 01 Mar 2011 - 07:40:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive