Accessing the Oracle database from R
dplyr makes the most common data manipulation tasks in R easier. dplyr can use dbplyr. dbplyr provides a transformation from the dplyr verbs to SQL queries. dbplyr 1.1.0 is released 2017-06-27. See here. It uses the DBI (R Database Interface). See here. This interface is implemented by various drivers such as ROracle. ROracle is an Oracle driver based on OCI (Oracle Call Interface) which is a high performance native C interface to connect to the Oracle Database.
Installing ROracle on Windows 10
I encountered several errors when installing ROracle in Windows 10 on R 3.3.3. The steps to take to do this right in one go are the following:
- Determine your R platform architecture. 32 bit or 64 bit. For me this was 64 bit
- Download and install the oracle instant client with the corresponding architecture (here). Download the basic and SDK files. Put the sdk file from the sdk zip in a subdirectory of the extracted basic zip (at the same level as vc14)
- Download and install RTools (here)
- Set the OCI_LIB64 or OCI_LIB32 variables to the instant client path
- Set the PATH variable to include the location of oci.dll
- Install ROracle (install.packages("ROracle") in R)
Warning in install.packages :
package ‘ROracle_1.3-1.zip’ is not available (for R version 3.3.3)
You probably tried to install the ROracle package which Oracle provides on an R version which is too new (see here). This will not work on R 3.3.3. You can compile ROracle on your own or use the (older) R version Oracle supports.
This can be done by installing RTools (here). This will install all the tools required to compile sources on a Windows machine.
Next you will get the following question:
Package which is only available in source form, and may need compilation of C/C++/Fortran: ‘ROracle’
Do you want to attempt to install these from sources?
y/n:
If you say y, you will get the following error:
installing the source package ‘ROracle’
trying URL 'https://cran.rstudio.com/src/contrib/ROracle_1.3-1.tar.gz'
Content type 'application/x-gzip' length 308252 bytes (301 KB)
downloaded 301 KB
* installing *source* package 'ROracle' ...
** package 'ROracle' successfully unpacked and MD5 sums checked
ERROR: cannot find Oracle Client.
Please set OCI_LIB64 to specify its location.
In order to fix this, you can download and install the Oracle Instant Client (the basic and SDK downloads).
Mind that when running a 64 bit version of R, you also need a 64 bit version of the instant client. You can check with the R version command. In my case: Platform: x86_64-w64-mingw32/x64 (64-bit). Next you have to set the OCI_LIB64 variable (for 64 bit else OCI_LIB32) to the specified path. After that you will get the error as specified below:
Next it will fail with something like:
Error in inDL(x, as.logical(local), as.logical(now), ...) :
unable to load shared object 'ROracle.dll':
LoadLibrary failure: The specified module could not be found.
This is caused when oci.dll from the instant client is not in the path environment variable. Add it and it will work! (at least it did on my machine). The INSTALL file from the ROracle package contains a lot of information about different errors which can occur during installation. If you encounter any other errors, be sure to check it.
How a successful 64 bit compilation looks
> install.packages("ROracle")
Installing package into ‘C:/Users/maart_000/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
Package which is only available in source form, and may need compilation of C/C++/Fortran: ‘ROracle’
Do you want to attempt to install these from sources?
y/n: y
installing the source package ‘ROracle’
trying URL 'https://cran.rstudio.com/src/contrib/ROracle_1.3-1.tar.gz'
Content type 'application/x-gzip' length 308252 bytes (301 KB)
downloaded 301 KB
* installing *source* package 'ROracle' ...
** package 'ROracle' successfully unpacked and MD5 sums checked
Oracle Client Shared Library 64-bit - 12.2.0.1.0 Operating in Instant Client mode.
found Instant Client C:\Users\maart_000\Desktop\instantclient_12_2
found Instant Client SDK C:\Users\maart_000\Desktop\instantclient_12_2/sdk/include
copying from C:\Users\maart_000\Desktop\instantclient_12_2/sdk/include
** libs
Warning: this package has a non-empty 'configure.win' file,
so building only the main architecture
c:/Rtools/mingw_64/bin/gcc -I"C:/PROGRA~1/R/R-33~1.3/include" -DNDEBUG -I./oci -I"d:/Compiler/gcc-4.9.3/local330/include" -O2 -Wall -std=gnu99 -mtune=core2 -c rodbi.c -o rodbi.o
c:/Rtools/mingw_64/bin/gcc -I"C:/PROGRA~1/R/R-33~1.3/include" -DNDEBUG -I./oci -I"d:/Compiler/gcc-4.9.3/local330/include" -O2 -Wall -std=gnu99 -mtune=core2 -c rooci.c -o rooci.o
c:/Rtools/mingw_64/bin/gcc -shared -s -static-libgcc -o ROracle.dll tmp.def rodbi.o rooci.o C:\Users\maart_000\Desktop\instantclient_12_2/oci.dll -Ld:/Compiler/gcc-4.9.3/local330/lib/x64 -Ld:/Compiler/gcc-4.9.3/local330/lib -LC:/PROGRA~1/R/R-33~1.3/bin/x64 -lR
installing to C:/Users/maart_000/Documents/R/win-library/3.3/ROracle/libs/x64
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (ROracle)
Testing ROracle
You can read the ROracle documentation here. Oracle has been so kind as to provide developer VM's to play around with the database. You can download them here. I used 'Database App Development VM'.
After installation of ROracle you can connect to the database and for example fetch employees from the EMP table. See for example below (make sure you also have DBI installed).
library("DBI")
library("ROracle")
drv <- dbDriver("Oracle")
host <- "localhost"
port <- "1521"
sid <- "orcl12c"
connect.string <- paste(
"(DESCRIPTION=",
"(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
"(CONNECT_DATA=(SID=", sid, ")))", sep = "")
con <- dbConnect(drv, username = "system", password = "oracle", dbname = connect.string, prefetch = FALSE,
bulk_read = 1000L, stmt_cache = 0L, external_credentials = FALSE,
sysdba = FALSE)
dbReadTable(con, "EMP")
This will yield the data in the EMP table.
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
1 7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850 NA 30
2 7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975 NA 20
3 7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 3000 NA 20
4 7902 FORD ANALYST 7566 1981-12-02 23:00:00 3000 NA 20
5 7369 SMITH CLERK 7902 1980-12-16 23:00:00 800 NA 20
6 7499 ALLEN SALESMAN 7698 1981-02-19 23:00:00 1600 300 30
7 7521 WARD SALESMAN 7698 1981-02-21 23:00:00 1250 500 30
8 7654 MARTIN SALESMAN 7698 1981-09-27 23:00:00 1250 1400 30
9 7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500 0 30
10 7876 ADAMS CLERK 7788 1987-05-23 00:00:00 1100 NA 20
11 7900 JAMES CLERK 7698 1981-12-02 23:00:00 950 NA 30
Using dplyr
dplyr uses dbplyr and it makes working with database data a lot easier. You can see an example here.
Installing dplyr and dbplyr in R is easy:
install.packages("dplyr")
install.packages("dbplyr")
Various functions are provides to work with data.frames, a popular R datatype in combination with data from the database. Also dplyr uses an abstraction above SQL which makes coding SQL for non-SQL coders more easy. You can compare it in some ways with Hibernate which makes working with databases from the Java object world more easy.
Some functions dplyr provides:
filter() to select cases based on their values.
arrange() to reorder the cases.
select() and rename() to select variables based on their names.
mutate() and transmute() to add new variables that are functions of existing variables.
summarise() to condense multiple values to a single value.
sample_n() and sample_frac() to take random samples.
Installing dplyr and dbplyr in R is easy:
install.packages("dplyr")
install.packages("dbplyr")
Some functions dplyr provides:
filter() to select cases based on their values.
arrange() to reorder the cases.
select() and rename() to select variables based on their names.
mutate() and transmute() to add new variables that are functions of existing variables.
summarise() to condense multiple values to a single value.
sample_n() and sample_frac() to take random samples.
I'll use the same example data as with the above sample which uses plain ROracle
library("DBI")
library("ROracle")
library("dplyr")
#below are required to make the translation done by dbplyr to SQL produce working Oracle SQL
sql_translate_env.OraConnection <- dbplyr:::sql_translate_env.Oracle
sql_select.OraConnection <- dbplyr:::sql_select.Oracle
sql_subquery.OraConnection <- dbplyr:::sql_subquery.Oracle
drv <- dbDriver("Oracle")
host <- "localhost"
port <- "1521"
sid <- "orcl12c"
connect.string <- paste(
"(DESCRIPTION=",
"(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
"(CONNECT_DATA=(SID=", sid, ")))", sep = "")
con <- dbConnect(drv, username = "system", password = "oracle", dbname = connect.string, prefetch = FALSE,
bulk_read = 1000L, stmt_cache = 0L, external_credentials = FALSE,
sysdba = FALSE)
emp_db <- tbl(con, "EMP")
emp_db
library("DBI")
library("ROracle")
library("dplyr")
#below are required to make the translation done by dbplyr to SQL produce working Oracle SQL
sql_translate_env.OraConnection <- dbplyr:::sql_translate_env.Oracle
sql_select.OraConnection <- dbplyr:::sql_select.Oracle
sql_subquery.OraConnection <- dbplyr:::sql_subquery.Oracle
drv <- dbDriver("Oracle")
host <- "localhost"
port <- "1521"
sid <- "orcl12c"
connect.string <- paste(
"(DESCRIPTION=",
"(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
"(CONNECT_DATA=(SID=", sid, ")))", sep = "")
con <- dbConnect(drv, username = "system", password = "oracle", dbname = connect.string, prefetch = FALSE,
bulk_read = 1000L, stmt_cache = 0L, external_credentials = FALSE,
sysdba = FALSE)
emp_db <- tbl(con, "EMP")
emp_db
The output is something like:
# Source: table<EMP> [?? x 8]
# Database: OraConnection
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
<int> <chr> <chr> <int> <dttm> <dbl> <dbl> <int>
1 7839 KING PRESIDENT NA 1981-11-16 23:00:00 5000 NA 10
2 7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850 NA 30
3 7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 NA 10
4 7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975 NA 20
5 7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 3000 NA 20
6 7902 FORD ANALYST 7566 1981-12-02 23:00:00 3000 NA 20
7 7369 SMITH CLERK 7902 1980-12-16 23:00:00 800 NA 20
8 7499 ALLEN SALESMAN 7698 1981-02-19 23:00:00 1600 300 30
9 7521 WARD SALESMAN 7698 1981-02-21 23:00:00 1250 500 30
10 7654 MARTIN SALESMAN 7698 1981-09-27 23:00:00 1250 1400 30
# ... with more rows
If I now want to select specific records, I can do something like:
emp_db %>% filter(DEPTNO == "10")
Which will yield
# Source: lazy query [?? x 8]
# Database: OraConnection
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
<int> <chr> <chr> <int> <dttm> <dbl> <dbl> <int>
1 7839 KING PRESIDENT NA 1981-11-16 23:00:00 5000 NA 10
2 7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 NA 10
3 7934 MILLER CLERK 7782 1982-01-22 23:00:00 1300 NA 10
A slightly more complex query:
emp_db %>%
group_by(DEPTNO) %>%
summarise(EMPLOYEES = count())
Will result in the number of employees per department:
# Source: lazy query [?? x 2]
# Database: OraConnection
DEPTNO EMPLOYEES
<int> <dbl>
1 30 6
2 20 5
3 10 3
You can see the generated query by:
emp_db %>%
group_by(DEPTNO) %>%
summarise(EMPLOYEES = count()) %>% show_query()
Will result in
<SQL>
SELECT "DEPTNO", COUNT(*) AS "EMPLOYEES"
FROM ("EMP")
GROUP BY "DEPTNO"
If I want to take a random sample from the dataset to perform analyses on, I can do:
# Source: table<EMP> [?? x 8]
# Database: OraConnection
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
<int> <chr> <chr> <int> <dttm> <dbl> <dbl> <int>
1 7839 KING PRESIDENT NA 1981-11-16 23:00:00 5000 NA 10
2 7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850 NA 30
3 7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 NA 10
4 7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975 NA 20
5 7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 3000 NA 20
6 7902 FORD ANALYST 7566 1981-12-02 23:00:00 3000 NA 20
7 7369 SMITH CLERK 7902 1980-12-16 23:00:00 800 NA 20
8 7499 ALLEN SALESMAN 7698 1981-02-19 23:00:00 1600 300 30
9 7521 WARD SALESMAN 7698 1981-02-21 23:00:00 1250 500 30
10 7654 MARTIN SALESMAN 7698 1981-09-27 23:00:00 1250 1400 30
# ... with more rows
If I now want to select specific records, I can do something like:
emp_db %>% filter(DEPTNO == "10")
Which will yield
# Source: lazy query [?? x 8]
# Database: OraConnection
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
<int> <chr> <chr> <int> <dttm> <dbl> <dbl> <int>
1 7839 KING PRESIDENT NA 1981-11-16 23:00:00 5000 NA 10
2 7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 NA 10
3 7934 MILLER CLERK 7782 1982-01-22 23:00:00 1300 NA 10
A slightly more complex query:
emp_db %>%
group_by(DEPTNO) %>%
summarise(EMPLOYEES = count())
Will result in the number of employees per department:
# Source: lazy query [?? x 2]
# Database: OraConnection
DEPTNO EMPLOYEES
<int> <dbl>
1 30 6
2 20 5
3 10 3
You can see the generated query by:
emp_db %>%
group_by(DEPTNO) %>%
summarise(EMPLOYEES = count()) %>% show_query()
Will result in
<SQL>
SELECT "DEPTNO", COUNT(*) AS "EMPLOYEES"
FROM ("EMP")
GROUP BY "DEPTNO"
If I want to take a random sample from the dataset to perform analyses on, I can do:
sample_n(as_data_frame(emp_db), 10)
Which could yield something like:
# A tibble: 10 x 8
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
<int> <chr> <chr> <int> <dttm> <dbl> <dbl> <int>
1 7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500 0 30
2 7499 ALLEN SALESMAN 7698 1981-02-19 23:00:00 1600 300 30
3 7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975 NA 20
4 7654 MARTIN SALESMAN 7698 1981-09-27 23:00:00 1250 1400 30
5 7369 SMITH CLERK 7902 1980-12-16 23:00:00 800 NA 20
6 7902 FORD ANALYST 7566 1981-12-02 23:00:00 3000 NA 20
7 7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850 NA 30
8 7876 ADAMS CLERK 7788 1987-05-23 00:00:00 1100 NA 20
9 7934 MILLER CLERK 7782 1982-01-22 23:00:00 1300 NA 10
10 7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 NA 10
Executing the same command again will result in a different sample.
Finally
There are multiple ways to get data to and from the Oracle database and perform actions on them. Oracle provides Oracle R Enterprise. Oracle R Enterprise is a component of the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition. You can create R proxy objects in your R session from database-resident data. This allows you to work on database data in R while the database does most of the computations. Another feature of Oracle R Enterprise is an R script repository in the database and a feature to allow execution of R scripts from within the database (embedded), even within SQL statements. As you can imagine this is quite powerful. More on this in a later blog!
Finally
There are multiple ways to get data to and from the Oracle database and perform actions on them. Oracle provides Oracle R Enterprise. Oracle R Enterprise is a component of the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition. You can create R proxy objects in your R session from database-resident data. This allows you to work on database data in R while the database does most of the computations. Another feature of Oracle R Enterprise is an R script repository in the database and a feature to allow execution of R scripts from within the database (embedded), even within SQL statements. As you can imagine this is quite powerful. More on this in a later blog!
No comments:
Post a Comment