Introduction

Currently we see that Hadoop is becoming part of Enterprise Data Warehouse family. But family should be connected to each other. Sometimes we need access to Hadoop from Oracle Database. Sometimes Hadoop users need enterprise data stored in Oracle database.

Hive has very interesting concept – External Tables which allow you to define Java classes to access external database and present it as a native hive table.

Oracle Datasource for Apache Hadoop (formerly Oracle Table Access for Apache Hadoop) turns Oracle Database tables into a Hadoop data source (i.e., external table) enabling direct and consistent Hive QL/Spark SQL queries, as well as direct Hadoop API access. Applications can join master data or dimension data in Oracle Database with data stored in Hadoop. Additionally data can be written back to Oracle Database after processing.

Oracle Datasource for Apache Hadoop optimizes a query’s execution plans using predicate and projection pushdown, and partition pruning. Database table access is performed in parallel based on the selected split patterns, using smart and secure connections (Kerberos, SSL, Oracle Wallet), regulated by both Hadoop (i.e., maximum concurrent tasks) and Oracle DBAs (i.e., max pool size).

How does Oracle DataSource for Apache Hadoop (OD4H) works?

You can start working with OD4H using the following steps:

Create a new Oracle table, or, reuse an existing table.
Create the Hive DDL for creating the external table referencing the Oracle Table.
Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.

How to setup

1. Oracle Datasource for Apache Hadoop requires (check current information here https://www.oracle.com/database/big-data-connectors/certifications.html):

Oracle Database 12c, 11.2.0.4, or earlier (if you can query it using Oracle JDBC driver 12c).
It would be better to have Oracle Database 12c to get better performance.
Hadoop: Cloudera CDH5, Hortonworks 2.x or Apache Hadoop 2.2.0-2.6.0.
Apache Hive 0.13.0, 0.13.1 or 1.1.0

2. Download Oracle Datasource for Apache Hadoop connector. There are two places where you can take it:
– standalone archive: http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/index.html
– Oracle Mammoth archive from support.oracle.com. BDA Extras contains OD4H

3. I’ve downloaded standalone archive which is named od4h-1.2.0.0.zip. So let’s unzip it on each Hadoop node. I’ll use free tool pssh to run unzip command on all nodes

pssh -h ~/hostsall "unzip /media/sf_nfs2/distrib/bigdata/od4h/od4h-1.2.0.0.zip -d /opt/oracle"

1	pssh -h ~/hostsall "unzip /media/sf_nfs2/distrib/bigdata/od4h/od4h-1.2.0.0.zip -d /opt/oracle"

4. Add /opt/oracle/od4h/jlib/* to HIVE_AUX_JARS_PATH.
– If you have Cloudera Manager then choose Hive->Configuration and search for HIVE_AUX_JARS_PATH

– If you don’t have Cloudera Manager then configure this variable in hive-env.sh file

5. That is. We are ready to create external hive table linked to Oracle tables.

Create simple example

1. Run the following query in

DROP TABLE EmployeeDataSimple6;
CREATE EXTERNAL TABLE EmployeeDataSimple6 (
employee_id int,
First_Name string,
Last_Name string,
Job_ID string,
Salary int
)
STORED BY 'oracle.hcat.osh.OracleStorageHandler'
WITH SERDEPROPERTIES (
'oracle.hcat.osh.columns.mapping' = 'employee_id, first_name, Last_Name, Job_Id, Salary')
TBLPROPERTIES (
'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@//scbds.ru.oracle.com:1521/pdb1',
'mapreduce.jdbc.username' = 'hr',
'mapreduce.jdbc.password' = 'hr',
'mapreduce.jdbc.input.table.name' = 'employees',
'oracle.hcat.osh.useMonitor'='true',
'oracle.hcat.osh.fetchSize'='10000',
'oracle.hcat.osh.useOracleParallelism'='true'
);

DROP TABLE EmployeeDataSimple6;

CREATE EXTERNAL TABLE EmployeeDataSimple6 (

employee_id int,

First_Name string,

Last_Name string,

Job_ID string,

Salary int

)

STORED BY 'oracle.hcat.osh.OracleStorageHandler'

WITH SERDEPROPERTIES (

'oracle.hcat.osh.columns.mapping' = 'employee_id, first_name, Last_Name, Job_Id, Salary')

TBLPROPERTIES (

'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@//scbds.ru.oracle.com:1521/pdb1',

'mapreduce.jdbc.username' = 'hr',

'mapreduce.jdbc.password' = 'hr',

'mapreduce.jdbc.input.table.name' = 'employees',

'oracle.hcat.osh.useMonitor'='true',

'oracle.hcat.osh.fetchSize'='10000',

'oracle.hcat.osh.useOracleParallelism'='true'

);

Attention! You can’t set table schema (or owner here). You should connect using table owner.

2. Run query using beeline

[root@scbda1 /]# beeline
beeline>  !connect jdbc:hive2://localhost:10000  oracle welcome1
0: jdbc:hive2://localhost:10000> select count(*) from EmployeeDataSimple6;

[root@scbda1 /]# beeline

beeline> !connect jdbc:hive2://localhost:10000 oracle welcome1

0: jdbc:hive2://localhost:10000> select count(*) from EmployeeDataSimple6;

3. Get the output

[root@scbda1 /]# beeline
16/09/27 03:18:15 DEBUG util.VersionInfo: version: 2.6.0-cdh5.8.0
2016-09-27 03:18:20,674 WARN  [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  Continuing without it.
Beeline version 1.1.0-cdh5.8.0 by Apache Hive
beeline>  !connect jdbc:hive2://localhost:10000  oracle welcome1
scan complete in 4ms
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Driver: Hive JDBC (version 1.1.0-cdh5.8.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> select count(*) from EmployeeDataSimple6;
DEBUG : Acquire a monitor for compiling query
INFO  : Compiling command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28): select count(*) from EmployeeDataSimple6
DEBUG : Encoding valid txns info 9223372036854775807:
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28); Time taken: 2.101 seconds
INFO  : Executing command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28): select count(*) from EmployeeDataSimple6
INFO  : Query ID = hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
DEBUG : Configuring job job_1474930452689_0001 with /user/oracle/.staging/job_1474930452689_0001 as the submit dir
DEBUG : adding the following namenodes' delegation tokens:[hdfs://scbda2.ru.oracle.com:8020]
DEBUG : Creating splits at hdfs://scbda2.ru.oracle.com:8020/user/oracle/.staging/job_1474930452689_0001
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1474930452689_0001
INFO  : The url to track the job: http://scbda1.ru.oracle.com:8088/proxy/application_1474930452689_0001/
INFO  : Starting Job = job_1474930452689_0001, Tracking URL = http://scbda1.ru.oracle.com:8088/proxy/application_1474930452689_0001/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/bin/hadoop job  -kill job_1474930452689_0001
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2016-09-27 02:19:31,719 Stage-1 map = 0%,  reduce = 0%
INFO  : 2016-09-27 02:19:44,405 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.73 sec
INFO  : 2016-09-27 02:19:52,834 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.12 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 120 msec
INFO  : Ended Job = job_1474930452689_0001
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.12 sec   HDFS Read: 173502 HDFS Write: 4 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 4 seconds 120 msec
INFO  : Completed executing command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28); Time taken: 48.807 seconds
INFO  : OK
DEBUG : Shutting down query select count(*) from EmployeeDataSimple6
+------+--+
| _c0  |
+------+--+
| 107  |
+------+--+
1 row selected (53.106 seconds)

[root@scbda1 /]# beeline

16/09/27 03:18:15 DEBUG util.VersionInfo: version: 2.6.0-cdh5.8.0

2016-09-27 03:18:20,674 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.

Beeline version 1.1.0-cdh5.8.0 by Apache Hive

beeline> !connect jdbc:hive2://localhost:10000 oracle welcome1

scan complete in 4ms

Connecting to jdbc:hive2://localhost:10000

Connected to: Apache Hive (version 1.1.0-cdh5.8.0)

Driver: Hive JDBC (version 1.1.0-cdh5.8.0)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://localhost:10000> select count(*) from EmployeeDataSimple6;

DEBUG : Acquire a monitor for compiling query

INFO : Compiling command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28): select count(*) from EmployeeDataSimple6

DEBUG : Encoding valid txns info 9223372036854775807:

INFO : Semantic Analysis Completed

INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)

INFO : Completed compiling command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28); Time taken: 2.101 seconds

INFO : Executing command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28): select count(*) from EmployeeDataSimple6

INFO : Query ID = hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28

INFO : Total jobs = 1

INFO : Launching Job 1 out of 1

INFO : Starting task [Stage-1:MAPRED] in serial mode

INFO : Number of reduce tasks determined at compile time: 1

INFO : In order to change the average load for a reducer (in bytes):

INFO : set hive.exec.reducers.bytes.per.reducer=<number>

INFO : In order to limit the maximum number of reducers:

INFO : set hive.exec.reducers.max=<number>

INFO : In order to set a constant number of reducers:

INFO : set mapreduce.job.reduces=<number>

DEBUG : Configuring job job_1474930452689_0001 with /user/oracle/.staging/job_1474930452689_0001 as the submit dir

DEBUG : adding the following namenodes' delegation tokens:[hdfs://scbda2.ru.oracle.com:8020]

DEBUG : Creating splits at hdfs://scbda2.ru.oracle.com:8020/user/oracle/.staging/job_1474930452689_0001

INFO : number of splits:1

INFO : Submitting tokens for job: job_1474930452689_0001

INFO : The url to track the job: http://scbda1.ru.oracle.com:8088/proxy/application_1474930452689_0001/

INFO : Starting Job = job_1474930452689_0001, Tracking URL = http://scbda1.ru.oracle.com:8088/proxy/application_1474930452689_0001/

INFO : Kill Command = /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/bin/hadoop job -kill job_1474930452689_0001

INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

INFO : 2016-09-27 02:19:31,719 Stage-1 map = 0%, reduce = 0%

INFO : 2016-09-27 02:19:44,405 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.73 sec

INFO : 2016-09-27 02:19:52,834 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.12 sec

INFO : MapReduce Total cumulative CPU time: 4 seconds 120 msec

INFO : Ended Job = job_1474930452689_0001

INFO : MapReduce Jobs Launched:

INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.12 sec HDFS Read: 173502 HDFS Write: 4 SUCCESS

INFO : Total MapReduce CPU Time Spent: 4 seconds 120 msec

INFO : Completed executing command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28); Time taken: 48.807 seconds

INFO : OK

DEBUG : Shutting down query select count(*) from EmployeeDataSimple6

+------+--+

| _c0 |

+------+--+

| 107 |

+------+--+

1 row selected (53.106 seconds)

Conclusion

So we can see from output that query was completed by 1 map and 1 reduce. This is ok for small tables but for large table it will be very slow configuration. We need more mappers and more database session to get data faster. I will talk about OD4H performance tuning in my next posts.

Oracle Data Integration, Cloud, Spatial and Analytics

Oracle Data Integration, Cloud, Spatial and Analytics (GoldenGate, ODI, Cloud, Spatial, Exadata)

Oracle DataSource for Apache Hadoop (OD4H): introduction

Introduction

How does Oracle DataSource for Apache Hadoop (OD4H) works?

How to setup

Create simple example

Conclusion

Leave a Reply Cancel reply