Introduction
Currently we see that Hadoop is becoming part of Enterprise Data Warehouse family. But family should be connected to each other. Sometimes we need access to Hadoop from Oracle Database. Sometimes Hadoop users need enterprise data stored in Oracle database.
Hive has very interesting concept – External Tables which allow you to define Java classes to access external database and present it as a native hive table.
Oracle Datasource for Apache Hadoop (formerly Oracle Table Access for Apache Hadoop) turns Oracle Database tables into a Hadoop data source (i.e., external table) enabling direct and consistent Hive QL/Spark SQL queries, as well as direct Hadoop API access. Applications can join master data or dimension data in Oracle Database with data stored in Hadoop. Additionally data can be written back to Oracle Database after processing.
Oracle Datasource for Apache Hadoop optimizes a query’s execution plans using predicate and projection pushdown, and partition pruning. Database table access is performed in parallel based on the selected split patterns, using smart and secure connections (Kerberos, SSL, Oracle Wallet), regulated by both Hadoop (i.e., maximum concurrent tasks) and Oracle DBAs (i.e., max pool size).
How does Oracle DataSource for Apache Hadoop (OD4H) works?
You can start working with OD4H using the following steps:
- Create a new Oracle table, or, reuse an existing table.
- Create the Hive DDL for creating the external table referencing the Oracle Table.
- Issue HiveSQL, SparkSQL, or other Spark/Hadoop queries and API calls.
How to setup
1. Oracle Datasource for Apache Hadoop requires (check current information here https://www.oracle.com/database/big-data-connectors/certifications.html):
- Oracle Database 12c, 11.2.0.4, or earlier (if you can query it using Oracle JDBC driver 12c).
It would be better to have Oracle Database 12c to get better performance. - Hadoop: Cloudera CDH5, Hortonworks 2.x or Apache Hadoop 2.2.0-2.6.0.
- Apache Hive 0.13.0, 0.13.1 or 1.1.0
2. Download Oracle Datasource for Apache Hadoop connector. There are two places where you can take it:
– standalone archive: http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/index.html
– Oracle Mammoth archive from support.oracle.com. BDA Extras contains OD4H
3. I’ve downloaded standalone archive which is named od4h-1.2.0.0.zip. So let’s unzip it on each Hadoop node. I’ll use free tool pssh to run unzip command on all nodes
|
1 |
pssh -h ~/hostsall "unzip /media/sf_nfs2/distrib/bigdata/od4h/od4h-1.2.0.0.zip -d /opt/oracle" |
4. Add /opt/oracle/od4h/jlib/* to HIVE_AUX_JARS_PATH.
– If you have Cloudera Manager then choose Hive->Configuration and search for HIVE_AUX_JARS_PATH
![]()
– If you don’t have Cloudera Manager then configure this variable in hive-env.sh file
5. That is. We are ready to create external hive table linked to Oracle tables.
Create simple example
1. Run the following query in
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
DROP TABLE EmployeeDataSimple6; CREATE EXTERNAL TABLE EmployeeDataSimple6 ( employee_id int, First_Name string, Last_Name string, Job_ID string, Salary int ) STORED BY 'oracle.hcat.osh.OracleStorageHandler' WITH SERDEPROPERTIES ( 'oracle.hcat.osh.columns.mapping' = 'employee_id, first_name, Last_Name, Job_Id, Salary') TBLPROPERTIES ( 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@//scbds.ru.oracle.com:1521/pdb1', 'mapreduce.jdbc.username' = 'hr', 'mapreduce.jdbc.password' = 'hr', 'mapreduce.jdbc.input.table.name' = 'employees', 'oracle.hcat.osh.useMonitor'='true', 'oracle.hcat.osh.fetchSize'='10000', 'oracle.hcat.osh.useOracleParallelism'='true' ); |
Attention! You can’t set table schema (or owner here). You should connect using table owner.
2. Run query using beeline
|
1 2 3 |
[root@scbda1 /]# beeline beeline> !connect jdbc:hive2://localhost:10000 oracle welcome1 0: jdbc:hive2://localhost:10000> select count(*) from EmployeeDataSimple6; |
3. Get the output
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
[root@scbda1 /]# beeline 16/09/27 03:18:15 DEBUG util.VersionInfo: version: 2.6.0-cdh5.8.0 2016-09-27 03:18:20,674 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it. Beeline version 1.1.0-cdh5.8.0 by Apache Hive beeline> !connect jdbc:hive2://localhost:10000 oracle welcome1 scan complete in 4ms Connecting to jdbc:hive2://localhost:10000 Connected to: Apache Hive (version 1.1.0-cdh5.8.0) Driver: Hive JDBC (version 1.1.0-cdh5.8.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10000> select count(*) from EmployeeDataSimple6; DEBUG : Acquire a monitor for compiling query INFO : Compiling command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28): select count(*) from EmployeeDataSimple6 DEBUG : Encoding valid txns info 9223372036854775807: INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28); Time taken: 2.101 seconds INFO : Executing command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28): select count(*) from EmployeeDataSimple6 INFO : Query ID = hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks determined at compile time: 1 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=<number> INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=<number> INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces=<number> DEBUG : Configuring job job_1474930452689_0001 with /user/oracle/.staging/job_1474930452689_0001 as the submit dir DEBUG : adding the following namenodes' delegation tokens:[hdfs://scbda2.ru.oracle.com:8020] DEBUG : Creating splits at hdfs://scbda2.ru.oracle.com:8020/user/oracle/.staging/job_1474930452689_0001 INFO : number of splits:1 INFO : Submitting tokens for job: job_1474930452689_0001 INFO : The url to track the job: http://scbda1.ru.oracle.com:8088/proxy/application_1474930452689_0001/ INFO : Starting Job = job_1474930452689_0001, Tracking URL = http://scbda1.ru.oracle.com:8088/proxy/application_1474930452689_0001/ INFO : Kill Command = /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/bin/hadoop job -kill job_1474930452689_0001 INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 INFO : 2016-09-27 02:19:31,719 Stage-1 map = 0%, reduce = 0% INFO : 2016-09-27 02:19:44,405 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.73 sec INFO : 2016-09-27 02:19:52,834 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.12 sec INFO : MapReduce Total cumulative CPU time: 4 seconds 120 msec INFO : Ended Job = job_1474930452689_0001 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.12 sec HDFS Read: 173502 HDFS Write: 4 SUCCESS INFO : Total MapReduce CPU Time Spent: 4 seconds 120 msec INFO : Completed executing command(queryId=hive_20160927021919_69e0e0ae-1fa1-4435-8d0b-311060a7de28); Time taken: 48.807 seconds INFO : OK DEBUG : Shutting down query select count(*) from EmployeeDataSimple6 +------+--+ | _c0 | +------+--+ | 107 | +------+--+ 1 row selected (53.106 seconds) |
Conclusion
So we can see from output that query was completed by 1 map and 1 reduce. This is ok for small tables but for large table it will be very slow configuration. We need more mappers and more database session to get data faster. I will talk about OD4H performance tuning in my next posts.