- Print
- PDF
Running Hive UDF
- Print
- PDF
Available in Classic
Hive UDF(User-Defined Functions) helps users to execute code written within Hive query, Built- in Functions is used when it is difficult to express the desired query.
Usually, UDF is created and used for the utilization of data in specific fields such as search logs and transaction history.
The UDF can be categorized into 3 types according to the number of input rows that the function receives and the number of output rows it returns. Each function type implements different interfaces.
UDF
It is a function that receives a single row as input and returns a single row as output.
Most mathematical and string functions such as ROUND and REPLACE fall under this type.UDAF
It is a function that receives multiple rows as input and returns a single row as output.
The aggregate functions such as COUNT and MAX fall under this type.UDTF
It is a function that receives a single row as input and returns multiple rows (table) as output.
Functions such as EXPLODE fall under this type.
This guide introduces the method for implementing the org.apache.hadoop.hive.ql.exec.UDF
Hive UDF interface and using it in Cloud Hadoop.
To use Hive UDF in Cloud Hadoop, proceed with the following steps in order.
UDF must be implemented in Java, and if you want to use another programming language, create a user-defined script (MapReduce script) and use the SELECT TRANSFORM syntax.
1. Create project
Use IntelliJ to create a Gradle project.
- package:
com.naverncp.hive
- package:
Under the root of the project, add the dependency settings to
build.gradle
as shown below.- The example used the same version as the component installed in Cloud Hadoop 1.2.
plugins { id 'java' } group 'com.naverncp' version '1.0-SNAPSHOT' sourceCompatibility = 1.8 repositories { mavenCentral() maven { url "<http://conjars.org/repo>" } } dependencies { compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '2.7.3' compile group: 'org.apache.hive', name: 'hive-exec', version: '1.2.2' compile group: 'org.apache.commons', name: 'commons-lang3', version: '3.9' testCompile group: 'junit', name: 'junit', version: '4.12' }
2. Implement interface
Implement the UDF that satisfies the following condition.
- The UDF inherits
org.apache.hadoop.hive.ql.exec.UDF
. - The UDF implements at least one
evaluate()
method.
It is because the
evaluate()
method is not defined in theorg.apache.hadoop.hive.ql.exec.UDF
interface since it is difficult to know in advance how many arguments the function will receive or what type the argument will be.// Strip.java package com.naverncp.hive; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; @Description( name = "Strip", value = "returns a stripped text", extended = "stripping characters from the ends of strings" ) public class Strip extends UDF { private Text result = new Text(); public Text evaluate(Text str){ if (str == null){ return null; } result.set(StringUtils.strip(str.toString())); return result; } public Text evaluate(Text str, String stripChar){ if (str == null){ return null; } result.set(StringUtils.strip(str.toString(), stripChar)); return result; } }
- The UDF inherits
In the class above, two evaluate
methods have been implemented.
- First method: Remove leading and trailing spaces from the string
- Second method: Remove specified characters starting from the end of the string
To use UDF in Hive, package the Java class into
.jar
first.- The following example shows
.jar
uploaded underhdfs:///user/example
.
$ ./gradlew clean $ ./gradlew build $ scp -i ~/Downloads/example-home.pem ~/IdeaProjects/hive/build/libs/hive-1.0-SNAPSHOT.jar sshuser@pub- 4rrsj.hadoop.ntruss.com:~/ $ ssh -i ~/Downloads/example-home.pem sshuser@pub- 4rrsj.hadoop.ntruss.com [sshuser@e-001-example-0917-hd ~]$ hadoop fs -copyFromLocal hive-1.0-SNAPSHOT.jar /user/example/
- The following example shows
3. Use Hive
Create an arbitrary
hive table
using the following command.[sshuser@e-001-example-0917-hd ~]$ echo 'dummy' > /tmp/dummy.txt [sshuser@e-001-example-0917-hd ~]$ hive -e "CREATE TABLE dummy (value STRING); \\ LOAD DATA LOCAL INPATH '/tmp/dummy.txt' \\ OVERWRITE INTO TABLE dummy"
Run Hive CLI using the following command.
- Since the HiveServer is installed in the edge node, you don't have to add any options.
[sshuser@e-001-example-0917-hd ~]$ hive 20/11/06 16:04:39 WARN conf.HiveConf: HiveConf of name hive.server2.enable.doAs.property does not exist log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender. Logging initialized using configuration in file:/etc/hive/2.6.5.0-292/0/hive-log4j.properties hive>
Register the function in
metastore
as below.- Set the name with the CREATE FUNCTION statement.
- Hive metastore: The place where metadata related to tables and partitions is saved
hive> CREATE FUNCTION strip AS 'com.naverncp.hive.Strip' > USING JAR 'hdfs:///user/example/hive-1.0-SNAPSHOT.jar'; converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar Added [/tmp/99c3d137-f58e-4fab-8a2a-98361e3e59a1_resources/hive-1.0-SNAPSHOT.jar] to class path Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar] OK Time taken: 17.786 seconds
If you want to use the function during a Hive session only rather than saving it permanently in the metastore, then use
TEMPORARY keyword
as below.ADD JAR 'hdfs:///user/example'; CREATE TEMPORARY FUNCTION strip AS 'com.naverncp.hive.Strip'
Check if the built
strip
function is executed successfully. You can see that the spaces have been successfully removed.hive> select strip(' bee ') from dummy; converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar Added [/tmp/70e2e17a-ecca-41ff-9fe6-48417b8ef797_resources/hive-1.0-SNAPSHOT.jar] to class path Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar] OK bee Time taken: 0.967 seconds, Fetched: 1 row(s) hive> select strip('banana', 'ab') from dummy ; OK nan Time taken: 0.173 seconds, Fetched: 1 row(s)
You can delete functions as shown below.
DROP FUNCTION strip
If you prepare frequently used logic as UDF according to the data characteristics, then you can easily search data with the SQL statements.