Running Hive UDF

release/20240425
English

Running Hive UDF

Article Summary

Share feedback

Thanks for sharing your feedback!

Available in Classic

Hive UDF(User-Defined Functions) helps users to execute code written within Hive query, Built- in Functions is used when it is difficult to express the desired query.
Usually, UDF is created and used for the utilization of data in specific fields such as search logs and transaction history.

The UDF can be categorized into 3 types according to the number of input rows that the function receives and the number of output rows it returns. Each function type implements different interfaces.

UDF
It is a function that receives a single row as input and returns a single row as output.
Most mathematical and string functions such as ROUND and REPLACE fall under this type.
UDAF
It is a function that receives multiple rows as input and returns a single row as output.
The aggregate functions such as COUNT and MAX fall under this type.
UDTF
It is a function that receives a single row as input and returns multiple rows (table) as output.
Functions such as EXPLODE fall under this type.

This guide introduces the method for implementing the org.apache.hadoop.hive.ql.exec.UDF Hive UDF interface and using it in Cloud Hadoop.

To use Hive UDF in Cloud Hadoop, proceed with the following steps in order.

1. Create project
2. Implement interface
3. Use Hive

Note

UDF must be implemented in Java, and if you want to use another programming language, create a user-defined script (MapReduce script) and use the SELECT TRANSFORM syntax.

1. Create project

Use IntelliJ to create a Gradle project.
- package: com.naverncp.hive

Under the root of the project, add the dependency settings to build.gradle as shown below.

The example used the same version as the component installed in Cloud Hadoop 1.2.

plugins {
    id 'java'
}
group 'com.naverncp'
version '1.0-SNAPSHOT'
sourceCompatibility = 1.8
repositories {
    mavenCentral()
    maven {
        url "<http://conjars.org/repo>"
    }
}
dependencies {
    compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '2.7.3'
    compile group: 'org.apache.hive', name: 'hive-exec', version: '1.2.2'
    compile group: 'org.apache.commons', name: 'commons-lang3', version: '3.9'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

2. Implement interface

Implement the UDF that satisfies the following condition.

The UDF inherits org.apache.hadoop.hive.ql.exec.UDF.
The UDF implements at least one evaluate() method.

It is because the evaluate() method is not defined in the org.apache.hadoop.hive.ql.exec.UDF interface since it is difficult to know in advance how many arguments the function will receive or what type the argument will be.

// Strip.java
package com.naverncp.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
@Description(
        name = "Strip",
        value = "returns a stripped text",
        extended = "stripping characters from the ends of strings"
)
public class Strip extends UDF {
    private Text result = new Text();
    public Text evaluate(Text str){
        if (str == null){
            return null;
        }
        result.set(StringUtils.strip(str.toString()));
        return result;
    }
    public Text evaluate(Text str, String stripChar){
        if (str == null){
            return null;
        }
        result.set(StringUtils.strip(str.toString(), stripChar));
        return result;
    }
}

In the class above, two evaluate methods have been implemented.

First method: Remove leading and trailing spaces from the string
Second method: Remove specified characters starting from the end of the string

To use UDF in Hive, package the Java class into .jar first.

The following example shows .jar uploaded under hdfs:///user/example.

$ ./gradlew clean
$ ./gradlew build
$ scp -i ~/Downloads/example-home.pem 
~/IdeaProjects/hive/build/libs/hive-1.0-SNAPSHOT.jar sshuser@pub-
4rrsj.hadoop.ntruss.com:~/
$ ssh -i ~/Downloads/example-home.pem  sshuser@pub-
4rrsj.hadoop.ntruss.com

[sshuser@e-001-example-0917-hd ~]$ hadoop fs -copyFromLocal hive-1.0-SNAPSHOT.jar /user/example/

3. Use Hive

Create an arbitrary hive table using the following command.

[sshuser@e-001-example-0917-hd ~]$ echo 'dummy' > /tmp/dummy.txt
[sshuser@e-001-example-0917-hd ~]$ hive -e "CREATE TABLE dummy (value STRING); \\
LOAD DATA LOCAL INPATH '/tmp/dummy.txt' \\
OVERWRITE INTO TABLE dummy"

Run Hive CLI using the following command.

Since the HiveServer is installed in the edge node, you don't have to add any options.

[sshuser@e-001-example-0917-hd ~]$ hive
20/11/06 16:04:39 WARN conf.HiveConf: HiveConf of name hive.server2.enable.doAs.property does not exist
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

Logging initialized using configuration in file:/etc/hive/2.6.5.0-292/0/hive-log4j.properties
hive>

Set the name with the CREATE FUNCTION statement.
Hive metastore: The place where metadata related to tables and partitions is saved

hive> CREATE FUNCTION strip AS 'com.naverncp.hive.Strip'
    > USING JAR 'hdfs:///user/example/hive-1.0-SNAPSHOT.jar';
converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
Added [/tmp/99c3d137-f58e-4fab-8a2a-98361e3e59a1_resources/hive-1.0-SNAPSHOT.jar] to class path
Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
OK
Time taken: 17.786 seconds

If you want to use the function during a Hive session only rather than saving it permanently in the metastore, then use TEMPORARY keyword as below.

ADD JAR 'hdfs:///user/example';
CREATE TEMPORARY FUNCTION strip AS 'com.naverncp.hive.Strip'

Check if the built strip function is executed successfully. You can see that the spaces have been successfully removed.

hive> select strip('  bee  ') from dummy;
converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
Added [/tmp/70e2e17a-ecca-41ff-9fe6-48417b8ef797_resources/hive-1.0-SNAPSHOT.jar] to class path
Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
OK
bee
Time taken: 0.967 seconds, Fetched: 1 row(s)
hive> select strip('banana', 'ab') from dummy ;
OK
nan
Time taken: 0.173 seconds, Fetched: 1 row(s)

You can delete functions as shown below.

DROP FUNCTION strip

Note

If you prepare frequently used logic as UDF according to the data characteristics, then you can easily search data with the SQL statements.

Was this article helpful?

What's Next

Linking CSV file with Hive external table

Table of contents

1. Create project
2. Implement interface