Run Hive UDF

Available in VPC

Hive user-defined functions (UDFs) allow you to run your own code within a Hive query, and are used when the desired query is difficult to express using only built-in functions.
Typically, you can write and use UDFs to handle data from specific fields, such as search logs and transaction history.

There are 3 types of UDFs depending on the number of input rows that are passed to the function and the number of output rows that are returned by the function. Each type of function has a different interface to implement.

UDF
A UDF is a function that receives a single input row and returns a single output value per row.
Most mathematical and string functions, such as ROUND and REPLACE, are of the UDF type.
UDAF
A UDAF is a function that receives multiple input rows and returns a single output row.
Aggregate functions, such as COUNT and MAX, are examples of UDAFs.
UDTF
A UDTF is a function that receives a single input row and returns multiple output rows (table).
EXPLODE is an example of a UDTF.

This guide describes how to implement the org.apache.hadoop.hive.ql.exec.UDF Hive UDF interface and use it in Cloud Hadoop.

To use Hive UDF in Cloud Hadoop:

Note

You must implement UDFs in Java. To use other programming languages, you can write a user-defined script (MapReduce script) and use it with the SELECT TRANSFORM syntax.

1. Create project

Create a Gradle Project using IntelliJ.
- package: com.naverncp.hive

Add dependency settings in build.gradle under the project root as follows:

The same version of the components installed in Cloud Hadoop 2.0 is used in the example.

plugins {
    id 'java'
}
group 'com.naverncp'
version '1.0-SNAPSHOT'
repositories {
    mavenCentral()
    maven {
        url "<http://conjars.org/repo>"
    }
}
dependencies {
    compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
    compile group: 'org.apache.hive', name: 'hive-exec', version: '3.1.2'
    compile group: 'org.apache.commons', name: 'commons-lang3', version: '3.9'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

2. Implement interface

Implement a UDF that meets the following conditions:
- A UDF must extend org.apache.hadoop.hive.ql.exec.UDF.
- A UDF must implement at least one evaluate() method.

Because the evaluate() method is not defined in the　org.apache.hadoop.hive.ql.exec.UDF interface, the number and types of arguments that the function will receive cannot be known in advance.

 // Strip.java
    package com.naverncp.hive;
    import org.apache.commons.lang.StringUtils;
    import org.apache.hadoop.hive.ql.exec.Description;
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.Text;
    @Description(
            name = "Strip",
            value = "returns a stripped text",
            extended = "stripping characters from the ends of strings"
    )
    public class Strip extends UDF {
        private Text result = new Text();
        public Text evaluate(Text str){
            if (str == null){
                return null;
            }
            result.set(StringUtils.strip(str.toString()));
            return result;
        }
        public Text evaluate(Text str, String stripChar){
            if (str == null){
                return null;
            }
            result.set(StringUtils.strip(str.toString(), stripChar));
            return result;
        }
    }

Two evaluate methods are implemented in the class above.

1st method: Remove spaces at the beginning and end of strings.
2nd method: Remove specified characters from the end of strings.

To use UDFs in Hive, first package the Java class into .jar.

The following example shows a case where .jar is uploaded under hdfs:///user/example.

$ ./gradlew clean
$ ./gradlew build
$ scp -i ~/Downloads/example-home.pem ~/IdeaProjects/hive/build/libs/hive-1.0-SNAPSHOT.jar sshuser@pub-4rrsj.hadoop.ntruss.com:~/
$ ssh -i ~/Downloads/example-home.pem sshuser@pub-4rrsj.hadoop.ntruss.com

[sshuser@e-001-example-0917-hd ~]$ hadoop fs -copyFromLocal hive-1.0-SNAPSHOT.jar /user/example/

3. Using Hive

Run Hive CLI using the following commands:

You don't need to grant any options because HiveServer is installed on the edge node.

[sshuser@e-001-example-0917-hd ~]$ hive
20/11/06 16:04:39 WARN conf.HiveConf: HiveConf of name hive.server2.enable.doAs.property does not exist
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

Logging initialized using configuration in file:/etc/hive/2.6.5.0-292/0/hive-log4j.properties
hive>

Set a name using the CREATE FUNCTION syntax.
Hive Metastore: A space where metadata related to tables and partitions is stored.

hive> CREATE FUNCTION strip AS 'com.naverncp.hive.Strip'
    > USING JAR 'hdfs:///user/[Admin account name]/hive-1.0-SNAPSHOT.jar';
converting to local hdfs:///user/[Admin account name]/hive-1.0-SNAPSHOT.jar
Added [/tmp/99c3d137-f58e-4fab-8a2a-98361e3e59a1_resources/hive-1.0-SNAPSHOT.jar] to class path
Added resources: [hdfs:///user/[Admin account name]/hive-1.0-SNAPSHOT.jar]
OK
Time taken: 17.786 seconds

To use a function only during a Hive session without storing it permanently in the metastore, use TEMPORARY keyword as follows:

ADD JAR 'hdfs:///user/[Admin account name]';
CREATE TEMPORARY FUNCTION strip AS 'com.naverncp.hive.Strip'

Check whether the built strip function works properly. You can verify that the spaces are removed.

hive> SELECT strip('  bee  ');
converting to local hdfs:///user/[Admin account name]/hive-1.0-SNAPSHOT.jar
Added [/tmp/70e2e17a-ecca-41ff-9fe6-48417b8ef797_resources/hive-1.0-SNAPSHOT.jar] to class path
Added resources: [hdfs:///user/[Admin account name]/hive-1.0-SNAPSHOT.jar]
OK
bee
Time taken: 0.967 seconds, Fetched: 1 row(s)
hive> SELECT strip('banana', 'ab');
OK
nan
Time taken: 0.173 seconds, Fetched: 1 row(s)

You can delete a function as follows:

DROP FUNCTION strip;

Note

If you create UDFs for frequently used logic based on data characteristics, you can easily view the data using SQL syntax.