Run Hive UDF

Prev Next

The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.

Available in VPC

Hive user-defined functions (UDFs) allow you to run your own code within a Hive query, and are used when the desired query is difficult to express using only built-in functions.
Typically, you can write and use UDFs to handle data from specific fields, such as search logs and transaction history.

There are 3 types of UDFs depending on the number of input rows that are passed to the function and the number of output rows that are returned by the function. Each type of function has a different interface to implement.

  • UDF
    A UDF is a function that receives a single input row and returns a single output value per row.
    Most mathematical and string functions, such as ROUND and REPLACE, are of the UDF type.

  • UDAF
    A UDAF is a function that receives multiple input rows and returns a single output row.
    Aggregate functions, such as COUNT and MAX, are examples of UDAFs.

  • UDTF
    A UDTF is a function that receives a single input row and returns multiple output rows (table).
    EXPLODE is an example of a UDTF.

This guide describes how to implement the org.apache.hadoop.hive.ql.exec.UDF Hive UDF interface and use it in Cloud Hadoop.

To use Hive UDF in Cloud Hadoop, follow the steps in order:

Note

You must implement UDFs in Java. To use other programming languages, you can write a user-defined script (MapReduce script) and use it with the SELECT TRANSFORM syntax.

1. Create project

  1. Create a Gradle Project using IntelliJ.

    • package: com.naverncp.hive

    hadoop-chadoop-use-ex5_1-1_ko

    hadoop-chadoop-use-ex5_1-2_ko

    hadoop-chadoop-use-ex5_1-3_ko

  2. Add dependency settings in build.gradle under the project root as follows:

    • The same version of the components installed in Cloud Hadoop 2.0 is used in the example.
    plugins {
        id 'java'
    }
    group 'com.naverncp'
    version '1.0-SNAPSHOT'
    repositories {
        mavenCentral()
        maven {
            url "<http://conjars.org/repo>"
        }
    }
    dependencies {
        compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
        compile group: 'org.apache.hive', name: 'hive-exec', version: '3.1.2'
        compile group: 'org.apache.commons', name: 'commons-lang3', version: '3.9'
        testCompile group: 'junit', name: 'junit', version: '4.12'
    }
    

2. Implement interface

  1. Implement a UDF that meets the following conditions:

    • A UDF must extend org.apache.hadoop.hive.ql.exec.UDF.
    • A UDF must implement at least one evaluate() method.

Because the evaluate() method is not defined in the org.apache.hadoop.hive.ql.exec.UDF interface, the number and types of arguments that the function will receive cannot be known in advance.

 // Strip.java
    package com.naverncp.hive;
    import org.apache.commons.lang.StringUtils;
    import org.apache.hadoop.hive.ql.exec.Description;
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.Text;
    @Description(
            name = "Strip",
            value = "returns a stripped text",
            extended = "stripping characters from the ends of strings"
    )
    public class Strip extends UDF {
        private Text result = new Text();
        public Text evaluate(Text str){
            if (str == null){
                return null;
            }
            result.set(StringUtils.strip(str.toString()));
            return result;
        }
        public Text evaluate(Text str, String stripChar){
            if (str == null){
                return null;
            }
            result.set(StringUtils.strip(str.toString(), stripChar));
            return result;
        }
    }

Two evaluate methods are implemented in the class above.

  • 1st method: Remove spaces at the beginning and end of strings.
  • 2nd method: Remove specified characters from the end of strings.
  1. To use UDFs in Hive, first package the Java class into .jar.

    • The following example shows a case where .jar is uploaded under hdfs:///user/example.
    $ ./gradlew clean
    $ ./gradlew build
    $ scp -i ~/Downloads/example-home.pem ~/IdeaProjects/hive/build/libs/hive-1.0-SNAPSHOT.jar sshuser@pub-4rrsj.hadoop.ntruss.com:~/
    $ ssh -i ~/Downloads/example-home.pem sshuser@pub-4rrsj.hadoop.ntruss.com
    
    [sshuser@e-001-example-0917-hd ~]$ hadoop fs -copyFromLocal hive-1.0-SNAPSHOT.jar /user/example/
    

3. Use Hive

  1. Run Hive CLI using the following commands:

    • You don't need to grant any options because HiveServer is installed on the edge node.
    [sshuser@e-001-example-0917-hd ~]$ hive
    20/11/06 16:04:39 WARN conf.HiveConf: HiveConf of name hive.server2.enable.doAs.property does not exist
    log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
    
    Logging initialized using configuration in file:/etc/hive/2.6.5.0-292/0/hive-log4j.properties
    hive>
    
  2. Register a function in Metastore as follows:

    • Set a name using the CREATE FUNCTION syntax.
    • Hive Metastore: A space where metadata related to tables and partitions is stored.
    hive> CREATE FUNCTION strip AS 'com.naverncp.hive.Strip'
        > USING JAR 'hdfs:///user/example/hive-1.0-SNAPSHOT.jar';
    converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
    Added [/tmp/99c3d137-f58e-4fab-8a2a-98361e3e59a1_resources/hive-1.0-SNAPSHOT.jar] to class path
    Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
    OK
    Time taken: 17.786 seconds
    

    To use a function only during a Hive session without storing it permanently in the metastore, use TEMPORARY keyword as follows:

    ADD JAR 'hdfs:///user/example';
    CREATE TEMPORARY FUNCTION strip AS 'com.naverncp.hive.Strip'
    
  3. Check whether the built strip function works properly. You can verify that the spaces are removed.

    hive> select strip('  bee  ');
    converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
    Added [/tmp/70e2e17a-ecca-41ff-9fe6-48417b8ef797_resources/hive-1.0-SNAPSHOT.jar] to class path
    Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
    OK
    bee
    Time taken: 0.967 seconds, Fetched: 1 row(s)
    hive> select strip('banana', 'ab');
    OK
    nan
    Time taken: 0.173 seconds, Fetched: 1 row(s)
    

You can delete a function as follows:

DROP FUNCTION strip;
Note

If you create UDFs for frequently used logic based on data characteristics, you can easily view the data using SQL syntax.