Running Hive UDF

Prev Next

Available in Classic

Hive UDF(User-Defined Functions) helps users to execute code written within Hive query, Built- in Functions is used when it is difficult to express the desired query.
Usually, UDF is created and used for the utilization of data in specific fields such as search logs and transaction history.

The UDF can be categorized into 3 types according to the number of input rows that the function receives and the number of output rows it returns. Each function type implements different interfaces.

  • UDF
    It is a function that receives a single row as input and returns a single row as output.
    Most mathematical and string functions such as ROUND and REPLACE fall under this type.

  • UDAF
    It is a function that receives multiple rows as input and returns a single row as output.
    The aggregate functions such as COUNT and MAX fall under this type.

  • UDTF
    It is a function that receives a single row as input and returns multiple rows (table) as output.
    Functions such as EXPLODE fall under this type.

This guide introduces the method for implementing the org.apache.hadoop.hive.ql.exec.UDF Hive UDF interface and using it in Cloud Hadoop.

To use Hive UDF in Cloud Hadoop, proceed with the following steps in order.

Note

UDF must be implemented in Java, and if you want to use another programming language, create a user-defined script (MapReduce script) and use the SELECT TRANSFORM syntax.

1. Create project

  1. Use IntelliJ to create a Gradle project.

    • package: com.naverncp.hive

    hadoop-chadoop-use-ex5_1-1_en

    hadoop-chadoop-use-ex5_1-2_en

    hadoop-chadoop-use-ex5_1-3_en

  2. Under the root of the project, add the dependency settings to build.gradle as shown below.

    • The example used the same version as the component installed in Cloud Hadoop 1.2.
    plugins {
        id 'java'
    }
    group 'com.naverncp'
    version '1.0-SNAPSHOT'
    sourceCompatibility = 1.8
    repositories {
        mavenCentral()
        maven {
            url "<http://conjars.org/repo>"
        }
    }
    dependencies {
        compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '2.7.3'
        compile group: 'org.apache.hive', name: 'hive-exec', version: '1.2.2'
        compile group: 'org.apache.commons', name: 'commons-lang3', version: '3.9'
        testCompile group: 'junit', name: 'junit', version: '4.12'
    }
    

2. Implement interface

  1. Implement the UDF that satisfies the following condition.

    • The UDF inherits org.apache.hadoop.hive.ql.exec.UDF.
    • The UDF implements at least one evaluate() method.

    It is because the evaluate() method is not defined in the org.apache.hadoop.hive.ql.exec.UDF interface since it is difficult to know in advance how many arguments the function will receive or what type the argument will be.

    // Strip.java
    package com.naverncp.hive;
    import org.apache.commons.lang.StringUtils;
    import org.apache.hadoop.hive.ql.exec.Description;
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.Text;
    @Description(
            name = "Strip",
            value = "returns a stripped text",
            extended = "stripping characters from the ends of strings"
    )
    public class Strip extends UDF {
        private Text result = new Text();
        public Text evaluate(Text str){
            if (str == null){
                return null;
            }
            result.set(StringUtils.strip(str.toString()));
            return result;
        }
        public Text evaluate(Text str, String stripChar){
            if (str == null){
                return null;
            }
            result.set(StringUtils.strip(str.toString(), stripChar));
            return result;
        }
    }
    

In the class above, two evaluate methods have been implemented.

  • First method: Remove leading and trailing spaces from the string
  • Second method: Remove specified characters starting from the end of the string
  1. To use UDF in Hive, package the Java class into .jar first.

    • The following example shows .jar uploaded under hdfs:///user/example.
    $ ./gradlew clean
    $ ./gradlew build
    $ scp -i ~/Downloads/example-home.pem 
    ~/IdeaProjects/hive/build/libs/hive-1.0-SNAPSHOT.jar sshuser@pub-
    4rrsj.hadoop.ntruss.com:~/
    $ ssh -i ~/Downloads/example-home.pem  sshuser@pub-
    4rrsj.hadoop.ntruss.com
    
    [sshuser@e-001-example-0917-hd ~]$ hadoop fs -copyFromLocal hive-1.0-SNAPSHOT.jar /user/example/
    

3. Use Hive

  1. Create an arbitrary hive table using the following command.

    [sshuser@e-001-example-0917-hd ~]$ echo 'dummy' > /tmp/dummy.txt
    [sshuser@e-001-example-0917-hd ~]$ hive -e "CREATE TABLE dummy (value STRING); \\
    LOAD DATA LOCAL INPATH '/tmp/dummy.txt' \\
    OVERWRITE INTO TABLE dummy"
    
  2. Run Hive CLI using the following command.

    • Since the HiveServer is installed in the edge node, you don't have to add any options.
    [sshuser@e-001-example-0917-hd ~]$ hive
    20/11/06 16:04:39 WARN conf.HiveConf: HiveConf of name hive.server2.enable.doAs.property does not exist
    log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
    
    Logging initialized using configuration in file:/etc/hive/2.6.5.0-292/0/hive-log4j.properties
    hive>
    
  3. Register the function in metastore as below.

    • Set the name with the CREATE FUNCTION statement.
    • Hive metastore: The place where metadata related to tables and partitions is saved
    hive> CREATE FUNCTION strip AS 'com.naverncp.hive.Strip'
        > USING JAR 'hdfs:///user/example/hive-1.0-SNAPSHOT.jar';
    converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
    Added [/tmp/99c3d137-f58e-4fab-8a2a-98361e3e59a1_resources/hive-1.0-SNAPSHOT.jar] to class path
    Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
    OK
    Time taken: 17.786 seconds
    

    If you want to use the function during a Hive session only rather than saving it permanently in the metastore, then use TEMPORARY keyword as below.

    ADD JAR 'hdfs:///user/example';
    CREATE TEMPORARY FUNCTION strip AS 'com.naverncp.hive.Strip'
    
  4. Check if the built strip function is executed successfully. You can see that the spaces have been successfully removed.

    hive> select strip('  bee  ') from dummy;
    converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
    Added [/tmp/70e2e17a-ecca-41ff-9fe6-48417b8ef797_resources/hive-1.0-SNAPSHOT.jar] to class path
    Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
    OK
    bee
    Time taken: 0.967 seconds, Fetched: 1 row(s)
    hive> select strip('banana', 'ab') from dummy ;
    OK
    nan
    Time taken: 0.173 seconds, Fetched: 1 row(s)
    

You can delete functions as shown below.

DROP FUNCTION strip
Note

If you prepare frequently used logic as UDF according to the data characteristics, then you can easily search data with the SQL statements.