Running Hive UDF
    • PDF

    Running Hive UDF

    • PDF

    Article Summary

    Available in VPC

    Hive UDF(User-Defined Functions) helps users to execute code written within Hive query, Built- in Functions is used when it is difficult to express the desired query.
    Usually, UDFs are written and used to be used for specific field data such as search logs and transaction history.

    UDFs can be divided into three types according to the number of input rows received by the function and the number of output rows returned. Each type of function has a different interface that must be implemented.

    • UDF
      It is a function that receives a single row as input and returns a single row as output.
      Most mathematical and string functions such as ROUND and REPLACE fall under this type.

    • UDAF
      It is a function that receives multiple rows as input and returns a single row as output.
      The aggregate functions such as COUNT and MAX fall under this type.

    • UDTF
      It is a function that receives a single row as input and returns multiple rows (table) as output.
      Functions such as EXPLODE fall under this type.

    This guide introduces the method for implementing the org.apache.hadoop.hive.ql.exec.UDF Hive UDF interface and using it in Cloud Hadoop.

    To use Hive UDF in Cloud Hadoop, proceed with the following steps in order.

    Note

    The UDF needs to be implemented with Java. If you want to use other programming languages, then create a user-defined script (map reduce script) and use the SELECT TRANSFORM statement.

    1. Create project

    1. Use IntelliJ to create a Gradle project.

      • package: com.naverncp.hive

      hadoop-chadoop-use-ex5_1-1_en

      hadoop-chadoop-use-ex5_1-2_en

      hadoop-chadoop-use-ex5_1-3_en

    2. Under the root of the project, add the dependency settings to build.gradle as shown below.

      • The example used the same version as the component installed in Cloud Hadoop 2.0.
      plugins {
          id 'java'
      }
      group 'com.naverncp'
      version '1.0-SNAPSHOT'
      repositories {
          mavenCentral()
          maven {
              url "<http://conjars.org/repo>"
          }
      }
      dependencies {
          compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
          compile group: 'org.apache.hive', name: 'hive-exec', version: '3.1.2'
          compile group: 'org.apache.commons', name: 'commons-lang3', version: '3.9'
          testCompile group: 'junit', name: 'junit', version: '4.12'
      }
      

    2. Implement interface

    1. Implement the UDF that satisfies the following condition.

      • The UDF inherits org.apache.hadoop.hive.ql.exec.UDF.
      • The UDF implements at least one evaluate() method.

    It is because the evaluate() method is not defined in the org.apache.hadoop.hive.ql.exec.UDF interface since it is difficult to know in advance how many arguments the function will receive or what type the argument will be.

      // Strip.java
      package com.naverncp.hive;
      import org.apache.commons.lang.StringUtils;
      import org.apache.hadoop.hive.ql.exec.Description;
      import org.apache.hadoop.hive.ql.exec.UDF;
      import org.apache.hadoop.io.Text;
      @Description(
              name = "Strip",
              value = "returns a stripped text",
              extended = "stripping characters from the ends of strings"
      )
      public class Strip extends UDF {
          private Text result = new Text();
          public Text evaluate(Text str){
              if (str == null){
                  return null;
              }
              result.set(StringUtils.strip(str.toString()));
              return result;
          }
          public Text evaluate(Text str, String stripChar){
              if (str == null){
                  return null;
              }
              result.set(StringUtils.strip(str.toString(), stripChar));
              return result;
          }
      }
    

    In the class above, two evaluate methods have been implemented.

    • First method: Remove leading and trailing spaces from the string
    • Second method: Remove specified characters starting from the end of the string
    1. To use UDF in Hive, package the Java class into .jar first.

      • The following example shows .jar uploaded under hdfs:///user/suewoon.
      $ ./gradlew clean
      $ ./gradlew build
      $ scp -i ~/Downloads/suewoon-home.pem 
      ~/IdeaProjects/hive/build/libs/hive-1.0-SNAPSHOT.jar sshuser@pub-
      4rrsj.hadoop.ntruss.com:~/
      $ ssh -i ~/Downloads/suewoon-home.pem  sshuser@pub-
      4rrsj.hadoop.ntruss.com
      
      [sshuser@e-001-suewoon-0917-hd ~]$ hadoop fs -copyFromLocal hive-1.0-SNAPSHOT.jar /user/suewoon/
      

    3. Use Hive

    1. Run Hive CLI using the following command.

      • Since the HiveServer is installed in the edge node, you don't have to add any options.
      [sshuser@e-001-example-0917-hd ~]$ hive
      20/11/06 16:04:39 WARN conf.HiveConf: HiveConf of name hive.server2.enable.doAs.property does not exist
      log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
      
      Logging initialized using configuration in file:/etc/hive/2.6.5.0-292/0/hive-log4j.properties
      hive>
      
    2. Register the function in metastore as below.

      • Set the name with the CREATE FUNCTION statement.
      • Hive metastore: The place where metadata related to tables and partitions is saved
      hive> CREATE FUNCTION strip AS 'com.naverncp.hive.Strip'
          > USING JAR 'hdfs:///user/example/hive-1.0-SNAPSHOT.jar';
      converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
      Added [/tmp/99c3d137-f58e-4fab-8a2a-98361e3e59a1_resources/hive-1.0-SNAPSHOT.jar] to class path
      Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
      OK
      Time taken: 17.786 seconds
      

      If you want to use the function during a Hive session only rather than saving it permanently in the metastore, then use TEMPORARY keyword as below.

      ADD JAR 'hdfs:///user/example';
      CREATE TEMPORARY FUNCTION strip AS 'com.naverncp.hive.Strip'
      
    3. Check if the built strip function is executed successfully. You can see that the spaces have been successfully removed.

      hive> select strip('  bee  ');
      converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar
      Added [/tmp/70e2e17a-ecca-41ff-9fe6-48417b8ef797_resources/hive-1.0-SNAPSHOT.jar] to class path
      Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar]
      OK
      bee
      Time taken: 0.967 seconds, Fetched: 1 row(s)
      hive> select strip('banana', 'ab');
      OK
      nan
      Time taken: 0.173 seconds, Fetched: 1 row(s)
      

    You can delete functions as shown below.

    DROP FUNCTION strip;
    
    Note

    If you prepare frequently used logic as UDF according to the data characteristics, then you can easily search data with the SQL statements.


    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.