- Print
- PDF
Running Hive UDF
- Print
- PDF
Available in VPC
Hive UDF(User-Defined Functions) helps users to execute code written within Hive query, Built- in Functions is used when it is difficult to express the desired query.
Usually, UDFs are written and used to be used for specific field data such as search logs and transaction history.
UDFs can be divided into three types according to the number of input rows received by the function and the number of output rows returned. Each type of function has a different interface that must be implemented.
UDF
It is a function that receives a single row as input and returns a single row as output.
Most mathematical and string functions such as ROUND and REPLACE fall under this type.UDAF
It is a function that receives multiple rows as input and returns a single row as output.
The aggregate functions such as COUNT and MAX fall under this type.UDTF
It is a function that receives a single row as input and returns multiple rows (table) as output.
Functions such as EXPLODE fall under this type.
This guide introduces the method for implementing the org.apache.hadoop.hive.ql.exec.UDF
Hive UDF interface and using it in Cloud Hadoop.
To use Hive UDF in Cloud Hadoop, proceed with the following steps in order.
The UDF needs to be implemented with Java. If you want to use other programming languages, then create a user-defined script (map reduce script) and use the SELECT TRANSFORM statement.
1. Create project
Use IntelliJ to create a Gradle project.
- package:
com.naverncp.hive
- package:
Under the root of the project, add the dependency settings to
build.gradle
as shown below.- The example used the same version as the component installed in Cloud Hadoop 2.0.
plugins { id 'java' } group 'com.naverncp' version '1.0-SNAPSHOT' repositories { mavenCentral() maven { url "<http://conjars.org/repo>" } } dependencies { compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'org.apache.hive', name: 'hive-exec', version: '3.1.2' compile group: 'org.apache.commons', name: 'commons-lang3', version: '3.9' testCompile group: 'junit', name: 'junit', version: '4.12' }
2. Implement interface
Implement the UDF that satisfies the following condition.
- The UDF inherits
org.apache.hadoop.hive.ql.exec.UDF
. - The UDF implements at least one
evaluate()
method.
- The UDF inherits
It is because the evaluate()
method is not defined in the org.apache.hadoop.hive.ql.exec.UDF
interface since it is difficult to know in advance how many arguments the function will receive or what type the argument will be.
// Strip.java
package com.naverncp.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
@Description(
name = "Strip",
value = "returns a stripped text",
extended = "stripping characters from the ends of strings"
)
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str){
if (str == null){
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChar){
if (str == null){
return null;
}
result.set(StringUtils.strip(str.toString(), stripChar));
return result;
}
}
In the class above, two evaluate
methods have been implemented.
- First method: Remove leading and trailing spaces from the string
- Second method: Remove specified characters starting from the end of the string
To use UDF in Hive, package the Java class into
.jar
first.- The following example shows
.jar
uploaded underhdfs:///user/suewoon
.
$ ./gradlew clean $ ./gradlew build $ scp -i ~/Downloads/suewoon-home.pem ~/IdeaProjects/hive/build/libs/hive-1.0-SNAPSHOT.jar sshuser@pub- 4rrsj.hadoop.ntruss.com:~/ $ ssh -i ~/Downloads/suewoon-home.pem sshuser@pub- 4rrsj.hadoop.ntruss.com [sshuser@e-001-suewoon-0917-hd ~]$ hadoop fs -copyFromLocal hive-1.0-SNAPSHOT.jar /user/suewoon/
- The following example shows
3. Use Hive
Run Hive CLI using the following command.
- Since the HiveServer is installed in the edge node, you don't have to add any options.
[sshuser@e-001-example-0917-hd ~]$ hive 20/11/06 16:04:39 WARN conf.HiveConf: HiveConf of name hive.server2.enable.doAs.property does not exist log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender. Logging initialized using configuration in file:/etc/hive/2.6.5.0-292/0/hive-log4j.properties hive>
Register the function in
metastore
as below.- Set the name with the CREATE FUNCTION statement.
- Hive metastore: The place where metadata related to tables and partitions is saved
hive> CREATE FUNCTION strip AS 'com.naverncp.hive.Strip' > USING JAR 'hdfs:///user/example/hive-1.0-SNAPSHOT.jar'; converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar Added [/tmp/99c3d137-f58e-4fab-8a2a-98361e3e59a1_resources/hive-1.0-SNAPSHOT.jar] to class path Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar] OK Time taken: 17.786 seconds
If you want to use the function during a Hive session only rather than saving it permanently in the metastore, then use
TEMPORARY keyword
as below.ADD JAR 'hdfs:///user/example'; CREATE TEMPORARY FUNCTION strip AS 'com.naverncp.hive.Strip'
Check if the built
strip
function is executed successfully. You can see that the spaces have been successfully removed.hive> select strip(' bee '); converting to local hdfs:///user/example/hive-1.0-SNAPSHOT.jar Added [/tmp/70e2e17a-ecca-41ff-9fe6-48417b8ef797_resources/hive-1.0-SNAPSHOT.jar] to class path Added resources: [hdfs:///user/example/hive-1.0-SNAPSHOT.jar] OK bee Time taken: 0.967 seconds, Fetched: 1 row(s) hive> select strip('banana', 'ab'); OK nan Time taken: 0.173 seconds, Fetched: 1 row(s)
You can delete functions as shown below.
DROP FUNCTION strip;
If you prepare frequently used logic as UDF according to the data characteristics, then you can easily search data with the SQL statements.