Available in VPC
This section describes how to manage Iceberg tables using Hive on NAVER Cloud Platform's Cloud Hadoop.
Apache Iceberg overview
What Is Apache Iceberg?
Iceberg is a high performance table format designed for analyzing massive data.
It not only provides the reliability and simplicity of SQL tables in a big data environment, but also enables simple integration with widely used data processing frameworks such as Spark, Flink, Hive, and Trino.
Apache Iceberg features
- Expressive SQL: supports widely used SQL commands for reading, writing, and updating data. Anyone familiar with SQL commands can build a data lake and perform most data lake operations with Apache Iceberg, without needing to learn a new language.
- Data consistency: ensures that all users reading and writing data see the same data, thereby guaranteeing consistency.
- Full schema evolution: even if the schema changes, there is no need to rewrite the table. You can add columns, rename columns, or remove columns in the data table without modifying the underlying data.
- Hidden partitioning: Iceberg derives column values, optionally transforms them, and then creates partition values. Because you don't have to maintain physical partitions directly, partitioning can remain hidden.
- Time travel and rollback: Iceberg supports data version control, allowing you to track how data changes over time. Thus, you can access earlier versions of the data for querying and the time travel function allows you to investigate changes between update and delete operations.
- Data compaction: data compression is natively supported, giving you various methods to optimize file layout and size.
Using Iceberg in Data Catalog
- In the Data Catalog console, you can create and view Iceberg tables and then manage the data according to the guide below.
- Later on, additional functions such as data compaction, snapshot management, and table optimization will be provided. (Scheduled in the second half of 2025)
Cloud Hadoop settings
Configure the Cloud Hadoop cluster's Hive Metastore repository to Set to Data Catalog.
Accessing Hue in the Cloud Hadoop Ecosystem
- Click on the cluster account, then on the cluster details screen, check the domain address.
- In your web browser's address bar, enter the domain address and port number as follows, then access the Hue web page:
https://{domain address}:8081
Hue (Hadoop User Experience) is a web based interface that works with Apache Hadoop clusters Bundled with other Hadoop ecosystems, Hue can be used to run Hive tasks and Spark jobs.
Verifying database and table created in Data Catalog
You can check in Hue the database and Iceberg type tables created in the Data Catalog console.
For information on creating Iceberg tables in the Data Catalog console, see Table guide.
When creating an Iceberg type table directly in Hive, external.table.purge=FALSE option configuration is required.
# Examples of table creation query
CREATE EXTERNAL TABLE `dct_iceberg_table`(
`id` bigint,
`name` string,
`category` string)
ROW FORMAT SERDE
'org.apache.iceberg.mr.hive.HiveIcebergSerDe'
STORED BY
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION
's3a://{table location}'
TBLPROPERTIES (
'external.table.purge'='FALSE',
'table_type'='ICEBERG');
Settings for Hive query execution
- Use the iceberg library using add jar.
add jar /usr/nch/3.1.0.0-78/hive/lib/iceberg-hive-runtime-1.2.1.jar;
- Use the libfb library using add jar.
add jar /usr/nch/3.1.0.0-78/hive/lib/libfb303-0.9.3.jar;
- Set the option values.
SET hive.vectorized.execution.enabled=false;
SET iceberg.engine.hive.lock-enabled=false;
SET hive.execution.engine=mr;
If you do not configure the option values, Insert queries execute without issues, but the results may not be reflected when querying the table.
- hive.vectorized.execution.enabled: because Iceberg can conflict with Hive's vectorization feature, you must set this to false.
- iceberg.engine.hive.lock-enabled: disable this so that Iceberg does not use the lock feature in Hive tables. If enabled, it may limit write operations.
- hive.execution.engine: because Iceberg may not fully integrate with Hive's Tez engine, set this to mr (MapReduce).
Run Hive query
- Enter the data using insert.
Insert into dct_iceberg_table values (1, 'John Doe', 'student');
- Check the data with select.
select * from dct_iceberg_table;
Changes in Iceberg files upon query execution
- Data files
- When storing table data, Iceberg uses columnar formats such as Parquet, Avro, or ORC. (The default value is Parquet)
- New data files are created according to the file format specified in Iceberg settings.
- They are saved in
s3a://{bucket name}/{table location}/data/subdirectories. - Iceberg utilizes a unique UUID-based file naming convention to avoid conflicts.
- Manifest file
- This file stores metadata for each data file.
- It includes the data file path, minimum/maximum values (stats) of the file, and partition details.
- It is stored in
s3a://{bucket name}/{table location}/metadata/subdirectory. - A typical file name is in
{UUID}.avroformat.
- Manifest list file
- This manages a list of manifest files, and when a new INSERT task is performed, Iceberg updates changes based on the previous manifest list.
- It is stored in
s3a://{bucket name}/{table location}/metadata/subdirectory. - A typical file name is in
snap-{snapshot-id}-{UUID}.avroformat.
- Metadata JSON file
- After each task (INSERT, DELETE, UPDATE), Iceberg creates a new snapshot of the table.
- This file captures the table's entire state, supporting rollback and time travel functions.
- It is stored in
s3a://{bucket name}/{table location}/metadata/subdirectory. - A typical file name is in
*.metadata.jsonformat.
- Iceberg table structural changes
s3a://{bucket name}/{table-location}/
├── data/
│ ├── {UUID}.parquet
│ └── {UUID}.parquet
└── metadata/
├── snap-{snapshot-id}-{UUID}.avro
├── {UUID}.avro
├── {UUID}.avro
└── *.metadata.json
Note for supported functions by version
- Check the Hive version in Hue.
select version();
- Depending on your Hive version, Delete and Update may not be supported.
Depending on your Hive version, Delete and Update may not be supported.
For more information on SQL support by Iceberg across different Hive versions, see Iceberg documentation.