Using Impala

Prev Next

The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.

Available in VPC

Impala is a massively parallel processing engine that can analyze stored data in real time using interactive SQL. It can read data from various storage sources such as HBase and Kudu, as well as Hadoop HDFS, and be used in conjunction with Hive and Hue. It uses your existing Hive metastore without the need to use a dedicated metastore.
Impala appeared to solve the real-time query performance problem and multi-user support of Hive. While Hive and Pig process queries through the Map-Reduce framework, Impala has the fastest response speed because it uses its own distributed query engine.

Note

Impala supports single-statement transaction, but not multi-statement transaction.

Impala components

Impala service consists of three major components: Impala Daemon, StateStore, and Catalog Service.

  • Impala Daemon (Impalad)
    Installed on the data nodes in the Hadoop cluster, it manages the planning, scheduling, and execution engines for requested queries. Each impalad consists of Query Planner, Query Coordinator, and Query Exec engine.

    • Query Planner: establishes an execution plan for queries
    • Query Coordinator: requests to executor by managing job list and scheduling
    • Query Exec Engine: optimizes and executes queries and returns results to the Coordinator
  • Impala Statestore (StateStored)
    It manages the state of impalad on each data node in the cluster, and performs metadata sync operations for Impalad requested by Catalogd.

  • Impala Catalog (Catalogd)
    It is responsible for requesting broadcasting to StateStored in order to reflect it in Impalad's metadata when querying. Metadata changes made directly in impalad are automatically synchronized, and changes made directly in Hive or HDFS must be synchronized using the Refresh statement.

Query execution process

The process of a query execution is as follows:

  1. You perform a query using the Impala shell, ODBC, and so on in a specific impalad within the cluster
  2. Impalad retrieves the table schema from the Hive metastore, determines the suitability of the query statement, and collects data blocks and location information required for query execution from the HDFS name node
  3. Based on the recently updated Impala metadata, information necessary for query execution is propagated to all impalads in the cluster
  4. All impalads that receive the query and metadata read the data block to be processed from the local directory and process the query
  5. When tasks are completed in all impalads, the impalad that received the query from the user collects the results and delivers them to the user

chadoop-30-00o

How to use Impala

It describes how to use Impala.

Create cluster

From NAVER Cloud Platform console, create the Cloud Hadoop cluster. For more information on creating clusters, see Create cluster.

Note

Starting with Cloud Hadoop 1.9, you can use clusters with Impala v4.1.0 installed.
chadoop-30-002_ko

Check Impala service from Ambari UI

From Ambari UI, you can check Impala services as follows: You can start and stop each component of the service from this page.

chadoop-30-003

  • Summary: checks hosts with components installed
  • Configs: change the configuration of Impala service
  • Quick Links: Impala Statestore WEB-UI, Impala Catalog WEB-UI, Impala Server WEB-UI
    • Accessing these links requires tunneling. It is recommended to access through the Web UI link provided by the console. For more information, see Access Impala WEB UI.
Caution

When installing Atlas service on an Impala-installed cluster, Atlas service may not run properly due to duplication of ports.
Change the value in Ambari UI > Atlas > CONFIGS > ADVANCED > Advanced application-properties > atlas.server.http.port so that it does not duplicate with the port number (21000) that Impala is using, and rerun.

Impala shell

Impala shell provides a responsive shell to run queries. For more information on Impala shell, see Impala shell Documentation.

  • Connect Impalad from Impala shell
    /usr/lib/impala/impala-shell/impala-shell
    connect <impalad-HOST-NAME>
    
Note

When connecting to impalad from the Impala shell, <impalad-HOST-NAME> is the host name of the data node running impalad. It is the host name of Ambari UI > Impala > Quick Links > Impala Server WEB-UI.

  • Change Hive properties to create table
    After changing the value of Ambari UI > Hive > CONFIGS > ADVANCED > Advanced hive-site > hive.strict.managed.tables to false, save it and restart Hive.

  • Create database and table

    CREATE database test;
    USE test;
    CREATE TABLE testTable
    (
    ID INT
    );
    
  • Search after saving data

    INSERT INTO testTable(ID)
    VALUES (1);
    SELECT * FROM testTable;
    
Note

For more query examples, see page 35 through 39 of the Apache Impala Guide.

Accessing Impala WEB UI

You can access the Impala WEB UI through [View by application] on the Cloud Hadoop console. For more information, see View by application.
chadoop-30-004_en.png

  • You can see the overall status of Impala services on the WEB UI page for connecting to Impala Server. You can also view query history.
    chadoop-30-005

  • You can check the Impala editor in the Hue WEB UI.

chadoop-impala-hue_en

Note

If you cannot see the Impala editor, proceed with the following steps:
Click Ambari Web UI > Hue > Configs > Hue Service Module > Hue Impala Module to change the status to ON, rerun the service.
chadoop-impala-hue-module