Using Impala
    • PDF

    Using Impala

    • PDF

    Article Summary

    Available in VPC

    Impala is a massive parallel processing engine that can analyze stored data in real time using interactive SQL. It can read data from various storage sources such as HBase and Kudu, as well as Hadoop HDFS, and be used in conjunction with Hive and Hue. It uses your existing Hive metastore without the need to use a dedicated metastore.
    Impala appeared to solve the real-time query performance problem and multi-user support of Hive. While Hive, Pig processes queries through the Map-Reduce framework, Impala has the fastest response speed because it uses its own distributed query engine.

    Note

    Impala supports single-statement transaction, but not multi-statement transaction.

    Impala components

    Impala service consists of three major components: Impala Daemon, StateStore, and Catalog Service.

    • Impala Daemon (Impalad)
      Installed on the data nodes in the Hadoop cluster, it manages the planning, scheduling, and execution engines for requested queries. Each Impalad consists of Query Planner, Query Coordinator, and Query Exec engine.

      • Query Planner: establishes an execution plan for queries
      • Query Coordinator: requests to executor by managing job list and scheduling
      • Query Exec Engine: optimizes and executes queries and returns results to the Coordinator
    • Impala Statestore (StateStored)
      It manages the state of Impalad on each data node in the cluster, and performs metadata sync operations for Impalad requested by Catalogd.

    • Impala Catalog (Catalogd)
      It is responsible for requesting broadcasting to StateStored in order to reflect it in Impalad's metadata when querying. Metadata changes made directly in Impalad are automatically synchronized, and changes made directly in Hive or HDFS must be synchronized using the Refresh statement.

    Query execution process

    The process of a query execution is as follows:

    1. You perform a query using the Impala shell, ODBC, etc. in a specific Impalad within the cluster.
    2. Impalad retrieves the table schema from the Hive metastore, determines the suitability of the query statement, and collects data blocks and location information required for query execution from the HDFS name node.
    3. Based on the recently updated Impala metadata, information necessary for query execution is propagated to all Impalads in the cluster.
    4. All Impalads that receive the query and metadata read the data block to be processed from the local directory and process the query.
    5. When tasks are completed in all Impalad, the Impalad that received the query from the user collects the results and delivers them to the user.

    chadoop-30-001.jpg

    Using Impala

    It describes how to use Impala.

    Create cluster

    From the NAVER Cloud Platform console, create the Cloud Hadoop cluster. For more information on creating clusters, see Create cluster.

    Note

    Starting with Cloud Hadoop 1.9, you can use clusters with Impala v4.1.0 installed.
    chadoop-30-002_ko.jpg

    Check Impala service in Ambari UI

    In the Ambari UI, you can view Impala services as follows. You can start and stop each component of the service from this page.

    chadoop-30-003.jpg

    • Summary: check the host where the components are installed
    • Configs: change configurations of Impala service
    • Quick Links: Impala Statestore WEB-UI, Impala Catalog WEB-UI, Impala Server WEB-UI
      • Accessing these links requires tunneling. Access through the web UI link provided by the console. For more information, see Access Impala WEB UI.
    Caution

    If you install the Atlas service on a cluster where Impala is installed, the Atlas service may not run properly due to port duplication.
    Change the value of Ambari UI > Atlas > CONFIGS > ADVANCED > Advanced application-properties > atlas.server.http.port so that it does not overlap with the port number (21000) used by Impala, and then rerun the service.

    Impala shell

    Impala shell provides a responsive shell to run queries. For detailed instructions on using the Impala shell, see Impala shell Documentation.

    • Connecting to impalad from the Impala shell
      /usr/lib/impala/impala-shell/impala-shell
      connect <impalad-HOST-NAME>
      
    Note

    When connecting to impalad from the Impala shell, <impalad-HOST-NAME> is the host name of the data node running impalad. Host name of Ambari UI > Impala > Quick Links > Impala Server WEB-UI

    • To create a table, change the value of Hive property change Ambari UI > Hive > CONFIGS > ADVANCED > Advanced hive-site > hive.strict.managed.tables to false, save it, and restart Hive.

    • Create database and table

      CREATE database test;
      USE test;
      CREATE TABLE testTable
      (
      ID INT
      );
      
    • Search after saving data

      INSERT INTO testTable(ID)
      VALUES (1);
      SELECT * FROM testTable;
      
    Note

    For more query examples, see p.35-p.39 of Apache Impala Guide.

    Access Impala WEB UI

    You can access the Impala WEB UI through [View by application] on the Cloud Hadoop console. For more information, see View by application.
    chadoop-30-004_ko.jpg

    • You can see the overall status of Impala services on the WEB UI page for connecting to Impala Server. You can also view query history.
      chadoop-30-005.jpg

    • You can check the Impala editor in the Hue WEB UI.
      chadoop-impala-hue_ko.jpg


    Was this article helpful?

    What's Next
    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.