Cloud Hadoop overview
    • PDF

    Cloud Hadoop overview

    • PDF

    Article Summary

    The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.

    Available in VPC

    Cloud Hadoop is a fully managed cloud analysis service where the user can freely use open source-based frameworks such as Apache Hadoop, HBase, Spark, Hive, and Presto to process big data easily and quickly. Direct server access through a terminal is allowed, and a user can directly manage it through the convenient cluster management feature provided through Ambari.
    You can easily configure the initial infrastructure with the Cloud Hadoop service of NAVER Cloud Platform. Stability, flexible expandability, and availability for services and tasks are secured by 2 master nodes being provided and availability of node expansion/reduction at any time. In addition, you can analyze large-scale data with various frameworks and server types supported, and clusters can be controlled by management and monitoring through Web UI.

    Cloud Hadoop features

    • User convenience

      • Cloud Hadoop automatically supports cluster creation, reducing burden on infrastructure management.
      • You can secure a system capable of analysis anytime through installation, configuration, and optimization processes of various open source frameworks.
    • Cost efficiency

      • It's an efficient service where the user only pays as much as they used from the start to the end point of the cluster.
      • Cloud Hadoop saves large-scale data at a low cost by using NAVER Cloud Platform's Object Storage as its storage for data.
    • Flexible scalability and stability

      • A user can easily reduce or increase the number of instances needed for analyzing data at the desired time.
      • Two master nodes are provided for higher stability and availability of services and tasks.
    • Various frameworks supported

      • Hadoop: framework that can distribute and process large-scale data sets across the whole computer clusters using a simple programing model
      • HBase: large-scale data storage that can be distributed and expanded
      • Spark: integrated analysis engine for processing large-scale data
      • Hive: data warehouse software for reading, inserting, and managing large-scale data sets in dispersed storage using SQL
      • Presto: dispersed SQL query engine for big data
    • Web UI provided for management and monitoring

      • A UI is provided for managing the information and status of Cloud Hadoop cluster.
      • Provision of root access permissions for clusters enables complete control over the clusters, as well as allowing the setting values of the framework to be checked or edited.

    Cloud Hadoop user guides

    NAVER Cloud Platform provides various resources and guides to help customers understand Cloud Hadoop better. If you're a developer or marketer in need of detailed information while you're considering adopting Cloud Hadoop for your company or establishing data related policies, make good use of the resources below.

    Check the FAQs first.

    Q. Which cluster node types are provided by Cloud Hadoop?
    A. Cloud Hadoop clusters are clusters, or sets of nodes, configured for distributed storage and analysis of data. Depending on the purpose of the node, there are three types of nodes in a cluster.

    • Edge node: gateway node for external connections
    • Master node: admin node for monitoring worker nodes. 2 master nodes are created with high availability support, and the number of master nodes can't be changed
    • Worker node: node that receives command from master nodes and actually performs tasks such as data analysis. You can initially create 2 to 8 nodes. More nodes can be added/deleted dynamically afterward

    Q. How is Cloud Hadoop service configured?
    A. Cloud Hadoop is a service for easily and conveniently constructing and managing clusters. You can make components such as Hadoop, HBase, Spark, and Presto and construct and operate a system for processing large-scale data. You can install open source frameworks that can process large amounts of data such as Apache Hadoop, HBase, Hive, and Spark on the cluster. For the configuration of the Cloud Hadoop service, see the following configuration diagram (architecture).

    chadoop-1_01_ko

    Q. network error: connection timed out occurs in the SSH access process in putty.
    A. When you have allowed the ssh access (Port 22) in ACG but an error occurs in the ssh access, it may be that ssh access (Port 22) is blocked in Network ACL (NACL). Allow the ssh access (Port 22) in NACL.

    Q. What is the bandwidth of NCP server?
    A. The basic bandwidth of NCP server is around 1 Gbps (1 Gbits/sec).

    Q. In the process of reading data while using NCP server, there is too much traffic overall. What should I do when there is too much network traffic use?
    A.

    • You can disperse data and traffic by adding several worker nodes.
    • You can save data in Object Storage by separating storage resource from computing resource, and read and save data of Object Storage by using computing resource of Cloud Hadoop, thereby reducing network traffic use.

    Q. In Cloud Hadoop Ambari Metric service, what are the differences in features between the general operation status and the Maintenance mode operation status?
    A. The Maintenance mode feature provided by Ambari WebUI can be set by the unit of service or host.

    • If you set the Maintenance mode, no alarm can be sent.
    • If the Maintenance mode is set by the unit of host (server), it is excluded from batch jobs such as service restart tasks when the batch jobs are conducted.

    Q. When running show tables in Hue, no View table list appears in Hive interpreter.
    A. When running show tables, only the general table list is exposed. You can run show views to check the View table list.

    Q. When I access Hive with an account that is not Hive and run a hive query, the Permission denied error occurs.
    A. There are two solutions to this problem.

    • You can add the relevant account to Yarn Queue ACL. Log in to Ambari WebUI > select Yarn Queue Manager > select default (yarn queue) and add the relevant account to Users of Administer Queue and Users of Submit Applications.
    • If you use a hive account, you can use it without adding an account separately.

    Q. When I run hadoop fsck / and check the file system, an error occurs.
    A. fsck of hdfs can be run through an hdfs account. Log in to sshuser, convert the account to sudo su - hdfs, and then run it.

    Q. In the process of integrating Object Storage (S3) through Hive, a communication error occurs with S3.
    A. Check the Object Storage address for each Cloud Hadoop Region. Even if a server is within Public Subnet, if the master server has not received a public IP, you can communicate with only the private domain of Object Storage.

    Note

    The following are the domain addresses of Object Storage:
    Server within public subnet

    • Internet-based communication is available using kr.object.ncloudstorage.com, which is a public domain.
    • Private communication is available using kr.object.private.ncloudstorage.com, which is a private domain.

    Server in a private subnet

    • Communication is available by default using kr.object.private.ncloudstorage.com, which is a private domain.
    • If you use NAT Gateway, you can communicate by using kr.object.ncloudstorage.com, which is a public domain.

    Q. I intend to perform data migration with Object Storage bucket. Can I connect several Hadoop Clusters to a single Object Storage bucket?
    A. You cannot select an Object Storage bucket designated when creating Cloud Hadoop for creating another Cloud Hadoop. To conduct migration, you can use the following method:

    1. Create a new bucket on Object Storage and perform data upload.
    2. When creating a new Cloud Hadoop, select the new bucket with the data uploaded.

    Q. If I want to delete the currently used Cloud Hadoop cluster but to use the relevant data as they are, what should I do?
    A. You can use the data as they are even if you delete Cloud Hadoop cluster through two methods.

    Q. Do I have to select cluster add-ons (HBASE, Impala, Nifi, etc.) when creating the cluster or can I install them later to be able to use them?
    A. You do not have to select add-ons when creating a cluster. You can click [Add service] on Ambari Web UI to add and use the services later.

    Q. I cannot access Hive View from Apache Ambari.
    A. Ambari 2.7.0 and later versions do not support Hive View. If you want to use Hive View, you can access it via Hue.

    Q. When using Cloud Hadoop cluster version 1.9, Presto version 0.240 is built-in. Can I migrate Presto to the latest version?
    A. Version upgrades for Presto (Trino) are not supported. Note that Cloud Hadoop cluster version 2.0 or higher supports Trino 377, which is a higher version of Presto 0.240.
    For more information about the versions supported by Cloud Hadoop, see Supported applications by cluster version.


    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.