Available in VPC
Cloud Hadoop is a fully managed cloud analysis service where you can freely use open source-based frameworks such as Apache Hadoop, HBase, Spark, Hive, and Presto to process big data easily and quickly. Direct server access through a terminal is allowed, and you can directly manage it through the convenient cluster management feature provided through Ambari.
You can easily configure the initial infrastructure with the Cloud Hadoop service on NAVER Cloud Platform. Stability, flexible scalability, and availability for services and tasks are secured by 2 master nodes provided including availability of node expansion/reduction at any time. In addition, you can analyze large-scale data with various frameworks and server types supported, and clusters can be controlled by management and monitoring through Web UI.
Cloud Hadoop features
-
User convenience
- Cloud Hadoop automatically supports cluster creation, reducing the burden on infrastructure management.
- You can secure a system capable of analyzing anytime through the installation, configuration, and optimization processes of various open source frameworks.
-
Cost efficiency
- It is an efficient service where you only pay as much as you used from the start to the end point of the cluster.
- Cloud Hadoop saves large-scale data at a low cost by using NAVER Cloud Platform's Object Storage as its storage for data.
-
Flexible scalability and stability
- You can easily reduce or increase the number of instances needed for analyzing data at the desired time.
- Two master nodes are provided for higher stability and availability of services and tasks.
-
Various frameworks supported
- Hadoop: framework that can distribute and process large-scale data sets across the whole computer clusters using a simple programming model
- HBase: large-scale data storage that can be distributed and expanded
- Spark: integrated analysis engine for processing large-scale data
- Hive: data warehouse software for reading, inserting, and managing large-scale data sets in dispersed storage using SQL
- Presto: dispersed SQL query engine for big data
-
Web UI provided for management and monitoring
- A UI is provided for managing the information and status of the Cloud Hadoop cluster.
- Provision of root access permissions for clusters enables complete control over the clusters, as well as allows the setting values of the framework to be checked or edited.
Cloud Hadoop user guide
- Cloud Hadoop overview: features and benefits, guides, related resources, and FAQs
- Cloud Hadoop scenario: complete usage scenarios
- Prerequisites for using Cloud Hadoop: guide to support specifications for using Cloud Hadoop
- Getting started with Cloud Hadoop: how to create Cloud Hadoop from the NAVER Cloud Platform console
- How to use Cloud Hadoop: how to use Cloud Hadoop
- Utilizing the Cloud Hadoop ecosystem: how to utilize the applications provided by Cloud Hadoop
- Integrating Cloud Hadoop: how to integrate Cloud Hadoop with an external system
- Cloud Hadoop resource management: view Cloud Hadoop resource information
- Managing Cloud Hadoop permissions: Cloud Hadoop permissions management method and policy guide
- Cloud Hadoop release notes: update history for Cloud Hadoop versions and guides
Cloud Hadoop related resources
NAVER Cloud Platform provides various resources and guides to help customers understand Cloud Hadoop better. If you are a developer or marketer in need of detailed information while you are considering adopting Cloud Hadoop for your company or establishing data related policies, then make good use of the resources below.
- API guide: instructions for developers
- CLI guide: instructions for developers
- Sub Account user guide: Sub Account guide for those who need administrator accounts of various authority levels for managing Cloud Hadoop
- Ncloud use environment guide: guide to VPC environment and supported functions
- Portal and console user guide: solutions for problems on basic use methods for portal and console as well as subscription and management of Cloud Hadoop
- Pricing plan, characteristics, detailed features
- Latest service news: latest news on Cloud Hadoop
- FAQ: frequently asked questions by Cloud Hadoop users
- Contact us: send direct inquiries in case of any unresolved questions that aren't answered by the user guide
Check FAQs first.
Q. Which cluster node types are provided by Cloud Hadoop?
A. Cloud Hadoop clusters are clusters, or sets of nodes, configured for distributed storage and analysis of data. Depending on the purpose of the node, there are three types of nodes in a cluster.
- Edge node: gateway node for external connections
- Master node: admin node for monitoring worker nodes. 2 master nodes are created with high availability support, and the number of master nodes can't be changed
- Worker node: node that receives commands from master nodes and actually performs tasks such as data analysis. You can initially create 2 to 8 nodes. More nodes can be added/deleted dynamically afterward
Q. How is Cloud Hadoop service configured?
A. Cloud Hadoop is a service for building and managing clusters easily and conveniently. You can create components such as Hadoop, HBase, Spark, and Presto and build and operate a system for processing large-scale data. You can install open source frameworks that can process large amounts of data such as Apache Hadoop, HBase, Hive, and Spark on the cluster. For the configuration of the Cloud Hadoop service, see the following configuration diagram (architecture).

Q. network error: connection timed out occurs in the SSH access process in putty.
A. When you have allowed ssh access (Port 22) in ACG but an error occurs in ssh access, ssh access (Port 22) is probably blocked in Network ACL (NACL). Allow ssh access (Port 22) in NACL.
Q. What is the bandwidth of the NCP server?
A. The basic bandwidth of the NCP server is around 1 Gbps (1 Gbit/sec).
Q. In the process of reading data while using the NCP server, there is too much traffic overall. What should I do when there is too much network traffic use?
A.
- You can disperse data and traffic by adding several worker nodes.
- You can save data in Object Storage by separating storage resources from computing resources, and read and save data of Object Storage by using the computing resources of Cloud Hadoop to reduce network traffic use.
Q. In the Cloud Hadoop Ambari Metric service, what are the differences in features between the general operation status and the Maintenance mode operation status?
A. The Maintenance mode feature provided by Ambari WebUI can be set by unit of service or host.
- If you set the Maintenance mode, no notification can be sent.
- If the Maintenance mode is set by unit of host (server), it is excluded from batch jobs such as service restart tasks when the batch jobs are conducted.
Q. When running show tables in Hue, no View table list appears in Hive interpreter.
A. When running show tables, only the general table list is exposed. You can run show views to check the View table list.
Q. When I access Hive with an account that is not Hive and run a hive query, Permission denied error occurs.
A. There are two solutions to this problem.
- You can add the relevant account to Yarn Queue ACL. Log in to Ambari WebUI > select Yarn Queue Manager > select default (yarn queue) and add the relevant account to Users of Administer Queue and Users of Submit Applications.
- If you use a hive account, you can use it without adding an account.
Q. When I run hadoop fsck / and check the file system, an error occurs.
A. fsck of hdfs can be run through an hdfs account. Log in to sshuser, convert the account into sudo su - hdfs, and then run it.
Q. In the process of integrating Object Storage (S3) through Hive, a communication error occurs with S3.
A. Check the Object Storage address for each Cloud Hadoop region. Even if a server is within Public Subnet, if the master server has not received a public IP, you can communicate with only the private domain of Object Storage.
The following are the domain addresses of Object Storage.
Server within Public Subnet
- Internet-based communication is available using kr.object.ncloudstorage.com, which is a public domain.
- Private communication is available using kr.object.private.ncloudstorage.com, which is a private domain.
Server within Private Subnet
- Communication is available by default using kr.object.private.ncloudstorage.com, which is a private domain.
- If you use NAT Gateway, you can communicate by using kr.object.ncloudstorage.com, which is a public domain.
Q. I intend to perform data migration with the Object Storage bucket. Can I connect several Hadoop Clusters to a single Object Storage bucket?
A. You cannot select an Object Storage bucket designated when creating Cloud Hadoop to create another Cloud Hadoop. To conduct migration, you can use the following method.
- Create a new bucket on Object Storage and perform data upload.
- When creating a new Cloud Hadoop, select the new bucket with the data uploaded.
Q. If I want to delete the currently used Cloud Hadoop cluster but use the relevant data as they are, what should I do?
A. You can use the data as they are even if you delete the Cloud Hadoop cluster through two methods.
- When creating Cloud Hadoop, if you use metastore via Data Catalog product, you can reuse the meta table of applications such as Hive/Trino/Impala as it is even if the cluster is deleted.
- Save the data that needs to be analyzed in Object Storage, iintegrate it with External table in Hive of Cloud Hadoop, and then you can reuse it.
Q. Do I have to select cluster add-ons (HBASE, Impala, Nifi, etc.) when creating the cluster or can I install them later to be able to use them?
A. You do not have to select add-ons when creating a cluster. You can click [Add service] on Ambari Web UI to add and use the services later.
Q. I cannot access Hive View from Apache Ambari.
A. Ambari 2.7.0 and later versions do not support Hive View. If you want to use Hive View, you can access it via Hue.
Q. If I use the Cloud Hadoop cluster 1.9 version, the Presto 0.240 version is included. Can I migrate Presto to the latest version?
A. The Presto (Trino) version upgrade is not supported. For the Cloud Hadoop 2.0 version or higher, Trino 377, a higher version of Presto 0.240, is supported.
For more information on the versions supported by Cloud Hadoop, see Supported applications by cluster version.
Q. After the Ambari Infra Solr service has stopped, the service does not restart.
A. Due to full GC caused by too much log data accumulated, the Infra Solr service might be stopped.
- Infra Solr is a service to store logs, so you may not be able to start the service due to full GC based on the amount of log data stored for a long time.
- If you cannot start the service, take the following actions.
- Increase the Infra Solr heap size to start the service. You can adjust the heap size in Ambari Web UI > Infra Solr > Configs.
- When the service starts normally, delete the log data stored before a certain period in
hadoop_logsof Infra Solr.# Examples of deleting data stored more than a month ago curl "http://{INFRA-SOLR-HOSTNAME}:8886/solr/hadoop_logs/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>evtTime:[* TO NOW-1MONTHS]</query></delete>"
Q. When running the Hive query, the System times on machines may be out of sync error occurs.
A. You need to synchronize system time and hardware time. Conduct the following tasks in all servers of Cloud Hadoop:
- Check time
- Check system time:
date - Check hardware time:
hwclock
- Check system time:
- Synchronize time
- Apply hardware time to system time:
hwclock --hctosys
- Apply hardware time to system time:
Q. When running the ntpstat, unsynchronisedoccurs.
A. Synchronize Cloud Hadoop server time by referring to the Checking time synchronization settings.
Q. Can I set the query log storage cycle for Trino?
A. As Trino is an open source, it does not support the log storage cycle setting feature. Instead, you can manage the query history by using properties provided by Trino.
query.max-history: sets the maximum number of queries that can be savedquery.min-expire-age: sets the minimum time required for history expiration
Trino’s query history is saved in In-Memory, and its performance may be affected if you set the query.max-history value too high.
Q. Can I save Trino query history as a file?
A. As Trino is an open source, it does not support the feature of saving the query history as a file. Instead, you can use Trino API (http://<TRONO_FQDN>:8285/ui/api/query) to obtain the query history from the memory in JSON format and utilize it.
Q. I would like to add a new account to Hive service.
A. As Apache Hive uses the local account of the OS, you can create a new account in the cluster. Conduct the following tasks:
- Create a new local account in all Cloud Hadoop servers
useradd -u {uid} {new_user} -g hadoop- It is recommended to set the same uid value across all servers
- Create a directory in HDFS for the new account
hdfs dfs -mkdir /user/{new_user}hdfs dfs -chown {new_user}:hadoop /user/{new_user}
Q. Where can I check the Impala port?
A. You can check the Impala port from Ambari Web UI > Impala > Configs > Advanced impala-port > Hive Server2 port. The Impala port is set as 21050 by default.
Q. What should I do to bring the SSL certificate of the edge node?
A. You can bring the certificate of the edge node by copying it to the server you are currently working on by using the scp command.
Or you can download the certificate from a web browser. You can bring the certificate by accessing the Application Web UI and clicking Certificate Viewer > Details > Export.
Q. I cannot access the Application Web UI.
A. Check if the access source and allowed port are set correctly in the ACG rule of the cluster. If you are using SSL VPN, be careful not to use 0.0.0.0/0 value as the destination address in Routing Table. For more information, see Preliminary tasks for accessing Web UI.