Cloud Hadoop overview

release/20240425
English

Cloud Hadoop overview

Article Summary

Share feedback

Thanks for sharing your feedback!

The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.

Available in VPC

Cloud Hadoop is a fully managed cloud analysis service where the user can freely use open source-based frameworks such as Apache Hadoop, HBase, Spark, Hive, and Presto to process big data easily and quickly. Direct server access through a terminal is allowed, and a user can directly manage it through the convenient cluster management feature provided through Ambari.
You can easily configure the initial infrastructure with the Cloud Hadoop service of NAVER Cloud Platform. Stability, flexible expandability, and availability for services and tasks are secured by 2 master nodes being provided and availability of node expansion/reduction at any time. In addition, you can analyze large-scale data with various frameworks and server types supported, and clusters can be controlled by management and monitoring through Web UI.

Cloud Hadoop features

User convenience
- Cloud Hadoop automatically supports cluster creation, reducing burden on infrastructure management.
- You can secure a system capable of analysis anytime through installation, configuration, and optimization processes of various open source frameworks.
Cost efficiency
- It's an efficient service where the user only pays as much as they used from the start to the end point of the cluster.
- Cloud Hadoop saves large-scale data at a low cost by using NAVER Cloud Platform's Object Storage as its storage for data.
Flexible scalability and stability
- A user can easily reduce or increase the number of instances needed for analyzing data at the desired time.
- Two master nodes are provided for higher stability and availability of services and tasks.
Various frameworks supported
- Hadoop: framework that can distribute and process large-scale data sets across the whole computer clusters using a simple programing model
- HBase: large-scale data storage that can be distributed and expanded
- Spark: integrated analysis engine for processing large-scale data
- Hive: data warehouse software for reading, inserting, and managing large-scale data sets in dispersed storage using SQL
- Presto: dispersed SQL query engine for big data
Web UI provided for management and monitoring
- A UI is provided for managing the information and status of Cloud Hadoop cluster.
- Provision of root access permissions for clusters enables complete control over the clusters, as well as allowing the setting values of the framework to be checked or edited.

Cloud Hadoop user guides

Cloud Hadoop overview: features and benefits, guides, related resources, and FAQs
Cloud Hadoop use scenarios: complete usage scenarios
Prerequisites for using Cloud Hadoop: guide to support specifications for using Cloud Hadoop
Getting started with Cloud Hadoop: guide to create Cloud Hadoop on NAVER Cloud Platform Console
Using Cloud Hadoop: guide on using Cloud Hadoop
Utilizing Cloud Hadoop ecosystem: guide to utilize applications provided by Cloud Hadoop
Integrating Cloud Hadoop: guide to integrate Cloud Hadoop with an external system
Managing Cloud Hadoop resource: guide to check the resource information of Cloud Hadoop
Managing Cloud Hadoop permissions: Cloud Hadoop permissions management method and policy guide
Cloud Hadoop release notes: update history for Cloud Hadoop version and guide

NAVER Cloud Platform provides various resources and guides to help customers understand Cloud Hadoop better. If you're a developer or marketer in need of detailed information while you're considering adopting Cloud Hadoop for your company or establishing data related policies, make good use of the resources below.

API guide: instructions for developers
- Cloud Hadoop (VPC) API guide
CLI guide: instructions for developers
- Cloud Hadoop (VPC) CLI guide
Sub Account user guide: Sub Account guide for those who need administrator accounts of various authority levels for managing Cloud Hadoop
Ncloud use environment guide: guide to VPC environment and supported features
Portal and console user guide: solutions for problems on basic use methods for portal and console as well as subscription and management of Cloud Hadoop
Pricing plan, characteristics, detailed features
Latest service news: latest news related to Cloud Hadoop
FAQs: frequently asked questions about Cloud Hadoop
Contact us: send direct inquiries in case of any unresolved questions that aren't answered by the user guide

Check the FAQs first.

Q. Which cluster node types are provided by Cloud Hadoop?
A. Cloud Hadoop clusters are clusters, or sets of nodes, configured for distributed storage and analysis of data. Depending on the purpose of the node, there are three types of nodes in a cluster.

Edge node: gateway node for external connections
Master node: admin node for monitoring worker nodes. 2 master nodes are created with high availability support, and the number of master nodes can't be changed
Worker node: node that receives command from master nodes and actually performs tasks such as data analysis. You can initially create 2 to 8 nodes. More nodes can be added/deleted dynamically afterward

Q. How is Cloud Hadoop service configured?
A. Cloud Hadoop is a service for easily and conveniently constructing and managing clusters. You can make components such as Hadoop, HBase, Spark, and Presto and construct and operate a system for processing large-scale data. You can install open source frameworks that can process large amounts of data such as Apache Hadoop, HBase, Hive, and Spark on the cluster. For the configuration of the Cloud Hadoop service, see the following configuration diagram (architecture).

chadoop-1_01_ko

Q. network error: connection timed out occurs in the SSH access process in putty.
A. When you have allowed the ssh access (Port 22) in ACG but an error occurs in the ssh access, it may be that ssh access (Port 22) is blocked in Network ACL (NACL). Allow the ssh access (Port 22) in NACL.

Q. What is the bandwidth of NCP server?
A. The basic bandwidth of NCP server is around 1 Gbps (1 Gbits/sec).

Q. In the process of reading data while using NCP server, there is too much traffic overall. What should I do when there is too much network traffic use?
A.

You can disperse data and traffic by adding several worker nodes.
You can save data in Object Storage by separating storage resource from computing resource, and read and save data of Object Storage by using computing resource of Cloud Hadoop, thereby reducing network traffic use.

Q. In Cloud Hadoop Ambari Metric service, what are the differences in features between the general operation status and the Maintenance mode operation status?
A. The Maintenance mode feature provided by Ambari WebUI can be set by the unit of service or host.

If you set the Maintenance mode, no alarm can be sent.
If the Maintenance mode is set by the unit of host (server), it is excluded from batch jobs such as service restart tasks when the batch jobs are conducted.

Q. When running show tables in Hue, no View table list appears in Hive interpreter.
A. When running show tables, only the general table list is exposed. You can run show views to check the View table list.

Q. When I access Hive with an account that is not Hive and run a hive query, the Permission denied error occurs.
A. There are two solutions to this problem.

You can add the relevant account to Yarn Queue ACL. Log in to Ambari WebUI > select Yarn Queue Manager > select default (yarn queue) and add the relevant account to Users of Administer Queue and Users of Submit Applications.
If you use a hive account, you can use it without adding an account separately.

Q. When I run hadoop fsck / and check the file system, an error occurs.
A. fsck of hdfs can be run through an hdfs account. Log in to sshuser, convert the account to sudo su - hdfs, and then run it.

Q. In the process of integrating Object Storage (S3) through Hive, a communication error occurs with S3.
A. Check the Object Storage address for each Cloud Hadoop Region. Even if a server is within Public Subnet, if the master server has not received a public IP, you can communicate with only the private domain of Object Storage.

Note

The following are the domain addresses of Object Storage:
Server within public subnet

Internet-based communication is available using kr.object.ncloudstorage.com, which is a public domain.
Private communication is available using kr.object.private.ncloudstorage.com, which is a private domain.

Server in a private subnet

Communication is available by default using kr.object.private.ncloudstorage.com, which is a private domain.
If you use NAT Gateway, you can communicate by using kr.object.ncloudstorage.com, which is a public domain.

Q. I intend to perform data migration with Object Storage bucket. Can I connect several Hadoop Clusters to a single Object Storage bucket?
A. You cannot select an Object Storage bucket designated when creating Cloud Hadoop for creating another Cloud Hadoop. To conduct migration, you can use the following method:

Create a new bucket on Object Storage and perform data upload.
When creating a new Cloud Hadoop, select the new bucket with the data uploaded.

Q. If I want to delete the currently used Cloud Hadoop cluster but to use the relevant data as they are, what should I do?
A. You can use the data as they are even if you delete Cloud Hadoop cluster through two methods.

When creating Cloud Hadoop, if you use Data Catalog product and use Metastore, you can use the meta table of applications such as Hive/Trino/Impala as it is even if the cluster is deleted.
Save the data needed to be analyzed in Object Storage, integrate it with External table in Hive of Cloud Hadoop, and then you can reuse it.

Q. Do I have to select cluster add-ons (HBASE, Impala, Nifi, etc.) when creating the cluster or can I install them later to be able to use them?
A. You do not have to select add-ons when creating a cluster. You can click [Add service] on Ambari Web UI to add and use the services later.

Q. I cannot access Hive View from Apache Ambari.
A. Ambari 2.7.0 and later versions do not support Hive View. If you want to use Hive View, you can access it via Hue.

Q. When using Cloud Hadoop cluster version 1.9, Presto version 0.240 is built-in. Can I migrate Presto to the latest version?
A. Version upgrades for Presto (Trino) are not supported. Note that Cloud Hadoop cluster version 2.0 or higher supports Trino 377, which is a higher version of Presto 0.240.
For more information about the versions supported by Cloud Hadoop, see Supported applications by cluster version.

Was this article helpful?

What's Next

Cloud Hadoop scenario

Table of contents

Cloud Hadoop features
Cloud Hadoop user guides
Cloud Hadoop-related resources
Check the FAQs first.

Cloud Hadoop overview

Cloud Hadoop features

Cloud Hadoop user guides

Cloud Hadoop-related resources

Check the FAQs first.

What's Next