- Print
- PDF
Data Forest overview
- Print
- PDF
The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.
Available in VPC
Data Forest is a large-scale, multi-tenant big data processing cluster based on Apache Hadoop. Data Forest supports various big data frameworks to simplify data storage, data processing, deep learning analysis, and serving. Security technology is applied and large-scale data is stored in distributed storage, so you can use it safely.
Various features provided by Data Forest
Integrated analysis environment
Data Forest is an integrated analysis platform based on Apache Hadoop that enables data collection and processing, deep learning analysis, and serving. It runs the service in the form of a YARN application and provides an environment where users can create a big data ecosystem by combining applications. It may also undertake deep learning training such as TensorFlow and PyTorch by getting GPU resources for each user dynamically assigned.Easy and quick configuration of analysis environments
You can launch apps easily and quickly in a container-based serverless environment, and create the app-based Hadoop ecosystem you need to configure the analysis environment. It provides an integrated multitenancy-based platform designed to process large amounts of data and large numbers of users. Depending on the analysis purpose, you can perform batch analysis and long-lived analysis in a multi-tenant environment.Flexible scalability
Even after creating an app, you can expand or reduce containers as many or as few as you need to use to flexibly respond to traffic. Because it's container-based, it can dynamically scale online and change quickly when needed.Enhanced security
Data Forest supports Kerberos/LDAP authentication as a secure Hadoop cluster with enhanced security. It provides a powerful security environment by using secret key encryption, so that other credentials are not transferred through a network. In addition, application permission management provides security through Apache Ranger authentication.Guaranteed high-level network and disk performance
Data Forest uses Hadoop Distributed File System (HDFS) storage as the local disk base of app-based computing nodes and physical servers, ensuring best network and disk performance.Various components
Data Forest consists of components available to store, analyze, and visualize data. Users may create and use components suitable for each purpose. HDFS, HBase, Kafka, and OpenTSDB are provided for data storage, Spark, Hive, Hive LLAP, Elasticsearch, Grafana, Hue, Trino, and Phoenix are provided for data analysis and processing, and Kibana and Zeppelin are provided for data visualization.Improving component accessibility and enabling web-based development environment
To facilitate seamless accessibility to components within a VPC environment, Data Forest provides a proxy function and a web-based development environment called Jupyter Notebook. Users can access the JupyterLab web page of the created notebook node to run queries and codes necessary for big data analysis and machine learning training. Integration with Object Storage further allows flexible reuse of data.
Data Forest user guides
Data Forest provides its services in the Korea Region. See the following table of contents and the details for efficient use of Data Forest.
- Data Forest overview: introduction to Data Forest advantages, related resources to Data Forest, FAQs
- Data Forest usage scenarios: guide to all usage scenarios for Data Forest
- Prerequisites for using Data Forest: guide to support specifications for using Data Forest
- VPC
- Getting started with Data Forest: guide on how to configure the client environment to access Data Forest and the Data Forest app
- Using Data Forest
- Create and manage account: instructions on how to create and manage a Data Forest account and how to verify your account
- Create and manage notebook: guide to creating and managing Data Forest notebooks
- Create and manage app: guide to creating and managing Data Forest apps
- Using Data Forest app
- Access quick links: guide to quick link types and how to access quick links
- Using Dev: Dev app details and how to use it
- Using Elasticsearch: Elasticsearch details and precautions
- Using Grafana: Grafana details, how to add a data source, and how to back up the database
- Using HBase: HBase details and precautions
- Using Hive: Hive details, access method, and precautions
- Using Hue: Hue details guide
- Using Kafka: Kafka details, how to use Kafka manager, and precautions when using
- Using Kibana: Kibana details guide
- Using OpenTSDB: OpenTSDB details guide
- Using Phoenix: Phoenix details guide
- Using Spark History Server: instructions on how to view Spark History Server details and jobs
- Using Trino: Trino details guide
- Using Zepplin: Zepplin details, interpreter settings, and backup instructions
- Using Zookeeper: details on Zookeeper, how to connect with other apps, and precautions when using
- Using Data Forest app
- Monitoring: instructions on monitoring submitted batch jobs and apps
- Utilizing Data Forest ecosystem
- Using HDFS: guide on how to upload and download files to HDFS
- Using Public Hive: guide to creating Hive databases and tables
- Using Oozie: how to compose workflows
- Using Ranger: instructions on how to set up Apache Ranger policies
- Using Spark: instructions on how to submit a Spark Job
- Data Forest usage examples
- Copying HDFS data to Object Storage: how to copy HDFS data to Object Storage
- Registering Spark batch jobs with Oozie scheduler: instructions on how to register Spark batch jobs with Oozie scheduler
- Data processing with Spark and Hive: instructions on how to process Spark and Hive data with Zepplin and Dev apps
- Using AI Forest
- AI Forest overview: guide on AI Forest
- AI Forest usage scenarios: information on scenarios for using AI Forest
- Create and manage workspace: how to create and manage workspaces
- Using workspace browser: how to manage and edit source files in a workspace
- Manage AI app: AI app details, how to see logs, and how to end the app
- Using AI Forest CLI: scenario guide for using AI Forest CLI in the Linux environment
- AI Forest use examples
- Classifying MNIST Handwritten Images with Tensorflow: guide to submitting jobs in Singlebatch
- Detecting objects in the pedestrian dataset with Pytorch: how to write a program to detect pedestrian objects and submit it as a Singlebatch job
- Container Registry integration: how to integrate with Container Registry products to use Docker images
- VPC
- Managing Data Forest permissions: Data Forest permissions management method and policy guide
- Data Forest release notes: Data Forest user guide update history
Data Forest related resources
NAVER Cloud Platform provides various related resources in addition to guides to help customers understand Data Forest. Developers, marketers, etc., who are thinking about adopting Data Forest in their company or need detailed information while establishing data-related policies, should actively utilize the following resources.
- Portal and console user guide: questions answered on how to subscribe to and manage Data Forest, and how to use the portal and console at a basic level
- Pricing plans, characteristics, and detailed features: Data Forest introduction and pricing information
- Latest service news: latest news related to Data Forest
- FAQ: frequently asked questions about Data Forest
- Contact us: send direct inquiries in case of any unresolved questions that aren't answered by the user guide
Check the FAQs first.
The FAQ can quickly answer common questions. If your questions are not resolved in the FAQ, see the user guides to find the information you want.
Q. Cloud Hadoop and Data Forest seem to be similar services. What's the difference?
A. The difference between the two services lies in server/serverless.
- Cloud Hadoop builds and provides a Hadoop cluster using customer-dedicated resources.
- It is a self-managed product in which the customer directly manages Hadoop.
- An open source-based web management tool (Apache Ambari) which allows self-management is provided.
- Data Forest is a serverless product and is used by submitting Jobs (DL Jobs) required for analysis. Hadoop Ecosystem, which must be run long-lived, can be easily analyzed by creating an app.
- It is a managed product that guarantees high availability at the product level, rather than having customers directly manage Hadoop.
- It provides more apps than Cloud Hadoop, and GPU-based Deep Learning Jobs can also be submitted.
Comparison
Feature | Cloud Hadoop | Data Forest |
---|---|---|
Scalability | The user directly determines Hadoop cluster size | Managed in service |
Cost | Hadoop cluster maintenance fee incurs | During user running, storage fee incurs |
Maintenance | The user manages directly and a user management tool (Apache Ambari) is provided | Managed in service |
Characteristics | The user can configure free environment | Provides various apps. GPU-based deep learning jobs can be submitted |
Q. What features are provided to collect and process real-time data or configure ETL environments?
A. Although Data Forest does not directly provide real-time data collection and processing, the environment can be configured using the Hadoop Ecosystem, which consists of various services of NAVER Cloud Platform and apps provided by Data Forest. A service for professionally configuring ETL is scheduled to be released as a separate product in the future.
Q. I want to access Quick links provided by Data Forest. How can I access it?
A. Accessing Quick links requires a Data Forest notebook server. You can create it in the Data Forest > Notebooks menus.
For more information, see Access Quick links.
Q. When I enter the creation command in the terminal of the user PC to create an SSH tunnel between the user PC and the notebook, it continues to say enter the password.
A. If authentication has failed with authentication key, it constantly requests a password. It can occur when the key used for running the SSH command differs from the login key set when creating the notebook. If you have lost the authentication key, you must change the authentication key.