- Print
- PDF
Creating Ncloud TensorFlow Server clusters
- Print
- PDF
Available in Classic
Before use
Q. What is Ncloud TensorFlow Cluster?
- Ncloud TensorFlow Cluster is a service that helps you more quickly run the tensorflow code with massive learning data or very large operations in clusters.
- Using tcm commands from the command line, you can create, add and delete cluster nodes, and easily share storage between clusters and expand it. You can also easily submit jobs to clusters with your own code.
- "Ncloud Tensorflow Cluster" uses Tensorflow, an open source machine learning software library developed by Google Brain.
Q. What kind of OS does Ncloud TensorFlow Cluster use?
- The supported OS is “ubuntu-16.04-64-server.”
Q. What packages are available and can I use only the provided packages?
- The Ncloud TensorFLow Cluster master server comes with Anaconda and the tcm CLI package that enables you to run clusters. (except for TensorFlow installation)
- TensorFlow is installed in a cluster node that is created by the Ncloud TensorFlow Cluster master server.
Q. Can I use Java or other languages besides Python?
- TensorFlow also provides APIs such as Java and Go, but it does not guarantee stability. Therefore, we recommend using Python.
Q. How do I create a Ncloud TensorFlow cluster?
- Create servers with standard specifications as you need, and create a cluster with the servers using the tcm commands through a terminal.
- You can choose between the monthly and hourly price plans for your TensorFlow master server, which you can use by setting up your access environment after you create a server. (Workers and parameter server nodes are created based on the hourly plans.)
Q. What server types are available?
The standard server types are available for the Ncloud TensorFlow Cluster master server provided by NAVER CLOUD PLATFORM. You can choose among the following specifications for server nodes that you can create with the tcm CLI commands through a terminal. (Note that all server nodes should be created with the same specifications.)
The server specifications available for Ncloud TensorFlow Cluster workers and parameter server nodes are listed below: (All server nodes are created with the same specifications, and GPU server types will be supported later.)
Spec code | Specifications | Description |
---|---|---|
mini | vCPU 4ea, Memory 16GB, HDD 50GB | Appropriate for testing clusters or processing small workload. |
basic | vCPU 8ea, Memory 32GB, HDD 50GB | Appropriate for processing mid-sized workload. |
high | vCPU 16ea, Memory 32GB, HDD 50GB | Appropriate for processing large workload. |
gpu1 | GPU 1ea, GPU Mem 24GB, vCPU 4ea, Memory 30GB, SSD 50GB | Use as many single GPUs as the number of cluster nodes. |
gpu2 | GPU 2ea, GPU Mem 48GB, vCPU 8ea, Memory 60GB, SSD 50GB | Use as many dual GPUs as the number of cluster nodes to process very large workload. |
(Note that server nodes are created with the same specifications, and all nodes are recognized as worker server nodes. The number of parameter server nodes can be specified.)
Q. I have created a Ncloud TensorFlow Cluster master server. How can I create a worker server or parameter server?
- Connect to your master server through a terminal, and execute “tcm create [number of servers].”
- For more information on how to use tcm CLI commands, see "How to Use Ncloud TensorFlow Cluster tcm Commands."
Q. How can I execute TensorFlow code in the cluster?
- Connect to your master server through a terminal, and execute “tcm submit [program path].”
- For more information on how to use tcm CLI commands, see "How to Use Ncloud TensorFlow Cluster tcm Commands."
About Ncloud TensorFlow Cluster
A Ncloud TensorFlow cluster is a set of tasks that participate in the distributed execution of a TensorFlow graph. Each task is associated with a TensorFlow server node, which contains a master server that can be used to create sessions, a worker server that executes operations in the graph and a parameter server that shares computed gradients.
For servers to do their tasks, you should pass ClusterSpec that describes all of the tasks in the cluster to each server node. Ncloud TensorFlow Cluster helps you do that relatively easily.
With Ncloud TensorFlow Cluster, you don’t need to care about ClusterSpec and can concentrate on writing learning code. You need to edit some code to pass ClusterSpec to the sessions though. For more information about editing the code, see “Ncloud TensorFlow Cluster MNIST Example.”
This service is designed to provide distributed parallel processing, but also provides basic data preprocessing and visualization as Anaconda is installed in the server.
Ncloud TensorFlow Cluster Configuration
Create Ncloud TensorFlow Cluster
Step 1. Connect to Console
Connect to the Console and select Server > Server.
① To create a server, click [Create server].
Step 2. Select server image
Select a TensorFlow server image to create a server.
① Select Application > Tensorflow.
② Select “tensorflow-cluster-ubuntu-16.04-64-master.” from Image and click [Next].
Step 3. Set server
Select a storage type, server type, pricing plan, and zone, and enter a server name.
① Select a zone. You can select between “KR-1” and “KR-2.”
② Select a server storage type.
- Select SSD for services that require high-performance I/O and HDD for general services. Please note that you can use SSD as additional storage only if the boot storage is SSD.
③ Select the server type you want. You can only select between the Standard server types for your master server.
④ You can select either the monthly or hourly pricing plan.
⑤ Enter the number of servers. (Select 1 as one master server composes one cluster, and select more than 1 only if you want multiple clusters.)
⑥ Enter a server name and click [Next].
Step 4. Set authentication key
If you have an existing authentication key, select Use an Existing Authentication Key. Otherwise, create a new authentication key according to the following procedure.
① Select Create a New Authentication Key.
② Enter an authentication key name.
③ Click [Create & Save Auth Key] to save the authentication key file to your local PC.
- Issue a new authentication key.
- After saving it, please keep the authentication key in a safe place on your PC.
- The authentication key is used to obtain an initial administrator password.
④ Click [Next].
Step 5. Set ACG
In Ncloud Tensorflow Cluster, worker nodes communicate with each other to perform jobs. The service needs a specific port for such communications, which can be specified by adding an ACG in the NAVER CLOUD PLATFORM Console. There are two ways to add an ACG for Ncloud Tensorflow Cluster:
① Creating a custom ACG
- Go to the ACG setting page.
- Click Create ACG.
- Enter an ACG name and click Create.
- Select the ACG from the ACG list and click Set ACG.
- Add the following 3 access sources as shown in the table below.
Protocol | Access source | Allowed port |
---|---|---|
TCP | 0.0.0.0 | 22 |
TCP | 0.0.0.0 | 2222 |
TCP | 0.0.0.0 | 3333 |
- Click Apply.
② Using “ncloud-default-acg”
“ncloud-default-acg” is an ACG that is registered by default. This ACG setting allows your nodes to access all ports.
With this ACG setting, each worker node can communicate with each other in Ncloud Tensorflow Cluster.
Select the ACG you set and added in Step 2.
Step 6. Confirm
Confirm the settings.
① Make sure that the server image, server, authentication key, and ACG are set properly.
② After the final confirmation, click [Create server].
- This may take several minutes or longer for the server to be created.
Check in the server list
Check if the created server is in the list.
① The server you have created will appear in the list.
② Wait until the server is created, the package is installed, and the server status becomes Running.
Set port forwarding and connect
Set port forwarding
To connect to a server via a terminal program (such as Putty), you must set port forwarding.
① Select the server from the server list and click [Set port forwarding].
② Set an external port number in the Set port forwarding window. The range of external port numbers is between 1024 and 65,534. It cannot be used for service purposes other than the feature to connect to a server.(The internal port number is set to 22.)
③ Click [Add] to add the setting to the bottom and select [Edit] or [Delete] to edit or delete the setting.
④ Click [Apply] for SSH connection to the configured external port using a terminal program.
Connect to server and use tcm CLI command
Connect to your server with ssh root@[public IP address for access] -p [port forwarding port number] via a terminal program as shown in the figure below, and you can use tcm CLI commands.
① Connect to the Ncloud TensorFlow Cluster master server.
② Execute “tcm” and the list of CLI commands is displayed.
③ Execute “tcm info” to see the cluster nodes previously created. (If there is no cluster node, use “create” command to create a new one.)
- For more information on how to use tcm CLI commands, see "How to Use Ncloud TensorFlow Cluster tcm Commands."