Ncloud TensorFlow Cluster tcm command user guide

Available in the Classic environment.

Ncloud TensorFlow Cluster tcm command user guide

Obtaining Permission to Use tcm CLI Commands

To obtain permission for using tcm CLI commands:

Access the Ncloud TensorFlow Cluster master server and run tcm command.
```
root@kym:~# tcm auth
```
When running the command for the first time, a message prompting you to enter the Access Key and Secret Key will be displayed because you do not yet have permission to use the tcm CLI. After copying and pasting the Access Key, a prompt for the Secret Key appears. Copy the Secret Key in the same way and paste it. To check the Access Key, see Ncloud APIs.
Once authentication is complete, a success message is displayed as shown below. After that, you can freely use the tcm CLI commands (no re-authentication is required as long as the server remains in use).
```
Authentication Success.
```

Learn about the available tcm CLI commands.

To check the available tcm CLI commands, run the following tcm command in the Ncloud TensorFlow Cluster console.

root@kym:~# tcm

When you run this command, a list of available commands is displayed as shown below.

root@kym:~# tcm
Type:        Tcm
String form: <__main__.Tcm object at 0x7f9c6bfb91d0>

Usage:       tcm
            tcm add-vm
            tcm create
            tcm delete
            tcm history
            tcm info
            tcm monitor
            tcm mount
            tcm nas-create
            tcm nas-delete
            tcm nas-info
            tcm nas-volume
            tcm rm-vm
            tcm start-vm
            tcm stop-vm
            tcm submit
            tcm unmount

Next, let’s go through each command and its usage.

Create a cluster

Before creating a cluster or after creation, you can check the cluster information.
```
root@kym:~# tcm info
[NCP TensorFlow Cluster] No clusters. Please create node using 'create' command
```
Since no cluster has been created yet, a message appears prompting you to create a cluster using create command.
Runtcm create. Usage details are shown below. [Option] is optional. The default value of--specis basic.
```
Usage:       tcm create COUNT [SPEC]
            tcm create --count COUNT [--spec SPEC]
```
When you run tcm create 3 or tcm create --count 3, 3 servers with the basic specification are created in the cluster.

You can specify the server node specification using --spec option. Each specification is as follows. (gpu1 and gpu2 are not supported yet.)
- mini: vCPU 4ea, Memory 16GB, HDD 50GB
- basic: vCPU 8ea, Memory 32GB, HDD 50GB
- high: vCPU 16ea, Memory 32GB, HDD 50GB
- gpu1: GPU 1ea, GPU Mem 24GB, vCPU 4ea, Memory 30GB, SSD 50GB
- gpu2: GPU 2ea, GPU Mem 48GB, vCPU 8ea, Memory 60GB, SSD 50GB
Run tcm create 3 command to create 3 server nodes. Set the number of server nodes to at least 2 and no more than 10. You can later scale out the cluster using a command such as add-vm.
```
root@kym:~# tcm create 3
[NCP TensorFlow Cluster] Creating nodes to configure the cluster...
[NCP TensorFlow Cluster] Successfully requested. After the cluster is installed, you can check the cluster infomation through the 'tcm info' command.
root@kym:~#
```

To verify that the servers are created correctly, run tcm info command. Each server node name is generated using the master server name followed by a random string and a sequence number.

root@kym:~# tcm info
+----------------+-------------+------------+---------------------+-------------+--------------+------------+
|  Server Name   | Instance No |   Status   |     Create Date     | Memory Size |  Private IP  |    SSH     |
|                |             |            |                     |             |              | Connection |
+----------------+-------------+------------+---------------------+-------------+--------------+------------+
| kym-zc8mooq4-1 |    539137   | setting up | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97  |   False    |
| kym-zc8mooq4-2 |    539140   | setting up | 2018-03-12 15:43:50 |    32.00 GB | 10.39.13.235 |   False    |
| kym-zc8mooq4-3 |    539143   | setting up | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202  |   False    |
+----------------+-------------+------------+---------------------+-------------+--------------+------------+

The server status is displayed as setting up, and the inter-server communication status (SSH Connection) is displayed as False. This process may take several minutes. Check the status of the server nodes again.

+----------------+-------------+---------+---------------------+-------------+--------------+------------+
|  Server Name   | Instance No |  Status |     Create Date     | Memory Size |  Private IP  |    SSH     |
|                |             |         |                     |             |              | Connection |
+----------------+-------------+---------+---------------------+-------------+--------------+------------+
| kym-zc8mooq4-1 |    539137   | running | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97  |    True    |
| kym-zc8mooq4-2 |    539140   | running | 2018-03-12 15:43:50 |    32.00 GB | 10.39.13.235 |    True    |
| kym-zc8mooq4-3 |    539143   | running | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202  |    True    |
+----------------+-------------+---------+---------------------+-------------+--------------+------------+

When the server status is running and the inter-server communication status (SSH Connection) for all servers is True, cluster creation is complete.

Add cluster server nodes

To add servers to a cluster, use the tcm add-vm [number of servers] or tcm add-vm --count [number of servers] command.

root@kym:~# tcm add-vm 1
[NCP TensorFlow Cluster] Successfully requested.

The command creates one additional server with the same specifications as the existing server nodes in the cluster. Thetcm add-vm 1 command is equivalent to the tcm add-vm --count 1 command. Each execution of thetcm add-vm command can add up to a maximum of 10 servers.

You can check the status of the newly added server nodes as shown below. Verify that both the server status and the inter-server communication status are available.

root@kym:~# tcm info
+----------------+-------------+------------+---------------------+-------------+--------------+------------+
|  Server Name   | Instance No |   Status   |     Create Date     | Memory Size |  Private IP  |    SSH     |
|                |             |            |                     |             |              | Connection |
+----------------+-------------+------------+---------------------+-------------+--------------+------------+
| kym-zc8mooq4-1 |    539137   |  running   | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97  |    True    |
| kym-zc8mooq4-2 |    539140   |  running   | 2018-03-12 15:43:50 |    32.00 GB | 10.39.13.235 |    True    |
| kym-zc8mooq4-3 |    539143   |  running   | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202  |    True    |
| kym-zc8mooq4-4 |    539158   | setting up | 2018-03-12 16:36:18 |    32.00 GB | 10.39.2.214  |   False    |
+----------------+-------------+------------+---------------------+-------------+--------------+------------+

Stop cluster server nodes

This section explains how to stop server nodes in a cluster.

You can stop the servers individually or stop all servers at once. If you enter allinstead of a server instance number, the command applies to all server nodes.

Usage:       tcm stop-vm INSTANCE_NO
             tcm stop-vm --instance-no INSTANCE_NO

root@kym:~# tcm stop-vm all
[NCP TensorFlow Cluster] Successfully requested.
root@kym:~# tcm info
+----------------+-------------+---------------+---------------------+-------------+--------------+------------+
|  Server Name   | Instance No |     Status    |     Create Date     | Memory Size |  Private IP  |    SSH     |
|                |             |               |                     |             |              | Connection |
+----------------+-------------+---------------+---------------------+-------------+--------------+------------+
| kym-zc8mooq4-1 |    539137   | shutting down | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97  |   False    |
| kym-zc8mooq4-2 |    539140   | shutting down | 2018-03-12 15:43:50 |    32.00 GB | 10.39.13.235 |   False    |
| kym-zc8mooq4-3 |    539143   | shutting down | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202  |   False    |
| kym-zc8mooq4-4 |    539158   | shutting down | 2018-03-12 16:36:18 |    32.00 GB | 10.39.2.214  |   False    |
+----------------+-------------+---------------+---------------------+-------------+--------------+------------+

Start cluster server nodes

This section explains how to start server nodes in a cluster.

You can start stopped servers individually or start all servers at once. If you enter allinstead of a server instance number, the command applies to all server nodes.

Usage:       tcm start-vm INSTANCE_NO
             tcm start-vm --instance-no INSTANCE_NO

root@kym:~# tcm start-vm 539137
success
root@kym:~# tcm info
+----------------+-------------+---------+---------------------+-------------+--------------+------------+
|  Server Name   | Instance No |  Status |     Create Date     | Memory Size |  Private IP  |    SSH     |
|                |             |         |                     |             |              | Connection |
+----------------+-------------+---------+---------------------+-------------+--------------+------------+
| kym-zc8mooq4-1 |    539137   | booting | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97  |   False    |
| kym-zc8mooq4-2 |    539140   | stopped | 2018-03-12 15:43:50 |    32.00 GB | 10.39.13.235 |   False    |
| kym-zc8mooq4-3 |    539143   | stopped | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202  |   False    |
| kym-zc8mooq4-4 |    539158   | stopped | 2018-03-12 16:36:18 |    32.00 GB | 10.39.2.214  |   False    |
+----------------+-------------+---------+---------------------+-------------+--------------+------------+

After the servers start, their status changes from booting to running, which may take a few minutes.

Delete cluster server nodes

You can delete stopped server nodes using thetcm rm-vm command.

Usage:       tcm rm-vm INSTANCE_NO
             tcm rm-vm --instance-no INSTANCE_NO

root@kym:~# tcm rm-vm 539140
[NCP TensorFlow Cluster] Successfully requested.
root@kym:~# tcm info
+----------------+-------------+---------+---------------------+-------------+-------------+------------+
|  Server Name   | Instance No |  Status |     Create Date     | Memory Size |  Private IP |    SSH     |
|                |             |         |                     |             |             | Connection |
+----------------+-------------+---------+---------------------+-------------+-------------+------------+
| kym-zc8mooq4-1 |    539137   | running | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97 |    True    |
| kym-zc8mooq4-3 |    539143   | stopped | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202 |   False    |
| kym-zc8mooq4-4 |    539158   | stopped | 2018-03-12 16:36:18 |    32.00 GB | 10.39.2.214 |   False    |
+----------------+-------------+---------+---------------------+-------------+-------------+------------+

If the cluster has only two server nodes, no further deletion is allowed. In this case, stop all servers using tcm stop-vm, and then delete the cluster using the tcm delete command.

Delete Cluster

To delete a cluster, use the tcm delete command. This operation can only be performed when all server nodes are in a stopped state.

Usage:       tcm delete

root@kym:~# tcm delete
[NCP TensorFlow Cluster] VM is powered on. Please shut down the VM.

root@kym:~# tcm info
+----------------+-------------+---------+---------------------+-------------+-------------+------------+
|  Server Name   | Instance No |  Status |     Create Date     | Memory Size |  Private IP |    SSH     |
|                |             |         |                     |             |             | Connection |
+----------------+-------------+---------+---------------------+-------------+-------------+------------+
| kym-zc8mooq4-1 |    539137   | running | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97 |    True    |
| kym-zc8mooq4-3 |    539143   | running | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202 |    True    |
| kym-zc8mooq4-4 |    539158   | running | 2018-03-12 16:36:18 |    32.00 GB | 10.39.2.214 |    True    |
+----------------+-------------+---------+---------------------+-------------+-------------+------------+
root@kym:~# tcm stop-vm all
[NCP TensorFlow Cluster] Successfully requested.
root@kym:~# tcm info
+----------------+-------------+---------------+---------------------+-------------+-------------+------------+
|  Server Name   | Instance No |     Status    |     Create Date     | Memory Size |  Private IP |    SSH     |
|                |             |               |                     |             |             | Connection |
+----------------+-------------+---------------+---------------------+-------------+-------------+------------+
| kym-zc8mooq4-1 |    539137   | shutting down | 2018-03-12 15:43:45 |    32.00 GB |  10.39.3.97 |   False    |
| kym-zc8mooq4-3 |    539143   | shutting down | 2018-03-12 15:43:55 |    32.00 GB | 10.39.3.202 |   False    |
| kym-zc8mooq4-4 |    539158   | shutting down | 2018-03-12 16:36:18 |    32.00 GB | 10.39.2.214 |   False    |
+----------------+-------------+---------------+---------------------+-------------+-------------+------------+
root@kym:~# tcm delete
[NCP TensorFlow Cluster] Successfully requested.
root@kym:~# tcm info
[NCP TensorFlow Cluster] No clusters. Please create node using 'create' command

Create shared cluster storage

If the local storage of the cluster master server is insufficient, or if you want server nodes to share training data, you can create NAS storage.

The NAS storage name is a required field, and the default value of the volume option is 500 GB. The volume option specifies the size (GB) of the NAS storage. The supported capacity range is 500–10,000 GB, and storage can be increased in 100 GB increments.

Usage:       tcm nas-create NAME [VOLUME]
             tcm nas-create --name NAME [--volume VOLUME]

root@kym:~# tcm nas-create train
success
root@kym:~# tcm nas-info
+----------------+-------------+--------+-----------+-----------+--------------+------------------------------+------------------------------------------+
| Name           | Instance No | Status | Size      | Used Size | Use Ratio(%) | Mount Info                   | ACL Instance List                        |
+----------------+-------------+--------+-----------+-----------+--------------+------------------------------+------------------------------------------+
| n000327_train  | 539170      | CREAT  | 500.00 GB | 272.00 KB | 0.0          | 10.250.48.15:/n000327_train  | []                                       |
+----------------+-------------+--------+-----------+-----------+--------------+------------------------------+------------------------------------------+

You can verify that a 500 GB NAS storage named train has been created.

Mount shared cluster storage

The created NAS storage is mounted simultaneously on all server nodes. Each cluster can mount only one NAS storage, and the mount path for all server nodes including the master server is fixed to /mnt/nas.

Usage:       tcm mount NAS_INSTANCE_NO
             tcm mount --nas-instance-no NAS_INSTANCE_NO

Use thenas-info command and pass the nas instance no created earlier as an argument to the mount command.

root@kym:~# tcm mount 539170
[NCP TensorFlow Cluster] Successfully requested.
root@kym:~# tcm nas-info
+----------------+-------------+--------+-----------+-----------+--------------+------------------------------+------------------------------------------+
| Name           | Instance No | Status | Size      | Used Size | Use Ratio(%) | Mount Info                   | ACL Instance List                        |
+----------------+-------------+--------+-----------+-----------+--------------+------------------------------+------------------------------------------+
| n000327_train  | 539170      | CREAT  | 500.00 GB | 348.00 KB | 0.0          | 10.250.48.15:/n000327_train  | ['539104', '539161', '539164', '539167'] |
+----------------+-------------+--------+-----------+-----------+--------------+------------------------------+------------------------------------------+

You can verify that the cluster master server and server nodes are automatically configured in the ACL instance.

As shown below, you can check whether the NAS is properly mounted on each server node by running the ssh root@[private IP of the server node] command. By default, the public key of the cluster master server is automatically copied to all server nodes during cluster creation, allowing you to use remote commands without additional configuration.

root@kym:~# ssh root@10.39.3.232 'df -h'
Warning: Permanently added '10.39.3.232' (ECDSA) to the list of known hosts.
Filesystem                   Size  Used Avail Use% Mounted on
udev                          16G     0   16G   0% /dev
tmpfs                        3.2G  8.8M  3.2G   1% /run
/dev/xvda1                    48G  5.2G   40G  12% /
tmpfs                         16G     0   16G   0% /dev/shm
tmpfs                        5.0M     0  5.0M   0% /run/lock
tmpfs                         16G     0   16G   0% /sys/fs/cgroup
tmpfs                        3.2G     0  3.2G   0% /run/user/1000
10.250.48.15:/n000327_train  500G  320K  500G   1% /mnt/nas
tmpfs                        3.2G     0  3.2G   0% /run/user/0

Unmount shared cluster storage

You can unmount the NAS storage simultaneously from the cluster master server and all server nodes using the following command.

Usage:       tcm unmount NAS_INSTANCE_NO
             tcm unmount --nas-instance-no NAS_INSTANCE_NO

Keep in mind that the NAS ACL is automatically released upon unmounting.

Delete shared cluster storage

This deletes the shared NAS storage. As shown in the unmount command, the NAS ACL is automatically released at the same time as the unmount operation.

Usage:       tcm nas-delete NAS_INSTANCE_NO
             tcm nas-delete --nas-instance-no NAS_INSTANCE_NO

Resize shared cluster storage volume

You can change the volume size of the NAS storage. The volume size can be adjusted from 500 GB up to 10,000 GB, and the resize unit is 100 GB.

Usage:       tcm nas-volume NAS_INSTANCE_NO SIZE
             tcm nas-volume --nas-instance-no NAS_INSTANCE_NO --size SIZE

Cluster Job Submit (Submission)

This service provides a cluster architecture designed for TensorFlow’s 'Distributed TensorFlow' feature. Your Python code must be written in accordance with the syntax and format provided by Distributed TensorFlow in order to run correctly. Before submitting a job, make sure that the SSH connection status between all server nodes is set to 'True.'

[PS_NUM] is an option used to specify the number of parameter servers. The default value is 1. You can set this value to a number smaller than the number of worker servers. In this case, the cluster automatically delivers the updated cluster specifications to the user program. (Applies when the program receives cluster parameters) By default, all server nodes in the cluster operate as worker servers.

When using the shared storage (NAS) mount feature, you can share data across the distributed execution environment.

Usage:       tcm submit FILE_PATH [PS_NUM] [FORCE] [ARGS ...]
             tcm submit --file-path FILE_PATH [--ps-num PS_NUM] [--force FORCE] [ARGS ...]

The following is an example of submitting a TensorFlow Distributed example provided by default.

root@kym:~# tcm submit /home/ncp/workspace/DistributedTensorFlow.py
root@kym:~# [NCP TensorFlow Cluster] Successfully requested. You can check the log using 'tcm monitor' command.

If the [FORCE] option is set to True, the master server node checks the execution status of all worker servers. When all worker servers terminate, the parameter servers automatically stop the job.

Since the job runs as a background job, control returns to the console prompt after execution. Logs can be checked using the tcm monitor command, which is explained in the next section.

View Cluster Job Logs

Usage:       tcm monitor

Displays TensorFlow training logs continuously using tailing. To exit while viewing the logs, press Ctrl + C.

Logs from each server node are aggregated and stored on the Cluster master server at: /home/ncp/ncp.log

root@kym:~# tcm monitor
[NCP TensorFlow Cluster] If you want to finish monitoring, press Ctr+c
2018-03-12 18:26:54.444489: I tensorflow/core/distributed_runtime/master_session.cc:998] Start master session f3f069b8690ca808 with config: device_filters: "/job:ps" device_filters: "/job:worker/task:0" allow_soft_placement: true
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /home/ncp/mnist-data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /home/ncp/mnist-data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /home/ncp/mnist-data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /home/ncp/mnist-data/t10k-labels-idx1-ubyte.gz
Worker 0: Initializing session...
Worker 0: Session initialization complete.
Training begins @ 1520846815.213741
1520846815.350583: Worker 0: training step 1 done (global step: 0)
1520846815.378844: Worker 0: training step 2 done (global step: 0)
1520846815.405878: Worker 0: training step 3 done (global step: 0)
1520846815.428620: Worker 0: training step 4 done (global step: 1)
1520846815.449924: Worker 0: training step 5 done (global step: 1)
1520846815.470958: Worker 0: training step 6 done (global step: 1)
1520846815.495449: Worker 0: training step 7 done (global step: 1)
1520846815.516172: Worker 0: training step 8 done (global step: 2)
1520846815.540724: Worker 0: training step 9 done (global step: 2)
1520846815.565506: Worker 0: training step 10 done (global step: 2)
------------------------------ omitted ------------------------------

1520846960.827992: Worker 0: training step 3031 done (global step: 3000)
Training ends @ 1520846960.828057
Training elapsed time: 145.614315 s
After 3000 training step(s), validation cross entropy = 812.873
##########################################################################################
[NCP TensorFlow Cluster] FINISH DISTRIBUTE JOB
[NCP TensorFlow Cluster] IF YOU WANT TO FINISH MONITORING, PRESS CTR+C
[NCP TensorFlow Cluster] TOTAL TIME : 163.34 seconds
##########################################################################################

View Cluster Job History

Usage:       tcm history

You can check the job submission history as shown below and also check the status of jobs currently in progress.

root@kym:~# tcm history
+---------------------+----------------------------------------------+--------+---------------------+-------------+
| Submit Time         | File Name                                    | Status | End Time            | Total Time  |
|                     |                                              |        |                     | (sec)       |
+---------------------+----------------------------------------------+--------+---------------------+-------------+
| 2018-03-12 18:26:39 | /home/ncp/workspace/DistributedTensorFlow.py | Finish | 2018-03-12 18:29:22 | 163.34      |
| 2018-03-12 18:37:48 | /home/ncp/workspace/DistributedTensorFlow.py | submit | None                | None        |
+---------------------+----------------------------------------------+--------+---------------------+-------------+