Monitoring

Available in VPC.

This section describes the composition of the Monitoring interface. Monitoring provides two types of dashboards that show GPU and storage usage for your project. The dashboards available in Monitoring are as follows:

Overview: Monitoring information for overall GPU and storage usage in the project
Workload/Pod View: Monitoring information for GPU usage by workload type, workload, and Pod

These dashboards allow you to view average resource usage during the past 30 days. Each dashboard is composed of various graph charts, which you can use to monitor real-time usage more effectively.

Overview

The Overview interface provides the following components:

mlxp_console_monitoring01_ko

Component	Description
① Overview	CPU, Memory, GPU, and Storage usage in the project
② Details	CPU, Memory, and GPU usage by workload type

The Overview dashboard consists of several graphical charts. You can use it to intuitively check CPU, Memory, GPU, and Storage usage for your project. All information is collected every 15 seconds, and average values are displayed.
The following describes the graph charts included in the Overview interface.

Graph chart	Unit	Description
CPU	Cores	CPU usage in the project Used: Total CPU currently in use Request: Total CPU request based on the resource manifest
Memory	Gi	Memory usage in the project Used: Total Memory currently in use Request: Total Memory request based on the resource manifest
GPU	Count	GPU usage in the project Used: Total GPU currently in use Total: Total GPUs allocated to the project
Storage	Gi	Storage usage in the project Total PVC capacity

Caution

GPU Total indicates the total GPU resources allocated to the project.
For detailed GPU resource information, refer to the GPU Resources interface.

In the Overview interface, you can view CPU, Memory, and GPU usage by workload type. You can easily check resource usage for each type currently running in the project, such as Notebook and PyTorchJob.
The following describes the detailed workload-type information displayed in the Overview interface.

Details	Unit	Description
Workload Type	-	Workload types currently in use (e.g., Notebook, PyTorchJob)
CPU	Cores	Status of CPU usage Request: Total CPU request based on the resource manifest Limits: Total CPU limit based on the resource manifest
Memory	Gi	Status of Memory usage Request: Total Memory request based on the resource manifest Limits: Total Memory limit based on the resource manifest
GPU	Count	Status of GPU usage Used: Total GPU currently in use

Note

The Overview interface automatically refreshes every 1 minute.

Workload/Pod View

The following provides an overview of the Workload/Pod View interface.

mlxp_console_monitoring02_ko

Component	Description
① View unit	The Workload Type, workload name, and Pod name currently in use in the project
② Time range	The monitoring time range and refresh control
③ Workload/Pod View	CPU, Memory, GPU, and GPU Memory usage by the selected resource unit and time range
④ Details	CPU, Memory, and GPU usage at the Pod level

The Workload/Pod View provides multiple graph-based dashboards. You can view CPU, Memory, GPU, and Storage usage in your project at detailed levels such as Workload Type, workload, and Pod. All information is collected every 15 seconds, and average values are displayed. To use the dashboard:

Select the resource unit you want to view.
- Workload Type: Workload types currently in use (e.g., Notebook, PyTorchJob). Only one type can be selected at a time.
- Workload: The list of workloads in the project. Multiple workloads can be selected.
  - All: View all workloads used in the project.
- Pod: The list of Pods within a workload. Multiple Pods can be selected.
  - OFF: Pod-level view is unavailable if All or multiple workloads are selected.
  - ALL: When one workload is selected, view all Pods within that workload.
Select the time range from the time-range selector or enter a custom period.

Note

The time range cannot exceed 30 days.

View the results in the dashboard.
- To check the exact metric value at a specific point in time: Hover your mouse over the desired point on the graph.
Click [Refresh] button to retrieve the latest data.

The following describes the graph charts that make up the Workload/Pod View interface.

Graph chart	Unit	Description
Average CPU Utilization	%	The average CPU utilization (Used/Request) for the selected conditions and time range. Used: Total CPU currently in use Request: Total CPU request based on the resource manifest
CPU Usage	Cores	CPU usage for the selected resource unit (workload or Pod)
Average Memory Utilization	%	The average Memory utilization (Used/Request) for the selected conditions and time range. Used: Total Memory currently in use Request: Total Memory request based on the resource manifest
Memory Usage	MiB	Memory usage for the selected resource unit (workload or Pod)
Average GPU Utilization	%	The average GPU utilization for the selected conditions and time range
GPU Usage	Cores	GPU usage for the selected resource unit (workload or Pod)
Average GPU Memory Utilization	%	The average GPU Memory utilization for the selected conditions and time range
GPU Memory Usage	MiB	GPU Memory usage for the selected resource unit (workload or Pod)

In the Workload/Pod View, you can also view detailed CPU, Memory, and GPU usage at the Pod level. You can easily check resource usage for Pods currently running in your project.
The following describes the Pod-level details displayed in the Workload/Pod View interface.

Details	Unit	Description
Pod Name	-	The name of the Pod within the selected workload
CPU	Cores	Status of CPU usage Request: Total CPU request based on the resource manifest Limits: Total CPU limit based on the resource manifest
Memory	Gi	Status of Memory usage Request: Total Memory request based on the resource manifest Limits: Total Memory limit based on the resource manifest
GPU	Count	Status of GPU usage Used: Total GPU currently in use

Documentation Index

Monitoring

Overview

Workload/Pod View