Monitoring

Prev Next

Available in VPC.

This section describes the composition of the Monitoring interface. Monitoring provides two types of dashboards that show GPU and storage usage for your project. The dashboards available in Monitoring are as follows:

  • Overview: Monitoring information for overall GPU and storage usage in the project
  • Workload/Pod View: Monitoring information for GPU usage by workload type, workload, and Pod

These dashboards allow you to view average resource usage during the past 30 days. Each dashboard is composed of various graph charts, which you can use to monitor real-time usage more effectively.

Overview

The Overview interface provides the following components:

mlxp_console_monitoring01_ko

Component Description
① Overview CPU, Memory, GPU, and Storage usage in the project
② Details CPU, Memory, and GPU usage by workload type

The Overview dashboard consists of several graphical charts. You can use it to intuitively check CPU, Memory, GPU, and Storage usage for your project. All information is collected every 15 seconds, and average values are displayed.
The following describes the graph charts included in the Overview interface.

Graph chart Unit Description
CPU Cores CPU usage in the project
  • Used: Total CPU currently in use
  • Request: Total CPU request based on the resource manifest
Memory Gi Memory usage in the project
  • Used: Total Memory currently in use
  • Request: Total Memory request based on the resource manifest
GPU Count GPU usage in the project
  • Used: Total GPU currently in use
  • Total: Total GPUs allocated to the project
Storage Gi Storage usage in the project
  • Total PVC capacity
Caution
  • GPU Total indicates the total GPU resources allocated to the project.
  • For detailed GPU resource information, refer to the GPU Resources interface.

In the Overview interface, you can view CPU, Memory, and GPU usage by workload type. You can easily check resource usage for each type currently running in the project, such as Notebook and PyTorchJob.
The following describes the detailed workload-type information displayed in the Overview interface.

Details Unit Description
Workload Type - Workload types currently in use (e.g., Notebook, PyTorchJob)
CPU Cores Status of CPU usage
  • Request: Total CPU request based on the resource manifest
  • Limits: Total CPU limit based on the resource manifest
Memory Gi Status of Memory usage
  • Request: Total Memory request based on the resource manifest
  • Limits: Total Memory limit based on the resource manifest
GPU Count Status of GPU usage
  • Used: Total GPU currently in use
Note

The Overview interface automatically refreshes every 1 minute.

Workload/Pod View

The following provides an overview of the Workload/Pod View interface.

mlxp_console_monitoring02_ko

Component Description
① View unit The Workload Type, workload name, and Pod name currently in use in the project
② Time range The monitoring time range and refresh control
③ Workload/Pod View CPU, Memory, GPU, and GPU Memory usage by the selected resource unit and time range
④ Details CPU, Memory, and GPU usage at the Pod level

The Workload/Pod View provides multiple graph-based dashboards. You can view CPU, Memory, GPU, and Storage usage in your project at detailed levels such as Workload Type, workload, and Pod. All information is collected every 15 seconds, and average values are displayed. To use the dashboard:

  1. Select the resource unit you want to view.
    • Workload Type: Workload types currently in use (e.g., Notebook, PyTorchJob). Only one type can be selected at a time.
    • Workload: The list of workloads in the project. Multiple workloads can be selected.
      • All: View all workloads used in the project.
    • Pod: The list of Pods within a workload. Multiple Pods can be selected.
      • OFF: Pod-level view is unavailable if All or multiple workloads are selected.
      • ALL: When one workload is selected, view all Pods within that workload.
  2. Select the time range from the time-range selector or enter a custom period.
Note

The time range cannot exceed 30 days.

  1. View the results in the dashboard.
    • To check the exact metric value at a specific point in time: Hover your mouse over the desired point on the graph.
  2. Click [Refresh] button to retrieve the latest data.

The following describes the graph charts that make up the Workload/Pod View interface.

Graph chart Unit Description
Average CPU Utilization % The average CPU utilization (Used/Request) for the selected conditions and time range.
  • Used: Total CPU currently in use
  • Request: Total CPU request based on the resource manifest
CPU Usage Cores CPU usage for the selected resource unit (workload or Pod)
Average Memory Utilization % The average Memory utilization (Used/Request) for the selected conditions and time range.
  • Used: Total Memory currently in use
  • Request: Total Memory request based on the resource manifest
Memory Usage MiB Memory usage for the selected resource unit (workload or Pod)
Average GPU Utilization % The average GPU utilization for the selected conditions and time range
GPU Usage Cores GPU usage for the selected resource unit (workload or Pod)
Average GPU Memory Utilization % The average GPU Memory utilization for the selected conditions and time range
GPU Memory Usage MiB GPU Memory usage for the selected resource unit (workload or Pod)

In the Workload/Pod View, you can also view detailed CPU, Memory, and GPU usage at the Pod level. You can easily check resource usage for Pods currently running in your project.
The following describes the Pod-level details displayed in the Workload/Pod View interface.

Details Unit Description
Pod Name - The name of the Pod within the selected workload
CPU Cores Status of CPU usage
  • Request: Total CPU request based on the resource manifest
  • Limits: Total CPU limit based on the resource manifest
Memory Gi Status of Memory usage
  • Request: Total Memory request based on the resource manifest
  • Limits: Total Memory limit based on the resource manifest
GPU Count Status of GPU usage
  • Used: Total GPU currently in use