Troubleshooting Cloud Insight

Article Summary

Share feedback

Thanks for sharing your feedback!

Available in Classic and VPC

Troubleshooting Cloud Insight is a description of problematic situations that users may face while using Cloud Insight, as well as their causes and resolutions.

For problems not explained in troubleshooting Cloud Insight, or in case you find it difficult to solve the problem even after reading the guide thoroughly, look over the following to see if they answer your questions, or submit a query for a resolution.

Q. My server is hanging. Metric collection is not happening, and I am not getting notifications, either.

A. When there is a server hang, the Agent does not work because it cannot get the CPU assigned. The problem may continue until the process causing the hang releases the hanged status on its own, or until you forcibly end the process. If you can't input anything to the server, then the server may need to be forcibly restarted. If a server is not working normally due to hangs or issues in the agent or network, use agent_status Metric from Server (Classic) or Server (VPC) to perform a check.

Q. Event occurs while monitoring with is_process_up of the Server (VPC), even when there is no issue.

A. The is_process_up data of Plugin Process is collected when the PID of the process name registered by the user is newly created. If a process name including an asterisk (*) is registered, the PID list of all matching processes becomes the target.

The conditions under which is_process_up fluctuates are listed below.

is_process_up = 1: when the PID list is maintained, or new PIDs are added
is_process_up = 0: when some or all of the PID list disappears

Therefore, is_process_up can be 0 even though the Main process is normal in the following cases:

If a Sub process of the Main process is temporarily created and deleted
If a Sub process of the Main process is temporarily deleted and created
When a Main process has fewer Sub processes

<example> When you register *httpd* as a process name, PID change over time and is_process_up / process_count metric value

Time	PID (Main)	PIDs (sub)	is_process_up	process_count	Detail
12:00	123	-	1	1	No sub process
12:01	123	124, 125	1	3	Create sub process
12:02	123	124	0	2	Delete part of a sub process
12:03	123	124, 126	1	3	Create sub process
12:04	123	124, 127	0	3	Update part of a sub process
12:05	123	-	0	1	Delete all sub processes
12:06	-	-	0	0	Delete main process

You usually monitor the process name such as *httpd* to determine normal operation of apache service.
In this case, it would be helpful to monitor via conditions such as process_count == 0 because proper monitoring might be unavailable with is_process_up. (as process_count of *httpd* will be 0 if apache service is terminated.)

Q. The Dimension of the File/Process/Port Plugin deleted when creating Event Rule continue to be exposed.

A. Dimension of the deleted File/Process/Port Plugin can be exposed for up to 2 days on the Event Rule creation screen after deletion. Metric information collected when File/Process/Port Plugin is deleted is immediately deleted, but Dimension information is deleted if the Metric with the corresponding Dimensions persists uncollected for 2 days.

Q. Event Rule has been created, but the Total Rule Count does not match the value of monitoring targets and monitoring items.

A. The Total Rule Count of the Event Rule is calculated based on the number of Rules actually created, and whether rules are actually created is determined by whether the set monitoring target is collecting the Metric of the monitoring item.

<example> If you have 3 monitoring targets, but the monitoring item metrics are being collected for 2 of them, then the total rule count is indicated as 2, not 3.

Monitoring item Metrics are not collected for some of the monitoring targets in the following cases.

When Dimensions set in the monitoring item Metric are not being collected for some of the monitoring target servers
When the monitoring item Metric Type is Extended and detailed monitoring settings are required, but some monitoring target servers don't have the detailed monitoring settings configured
When some of the monitoring target servers have been stopped and Metric collection has also been stopped
When the Metrics are not being collected normally for some of the monitoring target servers by an internal firewall, firewall solution, and so on
When the Metrics are not being collected normally for some of the monitoring target servers due to an Agent operation issue
When a Metric gets collected for a monitoring target, which had been excluded from the Total Rule Count because the Metric for it was not being collected when setting up the Event Rule (In this case, it is automatically counted in the Total Rule Count.)

Q. The proc_mem_usert of the Server is higher than the mem_usert of the Memory.

A. The following describes each Metric.

SERVER/proc_mem_usert: memory usage rate of all processes in the server
MEMORY/mem_usert: memory usage rate in the entire server

Generally, the server memory is used by multiple elements other than processes, so MEMORY/mem_usert tends to be greater than SERVER/proc_mem_usert.

The following two cases apply if SERVER/proc_mem_usert is larger than MEMORY/mem_usert.

SERVER/proc_mem_usert is the sum of RSS (Local Memory occupied by the process + Shared Memory referred to by the process) used by all processes. If multiple processes refer to the same Shared Memory page, then the sum of RSS can be aggregated to be higher than the actual memory usage rate since they are added repeatedly to the RSS.
The value for RSS is only updated when a process is using the CPU. Under a situation where the CPU load is very high, the CPU may not be assigned to each process. In this case, the update of RSS values may not be done properly. Thus, the sum of all RSS values may be greater than the actual memory usage rate.

To check the accurate memory usage rate, use MEMORY/mem_usert rather than SERVER/proc_mem_usert.

Q. The used_rto of the CPU is higher or lower than the avg_cpu_used_rto of the Server.

A. The following describes each Metric.

CPU/used_rto: each vCPU's usage rate
- <example> if there are 4 vCPUs, usage rate for one among cpu_idx: 0 to 3
SERVER/avg_cpu_used_rto: the average CPU usage rate in the entire server

Due to the characteristics of the Linux architecture, a specific process has a tendency to use a specific CPU more, rather than using all CPUs equally. In such a case, CPU/used_rto may appear higher or lower than SERVER/avg_cpu_used_rto.

To check the accurate average CPU usage rate of a server, use SERVER/avg_cpu_used_rto rather than CPU/used_rto.

Q. If the Condition is adjusted after an Event has occurred, no Event occurs.

A. Even if you adjust the Condition, the Event that has already occurred has to end for an Event according to the new Condition to occur.

If you adjust a Condition while the Event has not ended, and want to trigger a new Event based on this, then you can use the Rule deactivation feature. Deactivate the Rule to force-end that Event, adjust the Conditions and remove deactivation. A new Event will occur when the condition is met.

For more information about deactivating Rules, see View Rule details or Edit Event Rule.

Q. The Agent is running properly, but no data is collected in Cloud Insight.

A. The Outbound communication from Agent to Cloud Insight may be blocked due to reasons such as internal firewall settings of the server, installation of security solutions, and so on, even if the Agent is running normally. See the following Port list and check if the firewall is down.

Classic environment

Source	Destination	Port	Description
Customer VM bandwidth	real-collector.nsight.ncloud.com (10.250.5.199)	TCP 9973	Cloud Insight metrics collection server
Customer VM bandwidth	real-ntp.nsight.ncloud.com (10.250.5.117)	UDP 123	Cloud Insight NTP server
Customer VM bandwidth	real-wai.nsight.ncloud.com (10.250.5.118)	TCP 10280	Server to view information related to Cloud Insight
Customer VM bandwidth	repo-nsight.ncloud.com (10.213.208.165)	TCP 80,443	Cloud Insight Repository server
10.250.26.62	Customer VM bandwidth	ICMP	Cloud Insight Ping Check monitoring server
10.250.26.63	Customer VM bandwidth	ICMP	Cloud Insight Ping Check monitoring server

VPC environment

Source	Destination	Port	Description
Customer VM bandwidth	collector.nsight.ncloud.com (169.254.80.17)	TCP 9973	Cloud Insight metrics collection server
Customer VM bandwidth	ntp.nsight.ncloud.com (169.254.80.19)	UDP 123	Cloud Insight NTP server
Customer VM bandwidth	wai.nsight.ncloud.com (169.254.80.18)	TCP 10280	Server to view information related to Cloud Insight
Customer VM bandwidth	nsight.ncloud.com (169.254.80.16)	TCP 80,443	Cloud Insight Repository server
169.254.80.22	Customer VM bandwidth	ICMP	Cloud Insight Ping Check monitoring server
169.254.80.23	Customer VM bandwidth	ICMP	Cloud Insight Ping Check monitoring server

Q. The service data of NAVER Cloud Platform cannot be displayed on the dashboard.

A. You can view the Basic Metrics provided by default in each service of NAVER Cloud Platform without any additional settings in Cloud Insight. For Extended Metrics that are additionally provided, you need to first set them up from the console of each service. (For Server (VPC), you can set it up through Detailed monitoring settings.) Check the service list in the API guide for the Basic and Extended Metrics provided by each service. Also, Extended Metrics are collected from the point of setup, so please check the search period to view them on Cloud Insight.

Q. When an ASG policy is registered as an action, the server is continuously created/deleted.

A. If you set an ASG policy as an action when creating Event Rule, the ASG policy is executed when an Event occurs. Then the policy will run with the cooldown time specified in the ASG policy. When the Event ends, the policy also ends. So, servers can be continuously created or deleted until the Event ends.

Q. Custom Schema is created, but it cannot be viewed on the console.

A. You can check it on the console only after data is collected with the created Custom Schema. See SendData API for data transfer. Data can be viewed after data collection has begun and a specific time has passed according to the aggregation interval set when the Custom Schema was created.

Q. Data was sent through the API, but it cannot be viewed on the console.

A. The default period for viewing data in Cloud Insight is 1 hour from the view time. Search after setting the search period so that it includes the point in time where you sent them through the API.

Q. The data sent with the API seems to be different from the results shown on the dashboard.

A. Cloud Insight calculates the collected data using various aggregation functions at regular intervals. When the aggregation results are displayed, they may be different from the data you sent through the API. Check the aggregation interval when you view the data on Cloud Insight. You can set the aggregation interval and aggregation functions when creating Custom Schema.

Q. I didn't do anything, but after a certain point the metrics are not collected.

A. Cloud Insight Agent may not work properly after a point when the disk usage of the root(/) path is above 99%.
Therefore, you must check and free up the disk capacity of this path. After checking and freeing up the capacity, restart the agent and see if it collects normally.

Was this article helpful?

What's Next

Cloud Insight release notes

Table of contents