Troubleshoot event issues

Available in Classic and VPC

You might run into the following problems when using Cloud Insight. Find out causes and possible solutions.

Event occurrence despite no issues during monitoring with Server (VPC)'s is_process_up

Event occurs while monitoring with is_process_up of Server (VPC), even when there is no issue.

Cause

The is_process_up data of Plugin Process is collected based on when the PID of the process name you registered is newly created. If a process name including an asterisk (*) has been registered, the PID list of all matching processes becomes the target.

The conditions under which is_process_up fluctuates are as follows:

is_process_up = 1: when the PID list is maintained, or new PIDs are added
is_process_up = 0: when some or all of the PID list disappears

Therefore, is_process_up can be 0 even though the main process is normal in the following cases:

If a sub process of the main process is temporarily created and deleted
If a sub process of the main process is temporarily deleted and created
If a main process has fewer sub processes

Example: when you register *httpd* as a process name, PID change over time and is_process_up / process_count metric value

Time	PID (Main)	PIDs (sub)	is_process_up	process_count	Detail
12:00	123	-	1	1	No sub process
12:01	123	124, 125	1	3	Create sub process.
12:02	123	124	0	2	Delete part of a sub process.
12:03	123	124, 126	1	3	Create sub process.
12:04	123	124, 127	0	3	Update part of a sub process.
12:05	123	-	0	1	Delete all sub processes.
12:06	-	-	0	0	Delete main process.

Solution

To review the Apache service's normality, process names such as *httpd* are often monitored. In such cases, monitoring with is_process_up may not be normally performed. If the Apache service is terminated, the process_count of *httpd* will be 0. Therefore, we recommend monitoring with conditions such as process_count == 0.

Display of the dimensions of a file, process, and port plugin which were deleted during Event Rule creation

Dimensions of a file, process, and port plugin, which were deleted when creating an event rule, continue to be displayed.

Cause

Dimensions of a deleted file, process, and port plugin may be displayed for up to 2 days on the Event Rule creation screen after deletion. The metric information that is collected during the deletion of a file, process, or port plugin is immediately deleted. Dimension information, however, is deleted only if metric information stays uncollected for 2 days.

Solution

The dimension is not deleted until its metric stays uncollected for 2 days. Check it again after the period passes.

Mismatch between the total rule count value in Event Rule and the values of monitoring targets and monitoring items

I have created an Event Rule, but the total rule count value does not match the values of monitoring targets and monitoring items.

Cause

The total rule count of Event Rule is assessed only based on actually created rules. In this case, whether the rule is actually created depends on whether the configured monitoring target is collecting metrics for monitoring items.
For example, if you have 3 monitoring targets, but the monitoring item metrics are being collected for 2 of them, then the total rule count is indicated as 2, not 3.

Solution

Check the cases in which monitoring item metrics are not collected for some of the monitoring targets.

When some of the monitoring target servers are not being collected for dimensions set for monitoring item metric
When some of the monitoring target servers are not configured for detailed monitoring although it is necessary because the monitoring item metric's type is in Extended mode
When some metric collections are stopped because some monitoring target servers stopped
When some metric collections are not properly working due to an internal firewall or firewall solution
When metric collection for some monitoring target servers is not normally working due to Agent problems
When the metric is later collected for monitoring targets, which were originally excluded from the total rule count because the metric collection had not been taken at the time of the event rule creation (In this case, the target is automatically added to the total rule count.)

Event occurrence despite failure to meet the event trigger conditions

I changed the conditions after the event occurrence, but an event occurred again although the changed conditions were not met.

Cause

If you change the conditions for an event which is existing, the existing event ends and the end event notification occurs under the conditions configured at the time.

The examples where the elements such as duration were not considered are as follows:

Time	process_count	Condition	Description
00:00	0	process_count = 1	Non-occurrence of the event
00:01	1	process_count = 1	The process_count = 1 event notification occurs.
00:02	1	process_count = 0	The process_count = 0 end (resolve) event notification occurs.
00:03	0	process_count = 0	The process_count = 0 event notification occurs.

Solution

Check the actual conditions you set for end events that occurred due to their change in i_menu > Services > Management & Governance > Cloud Insight > Event on the NAVER Cloud Platform console.

Event occurrence despite the CPU usage below the event rule condition

An event occurred although the CPU usage was lower than the event rule condition.

Cause

The CPU/used_rto metric has cpu_idx:0~N dimensions depending on the number of CPUs. If you create an event rule without selecting a dimension, the metrics for all dimensions are targeted, and if any of the metrics for each dimension meet the condition, an event occurs.

Example: if the server has 2 CPUs and the event rule and metric values are as follows, the CPU/used_rto value is 45, but an event occurs because the corresponding value for the dimension cpu_idx: 0 is 60, which meets the condition.

Monitoring items and conditions
Metric: CPU/used_rto
Dimension: not selected
Condition: >=50
Aggregation method: AVG
Duration: 1 minute
Min1 data at a certain time

Time CPU/used_rto (cpu_idx: 0) CPU/used_rto (cpu_idx: 1) CPU/used_rto

00:01 60 30 45

Time	CPU/used_rto (cpu_idx: 0)	CPU/used_rto (cpu_idx: 1)	CPU/used_rto
00:01	60	30	45

Solution

If you need to configure the event according to the server's average CPU usage, use the SERVER/avg_cpu_used_rto metric.

Mismatch between event occurrence content and the data displayed on the Event menu

Event occurrence content differs from the data displayed on the Event menu.

Cause

For the graph you can view in i_menu > Services > Management & Governance > Cloud Insight > Event on the NAVER Cloud Platform console, its data aggregation interval varies by event starting time and ending time (examples: Min5). To view the data that actually triggered the event rule, you must view the data with an aggregation interval of Min1.

Solution

View Min1 data after setting the view period to less than 1 hour, through configuring a separate Dashboard or viewing the event rule details in the Event Rule menu.

No server data with higher CPU usage on the CPU usage widget

When viewing the TOP 10 widget data in the Service Dashboard, server data with higher CPU usage is not displayed on the CPU usage widget.

Cause

The list of Service Dashboard TOP 10 is chosen according to the following criteria.

For the view period (startTime and endTime), the top 10 are selected by sorting the most recently collected metric values for each resource based on endTime.

If there are more than 10 resources, a resource whose metric value is not included in the top 10 according to the above criteria may not be displayed on the Service Dashboard. In addition, although there are resources with high metric values during the view period, the resources may not be displayed on the Service Dashboard because the most recently collected values based on endTime are compared.
To check the exact metrics for a certain resource, check the data by selecting a certain resource on the widget data, or check the metrics for a certain resource by configuring a separate dashboard and widget.

Solution

This event is not an issue because the data of servers with higher CPU usage were excluded from the collection criteria of Service Dashboard TOP 10. For more information, see the section Cause.

Note

If you're still having trouble finding what you need, click on the feedback icon and send us your thoughts and requests. We'll use your feedback to improve this guide.