Облачная платформаAdvanced

Events Supported by Event Monitoring

Язык статьи: Английский
Перевести

Note

The name of a resource that supports event reporting can contain a maximum of 128 characters, including letters, digits, underscores (_), hyphens (-), and periods (.). If it contains other characters, the event may fail to be reported to Cloud Eye.

Table 1 Elastic Cloud Server (ECS)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

ECS

SYS.ECS

Restart triggered due to hardware fault

startAutoRecovery

Major

ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted.

Wait for the event to end and check whether services are affected.

Services may be interrupted.

Restart completed due to hardware failure

endAutoRecovery

Major

The ECS was recovered after the automatic migration.

This event indicates that the ECS has recovered and been working properly.

None

Auto recovery timeout (being processed on the backend)

faultAutoRecovery

Major

Migrating the ECS to a normal host timed out.

Migrate services to other ECSs.

Services are interrupted.

ECS deleted

deleteServer

Major

The ECS was deleted

  • on the management console.
  • by calling APIs.

Check whether the deletion was performed intentionally by a user.

Services are interrupted.

ECS restarted

rebootServer

Minor

The ECS was restarted

  • on the management console.
  • by calling APIs.

Check whether the restart was performed intentionally by a user.

  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

ECS stopped

stopServer

Minor

The ECS was stopped

  • on the management console.
  • by calling APIs.
NOTE:

The ECS is stopped only after CTS is enabled. For details, see Cloud Trace Service User Guide.

  • Check whether the operation was intentionally performed by a user.
  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

NIC deleted

deleteNic

Major

The ECS NIC was deleted

  • on the management console.
  • by calling APIs.
  • Check whether the deletion was performed intentionally by a user.
  • Deploy service applications in HA mode.
  • After the NIC is deleted, check whether services recover.

Services may be interrupted.

ECS resized

resizeServer

Minor

The ECS specifications were resized

  • on the management console.
  • by calling APIs.
  • Check whether the operation was performed by a user.
  • Deploy service applications in HA mode.
  • After the ECS is resized, check whether services have recovered.

Services are interrupted.

GuestOS restarted

RestartGuestOS

Minor

The guest OS was restarted.

Contact O&M personnel.

Services may be interrupted.

ECS failure caused by system faults

VMFaultsByHostProcessExceptions

Critical

The host where the ECS resides is faulty. The system will automatically try to start the ECS.

After the ECS is started, check whether this ECS and services on it can run properly.

The ECS is faulty.

Startup failure

faultPowerOn

Major

The ECS failed to start.

Start the ECS again. If the problem persists, contact O&M personnel.

The ECS cannot start.

Host breakdown risk

hostMayCrash

Major

The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons.

Migrate services running on the ECS first and delete or stop the ECS. Start the ECS only after the O&M personnel eliminate the risk.

The host may break down, causing service interruption.

Scheduled migration completed

instance_migrate_completed

Major

Scheduled ECS migration is completed.

Wait until the ECSs become available and check whether services are affected.

Services may be interrupted.

Scheduled migration being executed

instance_migrate_executing

Major

ECSs are being migrated as scheduled.

Wait until the event is complete and check whether services are affected.

Services may be interrupted.

Scheduled migration canceled

instance_migrate_canceled

Major

Scheduled ECS migration is canceled.

None

None

Scheduled migration failed

instance_migrate_failed

Major

ECSs failed to be migrated as scheduled.

Contact O&M personnel.

Services are interrupted.

Scheduled migration to be executed

instance_migrate_scheduled

Major

ECSs will be migrated as scheduled.

Check the impact on services during the execution window.

None

Scheduled specification modification failed

instance_resize_failed

Major

Specifications failed to be modified as scheduled.

Contact O&M personnel.

Services are interrupted.

Scheduled specification modification completed

instance_resize_completed

Major

Scheduled specifications modification is completed.

None

None

Scheduled specification modification being executed

instance_resize_executing

Major

Specifications are being modified as scheduled.

Wait until the event is complete and check whether services are affected.

Services are interrupted.

Scheduled specification modification canceled

instance_resize_canceled

Major

Scheduled specifications modification is canceled.

None

None

Scheduled specification modification to be executed

instance_resize_scheduled

Major

Specifications will be modified as scheduled.

Check the impact on services during the execution window.

None

Scheduled redeployment to be executed

instance_redeploy_scheduled

Major

ECSs will be redeployed on new hosts as scheduled.

Check the impact on services during the execution window.

None

Scheduled restart to be executed

instance_reboot_scheduled

Major

ECSs will be restarted as scheduled.

Check the impact on services during the execution window.

None

Scheduled stop to be executed

instance_stop_scheduled

Major

ECSs will be stopped as scheduled as they are affected by underlying hardware or system O&M.

Check the impact on services during the execution window.

None

Live migration started

liveMigrationStarted

Major

The host where the ECS is located may be faulty. Live migrate the ECS in advance to prevent service interruptions caused by host breakdown.

Wait for the event to end and check whether services are affected.

Services may be interrupted for less than 1s.

Live migration completed

liveMigrationCompleted

Major

The live migration is complete, and the ECS is running properly.

Check whether services are running properly.

None

Live migration failure

liveMigrationFailed

Major

An error occurred during the live migration of an ECS.

Check whether services are running properly.

There is a low probability that services are interrupted.

FPGA link fault

FPGALinkFault

Critical

The FPGA of the host on which the ECS is located was

  • faulty.
  • recovering from a fault.

Deploy service applications in HA mode.

After the FPGA fault is rectified, check whether services are restored.

Services are interrupted.

Scheduled redeployment to be authorized

instance_redeploy_inquiring

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Authorize scheduled redeployment.

None

Local disk replacement canceled

localdisk_recovery_canceled

Major

Local disk failure

None

None

Local disk replacement to be executed

localdisk_recovery_scheduled

Major

Local disk failure

Check the impact on services during the execution window.

None

nvidia-smi suspended

nvidiaSmiHangEvent

Major

nvidia-smi timed out.

If services are affected, submit a service ticket.

The driver may report an error during service running.

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors on the NPU.

If services are affected, replace the NPU with another one.

Services may be interrupted.

Scheduled redeployment canceled

instance_redeploy_canceled

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

None

None

Scheduled redeployment being executed

instance_redeploy_executing

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Wait until the event is complete and check whether services are affected.

Services are interrupted.

Scheduled redeployment completed

instance_redeploy_completed

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Wait until the redeployed ECSs are available and check whether services are affected.

None

Scheduled redeployment failed

instance_redeploy_failed

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Contact O&M personnel.

Services are interrupted.

Local disk replacement to be authorized

localdisk_recovery_inquiring

Major

Local disks are faulty.

Authorize local disk replacement.

Local disks are unavailable.

Local disks being replaced

localdisk_recovery_executing

Major

Local disk failure

Wait until the local disks are replaced and check whether the local disks are available.

Local disks are unavailable.

Local disks replaced

localdisk_recovery_completed

Major

Local disk failure

Wait until the services are running properly and check whether local disks are available.

None

Local disk replacement failed

localdisk_recovery_failed

Major

Local disks are faulty.

Contact O&M personnel.

Local disks are unavailable.

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty or the NPU is disconnected.

Transfer this issue to the Ascend or hardware team for handling.

The NPU cannot be used properly.

NPU: PCIe link error

PCIeErrorFound

Major

The possible cause is deskew_fifo overflow, symbol_unlock, deskew_unlock event, or phystatus timeout.

Transfer the issue to the hardware team.

The NPU cannot be used properly.

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Transfer the issue to the hardware team.

The NPU cannot be used properly.

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Stop services, restart the BMS, check the heat dissipation system, and reset the devices.

The ECS may be powered off due to overtemperature and devices may not be found.

NPU: request for instance restart

RebootVirtualMachine

Informational

A fault occurs and the BMS needs to be restarted.

Collect the fault information, and restart the BMS.

Services may be interrupted.

NPU: request for SoC reset

ResetSOC

Informational

A fault occurs and the SoC needs to be reset.

Collect the fault information, and reset the SoC.

Services may be interrupted.

NPU: request for restart AI process

RestartAIProcess

Informational

A fault occurs and the AI process needs to be restarted.

Collect the fault information, and restart the AI process.

The current AI task will be interrupted.

NPU: error codes

NPUErrorCodeWarning

Major

A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.

Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.

Services may be interrupted.

DAVP: die device node not found by vasmi

DAVPSMICardNotFound

Major

The driver may be faulty or the card may be disconnected.

Restart the VM. If the device still cannot be loaded, transfer this issue to the hardware team for handling.

The DAVP cannot be used properly.

DAVP: device not found by lspci

DAVPLspciCardNotFound

Major

The DAVP is disconnected.

Transfer the issue to the hardware team.

The DAVP cannot be used properly.

DAVP: temperature higher than the threshold 85°C

TemperatureOverDfLimit

Major

The core module temperature exceeds 85°C, which causes frequency reduction.

Stop services. Contact the hardware team to check the heat dissipation system and reset the device.

The DAVP card frequency is reduced.

DAVP: temperature higher than the threshold 105°C

TemperatureOverSdLimit

Major

The core module temperature exceeds 105°C, which generates a high temperature alarm.

Stop services. Contact the hardware team to check the heat dissipation system and reset the device.

Power-off protection is triggered. The DAVP cannot be used properly.

DAVP: core unit exception of the device node

DeviceCoreAbnormal

Major

You may need to restart the die device node.

Collect the fault information and restart die.

Services may be interrupted.

VM deletion failure

faultDeleteServer

Major

Failed to delete the ECS.

Check whether services are affected.

The ECS resources fail to be deleted.

Failed to delete the ECS.

Check whether services are affected.

Note

Auto recovery: Once a physical host running ECSs breaks down, the ECSs are automatically migrated to a functional physical host. During the migration, the ECSs will be restarted.

Table 2 Elastic IP (EIP)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

EIP

SYS.EIP

EIP bandwidth exceeded

EIPBandwidthOverflow

Major

The used bandwidth exceeded the purchased one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period.

The metrics are described as follows:

egressDropBandwidth: dropped outbound packets (bytes)

egressAcceptBandwidth: accepted outbound packets (bytes)

egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s)

ingressAcceptBandwidth: accepted inbound packets (bytes)

ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s)

ingressDropBandwidth: dropped inbound packets (bytes)

Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.

The network becomes slow or packets are lost.

EIP released

deleteEip

Minor

The EIP was released.

Check whether the EIP was release by mistake.

The server that has the EIP bound cannot access the Internet.

EIP blocked

blockEIP

Critical

The used bandwidth of an EIP exceeded 5 Gbit/s, the EIP were blocked and packets were discarded. Such an event may be caused by DDoS attacks.

Replace the EIP to prevent services from being affected.

Locate and deal with the fault.

Services are impacted.

EIP unblocked

unblockEIP

Critical

The EIP was unblocked.

Use the previous EIP again.

None

EIP traffic scrubbing started

ddosCleanEIP

Major

Traffic scrubbing on the EIP was started to prevent DDoS attacks.

Check whether the EIP was attacked.

Services may be interrupted.

EIP traffic scrubbing ended

ddosEndCleanEip

Major

Traffic scrubbing on the EIP to prevent DDoS attacks was ended.

Check whether the EIP was attacked.

Services may be interrupted.

QoS bandwidth exceeded

EIPBandwidthRuleOverflow

Major

The used QoS bandwidth exceeded the allocated one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period.

egressDropBandwidth: dropped outbound packets (bytes)

egressAcceptBandwidth: accepted outbound packets (bytes)

egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s)

ingressAcceptBandwidth: accepted inbound packets (bytes)

ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s)

ingressDropBandwidth: dropped inbound packets (bytes)

Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.

The network becomes slow or packets are lost.

EIP unbound with resources

EipNotBoundStatus

Major

The EIP is unbound with instance resources.

None

When an EIP is unbound, you will be billed for IP reservation fees and bandwidth fees (billed by bandwidth).

Table 3 Elastic Load Balance (ELB)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

ELB

SYS.ELB

The backend servers are unhealthy.

healthCheckUnhealthy

Major

Generally, this problem occurs because backend server services are offline. This event will not be reported after it is reported for several times.

Ensure that the backend servers are running properly.

ELB does not forward requests to unhealthy backend servers. If all backend servers in the backend server group are detected unhealthy, services will be interrupted.

The backend server is detected healthy.

healthCheckRecovery

Minor

The backend server is detected healthy.

No further action is required.

The load balancer can properly route requests to the backend server.

Table 4 Cloud Backup and Recovery (CBR)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

CBR

SYS.CBR

Failed to create the backup.

backupFailed

Critical

The backup failed to be created.

Manually create a backup or contact customer service.

Data loss may occur.

Failed to restore the resource using a backup.

restorationFailed

Critical

The resource failed to be restored using a backup.

Restore the resource using another backup or contact customer service.

Data loss may occur.

Failed to delete the backup.

backupDeleteFailed

Critical

The backup failed to be deleted.

Try again later or contact customer service.

Charging may be abnormal.

Failed to delete the vault.

vaultDeleteFailed

Critical

The vault failed to be deleted.

Try again later or contact technical support.

Charging may be abnormal.

Replication failure

replicationFailed

Critical

The backup failed to be replicated.

Try again later or contact technical support.

Data loss may occur.

The backup is created successfully.

backupSucceeded

Major

The backup was created.

None

None

Resource restoration using a backup succeeded.

restorationSucceeded

Major

The resource was restored using a backup.

Check whether the data is successfully restored.

None

The backup is deleted successfully.

backupDeletionSucceeded

Major

The backup was deleted.

None

None

The vault is deleted successfully.

vaultDeletionSucceeded

Major

The vault was deleted.

None

None

Replication success

replicationSucceeded

Major

The backup was replicated successfully.

None

None

Client offline

agentOffline

Critical

The backup client was offline.

Ensure that the Agent status is normal and the backup client can be connected to cloud service platform.

Backup tasks may fail.

Client online

agentOnline

Major

The backup client was online.

None

None

Table 5 Relational Database Service (RDS) — resource exception

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

RDS

SYS.RDS

DB instance creation failure

createInstanceFailed

Major

Generally, the cause is that the number of disks is insufficient due to quota limits, or underlying resources are exhausted.

The selected resource specifications are insufficient. Select other available specifications and try again.

DB instances cannot be created.

Full backup failure

fullBackupFailed

Major

A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).

Try again.

Full backup failed.

Read replica promotion failure

activeStandBySwitchFailed

Major

The standby DB instance does not take over workloads from the primary DB instance due to network or server failures. The original primary DB instance continues to provide services within a short time.

Perform the switchover again during off-peak hours.

The primary/standby switchover will fail.

Replication status abnormal

abnormalReplicationStatus

Major

The possible causes are as follows:

The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked.

The network between the primary instance and the standby instance or a read replica is disconnected.

Database replication is being repaired. You will be notified immediately after the repair.

The replication status is abnormal.

Replication status recovered

replicationStatusRecovered

Major

The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.

Check whether services are running properly.

Replication status is recovered.

DB instance faulty

faultyDBInstance

Major

A single or primary DB instance was faulty due to a catastrophic failure, for example, server failure.

Instance status is being repaired. You will be notified immediately after the repair.

The instance status is abnormal.

DB instance recovered

DBInstanceRecovered

Major

RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported.

The DB instance status is normal. Check whether services are running properly.

The instance is recovered.

Failure of changing single DB instance to primary/standby

singleToHaFailed

Major

A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located.

Automatic retry is in progress.

Changing a single DB instance to primary/standby failed.

Database process restarted

DatabaseProcessRestarted

Major

The database process is stopped due to insufficient memory or high load.

Check whether services are running properly.

The primary instance is restarted. Services are interrupted for a short period of time.

Instance storage full

instanceDiskFull

Major

Generally, the cause is that the data space usage is too high.

Scale up the storage.

The instance storage is used up. No data can be written into databases.

Instance storage full recovered

instanceDiskFullRecovered

Major

The instance disk is recovered.

Check whether services are running properly.

The instance has available storage.

Kafka connection failed

kafkaConnectionFailed

Major

The network is unstable or the Kafka server does not work properly.

Check whether services are affected.

None

Table 6 Document Database Service (DDS)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

DDS

SYS.DDS

DB instance creation failure

DDSCreateInstanceFailed

Major

A DDS instance fails to be created due to insufficient disks, quotas, and underlying resources.

Check the number and quota of disks. Release resources and create DDS instances again.

DDS instances cannot be created.

Replication failed

DDSAbnormalReplicationStatus

Major

The possible causes are as follows:

  1. The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked.
  2. The network between the primary instance and the standby instance or a read replica is disconnected.

Submit a service ticket.

  1. Read and write operations on the original instance are not interrupted, but data updates on the standby instance may experience delays.
  2. The replication delay keeps growing between the primary and standby instances, and the standby instance may be disconnected.

Replication status recovered

DDSReplicationStatusRecovered

Major

The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.

No action is required.

None

DB instance failed

DDSFaultyDBInstance

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

DB instance recovered

DDSDBInstanceRecovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

None

Faulty node

DDSFaultyDBNode

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

Node recovered

DDSDBNodeRecovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

None

Primary/Standby switchover or failover

DDSPrimaryStandbySwitched

Major

This event is reported when a primary/standby switchover or a failover is triggered.

No action is required.

None

Insufficient storage space

DDSRiskyDataDiskUsage

Major

The storage space is insufficient.

Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide.

The instance is set to read-only and data cannot be written to the instance.

Data disk expanded and being writable

DDSDataDiskUsageRecovered

Major

The capacity of a data disk has been expanded and the data disk becomes writable.

No further action is required.

No adverse impact.

Schedule for deleting a KMS key

planDeleteKmsKey

Major

A request to schedule deletion of a KMS key was submitted.

After the KMS key is scheduled to be deleted, either decrypt the data encrypted by KMS key in a timely manner or cancel the key deletion.

After the KMS key is deleted, users cannot encrypt disks.

Table 7 Distributed Database Middleware (DDM)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

DDM

SYS.DDM (DDM 1.0)

SYS.DDMS (DDM 2.0)

Failed to create a DDM instance

createDdmInstanceFailed

Major

The underlying resources are insufficient.

Release resources and create the instance again.

DDM instances cannot be created.

Failed to change class of a DDM instance

resizeFlavorFailed

Major

The underlying resources are insufficient.

Submit a service ticket to the O&M personnel to coordinate resources and try again.

Services on some nodes are interrupted.

Failed to scale out a DDM instance

enlargeNodeFailed

Major

The underlying resources are insufficient.

Submit a service ticket to the O&M personnel to coordinate resources, delete the node that fails to be added, and add a node again.

The instance fails to be scaled out.

Failed to scale in a DDM instance

reduceNodeFailed

Major

The underlying resources fail to be released.

Submit a service ticket to the O&M personnel to release resources.

The instance fails to be scaled in.

Failed to restart a DDM instance

restartInstanceFailed

Major

The DB instances associated are abnormal.

Check whether DB instances associated are normal. If the instances are normal, submit a service ticket to the O&M personnel.

Services on some nodes are interrupted.

Failed to create a schema

createLogicDbFailed

Major

The possible causes are as follows:

  • The password for the DB instance account is incorrect.
  • The security group of the DDM instance and the associated DB instance are incorrectly configured. As a result, the DDM instance cannot communicate with the associated DB instance.

Check whether

  • The username and password of the DB instance are correct.
  • The security groups associated with the DDM instance and underlying database instance are correctly configured.

Services cannot run properly.

Failed to bind an EIP

bindEipFailed

Major

The EIP is abnormal.

Try again later. In case of emergency, contact O&M personnel to rectify the fault.

The DDM instance cannot be accessed from the Internet.

Failed to scale out a schema

migrateLogicDbFailed

Major

The underlying resources fail to be processed.

Submit a service ticket to the O&M personnel.

The schema cannot be scaled out.

Failed to re-scale out a schema

retryMigrateLogicDbFailed

Major

The underlying resources fail to be processed.

Submit a service ticket to the O&M personnel.

The schema cannot be scaled out.

Table 8 Virtual Private Cloud (VPC)

Event Source

Namespace

Event Name

Event ID

Event Severity

VPC

SYS.VPC

VPC deleted

deleteVpc

Major

VPC modified

modifyVpc

Minor

Subnet deleted

deleteSubnet

Minor

Subnet modified

modifySubnet

Minor

Bandwidth modified

modifyBandwidth

Minor

VPN deleted

deleteVpn

Major

VPN modified

modifyVpn

Minor

Table 9 Elastic Volume Service (EVS)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

EVS

SYS.EVS

Update disk

updateVolume

Minor

Update the name and description of an EVS disk.

No further action is required.

None

Expand disk

extendVolume

Minor

Expand an EVS disk.

No further action is required.

None

Delete disk

deleteVolume

Major

Delete an EVS disk.

No further action is required.

Deleted disks cannot be recovered.

QoS upper limit reached

NOTE:

This event is no longer supported for EVS and will be removed from Cloud Eye.

reachQoS

Major

The I/O latency increases as the QoS upper limits of the disk are frequently reached and flow control triggered.

Change the disk type to one with a higher specification.

The current disk may fail to meet service requirements.

Table 10 Identity and Access Management (IAM)

Event Source

Namespace

Event Name

Event ID

Event Severity

IAM

SYS.IAM

Login

login

Minor

Logout

logout

Minor

Password changed

changePassword

Major

User created

createUser

Minor

User deleted

deleteUser

Major

User updated

updateUser

Minor

User group created

createUserGroup

Minor

User group deleted

deleteUserGroup

Major

User group updated

updateUserGroup

Minor

Identity provider created

createIdentityProvider

Minor

Identity provider deleted

deleteIdentityProvider

Major

Identity provider updated

updateIdentityProvider

Minor

Metadata updated

updateMetadata

Minor

Security policy updated

updateSecurityPolicies

Major

Credential added

addCredential

Major

Credential deleted

deleteCredential

Major

Project created

createProject

Minor

Project updated

updateProject

Minor

Project suspended

suspendProject

Major

Table 11 Key Management Service (KMS)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

KMS

SYS.KMS

Key disabled

disableKey

Major

A key is disabled and cannot be used.

If the customer needs to disable the key, no action is required. However, if the key is disabled by mistake, the customer needs to log in to the DEW console and enable it again.

Services may be affected if the key is being used.

Key deletion scheduled

scheduleKeyDeletion

Minor

A key is scheduled to be deleted and cannot be used.

If the customer needs to delete the key, no action is required. However, if the deletion of the key is scheduled by mistake, the customer needs to log in to the DEW console, cancel the scheduled deletion, and enable the key again.

Services may be affected if the key is being used.

Grant retired

retireGrant

Major

A grant is retired and the key cannot be used.

If the customer needs to cancel the key grant, no action is required. However, if the grant is canceled by mistake, the customer needs to log in to the DEW console and create the grant again.

Services may be affected if the key is being used.

Grant revoked

revokeGrant

Major

A grant is revoked and the key cannot be used.

If the customer needs to cancel the key grant, no action is required. However, if the grant is canceled by mistake, the customer needs to log in to the DEW console and create the grant again.

Services may be affected if the key is being used.

Table 12 Object Storage Service (OBS)

Event Source

Namespace

Event Name

Event ID

Event Severity

OBS

SYS.OBS

Bucket deleted

deleteBucket

Major

Bucket policy deleted

deleteBucketPolicy

Major

Bucket ACL configured

setBucketAcl

Minor

Bucket policy configured

setBucketPolicy

Minor

Table 13 Cloud Eye

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

Cloud Eye

SYS.CES

Agent heartbeat interruption

agentHeartbeatInterrupted

Major

The collecting process of the Agent is faulty.

  • Confirm that the Agent domain name cannot be resolved.
  • Check whether your account is in arrears.
  • The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent.
  • Confirm that the server time is inconsistent with the local standard time.
  • Update the Agent to the latest version.

The Agent will stop collecting and reporting metrics.

Agent back to normal

agentResumed

Informational

The Agent was back to normal.

No action is required.

None

Agent faulty

agentFaulted

Major

The Agent was faulty and this status was reported to Cloud Eye.

The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent.

Update the Agent to the latest version.

The Agent will stop collecting and reporting metrics.

Agent disconnected

agentDisconnected

Major

The communication process of the Agent is faulty.

Confirm that the Agent domain name cannot be resolved.

Check whether your account is in arrears.

The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent.

Confirm that the server time is inconsistent with the local standard time.

Update the Agent to the latest version.

The Agent will stop collecting and reporting metrics.

Table 14 Host Security Service (HSS)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

HSS

SYS.HSS

HSS agent disconnected

hssAgentAbnormalOffline

Major

The communication between the agent and the server is abnormal, or the agent process on the server is abnormal.

Fix your network connection. If the agent is still offline for a long time after the network recovers, the agent process may be abnormal. In this case, log in to the server and restart the agent process.

Services are interrupted.

Abnormal HSS agent status

hssAgentAbnormalProtection

Major

The agent is abnormal probably because it does not have sufficient resources.

Log in to the server and check your resources. If the usage of memory or other system resources is too high, increase their capacity first. If the resources are sufficient but the fault persists after the agent process is restarted, submit a service ticket to the O&M personnel.

Services are interrupted.

Table 15 Image Management Service (IMS)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

IMS

SYS.IMS

Create Image

createImage

Major

An image was created.

None

You can use this image to create cloud servers.

Update Image

updateImage

Major

Metadata of an image was modified.

None

Cloud servers may fail to be created from this image.

Delete Image

deleteImage

Major

An image was deleted.

None

This image will be unavailable on the management console.

Table 16 MapReduce Service (MRS)

Event Source

Namespace

Event Name

Event ID

Event Severity

Description

Solution

Impact

MRS

SYS.MRS

DBServer Switchover

dbServerSwitchover

Minor

The DBServer switchover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may affect Hive service availability.

Flume Channel overflow

flumeChannelOverflow

Minor

Flume Channel overflow

Check whether the Flume channel configuration is proper and whether the service volume increases sharply.

Flume tasks cannot write data to the backend.

NameNode Switchover

namenodeSwitchover

Minor

The NameNode switchover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may cause HDFS file read/write failures.

ResourceManager Switchover

resourceManagerSwitchover

Minor

ResourceManager Switchover

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may cause exceptions or even failures of YARN tasks.

JobHistoryServer Switchover

jobHistoryServerSwitchover

Minor

The JobHistoryServer switchover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may cause failures to read MapReduce task logs.

HMaster Failover

hmasterFailover

Minor

The HMaster failover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may affect HBase service availability.

Hue Failover

hueFailover

Minor

The Hue failover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

The active/standby switchover may affect the display of the HUE page.

Impala HaProxy Failover

impalaHaProxyFailover

Minor

The Impala HaProxy switchover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may affect Impala service availability.

Impala StateStoreCatalog Failover

impalaStateStoreCatalogFailover

Minor

The Impala StateStoreCatalog failover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may affect Impala service availability.

LdapServer Failover

ldapServerFailover

Minor

The LdapServer failover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

Consecutive active/standby switchovers may affect LdapServer service availability.

Loader Switchover

loaderSwitchover

Minor

The Loader switchover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

The active/standby switchover may affect Loader service availability.

Manager Switchover

managerSwitchover

Informational

The Manager switchover occurs.

Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.

The active/standby Manager switchover may cause the Manager page inaccessible and abnormal values of some monitoring items.

Job Running Failed

jobRunningFailed

Informational

A job fails to be executed.

On the Jobs tab page, check whether the failed task is normal.

The job fails to be executed.

Job Killed

jobkilled

Informational

The job is terminated.

Check whether the task is manually terminated.

The job execution process is terminated.

Oozie Workflow Execution Failure

oozieWorkflowExecutionFailure

Minor

Oozie workflows fail to execute.

View Oozie logs to locate the failure cause.

Oozie workflows fail to execute.

Oozie Scheduled Job Execution Failure

oozieScheduledJobExecutionFailure

Minor

Oozie scheduled tasks fail to execute.

View Oozie logs to locate the failure cause.

Oozie scheduled tasks fail to execute.

ClickHouse Service Unavailable

clickHouseServiceUnavailable

Critical

The ClickHouse service is unavailable.

For details, see section "ALM-45425 ClickHouse Service Unavailable" in MapReduce Service User Guide.

The ClickHouse service is abnormal. Cluster operations cannot be performed on the ClickHouse service on FusionInsight Manager, and the ClickHouse service function cannot be used.

DBService Service Unavailable

dbServiceServiceUnavailable

Critical

DBService is unavailable

For details, see section "ALM-27001 DBService Service Unavailable" in MapReduce Service User Guide.

The database service is unavailable and cannot provide data import and query functions for upper-layer services. As a result, service exceptions occur.

DBService Heartbeat Interruption Between the Active and Standby Nodes

dbServiceHeartbeatInterruptionBetweentheActiveAndStandbyNodes

Major

DBService Heartbeat Interruption Between the Active and Standby Nodes

For details, see section "ALM-27003 Heartbeat Interruption Between the Active and Standby Nodes" in MapReduce Service User Guide.

During the DBService heartbeat interruption, only one node can provide the service. If this node is faulty, no standby node is available for failover and the service is unavailable.

Data Inconsistency Between Active and Standby DBServices

dataInconsistencyBetweenActiveAndStandbyDBServices

Critical

Data Inconsistency Between Active and Standby DBServices

For details, see section "ALM-27004 Data Inconsistency Between Active and Standby DBService" in MapReduce Service User Guide.

When data is not synchronized between the active and standby DBServices, the data may be lost or abnormal if the active instance becomes abnormal.

Database Enters the Read-Only Mode

databaseEnterstheReadOnlyMode

Critical

The database enters the read-only mode.

For details, see section "ALM-27007 Database Enters the Read-Only Mode" in MapReduce Service User Guide.

The database enters the read-only mode, causing service data loss.

Flume Service Unavailable

flumeServiceUnavailable

Critical

Flume Service Unavailable

For details, see section "ALM-24000 Flume Service Unavailable" in MapReduce Service User Guide.

Flume is running abnormally and the data transmission service is interrupted.

Flume Agent Exception

flumeAgentException

Major

The Flume Agent is abnormal.

For details, see section "ALM-24001 Flume Agent Exception" in MapReduce Service User Guide.

The Flume agent instance for which the alarm is generated cannot provide services properly, and the data transmission tasks of the instance are temporarily interrupted. Real-time data is lost during real-time data transmission.

Flume Client Disconnection Alarm

flumeClientDisconnected

Major

Flume Client Disconnection Alarm

For details, see section "ALM-24003 Flume Client Interrupted" in MapReduce Service User Guide.

The Flume Client for which the alarm is generated cannot communicate with the Flume Server and the data of the Flume Client cannot be sent to the Flume Server.

Exception Occurs When Flume Reads Data

exceptionOccursWhenFlumeReadsData

Major

Exceptions occur when flume reads data.

For details, see section "ALM-24004 Exception Occurs When Flume Reads Data" in MapReduce Service User Guide.

If data is found in the data source and Flume Source continuously fails to read data, the data collection is stopped.

Exception Occurs When Flume Transmits Data

exceptionOccursWhenFlumeTransmitsData

Major

Exceptions occur when flume transmits data.

For details, see section "ALM-24005 Exception Occurs When Flume Transmits Data" in MapReduce Service User Guide.

If the disk usage of Flume Channel increases continuously, the time required for importing data to a specified destination prolongs. When the disk usage of Flume Channel reaches 100%, the Flume agent process pauses.

Flume Certificate File Is Invalid

flumeCertificateFileIsinvalid

Major

The Flume certificate file is invalid or damaged.

For details, see section "ALM-24010 Flume Certificate File Is Invalid or Damaged" in MapReduce Service User Guide.

The Flume certificate file is invalid or damaged, and the Flume client cannot access the Flume server.

Flume Certificate File Is About to Expire

flumeCertificateFileIsAboutToExpire

Major

The Flume certificate file is about to expire.

For details, see section "ALM-24011 Flume Certificate File Is About to Expire" in MapReduce Service User Guide.

The Flume certificate file is about to expire, which has no adverse impact on the system.

Flume Certificate File Is Expired

flumeCertificateFileIsExpired

Major

The Flume certificate file has expired.

For details, see section "ALM-24012 Flume Certificate File Has Expired" in MapReduce Service User Guide.

The Flume certificate file has expired and functions are restricted. The Flume client cannot access the Flume server.

Flume MonitorServer Certificate File Is Invalid

flumeMonitorServerCertificateFileIsInvalid

Major

The Flume MonitorServer certificate file is invalid.

For details, see section "ALM-24013 Flume MonitorServer Certificate File Is Invalid or Damaged" in MapReduce Service User Guide.

The MonitorServer certificate file is invalid or damaged, and the Flume client cannot access the Flume server.

Flume MonitorServer Certificate File Is About to Expire

flumeMonitorServerCertificate FileIsAboutToExpire

Major

The Flume MonitorServer certificate file is about to expire.

For details, see section "ALM-24014 Flume MonitorServer Certificate Is About to Expire" in MapReduce Service User Guide.

The MonitorServer certificate is about to expire, which has no adverse impact on the system.

Flume MonitorServer Certificate File Is Expired

flumeMonitorServerCertificateFileIsExpired

Major

The Flume MonitorServer certificate file has expired.

For details, see section "ALM-24015 Flume MonitorServer Certificate File Has Expired" in MapReduce Service User Guide.

The MonitorServer certificate file has expired and functions are restricted. The Flume client cannot access the Flume server.

HDFS Service Unavailable

hdfsServiceUnavailable

Critical

The HDFS service is unavailable.

For details, see section "ALM-14000 HDFS Service Unavailable" in MapReduce Service User Guide.

HDFS fails to provide services for HDFS service-based upper-layer components, such as HBase and MapReduce. As a result, users cannot read or write files.

NameService Service Unavailable

nameServiceServiceUnavailable

Major

The NameService service is abnormal.

For details, see section "ALM-14010 NameService Service Is Abnormal" in MapReduce Service User Guide.

HDFS fails to provide services for upper-layer components based on the NameService service, such as HBase and MapReduce. As a result, users cannot read or write files.

DataNode Data Directory Is Not Configured Properly

datanodeDataDirectoryIsNotConfiguredProperly

Major

The DataNode data directory is not configured properly.

For details, see section "ALM-14011 DataNode Data Directory Is Not Configured Properly" in MapReduce Service User Guide.

If the DataNode data directory is mounted on critical directories such as the root directory, the disk space of the root directory will be used up after running for a long time. This causes a system fault.

If the DataNode data directory is not configured properly, HDFS performance will deteriorate.

Journalnode Is Out of Synchronization

journalnodeIsOutOfSynchronization

Major

The Journalnode data is not synchronized.

For details, see section "ALM-14012 JournalNode Is Out of Synchronization" in MapReduce Service User Guide.

When a JournalNode is working incorrectly, data on the node is not synchronized with that on other JournalNodes. If data on more than half of JournalNodes is not synchronized, the NameNode cannot work correctly, making the HDFS service unavailable.

Failed to Update the NameNode FsImage File

failedToUpdateTheNameNodeFsImageFile

Major

The NameNode FsImage file failed to be updated.

For details, see section "ALM-14013 Failed to Update the NameNode FsImage File" in MapReduce Service User Guide.

If the FsImage file in the data directory of the active NameNode is not updated, the HDFS metadata combination function is abnormal and requires rectification. If it is not rectified, the Editlog files increase continuously after HDFS runs for a period. In this case, HDFS restart is time-consuming because a large number of Editlog files need to be loaded. In addition, this alarm also indicates that the standby NameNode is abnormal and the NameNode high availability (HA) mechanism becomes invalid. When the active NameNode is faulty, the HDFS service becomes unavailable.

DataNode Disk Fault

datanodeDiskFault

Major

The DataNode disk is faulty.

For details, see section "ALM-14027 DataNode Disk Fault" in MapReduce Service User Guide.

If a DataNode disk fault alarm is reported, a faulty disk partition exists on the DataNode. As a result, files that have been written may be lost.

Yarn Service Unavailable

yarnServiceUnavailable

Critical

The Yarn service is unavailable.

For details, see section "ALM-18000 Yarn Service Unavailable" in MapReduce Service User Guide.

The cluster cannot provide the Yarn service. Users cannot run new applications. Submitted applications cannot be run.

NodeManager Heartbeat Lost

nodemanagerHeartbeatLost

Major

The NodeManager heartbeat is lost.

For details, see section "ALM-18002 NodeManager Heartbeat Lost" in MapReduce Service User Guide.

The lost NodeManager node cannot provide the Yarn service.

The number of containers decreases, so the cluster performance deteriorates.

NodeManager Unhealthy

nodemanagerUnhealthy

Major

The NodeManager is unhealthy.

For details, see section "ALM-18003 NodeManager Unhealthy" in MapReduce Service User Guide.

The faulty NodeManager node cannot provide the Yarn service.

The number of containers decreases, so the cluster performance deteriorates.

Yarn Application Timeout

yarnApplicationTimeout

Minor

Yarn task execution timed out.

For details, see section "ALM-18020 Yarn Task Execution Timeout" in MapReduce Service User Guide.

The alarm persists after task execution times out. However, the task can still be properly executed, so this alarm does not exert any impact on the system.

MapReduce Service Unavailable

mapreduceServiceUnavailable

Critical

The MapReduce service is unavailable.

For details, see section "ALM-18021 MapReduce Service Unavailable" in MapReduce Service User Guide.

The cluster cannot provide the MapReduce service. For example, MapReduce cannot be used to view task logs and the log archive function is unavailable.

Insufficient Yarn Queue Resources

insufficientYarnQueueResources

Minor

Yarn queue resources are insufficient.

For details, see section "ALM-18022 Insufficient Yarn Queue Resources" in MapReduce Service User Guide.

It takes long time to end an application.

A new application cannot run for a long time after submission.

HBase Service Unavailable

hbaseServiceUnavailable

Critical

The HBase service is unavailable.

For details, see section "ALM-19000 HBase Service Unavailable" in MapReduce Service User Guide.

Operations cannot be performed, such as reading or writing data and creating tables.

System Table Path or File of HBase Is Missing

systemTablePathOrFileOfHBaseIsMissing

Critical

The table directories or files of the HBase System are lost.

For details, see section "ALM-19012 HBase System Table Directory or File Lost" in MapReduce Service User Guide.

The HBase service fails to restart or start.

Hive Service Unavailable

hiveServiceUnavailable

Critical

The Hive service is unavailable.

For details, see section "ALM-16004 Hive Service Unavailable" in MapReduce Service User Guide.

Hive cannot provide data loading, query, and extraction services.

Hive Data Warehouse Is Deleted

hiveDataWarehouseIsDeleted

Critical

The Hive data warehouse is deleted.

For details, see section "ALM-16045 Hive Data Warehouse Is Deleted" in MapReduce Service User Guide.

If the default Hive data warehouse is deleted, databases and tables fail to be created in the default data warehouse, affecting service usage.

Hive Data Warehouse Permission Is Modified

hiveDataWarehousePermissionIsModified

Critical

The Hive data warehouse permissions are modified.

For details, see section "ALM-16046 Hive Data Warehouse Permission Is Modified" in MapReduce Service User Guide.

If the permissions on the Hive default data warehouse are modified, the permissions for users or user groups to create databases or tables in the default data warehouse are affected. The permissions will be expanded or reduced.

HiveServer has been deregistered from zookeeper

hiveServerHasBeenDeregisteredFromZookeeper

Major

HiveServer has been deregistered from zookeeper.

For details, see section "ALM-16047 HiveServer Has Been Deregistered from ZooKeeper" in MapReduce Service User Guide.

If Hive configurations cannot be read from ZooKeeper, HiveServer will be unavailable.

Tez or Spark Library Path Does Not Exist

tezlibOrSparklibIsNotExist

Major

The tez or spark library path does not exist.

For details, see section "ALM-16048 Tez or Spark Library Path Does Not Exist" in MapReduce Service User Guide.

The Hive on Tez and Hive on Spark functions are affected.

Hue Service Unavailable

hueServiceUnavailable

Critical

The Hue service is unavailable.

For details, see section "ALM-20002 Hue Service Unavailable" in MapReduce Service User Guide.

The system cannot provide data loading, query, and extraction services.

Impala Service Unavailable

impalaServiceUnavailable

Critical

The Impala service is unavailable.

For details, see section "ALM-29000 Impala Service Unavailable" in MapReduce Service User Guide.

The Impala service is abnormal. Cluster operations cannot be performed on Impala on FusionInsight Manager, and Impala service functions cannot be used.

Kafka Service Unavailable

kafkaServiceUnavailable

Critical

The Kafka service is unavailable.

For details, see section "ALM-38000 Kafka Service Unavailable" in MapReduce Service User Guide.

The cluster cannot provide the Kafka service, and users cannot perform new Kafka tasks.

Status of Kafka Default User Is Abnormal

statusOfKafkaDefaultUserIsAbnormal

Critical

The status of Kafka default user is abnormal.

For details, see section "ALM-38007 Status of Kafka Default User Is Abnormal" in MapReduce Service User Guide.

If the Kafka default user status is abnormal, metadata synchronization between Brokers and interaction between Kafka and ZooKeeper will be affected, affecting service production, consumption, and topic creation and deletion.

Abnormal Kafka Data Directory Status

abnormalKafkaDataDirectoryStatus

Major

The status of Kafka data directory is abnormal.

For details, see section "ALM-38008 Abnormal Kafka Data Directory Status" in MapReduce Service User Guide.

If the Kafka data directory status is abnormal, the current replicas of all partitions in the data directory are brought offline, and the data directory status of multiple nodes is abnormal at the same time. As a result, some partitions may become unavailable.

Topics with Single Replica

topicsWithSingleReplica

Warning

A topic with a single replica exists.

For details, see section "ALM-38010 Topics with Single Replica" in MapReduce Service User Guide.

There is the single point of failure (SPOF) risk for topics with only one replica. When the node where the replica resides becomes abnormal, the partition does not have a leader, and services on the topic are affected.

KrbServer Service Unavailable

krbServerServiceUnavailable

Critical

The KrbServer service is unavailable.

For details, see section "ALM-25500 KrbServer Service Unavailable" in MapReduce Service User Guide.

When this alarm is generated, no operation can be performed for the KrbServer component in the cluster. The authentication of KrbServer in other components will be affected. The running status of components that depend on KrbServer in the cluster is faulty.

Kudu Service Unavailable

kuduServiceUnavailable

Critical

The Kudu service is unavailable.

For details, see section "ALM-29100 Kudu Service Unavailable" in MapReduce Service User Guide.

Users cannot use the Kudu service.

LdapServer Service Unavailable

ldapServerServiceUnavailable

Critical

The LdapServer service Is unavailable.

For details, see section "ALM-25000 LdapServer Service Unavailable" in MapReduce Service User Guide.

When this alarm is generated, no operation can be performed for the KrbServer users and LdapServer users in the cluster. For example, users, user groups, or roles cannot be added, deleted, or modified, and user passwords cannot be changed on the FusionInsight Manager portal. The authentication for existing users in the cluster is not affected.

Abnormal LdapServer Data Synchronization

abnormalLdapServerDataSynchronization

Critical

The LdapServer data synchronization is abnormal.

For details, see section "ALM-25004 Abnormal LdapServer Data Synchronization" in MapReduce Service User Guide.

LdapServer data inconsistency occurs because LdapServer data on Manager or in the cluster is damaged. The LdapServer process with damaged data cannot provide services externally, and the authentication functions of Manager and the cluster are affected.

Nscd Service Is Abnormal

nscdServiceIsAbnormal

Major

The Nscd service is abnormal.

For details, see section "ALM-25005 nscd Service Exception" in MapReduce Service User Guide.

If the Nscd service is abnormal, the node may fail to synchronize data from an LDAP server. In this case, running the id command may fail to obtain data from an LDAP server, affecting upper-layer services.

Sssd Service Is Abnormal

sssdServiceIsAbnormal

Major

The Sssd service is abnormal.

For details, see section "ALM-25006 Sssd Service Exception" in MapReduce Service User Guide.

If the Sssd service is abnormal, the node may fail to synchronize data from LdapServer. In this case, running the id command may fail to obtain LDAP data, affecting upper-layer services.

Loader Service Unavailable

loaderServiceUnavailable

Critical

The Loader service is unavailable.

For details, see section "ALM-23001 Loader Service Unavailable" in MapReduce Service User Guide.

When the Loader service is unavailable, the data loading, import, and conversion functions are unavailable.

Oozie Service Unavailable

oozieServiceUnavailable

Critical

The Oozie service is unavailable.

For details, see section "ALM-17003 Oozie Service Unavailable" in MapReduce Service User Guide.

The Oozie service cannot be used to submit jobs.

Ranger Service Unavailable

rangerServiceUnavailable

Critical

The Ranger service is unavailable.

For details, see section "ALM-45275 Ranger Service Unavailable" in MapReduce Service User Guide.

When the Ranger service is unavailable, the Ranger cannot work properly and the native UI of the Ranger cannot be accessed.

Abnormal RangerAdmin status

abnormalRangerAdminStatus

Major

The RangerAdmin status is abnormal.

For details, see section "ALM-45276 Abnormal RangerAdmin Status" in MapReduce Service User Guide.

If the status of a single RangerAdmin is abnormal, the access to the Ranger native UI is not affected. If the status of two RangerAdmins is abnormal, the Ranger native UI cannot be accessed and operations such as creating, modifying, and deleting policies cannot be performed.

Spark2x Service Unavailable

spark2xServiceUnavailable

Critical

The Spark2x service is unavailable.

For details, see section "ALM-43001 Spark2x Service Unavailable" in MapReduce Service User Guide.

The Spark tasks submitted by users fail to be executed.

Storm Service Unavailable

stormServiceUnavailable

Critical

The Storm service is unavailable.

For details, see section "ALM-26051 Storm Service Unavailable" in MapReduce Service User Guide.

The cluster cannot provide the Storm service externally, and users cannot execute new Storm tasks.

ZooKeeper Service Unavailable

zooKeeperServiceUnavailable

Critical

The ZooKeeper service is unavailable.

For details, see section "ALM-13000 ZooKeeper Service Unavailable" in MapReduce Service User Guide.

ZooKeeper fails to provide coordination services for upper-layer components and the components depending on ZooKeeper may not run properly.

Failed to Set the Quota of Top Directories of ZooKeeper Component

failedToSetTheQuotaOfTopDirectoriesOfZooKeeperComponent

Minor

The quota of top directories of ZooKeeper components failed to be configured.

For details, see section "ALM-13005 Failed to Set the Quota of Top Directories of ZooKeeper Components" in MapReduce Service User Guide.

Components can write a large amount of data to the top-level directory of ZooKeeper. As a result, the ZooKeeper service is unavailable.

Table 17 Elastic Cloud Server (ECS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

ECS

ECS restarted

rebootServer

Minor

The ECS was restarted

  • on the management console.
  • by calling APIs.

Check whether the restart was performed intentionally by a user.

  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

ECS deleted

deleteServer

Major

The ECS was deleted

  • on the management console.
  • by calling APIs.

Check whether the deletion was performed intentionally by a user.

Services are interrupted.

ECS stopped

stopServer

Minor

The ECS was stopped

  • on the management console.
  • by calling APIs.
NOTE:

The ECS is stopped only after CTS is enabled. For details, see Cloud Trace Service User Guide.

  • Check whether the operation was intentionally performed by a user.
  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

NIC deleted

deleteNic

Major

The ECS NIC was deleted

  • on the management console.
  • by calling APIs.
  • Check whether the deletion was performed intentionally by a user.
  • Deploy service applications in HA mode.
  • After the NIC is deleted, check whether services recover.

Services may be interrupted.

GuestOS restarted

RestartGuestOS

Minor

The guest OS was restarted.

Contact O&M personnel.

Services may be interrupted.

ECS failure caused by system faults

VMFaultsByHostProcessExceptions

Critical

The host where the ECS resides is faulty. The system will automatically try to start the ECS.

After the ECS is started, check whether this ECS and services on it can run properly.

The ECS is faulty.

Scheduled migration to be executed

instance_migrate_scheduled

Major

ECSs will be migrated as scheduled.

Check the impact on services during the execution window.

None

Scheduled specification modification to be executed

instance_resize_scheduled

Major

Specifications will be modified as scheduled.

Check the impact on services during the execution window.

None

Scheduled redeployment to be executed

instance_redeploy_scheduled

Major

ECSs will be redeployed on new hosts as scheduled.

Check the impact on services during the execution window.

None

Scheduled restart to be executed

instance_reboot_scheduled

Major

ECSs will be restarted as scheduled.

Check the impact on services during the execution window.

None

Scheduled stop to be executed

instance_stop_scheduled

Major

ECSs will be stopped as scheduled as they are affected by underlying hardware or system O&M.

Check the impact on services during the execution window.

None

Live migration started

liveMigrationStarted

Major

The host where the ECS is located may be faulty. Live migrate the ECS in advance to prevent service interruptions caused by host breakdown.

Wait for the event to end and check whether services are affected.

Services may be interrupted for less than 1s.

Live migration completed

liveMigrationCompleted

Major

The live migration is complete, and the ECS is running properly.

Check whether services are running properly.

None

Live migration failure

liveMigrationFailed

Major

An error occurred during the live migration of an ECS.

Check whether services are running properly.

There is a low probability that services are interrupted.

ECC uncorrectable error alarm generated on GPU SRAM

SRAMUncorrectableEccError

Major

There are ECC uncorrectable errors generated on GPU SRAM.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally.

Restart triggered due to hardware fault

startAutoRecovery

Major

ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted.

Wait for the event to end and check whether services are affected.

Services may be interrupted.

Restart completed due to hardware failure

endAutoRecovery

Major

The ECS was recovered after the automatic migration.

This event indicates that the ECS has recovered and been working properly.

None

Auto recovery timeout (being processed on the backend)

faultAutoRecovery

Major

Migrating the ECS to a normal host timed out.

Migrate services to other ECSs.

Services are interrupted.

Startup failure

faultPowerOn

Major

The ECS failed to start.

Start the ECS again. If the problem persists, contact O&M personnel.

The ECS cannot start.

GPU link fault

GPULinkFault

Critical

The GPU of the host on which the ECS is located was faulty or was recovering from a fault.

  • faulty.
  • recovering from a fault.

Deploy service applications in HA mode.

After the GPU fault is rectified, check whether services are restored.

Services are interrupted.

FPGA link fault

FPGALinkFault

Critical

The FPGA of the host on which the ECS is located was faulty or was recovering from a fault.

  • faulty.
  • recovering from a fault.

Deploy service applications in HA mode.

After the FPGA fault is rectified, check whether services are restored.

Services are interrupted.

Scheduled redeployment to be authorized

instance_redeploy_inquiring

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Authorize scheduled redeployment.

None

Local disk replacement canceled

localdisk_recovery_canceled

Major

Local disk failure

None

None

Local disk replacement to be executed

localdisk_recovery_scheduled

Major

Local disk failure

Check the impact on services during the execution window.

None

Xid event alarm generated on GPU

commonXidError

Major

A Xid event alarm was generated on the GPU.

If services are affected, submit a service ticket.

A Xid error is caused by GPU hardware, driver, or application problems, which may result in abnormal service exit.

nvidia-smi suspended

nvidiaSmiHangEvent

Major

nvidia-smi timed out.

If services are affected, submit a service ticket.

The driver may report an error during service running.

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors on the NPU.

If services are affected, replace the NPU with another one.

Services may be interrupted.

Scheduled redeployment canceled

instance_redeploy_canceled

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

None

None

Scheduled redeployment being executed

instance_redeploy_executing

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Wait until the event is complete and check whether services are affected.

Services are interrupted.

Scheduled redeployment completed

instance_redeploy_completed

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Wait until the redeployed ECSs are available and check whether services are affected.

None

Scheduled redeployment failed

instance_redeploy_failed

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Contact O&M personnel.

Services are interrupted.

Local disk replacement to be authorized

localdisk_recovery_inquiring

Major

Local disks are faulty.

Authorize local disk replacement.

Local disks are unavailable.

Local disks being replaced

localdisk_recovery_executing

Major

Local disk failure

Wait until the local disks are replaced and check whether the local disks are available.

Local disks are unavailable.

Local disks replaced

localdisk_recovery_completed

Major

Local disks are faulty.

Wait until the services are running properly and check whether local disks are available.

None

Local disk replacement failed

localdisk_recovery_failed

Major

Local disks are faulty.

Contact O&M personnel.

Local disks are unavailable.

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty or the NPU is disconnected.

Transfer this issue to the Ascend or hardware team for handling.

The NPU cannot be used properly.

NPU: PCIe link error

PCIeErrorFound

Major

The possible cause is deskew_fifo overflow, symbol_unlock, deskew_unlock event, or phystatus timeout.

Transfer the issue to the hardware team.

The NPU cannot be used properly.

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Transfer the issue to the hardware team.

The NPU cannot be used properly.

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Stop services, restart the BMS, check the heat dissipation system, and reset the devices.

The ECS may be powered off due to overtemperature and devices may not be found.

NPU: request for instance restart

RebootVirtualMachine

Informational

A fault occurs and the BMS needs to be restarted.

Collect the fault information, and restart the BMS.

Services may be interrupted.

NPU: request for SoC reset

ResetSOC

Informational

A fault occurs and the SoC needs to be reset.

Collect the fault information, and reset the SoC.

Services may be interrupted.

NPU: request for restart AI process

RestartAIProcess

Informational

A fault occurs and the AI process needs to be restarted.

Collect the fault information, and restart the AI process.

The current AI task will be interrupted.

NPU: error codes

NPUErrorCodeWarning

Major

A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.

Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.

Services may be interrupted.

DAVP: die device node not found by vasmi

DAVPSMICardNotFound

Major

The driver may be faulty or the card may be disconnected.

Restart the VM. If the device still cannot be loaded, transfer this issue to the hardware team for handling.

The DAVP cannot be used properly.

DAVP: device not found by lspci

DAVPLspciCardNotFound

Major

The DAVP is disconnected.

Transfer the issue to the hardware team.

The DAVP cannot be used properly.

DAVP: temperature higher than the threshold 85°C

TemperatureOverDfLimit

Major

The core module temperature exceeds 85°C, which causes frequency reduction.

Stop services. Contact the hardware team to check the heat dissipation system and reset the device.

The DAVP card frequency is reduced.

DAVP: temperature higher than the threshold 105°C

TemperatureOverSdLimit

Major

The core module temperature exceeds 105°C, which generates a high temperature alarm.

Stop services. Contact the hardware team to check the heat dissipation system and reset the device.

Power-off protection is triggered. The DAVP cannot be used properly.

DAVP: core unit exception of the device node

DeviceCoreAbnormal

Major

You may need to restart the die device node.

Collect the fault information and restart die.

Services may be interrupted.

GPU NVML library API error

gpuNvmlApiError

Major

Unknown errors exist in the power, clock, or fan API of the NVML library provided by the GPU driver.

Restart the server or upgrade the driver. If the fault persists, transfer this issue to the hardware team.

GPUs may be unavailable.

VM deletion failure

faultDeleteServer

Major

Failed to delete the ECS.

Check whether services are affected.

The ECS resources fail to be deleted.

GPU throttle alarm

gpuClocksThrottleReasonsAlarm

Informational

  1. The GPU power may exceed the maximum operating power threshold (continuous full load). The clock frequency automatically decreases to prevent the GPU from being damaged.
  2. The GPU temperature may exceed the maximum operating temperature threshold (continuous full load). The clock frequency automatically decreases to reduce heat.
  3. The GPU may remain idle, with the clock frequency automatically decreasing to reduce power consumption.
  4. Hardware faults may cause a decrease in clock frequency.

Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team.

The GPU slows down, resulting in less powerful compute.

Pending page retirement for GPU DRAM ECC

gpuRetiredPagesPendingAlarm

Major

  1. An ECC error occurred on the hardware. DRAM pages need to be retired.
  2. An uncorrectable ECC error occurred on the GPU memory page and the page needs to be retired. However, the page is suspended and has not been retired yet.
  1. View the event details and check whether the value of retired_pages.pending is yes.
  2. Restart the GPU for automatic retirement.

The GPU cannot work properly.

Pending row remapping for GPU DRAM ECC

gpuRemappedRowsAlarm

Major

Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources.

  1. View the event metric "RemappedRow" to check if there are any rows that have been remapped.
  2. Restart the GPU for automatic retirement.

The GPU cannot work properly.

Insufficient resources for GPU DRAM ECC row remapping

gpuRowRemapperResourceAlarm

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. The standby GPU memory row resources are exhausted, so row remapping cannot be continued.

Transfer the issue to the hardware team.

The GPU cannot work properly.

Correctable GPU DRAM ECC error

gpuDRAMCorrectableEccError

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. A correctable ECC error occurs in the DRAM of the GPU. However, the ECC mechanism can automatically rectify the error and programs are not affected.
  1. View the event metric "ecc.errors.corrected.volatile" to check whether there are any correctable ECC error values.
  2. Restart the GPU for automatic retirement.

The GPU may not work properly.

Uncorrectable GPU DRAM ECC error

gpuDRAMUncorrectableEccError

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. An uncorrectable ECC error occurs in the DRAM of the GPU. This error cannot be automatically corrected using the ECC mechanism. The verification process affects system stability and may cause program crashes.
  1. View the event metric "ecc.errors.uncorrected.volatile" to check whether there are any uncorrectable ECC error values.
  2. Restart the GPU for automatic retirement.

The GPU may not work properly.

Inconsistent GPU kernel versions

gpuKernelVersionInconsistencyAlarm

Major

Inconsistent GPU kernel versions.

During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled.

Run the following commands to rectify the issue:

rmmod nvidia_drm

rmmod nvidia_modeset

rmmod nvidia

Then, run nvidia-smi. If the command output is normal, the issue has been rectified.

The GPU cannot work properly.

GPU monitoring dependency not met

gpuCheckEnvFailedAlarm

Major

The plug-in cannot identify the GPU driver library path.

  1. Check whether the driver is installed.
  2. Check whether the driver installation directory has been customized. The driver needs to be installed in the default installation directory /usr/bin/.

The GPU monitoring metrics cannot be collected.

ReadOnly issues in OS

ReadOnlyFileSystem

Critical

The file system %s is read-only.

Check the disk health status.

The files cannot be written.

NPU: driver and firmware not matching

NpuDriverFirmwareMismatch

Major

The NPU's driver and firmware do not match.

Obtain the matched version from the Ascend official website and reinstall it.

NPUs cannot be used.

NPU: Docker container environment check

NpuContainerEnvSystem

Major

Docker was unavailable.

Check if Docker is normal.

Docker cannot be used.

The container plug-in Ascend-Docker-Runtime was not installed.

Install the container plug-in Ascend-Docker-Runtime. Or, the container cannot use Ascend cards.

NPUs cannot be attached to Docker containers.

IP forwarding was not enabled in the OS.

Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.

Docker containers experience network communication problems.

The shared memory of the container was too small.

The default shared memory is 64 MB, which can be modified as needed.

Method 1

Modify the default-shm-size field in the /etc/docker/daemon.json configuration file.

Method 2

Use the --shm-size parameter in the docker run command to set the shared memory size of a container.

Distributed training will fail due to insufficient shared memory.

NPU: RoCE NIC down

RoCELinkStatusDown

Major

The RoCE link of NPU card %d was down.

Check the NPU RoCE network port status.

The NPU NIC becomes unavailable.

NPU: RoCE NIC health status abnormal

RoCEHealthStatusError

Major

The RoCE network health status of NPU %d was abnormal.

Check the health status of the NPU RoCE NIC.

The NPU NIC becomes unavailable.

NPU: RoCE NIC configuration file /etc/hccn.conf not found

HccnConfNotExisted

Major

The RoCE NIC configuration file /etc/hccn.conf was not found.

Check whether the /etc/hccn.conf NIC configuration file can be found.

The RoCE NIC becomes unavailable.

GPU: basic components abnormal

GpuEnvironmentSystem

Major

The nvidia-smi command was abnormal.

Check whether the GPU driver is normal.

The GPU driver is unavailable.

The nvidia-fabricmanager version was inconsistent with the GPU driver version.

Check the GPU driver version and nvidia-fabricmanager version.

The nvidia-fabricmanager cannot work properly, affecting GPU usage.

The container plug-in nvidia-container-toolkit was not installed.

Install the container plug-in nvidia-container-toolkit.

GPUs cannot be attached to Docker containers.

Local disk attachment inspection

MountDiskSystem

Major

The /etc/fstab file contains invalid UUIDs.

Ensure that the UUIDs in the /etc/fstab configuration file are correct. Or, the server may fail to be restarted.

The disk attachment process fails, preventing the server from restarting.

GPU: incorrectly configured dynamic route for Ant series server

GpuRouteConfigError

Major

The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s.

Configure the RoCE NIC route correctly.

The NPU network communication will be interrupted.

NPU: RoCE port not split

RoCEUdpConfigError

Major

The RoCE UDP port was not split.

Check the RoCE UDP port configuration on the NPU.

The communication performance of NPUs is affected.

Warning of automatic system kernel upgrade

KernelUpgradeWarning

Major

Warning of automatic system kernel upgrade. Old version: %s; new version: %s.

System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.

The AI software may be unavailable.

NPU environment command detection

NpuToolsWarning

Major

The hccn_tool was unavailable.

Check whether the NPU driver is normal.

The IP address and gateway of the RoCE NIC cannot be configured.

The npu-smi was unavailable.

Check whether the NPU driver is normal.

NPUs cannot be used.

The ascend-dmi was unavailable.

Check whether ToolBox is properly installed.

ascend-dmi cannot be used for performance analysis.

Warning of an NPU driver exception

NpuDriverAbnormalWarning

Major

The NPU driver was abnormal.

Reinstall the NPU driver.

NPUs cannot be used.