ORA-29701: unable to connect to Cluster Synchronization Service, Error 29701

Pacjent: Grid 19.13, DB 12.2.0.1, OL 7.9 – NODE2 łapie freeza po starcie i trzeba pocisnąć go z palca (po błędach związanych z interconnectem w logu CRS-a)

— w alercie DB NODE1

2021-11-04T14:26:11.977078+01:00
JIT: pid 44720 requesting full stop
2021-11-04T14:26:18.243731+01:00
JIT: pid 44720 requesting full stop
2021-11-04T14:33:53.984937+01:00
IPC Send timeout detected. Sender: ospid 20158 [oracle@NODE1 (LCK0)]
Receiver: inst 2 binc 16 ospid 18523
2021-11-04T14:33:53.994876+01:00
Communications reconfiguration: instance_number 2 by ospid 20158
2021-11-04T14:34:42.795807+01:00
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
2021-11-04T14:34:42.950298+01:00
IPC Send timeout to 2.1 inc 20 for msg type 65521 from opid 24
2021-11-04T14:34:42.950358+01:00
IPC Send timeout to 2.1 inc 20 for msg type 65521 from opid 24

— a w logu CRS-a NODE2 (na NODE1 są jedynie informacje o problemach z NODE2 bez countdownu):

2021-11-04 14:37:21.717 [OCSSD(10029)]CRS-7503: The Oracle Grid Infrastructure process ocssd observed communication issues between node NODE2 and node NODE1, interface list of local node NODE2 is 172.30.1.2:20313, interface list of remote node NODE1 is 172.30.1.1:64128.

2021-11-04 14:37:27.240 [OCSSD(10029)]CRS-1612: Network communication with node NODE1 (1) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.470 seconds

2021-11-04 14:37:34.242 [OCSSD(10029)]CRS-1611: Network communication with node NODE1 (1) has been missing for 75% of the timeout interval. If this persists, removal of this node from cluster will occur in 7.470 seconds

2021-11-04 14:37:39.243 [OCSSD(10029)]CRS-1610: Network communication with node NODE1 (1) has been missing for 90% of the timeout interval. If this persists, removal of this node from cluster will occur in 2.470 seconds

2021-11-04 14:37:42.217 [OCSSD(10029)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/grid/diag/crs/NODE2/crs/trace/ocssd.trc.

2021-11-04 14:37:42.218 [OCSSD(10029)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/NODE2/crs/trace/ocssd.trc

— proba startu DB na NODE2 po bezproblemowym starcie CRS-a

2021-11-08T16:05:19.875473+01:00
Error 29701: unexpected return code 6 from the Cluster Synchronization Service
2021-11-08T16:05:19.881332+01:00
Error 29701: unexpected return code 6 from the Cluster Synchronization Service.
2021-11-08T16:05:19.882182+01:00
Errors in file /u01/app/oracle/diag/rdbms/db/DBNODE2/trace/DBNODE2_lmon_19469.trc:
ORA-29701: unable to connect to Cluster Synchronization Service
2021-11-08T16:05:20.010080+01:00
Error: Shutdown in progress. Error: 29701.
USER (ospid: 19329): terminating the instance due to error 29701

i w tym przypadku lmon sam się ubija i po któreś próbie startu cały NODE2 (maszyna) łapie freeza i trzeba pocisnąć ją „z palca”, ciekawostką jest to, że CSS nie sygnalizuje błędów, jest podniesiony (można sprawdzić dla pewności: crsctl stat res -t -init)

rozwiązanie: stop/start nieżyjącej DB an NODE1 (która wygląda, że działa poprawnie w srvctl jak i crsctl, ale za to w logach już nie) i wszystko wróci do normy (czyli NODE2 bez problemu podniesie się z wszystkimi usługami)