기술 정보
home
채널 소개
home

CM FENCING 기능 설정 및 테스트 방법

문서 유형
기술 정보
분야
관리/환경설정
키워드
CM
FENCE
fencing
적용 제품 버전
7FS02PS

개요

CM Fencing 기능관련하여, 기능 활성화 상태에서 장애 발생 시 CM 프로세스를 보호하고 시스템의 안정성을 유지할 수 있는 방법을 안내합니다. 또한 CM_FENCE 파라미터 설정 방법과 테스트 시나리오를 제시합니다.
참고
테스트 OS / CM 버전 : Rocky Linux 8.10 / TBCM 7.1.1 (Build 277758)
DB 설치, 기동 완료 후 CM_FENCE 파라미터 Y 설정

방법

CM, DB 설치 이후 진행합니다.

$CM_SID.tip 수정

su - root [root@tac1 ~]# vi $CM_HOME/config/$CM_SID.tip CM_FENCE=Y
SQL
복사
기동 시 fencing 기능 enabled이 출력됩니다.
[root@tac1 ~]# tbcm -b CM Guard daemon started up. CM-fence enabled. import resources from '/cm/cmresource'...
SQL
복사
cmrctl show param 파라미터 적용을 확인합니다.
[root@tac1 ~]# cmrctl show param Parameter Resource Info ================================================ CM_FENCE : Y ================================================
SQL
복사

FENCING 기능 테스트 시나리오

참고
아래의 두 case 들은 테스트를 위해 수행한 예시 상황 입니다.

Case 1. cm file 접근 불가

CM FILE이 있는 DISK를 VM에서 DETACH 합니다. (콘솔 VM Setting에서 disk remove)
Disk detach 된 vm node의 cm log
2025/03/26 10:35:45.987 [2] cm_fd.c :0318(05) [cls:0] cmhead read error(5, Input/output error) r_size=-1 2025/03/26 10:35:45.988 [2] cm_fd.c :0318(07) [cls:2] cmhead read error(5, Input/output error) r_size=-1 2025/03/26 10:35:45.993 [1] cm_actio:1194(04) [cls] [ERROR] Cannot access to enough CM files (1/3). SHUTDOWN... 2025/03/26 10:35:45.994 [1] cm_actio:1201(04) [cls] FENCE notify to CM_GUARD (cls=cls) 2025/03/26 10:35:46.194 [1] cm_actio:7526(04) [cls] [WARNING] VIP 'tac2_vip' status 5, global status 12, intr stat 0. Forcibly clearing... 2025/03/26 10:35:46.194 [1] cm_actio:6011(04) [cls] [VIP] (tac1_vip) prework start! vip:192.168.56.11, port:8629 (svc: tac) 2025/03/26 10:35:46.194 [1] cm_actio:5415(04) [cls] Executing command for the instance(tas1) with default environment variable profile 2025/03/26 10:35:46.195 [1] cm_actio:5443(04) [cls] EXECUTE CMD: dbctl_for_cm.sh down abnormal 2025/03/26 10:35:46.194 [1] cm_actio:5415(04) [cls] Executing command for the instance(tac1) with default environment variable profile 2025/03/26 10:35:46.195 [1] cm_actio:5443(04) [cls] EXECUTE CMD: dbctl_for_cm.sh down abnormal 2025/03/26 10:35:46.198 [1] cm_actio:3525(04) [cls] [INST] VIP_LOST msg 2025/03/26 10:35:46.198 [1] cm_actio:3943(04) [cls] [VIP] (tac1_vip) prework done! vip:192.168.56.11, port:8629 (svc: tac) 2025/03/26 10:35:46.241 [2] cm_netwo:0386(00) connection closed. fd:19 2025/03/26 10:35:46.244 [1] cm_util.:0395(04) [cls] start exec ifconfig ens33:1 down 2025/03/26 10:35:46.247 [1] cm_util.:0415(04) [cls] exec ifconfig ens33:1 down success. exit status 255 2025/03/26 10:35:46.248 [1] cm_actio:5039(00) cmd execution rc: -1 (100 < rc < 106) 2025/03/26 10:35:46.248 [1] cm_actio:5289(00) Resource tac1_vip down SUCCESS 2025/03/26 10:35:46.263 [2] cm_netwo:0386(00) connection closed. fd:18 2025/03/26 10:35:46.430 [1] cm_actio:5462(04) [cls] EXECUTE RESULT: 0 2025/03/26 10:35:46.430 [1] cm_actio:5039(00) cmd execution rc: 0 (100 < rc < 106) 2025/03/26 10:35:46.430 [1] cm_actio:5137(00) Resource 'tas1' down SUCCESS (mode: ABNORMAL) 2025/03/26 10:35:46.463 [1] cm_act_s:1235(04) [cls] [SERVICE] New incar no. 3 for service tas 2025/03/26 10:35:46.528 [1] cm_actio:5462(04) [cls] EXECUTE RESULT: 0 2025/03/26 10:35:46.528 [1] cm_actio:5039(00) cmd execution rc: 0 (100 < rc < 106) 2025/03/26 10:35:46.528 [1] cm_actio:5137(00) Resource 'tac1' down SUCCESS (mode: ABNORMAL) 2025/03/26 10:35:46.670 [1] cm_act_s:1281(04) [cls] [SERVICE] incar no. 3 for service tas acked by all scheduled instances 2025/03/26 10:35:46.670 [1] cm_act_s:1235(04) [cls] [SERVICE] New incar no. 3 for service tac 2025/03/26 10:35:46.672 [1] cm_actio:7526(04) [cls] [WARNING] VIP 'tac2_vip' status 6, global status 12, intr stat 0. Forcibly clearing... 2025/03/26 10:35:46.672 [1] cm_actio:7551(04) [cls] all cluster resource down 2025/03/26 10:35:46.675 [2] cm_netwo:0386(00) connection closed. fd:16 2025/03/26 10:35:46.675 [2] cm_netwo:0497(00) delayed close done. fd:16 2025/03/26 10:35:46.988 [1] cm_file.:1022(05) [cls:0] Exit FILEIO thread for +0 2025/03/26 10:35:46.995 [1] cm_file.:1022(07) [cls:2] Exit FILEIO thread for +2 2025/03/26 10:35:47.004 [1] cm_file.:1007(06) [cls:1] Write file down notify! 2025/03/26 10:35:47.005 [2] cm_fd.c :0318(06) [cls:1] cmhead read error(0, Success) r_size=0 2025/03/26 10:35:47.005 [1] cm_file.:0778(06) [cls:1] [ERROR] File hb write failed! size_write = -1, hb_size = 1024 2025/03/26 10:35:47.005 [1] cm_file.:1022(06) [cls:1] Exit FILEIO thread for +1 2025/03/26 10:35:47.687 [1] cm_actio:9346(04) [cls] [ACTION] finish loop
SQL
복사
cm guard log
2025/03/26 10:35:45.994 [1] cm_guard:0589(00) [CM_GUARD] FENCE notify from CM (cls=cls) 2025/03/26 10:35:45.997 [1] cm_guard:0459(00) [CM_GUARD] resource release on behalf of CM 2025/03/26 10:35:45.999 [1] cm_util.:0393(00) start exec ifconfig ens33:1 down 2025/03/26 10:35:46.002 [1] cm_util.:0412(00) exec ifconfig ens33:1 down success. exit status 0 2025/03/26 10:35:46.003 [2] cm_vip.c:0661(00) VIP 192.168.56.11 removed from ens33:1 2025/03/26 10:35:48.004 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2254 2025/03/26 10:35:48.010 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2261 2025/03/26 10:35:48.012 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2262 2025/03/26 10:35:48.014 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2263 2025/03/26 10:35:48.016 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2264 2025/03/26 10:35:48.017 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2265 2025/03/26 10:35:48.018 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2266 2025/03/26 10:35:48.019 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2267 2025/03/26 10:35:48.020 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2268 2025/03/26 10:35:48.021 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2423 2025/03/26 10:35:48.022 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2437 2025/03/26 10:35:48.023 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2438 2025/03/26 10:35:48.025 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2439 2025/03/26 10:35:48.026 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2440 2025/03/26 10:35:48.027 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2441 2025/03/26 10:35:48.028 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2442 2025/03/26 10:35:48.029 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2443 2025/03/26 10:35:48.029 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2444 2025/03/26 10:35:48.030 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2445 2025/03/26 10:35:48.031 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2446 2025/03/26 10:35:48.032 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2447 2025/03/26 10:35:48.033 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2448 2025/03/26 10:35:48.034 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2449 2025/03/26 10:35:48.035 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 2450 2025/03/26 10:35:48.147 [1] cm_guard:0490(00) [CM_GUARD] kill CM process with pid '1842' 2025/03/26 10:35:48.148 [1] cm_guard:0504(00) [CM_GUARD] is going to reboot system.
SQL
복사
이후 server가 reboot 됩니다.

Case 2. cm process hang (File H/B expired)

gdb로 cm process를 attach 합니다.
$ ps -ef | grep -v grep | grep -v guard|grep $CM_SID | awk '{print $2}' $ gdb (gdb) attach [CM PID]
SQL
복사
h/b expired 된 node의 cm guard log
2025/03/27 10:52:24.119 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 25% expire count 2025/03/27 10:52:57.472 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 50% expire count 2025/03/27 10:53:28.802 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 75% expire count 2025/03/27 10:53:49.022 [1] cm_guard:0868(00) [EXPIRE] Heartbeat from CM missing for 90% expire count 2025/03/27 10:54:01.162 [1] cm_guard:0863(00) [EXPIRE] Heartbeat from CM was expired! (last HB: 3914.211721, current time: 4044.616027) 2025/03/27 10:54:01.166 [1] cm_guard:0935(00) [CM_GUARD] Heartbeat from CM expired... 2025/03/27 10:54:03.175 [1] cm_guard:0459(00) [CM_GUARD] resource release on behalf of CM 2025/03/27 10:54:03.179 [1] cm_util.:0393(00) start exec ifconfig ens33:1 down 2025/03/27 10:54:03.186 [1] cm_util.:0412(00) exec ifconfig ens33:1 down success. exit status 0 2025/03/27 10:54:03.189 [2] cm_vip.c:0661(00) VIP 192.168.56.11 removed from ens33:1 2025/03/27 10:54:05.192 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8944 2025/03/27 10:54:05.196 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8957 2025/03/27 10:54:05.199 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8958 2025/03/27 10:54:05.202 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8959 2025/03/27 10:54:05.204 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8960 2025/03/27 10:54:05.206 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8961 2025/03/27 10:54:05.208 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8962 2025/03/27 10:54:05.209 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8963 2025/03/27 10:54:05.211 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 8964 2025/03/27 10:54:05.212 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9312 2025/03/27 10:54:05.213 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9348 2025/03/27 10:54:05.221 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9349 2025/03/27 10:54:05.233 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9350 2025/03/27 10:54:05.286 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9352 2025/03/27 10:54:05.295 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9353 2025/03/27 10:54:05.301 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9354 2025/03/27 10:54:05.308 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9355 2025/03/27 10:54:05.240 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9351 2025/03/27 10:54:05.310 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9356 2025/03/27 10:54:05.312 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9357 2025/03/27 10:54:05.314 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9358 2025/03/27 10:54:05.315 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9359 2025/03/27 10:54:05.318 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9360 2025/03/27 10:54:05.320 [1] cm_util.:0432(00) [CM_GUARD] kill process with pid 9361 2025/03/27 10:54:05.599 [1] cm_guard:0490(00) [CM_GUARD] kill CM process with pid '1684'
SQL
복사
이후 server가 reboot 됩니다.