현상
drop partition 수행 후, sys.log 상에서 DDD Timeout 로그가 다량 발생하면서 drop partition 작업이 지연되고, select 수행이 안되는 현상이 발생했습니다.
위와 같은 현상 확인을 위해 최소한 다음과 같은 로그 및 정보를 수집합니다.
•
diag.log / diag dump
•
ilog
•
sys.log
•
callstack / .out
•
lock & session 관련 뷰 : v$lock, v$session, v$transaction, v$waiter_session, v$blocker_session
TAC 환경이라면, 각 노드별로 로그를 추출해야 하며 global 뷰로 조회해야 합니다.
수집한 로그를 통한 지연 현상의 원인을 확인하는 과정은 다음과 같습니다.
sys.log / gv$lock
2번 노드 1705 세션에서 DDL을 수행하는데, 3번 노드의 2381 세션을 기다리는 상황입니다.
-- #sys.log
[2024-02-28T16:07:32.959719] [DDL-1705] [I] Executing DDL: ALTER TABLE TMUSER.TBR_TM_SAME_5GRU_CHATBOT_TREND DROP PARTITION PT_2024022511
...
[2024-02-28T16:38:44.660656] [ACF-214] [I] DDD timeouted! tid 1705, lkb 00007f9c28e5ba50, rsb 00007f9c78dec0a0, 1:16-82-1357311467
[2024-02-28T16:39:14.605657] [ACF-214] [I] DDD timeouted! tid 1705, lkb 00007f9c28e5ba50, rsb 00007f9c78dec0a0, 1:16-82-1357311467
[2024-02-28T16:39:45.319655] [ACF-214] [I] DDD timeouted! tid 1705, lkb 00007f9c28e5ba50, rsb 00007f9c78dec0a0, 1:16-82-1357311467
-- #select * from gv$lock
INST_ID TYPE ID1 ID2 LMODE REQUESTED SESS_ID CTIME
3 WLOCK_DD_OBJ 82 1357311467 1 0 2381 101347
2 WLOCK_DD_OBJ 82 1357311467 1 5 1705 56
SQL
복사
diag dump / sys.log / gv$session
diag dump 확인 결과, 3번 노드의 231 thread 상태가 dead로 나오는 것을 확인할 수 있습니다.
sys.log 확인 결과 session kill force로 thread를 강제 종료한 이력을 확인할 수 있습니다.
참고로, 죽은 thread의 cleaup을 대신 진행하는 2413의 세션 정보를 보면 아직 cleanup 작업을 완료하지 못한 상태이며 state는 running 상태 =입니다.
-- #diag dump
Trace session. ID: 2381 (WTHR)
----------------------------------
Status : RUNNING
User name : TMUSER
V.sessid : 2381
SQL E.T. : 70604100(ms)
SQL ID : 9558962701554134066
SQL ppid : 0
Program : JOB_SCHEDULER
Wait event: NONE
Trace attached thread: 2381
----------------------------------
PID: 9572
Physical TID: 2
Thread status: DEAD
----------------------------------
Callstack dump for sessid[2381] (filename: tbsvr.callstack.9572)
----------------------------------------
-- #sys.log
[2024-02-27T12:47:07.050098] [DDL-1790] [I] Executing DDL: alter system kill session '2381,399323' force
[2024-02-27T12:47:07.050357] [FRM-1790] [I] get SESS_KILL_FORCE req 2381
[2024-02-27T12:47:07.050369] [FRM-1790] [I] SEND SESS_KILL FORCE SIGNAL to 2381
[2024-02-27T12:47:07.050413] [FRM-198] [I] SEND SESS_KILL FORCE SIGNAL to 2381
[2024-02-27T12:47:07.050515] [DDL-1790] [I] DDL execution succeeded
[2024-02-27T12:47:07.051128] [FRM-2413] [I] WTHR(2381) recovery
[2024-02-27T12:47:07.051147] [FRM-2413] [I] thr cleanup WTHR(2381) sess(2381)
-- # select * from gv$session;
INST_ID SID SERIAL# AUDSID USER# USERNAME IPADDR COMMAND STATUS SCHEMA#
SCHEMANAME TYPE SQL_ID SQL_CHILD_NUMBER PREV_SQL_ID PREV_CHILD_NUMBER SQL_ET LOGON_TIME
STATE WLOCK_WAIT WAIT_EVENT WAIT_TIME PGA_USED_MEM SQL_TRACE PROG_NAME CLIENT_PID PID WTHR_ID
OS_THR_ID OSUSER MACHINE TERMINAL MODULE
ACTION CLIENT_INFO CLIENT_IDENTIFIER PDML_ENABLED PDML_STATUS PDDL_STATUS
PQ_STATUS ROW_WAIT_OBJ_ID ROW_WAIT_FILE_NO ROW_WAIT_BLOCK_NO ROW_WAIT_ROW_NO CONSUMER_GROUP CONSUMED_CPU_TIME
3 2413 418817 4294967295 0 SYS 0 READY 0
SYS WTHR 0 2024/02/27 12:47:07
RUNNING -1 0 11384 DISABLED SO RECOVERY(WTHR) 0 9577 4
11518
NO DISABLED ENABLED
ENABLED -1 1835296
SQL
복사
callstack
•
2번 노드 1705 : ddl 수행 세션으로 해당 object에 대해서 exclusive lock을 획득해야 하는데 3번 노드의 2381 세션이 lock을 획득 중이라 대기 중으로 확인됩니다.
•
3번 노드 2381 : 작업 도중 session kill force로 thread가 종료되었습니다.
•
3번 노드 2413 : 죽은 2381 세션의 cleanup 작업을 대신 진행 중으로, 이 cleanup 작업이 완료되어야 2381이 잡고있는 lock이 해제됩니다.
Thread 8 (Thread 0x7f8fbbff7700 (LWP 11518)):
#0 0x00007fbbc2b021ad in nanosleep () from /lib64/libc.so.6
#1 0x00007fbbc2b32ec4 in usleep () from /lib64/libc.so.6
#2 0x0000000001030395 in gca_cancel (lksb=0x7fac18616270, ccc_flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:559
#3 0x0000000001031668 in gca_recover_lock (lksb=0x7fac18616270) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:734
#4 0x00000000010675d8 in ccc_cleanup_lksb (so=0x7fac18616270, tid=<optimized out>, flag=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lksb.c:211
#5 0x0000000000e1b48c in sess_so_cleanup (sessid=2381) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:251
#6 0x000000000057c884 in wthr_so_cleanup_thr () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:3514
#7 0x00000000005a49e2 in thread_main_chk_bitmask (my_tid=2413, bitmask=...) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:516
#8 0x00000000005ab4cb in svr_wthr_main_internal (my_tid=2413) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:919
#9 0x0000000000e31de0 in wthr_init (args=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_cthr.c:3288
#10 0x00007fbbc4d2be25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fbbc2b3b34d in clone () from /lib64/libc.so.6
SQL
복사
diag.log
2024/02/27 12:05:36.502 211 diag file '/data1/NMSMNDB5G3/dump/diag/tb_[CCC_WAIT]_1113_140524-12582964.s1176.n3.diag'(5444 bytes) deleted due to limit
2024/02/27 12:06:21.027 2381 trace request! sessid: 2381, trc: 1 args: 2
2024/02/27 12:06:21.027 211 trace started. id 12583232 type 1
2024/02/27 12:06:21.027 211 [nid=3][LOCAL]CCC WAIT trace rsb [49|511_00339724] for trace id 12583232 from sid 2381
2024/02/27 12:06:21.027 211 [1][CCC MASTER][49|511_00339724] shadow in grq (0): SHR-lock granted (rq=-1, bl=5)
2024/02/27 12:06:21.028 211 trace finished. id 12583232
2024/02/27 12:06:36.028 2381 trace request! sessid: 2381, trc: 1 args: 2
2024/02/27 12:06:36.028 211 trace started. id 12583233 type 1
2024/02/27 12:06:36.028 211 [nid=3][LOCAL]CCC WAIT trace rsb [49|511_00339724] for trace id 12583233 from sid 2381
SQL
복사
diag dump
===== CCC_WAIT, TraceID: 12583270 =====
*** 2024/02/27 12:42:06.040 ***
[nid=2] trace start with rsb id [49|511_00339724] from nid: 3 (timeout cnt=144)
Trace CCC RSB(00007fa0d30ea4c0) timeout count=144
----------------------------------------
rsb<00007fa0d30ea4c0>49|511_00339724:nid 3,f 20000,rf 2,mgr 3,mbl 5,lw 0,wf 5,bf 0,rt(0), cf 0,cgr 3,cbl 5,chi -1,ht(0,0),wt(0,0),dt(0,0), ogt(0,0) wr_nid -1 lw 0 snd 1,rcv 2,bh<0000000000000000>
[0/0] msg type: 2
[1/7] msg type: 0
[2/6] msg type: 0
[3/5] msg type: 0
[4/4] msg type: 0
[5/3] msg type: 0
[6/2] msg type: 0
[7/1] msg type: 0
lkb(00007f9fcc25d9e0):st 1,gr 3,rq -1,bl 5,mt 0,nid 3,exf 10000,f 8000000,sbf 0,rcv 1,snd 1,htsn(0,0),rsb<00007fa0d30ea4c0>[49|511_00339724]
lksb(00007fa141dbf720): owner WTHR(tid 1306, node 2, pid 4136), status -22,flags 0x0, lkb 00007f9fcc25d9e0, priv 00007f9be5219da8
No bh in this lcb(00007fa0d30ea670), rsb(00007fa0d30ea4c0)
Trace session. ID: 1306 (WTHR)
----------------------------------
Status : SESS_CLEANUP
User name : TMUSER
V.sessid : 1306
Program : 5GNMS_PCCLIENT.EXE
Wait event: NONE
Trace attached thread: 1306
----------------------------------
PID: 4136
Physical TID: 7
Thread status: DEAD
----------------------------------
Callstack dump for sessid[1306] (filename: tbsvr.callstack.4136)
----------------------------------
00007fa0d30ea4c0
Redo log flushed tsn: (383949224490)
RSB log flush wait tsn: (0)
RSB in CATH scan list: (0)
[49|511_00339724] CRAS req redo tsn: (383949224610)
[49|511_00339724] CATH req redo tsn: (383949224610), for_ccc (1)
----------------------------------
...
SQL
복사
sys.log
1306에 대한 so recovery 중 SIGSEGV가 발생한 것을 확인할 수 있습니다.
[2024-02-23T17:02:30.025691] [FRM-1306] [I] Signal 'SIGSEGV'(11) caught (TID = 1306, OS_THR_ID = 7367, PID = 4136): si_addr = 0000000000001fd0 : the cause of the signal can be found in a process callstack file.
...
[2024-02-23T17:03:55.995824] [FRM-2408] [I] WTHR(1306) recovery
[2024-02-23T17:03:55.995870] [FRM-2408] [I] thr cleanup WTHR(1306) sess(1306)
[2024-02-23T17:03:55.995939] [FRM-2408] [I] Signal 'SIGSEGV'(11) caught (TID = 2408, OS_THR_ID = 7474, PID = 4297): si_addr = 0000000000000190 : the cause of the signal can be found in a process callstack file.
Plain Text
복사
callstack
tbsvr.callstack.4136 파일 확인 결과, SIGSEGV가 발생했고, 2408 세션에서 1306에 대한 so recovery가 수행되었다가 또 다시 SIGSEGV가 발생한 것을 callstack을 통해서도 확인되었습니다.
Thread 5 (Thread 0x7f7f83ff7700 (LWP 7367)):
#0 0x00007fab9c4781ad in nanosleep () from /lib64/libc.so.6
#1 0x00007fab9c4a8ec4 in usleep () from /lib64/libc.so.6
#2 0x0000000000e4b812 in sync_sig_handler (signo=11, siginfo=0x7f7f7800fd70, context=0x7f7f7800fc40) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_signal.c:1587
#3 <signal handler called>
#4 0x00007fab9c509339 in _memmove_ssse3_back () from /lib64/libc.so.6
#5 0x0000000001cfec06 in sm_stat_new (csr=0x7f9bd978b390, is_pe=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/sql_monitor.c:315
#6 0x0000000000e12c8b in refresh_sql_monitor_stats () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/include/chk_sesskill.h:32
#7 bitq_read_internal (bitmask=..., timeout=0, event_type=WE_ACF_MTX_RW, id1=-1, id2=-1) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/bitqueue.c:453
#8 0x0000000000f33060 in wait_mtx (l=<optimized out>, status=MTX_STS_READ_WAITING, use_so=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:158
#9 0x0000000000f3412a in mtx_rdlock_ex (l=0x7f9897642118, nowait=0 '\000', use_so=224 '\340') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:222
#10 0x00000000010433e1 in ccc_unlock (lksb=0x7fa141dbf720, flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lock.c:4735
#11 0x0000000001031a4f in gca_unlock_internal (buf=0x7f9be5219da8, ccc_flags=8160) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:442
#12 0x00000000018c2760 in tcbuf_pin_cleanup (so=<optimized out>, sessid=<optimized out>, flag=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/tc/tc_buf.c:1369
#13 0x0000000000e19a20 in sess_csr_so_cleanup (so_list=0x7f9bd978b3a8, sessid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:158
#14 0x0000000001cd99c0 in csr_close (csr=0x7f9bd978b390, is_recovery=1 '\001') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/csr.c:1802
#15 0x0000000000e1f1ce in so_csr_marker_cleanup (so=0x7f994f482a30, sessid=<optimized out>, flag=224 '\340') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_registry.c:163
#16 0x0000000000e18010 in so_cleanup_after (id=1306, so_list=0x7f7fb4d54ad0, marker=0x7f9bd97843e0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:59
#17 0x00000000005ac84f in svr_wthr_main_internal (my_tid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:955
#18 0x0000000000e31de0 in wthr_init (args=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_cthr.c:3288
#19 0x00007fab9e6a1e25 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fab9c4b134d in clone () from /lib64/libc.so.6
SQL
복사
Thread 3 (Thread 0x7f7f8a7c7700 (LWP 7474)):
#0 0x00007fab9c4781ad in nanosleep () from /lib64/libc.so.6
#1 0x00007fab9c4a8ec4 in usleep () from /lib64/libc.so.6
#2 0x0000000000e4b812 in sync_sig_handler (signo=11, siginfo=0x7f7f7400fd70, context=0x7f7f7400fc40) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_signal.c:1587
#3 <signal handler called>
#4 clone_exstat (stat=0x7fab88c75018, exn=0x7fab88c75018) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/sql_monitor.c:204
#5 0x0000000001cfeb7d in sm_stat_new (csr=0x7f9bd978b390, is_pe=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/sql_monitor.c:309
#6 0x0000000000e12c8b in refresh_sql_monitor_stats () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/include/chk_sesskill.h:32
#7 bitq_read_internal (bitmask=..., timeout=0, event_type=WE_ACF_MTX_RW, id1=-1, id2=-1) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/bitqueue.c:453
#8 0x0000000000f33060 in wait_mtx (l=<optimized out>, status=MTX_STS_READ_WAITING, use_so=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:158
#9 0x0000000000f3412a in mtx_rdlock_ex (l=0x7f9897642118, nowait=0 '\000', use_so=95 '') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:222
#10 0x0000000001041e71 in ccc_cancel (lksb=0x7fa141dbf720, flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lock.c:5063
#11 0x00000000010303a0 in gca_cancel (lksb=0x7fa141dbf720, ccc_flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:553
#12 0x0000000001031668 in gca_recover_lock (lksb=0x7fa141dbf720) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:734
#13 0x00000000010675d8 in ccc_cleanup_lksb (so=0x7fa141dbf720, tid=<optimized out>, flag=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lksb.c:211
#14 0x0000000000e19a20 in sess_csr_so_cleanup (so_list=0x7f9bd978b3a8, sessid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:158
#15 0x0000000001cd94be in csr_close (csr=0x7f9bd978b390, is_recovery=1 '\001') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/csr.c:1566
#16 0x0000000000e1f1ce in so_csr_marker_cleanup (so=0x7fab88c75018, sessid=<optimized out>, flag=95 '_') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_registry.c:163
#17 0x0000000000e1b48c in sess_so_cleanup (sessid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:251
#18 0x000000000057c884 in wthr_so_cleanup_thr () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:3514
#19 0x00000000005a49e2 in thread_main_chk_bitmask (my_tid=2408, bitmask=...) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:516
#20 0x00000000005ab4cb in svr_wthr_main_internal (my_tid=2408) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:919
#21 0x0000000000e31de0 in wthr_init (args=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_cthr.c:3288
#22 0x00007fab9e6a1e25 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fab9c4b134d in clone () from /lib64/libc.so.6
SQL
복사
원인
종료 대상 세션에 session kill 요청(signal)이 전달되면, 해당 세션은 진행 중이던 작업을 취소하거나 소유 중인 자원을 반납하는 so recovery(=so cleanup) 과정을 수행하게 됩니다.
예를 들어, A 세션이 session kill 시그널을 수신하고, 이를 처리하기 위해 B 스레드가 so recovery를 시도하던 중 SIGSEGV오류가 발생해 실패하게 됩니다.
그 후 C 스레드가 다시 A 세션에 대해 so recovery를 시도하지만, 동일하게 SIGSEGV 오류가 발생하여 실패하게 됩니다.
이러한 반복적인 실패로 인해 A 세션이 점유하고 있던 자원이 정상적으로 해제되지 않았고, 그 결과 해당 세션이 보유하던 lock이 풀리지 않은 상태로 남게 됩니다.
즉, so recovery 과정 중 SIGSEGV 오류로 인해 자원 반납이 정상적으로 이루어지지 않으면서, drop partition 작업 지연 및 select 불가 현상이 발생한 것입니다.
해결
패치를 적용해 해결합니다. (적용패치: 264981c)
주의
티맥스티베로에서 제공하는 기술지원을 통해 패치를 적용합니다.