기술 정보
home
채널 소개
home

drop partition 작업 지연 및 select 수행 불가 현상

문서 유형
장애 해결
분야
관리/환경설정
키워드
drop partition
select
적용 제품 버전
6FS06
1 more property

현상

drop partition 수행 후, sys.log 상에서 DDD Timeout 로그가 다량 발생하면서 drop partition 작업이 지연되고, select 수행이 안되는 현상이 발생했습니다.
위와 같은 현상 확인을 위해 최소한 다음과 같은 로그 및 정보를 수집합니다.
diag.log / diag dump
ilog
sys.log
callstack / .out
lock & session 관련 뷰 : v$lock, v$session, v$transaction, v$waiter_session, v$blocker_session
TAC 환경이라면, 각 노드별로 로그를 추출해야 하며  global 뷰로 조회해야 합니다.
수집한 로그를 통한 지연 현상의 원인을 확인하는 과정은 다음과 같습니다.

sys.log / gv$lock

2번 노드 1705 세션에서 DDL을 수행하는데, 3번 노드의 2381 세션을 기다리는 상황입니다.
-- #sys.log [2024-02-28T16:07:32.959719] [DDL-1705] [I] Executing DDL: ALTER TABLE TMUSER.TBR_TM_SAME_5GRU_CHATBOT_TREND DROP PARTITION PT_2024022511 ... [2024-02-28T16:38:44.660656] [ACF-214] [I] DDD timeouted! tid 1705, lkb 00007f9c28e5ba50, rsb 00007f9c78dec0a0, 1:16-82-1357311467 [2024-02-28T16:39:14.605657] [ACF-214] [I] DDD timeouted! tid 1705, lkb 00007f9c28e5ba50, rsb 00007f9c78dec0a0, 1:16-82-1357311467 [2024-02-28T16:39:45.319655] [ACF-214] [I] DDD timeouted! tid 1705, lkb 00007f9c28e5ba50, rsb 00007f9c78dec0a0, 1:16-82-1357311467 -- #select * from gv$lock INST_ID TYPE ID1 ID2 LMODE REQUESTED SESS_ID CTIME 3 WLOCK_DD_OBJ 82 1357311467 1 0 2381 101347 2 WLOCK_DD_OBJ 82 1357311467 1 5 1705 56
SQL
복사

diag dump / sys.log / gv$session

diag dump 확인 결과, 3번 노드의 231 thread 상태가 dead로 나오는 것을 확인할 수 있습니다.
sys.log 확인 결과 session kill force로 thread를 강제 종료한 이력을 확인할 수 있습니다.
참고로, 죽은 thread의 cleaup을 대신 진행하는 2413의 세션 정보를 보면 아직 cleanup 작업을 완료하지 못한 상태이며 state는 running 상태 =입니다.
-- #diag dump Trace session. ID: 2381 (WTHR) ---------------------------------- Status : RUNNING User name : TMUSER V.sessid : 2381 SQL E.T. : 70604100(ms) SQL ID : 9558962701554134066 SQL ppid : 0 Program : JOB_SCHEDULER Wait event: NONE Trace attached thread: 2381 ---------------------------------- PID: 9572 Physical TID: 2 Thread status: DEAD ---------------------------------- Callstack dump for sessid[2381] (filename: tbsvr.callstack.9572) ---------------------------------------- -- #sys.log [2024-02-27T12:47:07.050098] [DDL-1790] [I] Executing DDL: alter system kill session '2381,399323' force [2024-02-27T12:47:07.050357] [FRM-1790] [I] get SESS_KILL_FORCE req 2381 [2024-02-27T12:47:07.050369] [FRM-1790] [I] SEND SESS_KILL FORCE SIGNAL to 2381 [2024-02-27T12:47:07.050413] [FRM-198] [I] SEND SESS_KILL FORCE SIGNAL to 2381 [2024-02-27T12:47:07.050515] [DDL-1790] [I] DDL execution succeeded [2024-02-27T12:47:07.051128] [FRM-2413] [I] WTHR(2381) recovery [2024-02-27T12:47:07.051147] [FRM-2413] [I] thr cleanup WTHR(2381) sess(2381) -- # select * from gv$session; INST_ID SID SERIAL# AUDSID USER# USERNAME IPADDR COMMAND STATUS SCHEMA# SCHEMANAME TYPE SQL_ID SQL_CHILD_NUMBER PREV_SQL_ID PREV_CHILD_NUMBER SQL_ET LOGON_TIME STATE WLOCK_WAIT WAIT_EVENT WAIT_TIME PGA_USED_MEM SQL_TRACE PROG_NAME CLIENT_PID PID WTHR_ID OS_THR_ID OSUSER MACHINE TERMINAL MODULE ACTION CLIENT_INFO CLIENT_IDENTIFIER PDML_ENABLED PDML_STATUS PDDL_STATUS PQ_STATUS ROW_WAIT_OBJ_ID ROW_WAIT_FILE_NO ROW_WAIT_BLOCK_NO ROW_WAIT_ROW_NO CONSUMER_GROUP CONSUMED_CPU_TIME 3 2413 418817 4294967295 0 SYS 0 READY 0 SYS WTHR 0 2024/02/27 12:47:07 RUNNING -1 0 11384 DISABLED SO RECOVERY(WTHR) 0 9577 4 11518 NO DISABLED ENABLED ENABLED -1 1835296
SQL
복사

callstack

2번 노드 1705 : ddl 수행 세션으로 해당 object에 대해서 exclusive lock을 획득해야 하는데 3번 노드의 2381 세션이 lock을 획득 중이라 대기 중으로 확인됩니다.
3번 노드 2381 : 작업 도중 session kill force로 thread가 종료되었습니다.
3번 노드 2413 : 죽은 2381 세션의 cleanup 작업을 대신 진행 중으로, 이 cleanup 작업이 완료되어야 2381이 잡고있는 lock이 해제됩니다.
Thread 8 (Thread 0x7f8fbbff7700 (LWP 11518)): #0 0x00007fbbc2b021ad in nanosleep () from /lib64/libc.so.6 #1 0x00007fbbc2b32ec4 in usleep () from /lib64/libc.so.6 #2 0x0000000001030395 in gca_cancel (lksb=0x7fac18616270, ccc_flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:559 #3 0x0000000001031668 in gca_recover_lock (lksb=0x7fac18616270) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:734 #4 0x00000000010675d8 in ccc_cleanup_lksb (so=0x7fac18616270, tid=<optimized out>, flag=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lksb.c:211 #5 0x0000000000e1b48c in sess_so_cleanup (sessid=2381) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:251 #6 0x000000000057c884 in wthr_so_cleanup_thr () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:3514 #7 0x00000000005a49e2 in thread_main_chk_bitmask (my_tid=2413, bitmask=...) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:516 #8 0x00000000005ab4cb in svr_wthr_main_internal (my_tid=2413) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:919 #9 0x0000000000e31de0 in wthr_init (args=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_cthr.c:3288 #10 0x00007fbbc4d2be25 in start_thread () from /lib64/libpthread.so.0 #11 0x00007fbbc2b3b34d in clone () from /lib64/libc.so.6
SQL
복사

diag.log

2024/02/27 12:05:36.502 211 diag file '/data1/NMSMNDB5G3/dump/diag/tb_[CCC_WAIT]_1113_140524-12582964.s1176.n3.diag'(5444 bytes) deleted due to limit 2024/02/27 12:06:21.027 2381 trace request! sessid: 2381, trc: 1 args: 2 2024/02/27 12:06:21.027 211 trace started. id 12583232 type 1 2024/02/27 12:06:21.027 211 [nid=3][LOCAL]CCC WAIT trace rsb [49|511_00339724] for trace id 12583232 from sid 2381 2024/02/27 12:06:21.027 211 [1][CCC MASTER][49|511_00339724] shadow in grq (0): SHR-lock granted (rq=-1, bl=5) 2024/02/27 12:06:21.028 211 trace finished. id 12583232 2024/02/27 12:06:36.028 2381 trace request! sessid: 2381, trc: 1 args: 2 2024/02/27 12:06:36.028 211 trace started. id 12583233 type 1 2024/02/27 12:06:36.028 211 [nid=3][LOCAL]CCC WAIT trace rsb [49|511_00339724] for trace id 12583233 from sid 2381
SQL
복사

diag dump

===== CCC_WAIT, TraceID: 12583270 ===== *** 2024/02/27 12:42:06.040 *** [nid=2] trace start with rsb id [49|511_00339724] from nid: 3 (timeout cnt=144) Trace CCC RSB(00007fa0d30ea4c0) timeout count=144 ---------------------------------------- rsb<00007fa0d30ea4c0>49|511_00339724:nid 3,f 20000,rf 2,mgr 3,mbl 5,lw 0,wf 5,bf 0,rt(0), cf 0,cgr 3,cbl 5,chi -1,ht(0,0),wt(0,0),dt(0,0), ogt(0,0) wr_nid -1 lw 0 snd 1,rcv 2,bh<0000000000000000> [0/0] msg type: 2 [1/7] msg type: 0 [2/6] msg type: 0 [3/5] msg type: 0 [4/4] msg type: 0 [5/3] msg type: 0 [6/2] msg type: 0 [7/1] msg type: 0 lkb(00007f9fcc25d9e0):st 1,gr 3,rq -1,bl 5,mt 0,nid 3,exf 10000,f 8000000,sbf 0,rcv 1,snd 1,htsn(0,0),rsb<00007fa0d30ea4c0>[49|511_00339724] lksb(00007fa141dbf720): owner WTHR(tid 1306, node 2, pid 4136), status -22,flags 0x0, lkb 00007f9fcc25d9e0, priv 00007f9be5219da8 No bh in this lcb(00007fa0d30ea670), rsb(00007fa0d30ea4c0) Trace session. ID: 1306 (WTHR) ---------------------------------- Status : SESS_CLEANUP User name : TMUSER V.sessid : 1306 Program : 5GNMS_PCCLIENT.EXE Wait event: NONE Trace attached thread: 1306 ---------------------------------- PID: 4136 Physical TID: 7 Thread status: DEAD ---------------------------------- Callstack dump for sessid[1306] (filename: tbsvr.callstack.4136) ---------------------------------- 00007fa0d30ea4c0 Redo log flushed tsn: (383949224490) RSB log flush wait tsn: (0) RSB in CATH scan list: (0) [49|511_00339724] CRAS req redo tsn: (383949224610) [49|511_00339724] CATH req redo tsn: (383949224610), for_ccc (1) ---------------------------------- ...
SQL
복사

sys.log

1306에 대한 so recovery 중 SIGSEGV가 발생한 것을 확인할 수 있습니다.
[2024-02-23T17:02:30.025691] [FRM-1306] [I] Signal 'SIGSEGV'(11) caught (TID = 1306, OS_THR_ID = 7367, PID = 4136): si_addr = 0000000000001fd0 : the cause of the signal can be found in a process callstack file. ... [2024-02-23T17:03:55.995824] [FRM-2408] [I] WTHR(1306) recovery [2024-02-23T17:03:55.995870] [FRM-2408] [I] thr cleanup WTHR(1306) sess(1306) [2024-02-23T17:03:55.995939] [FRM-2408] [I] Signal 'SIGSEGV'(11) caught (TID = 2408, OS_THR_ID = 7474, PID = 4297): si_addr = 0000000000000190 : the cause of the signal can be found in a process callstack file.
Plain Text
복사

callstack

tbsvr.callstack.4136 파일 확인 결과, SIGSEGV가 발생했고, 2408 세션에서 1306에 대한 so recovery가 수행되었다가 또 다시 SIGSEGV가 발생한 것을 callstack을 통해서도 확인되었습니다.
Thread 5 (Thread 0x7f7f83ff7700 (LWP 7367)): #0 0x00007fab9c4781ad in nanosleep () from /lib64/libc.so.6 #1 0x00007fab9c4a8ec4 in usleep () from /lib64/libc.so.6 #2 0x0000000000e4b812 in sync_sig_handler (signo=11, siginfo=0x7f7f7800fd70, context=0x7f7f7800fc40) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_signal.c:1587 #3 <signal handler called> #4 0x00007fab9c509339 in _memmove_ssse3_back () from /lib64/libc.so.6 #5 0x0000000001cfec06 in sm_stat_new (csr=0x7f9bd978b390, is_pe=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/sql_monitor.c:315 #6 0x0000000000e12c8b in refresh_sql_monitor_stats () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/include/chk_sesskill.h:32 #7 bitq_read_internal (bitmask=..., timeout=0, event_type=WE_ACF_MTX_RW, id1=-1, id2=-1) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/bitqueue.c:453 #8 0x0000000000f33060 in wait_mtx (l=<optimized out>, status=MTX_STS_READ_WAITING, use_so=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:158 #9 0x0000000000f3412a in mtx_rdlock_ex (l=0x7f9897642118, nowait=0 '\000', use_so=224 '\340') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:222 #10 0x00000000010433e1 in ccc_unlock (lksb=0x7fa141dbf720, flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lock.c:4735 #11 0x0000000001031a4f in gca_unlock_internal (buf=0x7f9be5219da8, ccc_flags=8160) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:442 #12 0x00000000018c2760 in tcbuf_pin_cleanup (so=<optimized out>, sessid=<optimized out>, flag=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/tc/tc_buf.c:1369 #13 0x0000000000e19a20 in sess_csr_so_cleanup (so_list=0x7f9bd978b3a8, sessid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:158 #14 0x0000000001cd99c0 in csr_close (csr=0x7f9bd978b390, is_recovery=1 '\001') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/csr.c:1802 #15 0x0000000000e1f1ce in so_csr_marker_cleanup (so=0x7f994f482a30, sessid=<optimized out>, flag=224 '\340') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_registry.c:163 #16 0x0000000000e18010 in so_cleanup_after (id=1306, so_list=0x7f7fb4d54ad0, marker=0x7f9bd97843e0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:59 #17 0x00000000005ac84f in svr_wthr_main_internal (my_tid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:955 #18 0x0000000000e31de0 in wthr_init (args=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_cthr.c:3288 #19 0x00007fab9e6a1e25 in start_thread () from /lib64/libpthread.so.0 #20 0x00007fab9c4b134d in clone () from /lib64/libc.so.6
SQL
복사
Thread 3 (Thread 0x7f7f8a7c7700 (LWP 7474)): #0 0x00007fab9c4781ad in nanosleep () from /lib64/libc.so.6 #1 0x00007fab9c4a8ec4 in usleep () from /lib64/libc.so.6 #2 0x0000000000e4b812 in sync_sig_handler (signo=11, siginfo=0x7f7f7400fd70, context=0x7f7f7400fc40) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_signal.c:1587 #3 <signal handler called> #4 clone_exstat (stat=0x7fab88c75018, exn=0x7fab88c75018) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/sql_monitor.c:204 #5 0x0000000001cfeb7d in sm_stat_new (csr=0x7f9bd978b390, is_pe=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/sql_monitor.c:309 #6 0x0000000000e12c8b in refresh_sql_monitor_stats () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/include/chk_sesskill.h:32 #7 bitq_read_internal (bitmask=..., timeout=0, event_type=WE_ACF_MTX_RW, id1=-1, id2=-1) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/bitqueue.c:453 #8 0x0000000000f33060 in wait_mtx (l=<optimized out>, status=MTX_STS_READ_WAITING, use_so=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:158 #9 0x0000000000f3412a in mtx_rdlock_ex (l=0x7f9897642118, nowait=0 '\000', use_so=95 '') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/acf/acf_mtx.c:222 #10 0x0000000001041e71 in ccc_cancel (lksb=0x7fa141dbf720, flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lock.c:5063 #11 0x00000000010303a0 in gca_cancel (lksb=0x7fa141dbf720, ccc_flags=0) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:553 #12 0x0000000001031668 in gca_recover_lock (lksb=0x7fa141dbf720) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_gca.c:734 #13 0x00000000010675d8 in ccc_cleanup_lksb (so=0x7fa141dbf720, tid=<optimized out>, flag=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ccc/ccc_lksb.c:211 #14 0x0000000000e19a20 in sess_csr_so_cleanup (so_list=0x7f9bd978b3a8, sessid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:158 #15 0x0000000001cd94be in csr_close (csr=0x7f9bd978b390, is_recovery=1 '\001') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/ex/csr.c:1566 #16 0x0000000000e1f1ce in so_csr_marker_cleanup (so=0x7fab88c75018, sessid=<optimized out>, flag=95 '_') at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_registry.c:163 #17 0x0000000000e1b48c in sess_so_cleanup (sessid=1306) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/so_cleanup.c:251 #18 0x000000000057c884 in wthr_so_cleanup_thr () at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:3514 #19 0x00000000005a49e2 in thread_main_chk_bitmask (my_tid=2408, bitmask=...) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:516 #20 0x00000000005ab4cb in svr_wthr_main_internal (my_tid=2408) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/body/tbsvr_wthr.c:919 #21 0x0000000000e31de0 in wthr_init (args=<optimized out>) at /data/autodist/TLeft_263087/tibero6/src/tbsvr/frame/tbsvr_cthr.c:3288 #22 0x00007fab9e6a1e25 in start_thread () from /lib64/libpthread.so.0 #23 0x00007fab9c4b134d in clone () from /lib64/libc.so.6
SQL
복사

원인

종료 대상 세션에 session kill 요청(signal)이 전달되면, 해당 세션은 진행 중이던 작업을 취소하거나 소유 중인 자원을 반납하는 so recovery(=so cleanup) 과정을 수행하게 됩니다.
예를 들어, A 세션이 session kill 시그널을 수신하고, 이를 처리하기 위해 B 스레드가 so recovery를 시도하던 중 SIGSEGV오류가 발생해 실패하게 됩니다.
그 후 C 스레드가 다시 A 세션에 대해 so recovery를 시도하지만, 동일하게 SIGSEGV 오류가 발생하여 실패하게 됩니다.
이러한 반복적인 실패로 인해 A 세션이 점유하고 있던 자원이 정상적으로 해제되지 않았고, 그 결과 해당 세션이 보유하던 lock이 풀리지 않은 상태로 남게 됩니다.
즉, so recovery 과정 중 SIGSEGV 오류로 인해 자원 반납이 정상적으로 이루어지지 않으면서, drop partition 작업 지연 및 select 불가 현상이 발생한 것입니다.

해결

패치를 적용해 해결합니다. (적용패치: 264981c)
주의
티맥스티베로에서 제공하는 기술지원을 통해 패치를 적용합니다.