The Problem

1. Several disks have been added to an existing diskgroup in RAC environment and sqlplus session initiating adding operation does not return the control which have to be disconnected manually.

2. Rebalance is not occuring from v$asm_operation:

SQL> select * from gv$asm_operation;
no rows selected

3. “disk validation pending” message in other nodes is visible but there is no “SUCCESS: refreshed membership” message in ASM alert.log:

Tue Aug 27 23:32:36 2013
NOTE: disk validation pending for group 2/0x75fe02b8 (DATA)
Wed Aug 28 05:28:52 2013

4. RBAL trace show the following message repeatedly.

kfgbTryFn: failed to acquire DD.0.0 in 6 for kfgbDiscoverNow (of group 7/0x259d8ac6)

Note: “DD.0.0” for dismounted disk discovery enqueue and “6” for exclusive mode.

5. Querying v$asm_disk and v$asm_diskgroup is hung but querying v$asm_disk_stat and v$asm_diskgroup_stat view works.

Example of v$asm_disk_stat output for new devices. Note “ADDING” state:

GN DN    m_status    h_status     mo_status     state     dname
 2   10  OPENED      MEMBER       SYNCING       ADDING    DATA_0010   
 2   11  OPENED      MEMBER       SYNCING       ADDING    DATA_0011

The Solution

1. One of sessions holding dismounted disk discovery enqueue “DD-00000000-00000000″ in exclusive mode is waiting for ‘kfk: async disk IO” indefinitely.

This process is blocking RBAL to get the same enqueue ( dismounted disk discovery enqueue ) for the new devices being added on other nodes in RAC env. That is why the following message is repeated from RBAL trace.

kfgbTryFn: failed to acquire DD.0.0 in 6 for kfgbDiscoverNow (of group 7/0x259d8ac6)

This can be checked from the sql script given below:

set linesize 200
set pagesize 1000

column username format a10
column mod format a20
column blocker format a7
column waiter format a7
column lmode format 9999
column request format 9999
column I format 99
column sid format 9999
col username format a6
col osuser format a8
col s# format 99999
col CS_pid format a13
col pname format a10
col program format a20
col waitsec format 999,999,999
col pid format 9999
--col p1 format 9999
col p2 format a20
col sql format a20

spool locking_information
prompt ########################
prompt # Blocking Information #
prompt ########################
select  b.inst_id||'/'||b.sid blocker,
--      s.module,
        w.inst_id||'/'||w.sid waiter,
        b.type,
        b.id1,
        b.id2,
        b.lmode,
        w.request
from    gv$lock b,
        ( select inst_id, sid, type, id1, id2, lmode, request
          from   gv$lock  where request > 0 ) w
--      gv$session s
where   b.lmode > 0
and     ( b.id1 = w.id1 and b.id2 = w.id2 and b.type = w.type )
--and   ( b.sid = s.sid and b.inst_id = s.inst_id )
order by b.inst_id, b.sid
/
prompt ##########################
prompt # Rebalance Information  #
prompt ##########################
select * from gv$asm_operation
/
prompt ########################
prompt # Locking Information  #
prompt ########################
select a.type, a.id1, a.id2, a.lmode, a.request, a.inst_id inst, a.sid,
case when a.type="DD" and a.id1=0 and a.id2=0 and a.lmode=6 then ' p.program in 10g_11gR1
s.status, s.module program, s.osuser ,
substr(w.event, 1, 30) wait_event, w.seconds_in_wait waitsec, w.p1,
case
  when w.event="DFS lock handle" and w.p2=38 then 'ASM diskgroup discovery wait'
  when w.event="DFS lock handle" and w.p2=39 then 'ASM diskgroup release'
  when w.event="DFS lock handle" and w.p2=40 then 'ASM push DB updates'
  when w.event="DFS lock handle" and w.p2=41 then 'ASM add ACD chunk'
  when w.event="DFS lock handle" and w.p2=42 then 'ASM map resize message'
  when w.event="DFS lock handle" and w.p2=43 then 'ASM map lock message'
  when w.event="DFS lock handle" and w.p2=44 then 'ASM map unlock message (phase 1)'
  when w.event="DFS lock handle" and w.p2=45 then 'ASM map unlock message (phase 2)'
  when w.event="DFS lock handle" and w.p2=46 then 'ASM generate add disk redo marker'
  when w.event="DFS lock handle" and w.p2=47 then 'ASM check of PST validity'
  when w.event="DFS lock handle" and w.p2=48 then 'ASM offline disk CIC'
  when w.event="DFS lock handle" and w.p2=52 then 'ASM F1X0 relocation'
  when w.event="DFS lock handle" and w.p2=55 then 'ASM disk operation message'
  when w.event="DFS lock handle" and w.p2=56 then 'ASM I/O error emulation'
  when w.event="DFS lock handle" and w.p2=60 then 'ASM Pre-Existing Extent Lock wait'
  when w.event="DFS lock handle" and w.p2=61 then 'Perform a ksk action through DBWR'
  when w.event="DFS lock handle" and w.p2=62 then 'ASM diskgroup refresh wait'
  else to_char(w.p2)
end  p2 , substr(q.sql_text, 1, 100) sql
from gv$session s , gv$process p , gv$session_wait w , gv$sqlarea q
where   ( s.paddr = p.addr and s.inst_id = p.inst_id )
and     ( s.inst_id = w.inst_id and s.sid = w.sid )
and     ( s.inst_id = q.inst_id(+) and s.sql_address = q.address(+) )
order by s.inst_id, s.sid --, s.audsid
/
spool off
exit

Sample output:

########################
# Locking Information                      #
########################

TY        ID1        ID2 LMODE REQUEST       INST   SID Dismounted DD enq holder
-------------------------------------------------------------------------------------------------------------------------
DD          0          0     6          0                   2         182 

Note that ID1 and ID2 is “0”, i.e DD-00000000-00000000 and LMODE is “6” which is exclusive mode.

2. One of devices being added to the affected diskgroup shows near 100% utilization. For example, the output of “iostat -xt 2” where xvdev1 is one of devices being added.

Device:         rrqm/s   wrqm/s   r/s   w/s     rsec/s   wsec/s avgrq-sz  avgqu-sz     await     svctm     %util
xvdev           0.00     0.00    0.00  0.00     0.00     0.00     0.00    8.00         0.00      0.00      100.00          100.00

Follow the steps outline below to resolve the issue:

1. Fix the device issue showing near 100% utilization on OS or Storage level.

2. After fixing the device in question, simulate the same issue by creating dummy diskgroup in the way described in note 557348.1 with new devices. And run asm_blocking.sql attached to check if there is any process holding “DD-00000000-00000000” for long time. If new DUMMY diskgroup can be created without any issue, the same situation will not happen.

3. Re-initate rebalance for the diskgroup if rebalance is not started automatically after fixing storage issue on OS level.

SQL>  alter diskgroup DATA rebalance power 6;