This past week we experienced severe slowdowns on our Active Data Guard database. I thought I’d document what we saw and the steps we took to remedy the issue. Both primary and secondary databases are 16-node (two-rack) Exadatas running Oracle version 11.2.0.4.
What We Saw – Symptoms
Users were complaining of slow response times for their read-only queries on Active Data Guard and even simple things like logging in and simple queries against GV$/V$ views was slow. We took several hanganalyze dumps and saw the following predominant wait events:
- library cache lock / library cache load lock
- cursor: pin S wait on X
- gc cr request
We were also seeing ORA-4031 errors on large pool allocations and we had frequent memory stealing going on between DB cache and large pool. And shared pool on the recovery instance had grown quite large (over 40 GB).
When I looked at SGA/shared pool allocations, I saw:
1 select pool, name, bytes 2 from v$sgastat 3* order by pool, bytes desc POOL NAME BYTES ----------- -------------------------- ---------------- shared pool gcs dynamic r 13,807,768,000 gcs dynamic s 8,961,551,424 free memory 7,530,946,112 ASM extent pointer array 3,191,875,096 ges enqueues 2,925,579,056 gcs resources 1,118,565,984 Checkpoint queue 983,056,384 gcs shadows 774,391,848 . . . *********** ---------------- sum 41,204,842,496
Querying GV$RESOURCE_LIMIT, we saw high numbers of “gcs resources” and “gc shadows” had been/were being used on instances 10 and 11 (MRP/recovery had been on instance 10 and was moved to instance 11 to try to mitigate the issue):
INST CURRENT MAX INITIAL LIMIT ID RESOURCE_NAME UTILIZATION UTILIZATION ALLOCATION VALUE ----- ------------------------------ ----------- ----------- ---------- ---------- 1 gcs_resources 800 3594418 9923100 UNLIMITED 1 gcs_shadows 800 4341615 9923100 UNLIMITED 2 gcs_resources 800 3635583 9923100 UNLIMITED 2 gcs_shadows 800 4310007 9923100 UNLIMITED 3 gcs_resources 800 3673520 9923100 UNLIMITED 3 gcs_shadows 800 4425169 9923100 UNLIMITED 4 gcs_resources 800 1154037 9923100 UNLIMITED 4 gcs_shadows 800 2249026 9923100 UNLIMITED 5 gcs_resources 800 3259731 9923100 UNLIMITED 5 gcs_shadows 800 3956792 9923100 UNLIMITED 6 gcs_resources 800 3236723 9923100 UNLIMITED 6 gcs_shadows 800 3962286 9923100 UNLIMITED 7 gcs_resources 800 3251719 9923100 UNLIMITED 7 gcs_shadows 800 3944630 9923100 UNLIMITED 8 gcs_resources 800 967050 9923100 UNLIMITED 8 gcs_shadows 800 1961536 9923100 UNLIMITED 9 gcs_resources 1200 3237116 9891090 UNLIMITED 9 gcs_shadows 1200 3914301 9891090 UNLIMITED 10 gcs_resources 1200 70984040 5377680 UNLIMITED 10 gcs_shadows 1200 67610640 5377680 UNLIMITED 11 gcs_resources 18004454 18041094 9891090 UNLIMITED 11 gcs_shadows 28291646 28651690 9891090 UNLIMITED 12 gcs_resources 1200 3208902 9891090 UNLIMITED 12 gcs_shadows 1200 4040165 9891090 UNLIMITED 13 gcs_resources 1200 3182042 9891090 UNLIMITED 13 gcs_shadows 1200 3883302 9891090 UNLIMITED 14 gcs_resources 1200 3154007 9891090 UNLIMITED 14 gcs_shadows 1200 3899527 9891090 UNLIMITED 15 gcs_resources 1200 3146403 9891090 UNLIMITED 15 gcs_shadows 1200 3924065 9891090 UNLIMITED 16 gcs_resources 1200 2986005 9891090 UNLIMITED 16 gcs_shadows 1200 3839600 9891090 UNLIMITED
We also saw the following recurring messages about 11 seconds apart in the alert.logs on multiple instances:
Wed Sep 28 09:51:27 2016 Global flush: getting parseLock Global flush: getting parseLock Wed Sep 28 09:51:38 2016 Global flush: getting parseLock Global flush: getting parseLock
We are used to seeing the following sequence in our alert.log:
Global flush: getting parseLock Start global flush... Complete global flush...
However, the previous night was the last time we saw a “Complete global flush” message. And, starting at 04:04 in the morning, we got the following “redo cache is full” message:
Wed Sep 28 04:04:12 2016 Recovery: global flush, redo cache is full
To compound the problem, we also had user sessions running on the recovery node. This caused additional contention as it took longer to serve GC requests.
We were also experiencing a recovery lag.
What We Found Out
In Active Data Guard, the node that MRP runs on (the recovery node) becomes the master for all blocks. Therefore the GCS regions in the shared pool on that one instance become very large. Every time the shared pool grew, it was causing large flushes to occur (up to 1000 objects at a time according to X$KSMLRU). This also accounted for our ORA-4031 errors on large pool allocations as well as the severe library cache and shared pool contention.
What We Did to Address the Issue
The first thing we did was to migrate all users/services off of nodes 10 and 11. We decided that until we got more definitive answers from Oracle Support, we would dedicate nodes 10 and 11 to recovery. By keeping users off of both nodes, we could then “toggle” MRP back and forth between the two nodes and tweak our configurations without impact to users.
Support recommended that based on our high water marks for GCS allocations, that we increase the resource limits for those objects as well as increasing shared pool and SGA sizes on the two recovery nodes (nodes 10 and 11). Here are the values we used:
_gcs_resources=80000000 _gcs_shadows_locks=80000000 shared_pool_size=48G db_cache_size=24G sga_target=84G sga_max_size=90G
We made the changes and then restarted both instances and MRP/recovery.
Results
After those changes, things stabilized and MRP caught up. Users reported no further problems with queries and the problem went away.
Additional Info
There is a patch for bug 17597704 that allows blocks to be “multi-mastered” (mastered on all nodes in the cluster, instead of just on the recovery node). We opted not to apply and enable this fix as we felt that it was potentially trading one problem for another. In other words, instead of going to the recovery nodes for the block masters, queries running on one node would potentially have to visit any of the other 15 nodes. Since performance was fine with mastering on the recovery node, we left this one alone.
In addition, in 11.2, these GCS-related allocations are permanent objects and not freeable. There is a patch (21863727) in 12.1.0.2 that makes these objects freeable but unfortunately there is no remedy for 11.2 at this time. This explains why flushes of the shared pool did not help.