High Load on Oracle Standby: I/O and RECOVERY_PARALLELISM

After moving to a new server, we started receiving pages for spikes of high load on our physical standby. Now this was not an Active Dataguard configuration. This was a regular physical standby mounted and not open; there were no user sessions that would cause this load.

Aside from backups, nothing else ran on this server, and the timing of the spikes did not match with backup windows. Yet, the system would suddenly experience the following.

  • Load average spiking up, close to 100 (on a 16-core single-thread server)
  • %iowait dominating
  • CPU mostly idle

Here is what sar showed during one such spike:

## Load Average
08:50:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
09:00:01 AM         1      1054      0.16      0.13      0.15         1
09:10:01 AM         1      2346    570.22    255.90    101.66       746
09:20:01 AM         1      1149      1.45     76.85     96.19         0


### CPU behavior 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.48    0.00   13.33   77.82    0.00    3.36

Observations

At first glance, this seemed weird—high load without CPU usage. However, from the sar man help page, you can find the following definitions, which will help you understand this better.

% iowait – Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request

blocked – Number of tasks currently blocked, waiting for I/O to complete.

So the things to note are

  • The spike in load [At least 570]
  • 745 blocked processes [ These are processes that were waiting for I/O to complete]
  • CPU Run queue is low ( runq-sz ~ 1)
  • System CPU utilization is higher than %user, and iowait is > 77%
  • Clearly, the CPU is not a bottleneck.

This would last for barely 5 minutes, but it was random. So what was going on? Was there an issue with our i/o ie storage? So now we looked at iostat itself, and noticed the following metrics.

Device     r/s   rkB/s   await   aqu-sz   %util
nvme0n5   3000   25256   53.56   160.78   100
nvme0n6   3389   29752   36.38   123.29    88
nvme0n9   3625   31450   30–50   117.88    85
  • await ~ 7 ms This is the total Time an I/O takes ie time spent [in queue + performing I/O ]
  • aqu-sz – The number of I/Os in the queue. We can clearly see a spike in the aqu-sz.
  • %util – is high

A Note – Highly recommend reading through Jared’s blog to understand the %iowait metric in sar. As Jared explains, %iowait represents CPU time spent waiting for I/O, and does not by itself indicate a performance problem – that is exactly what is happening here.

It looks more like a queueing problem: i.e., our I/O request queues are getting saturated. That begs the question, why so many I/O requests on a mounted standby?

Checking Database

Well, since this is a mounted standby, there’s only one process doing anything substantial: Managed Recovery Process – MRP. So let’s take a look at what’s going on.

select process, status, thread#, sequence#,
       known_agents, active_agents
from v$managed_standby;


PROCESS   STATUS          THREAD#  SEQUENCE# KNOWN_AGENTS ACTIVE_AGENTS
--------- ------------ ---------- ---------- ------------ -------------
ARCH      CLOSING               1     681669            0             0
DGRD      ALLOCATED             0          0            0             0
DGRD      ALLOCATED             0          0            0             0
ARCH      CLOSING               1     681667            0             0
ARCH      CLOSING               1     681670            0             0
ARCH      CLOSING               1     681668            0             0
ARCH      CLOSING               1     681666            0             0
RFS       RECEIVING             1     681671            0             0
RFS       IDLE                  1          0            0             0
MRP0      APPLYING_LOG          1     681671           17            17

As we see, there are 17 recovery agents. This means that during peak load, each parallel thread would be issuing independent I/O. This explains the concurrent I/O requests.

However, for most of the day, the load would not spike. The spikes only appeared during periods of high redo generation on the primary. So this meant that when a large volume of redo would be shipped to the standby, all 17 recovery agents would become active at the same time, and issue concurrent I/O requests, leading to a spike in load.

Recovery Parallelism

Our server could handle constant, steady recovery throughput, but not these short spikes of parallel I/O. The next parameter to check was recovery parallelism, or the property ApplyParallel of the Dataguard broker. We had it set to AUTO.

AUTO— the number of parallel processes used for Redo Apply is automatically determined by Oracle based on the number of CPUs that the system has.

DGMGRL> show database verbose orcls

Database - orcls

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          1 second (computed 0 seconds ago)
  Average Apply Rate: 2.07 MByte/s
  Active Apply Rate:  1.38 MByte/s
  Maximum Apply Rate: 158.73 MByte/s
  Instance(s):
    orcls

  Properties:
...
    ApplyParallel                   = 'AUTO'

With ApplyParallel set to AUTO The server spawned multiple recovery agents. We reduced the parallelism to throttle this, and that helped us manage the load better on the system

DGMGRL> edit database orcls set property ApplyParallel = 8;

Reducing recovery parallelism behaved as expected and improved overall system stability.

Notably, behavior was not observed in the previous environment that was on Bare Metal Systems (BMS). We attributed this to the decrease in IOPS on the new Google Cloud Platform (GCE)

Which means the recovery parallelism that was sustainable on BMS was not on GCE. And it led to increased I/O queueing, higher latency, and elevated system load on GCE. So keep an eye on Recovery parallelism, and tune it in accordance with the I/O capacity of the underlying platform.

But be mindful, although reducing recovery parallelism improves I/O stability, it may also limit peak redo apply throughput. This can lead to lag during periods of sustained high redo generation or when the standby is catching up after a network outage and has a lot of backlog redo to apply.


Discover more from oratrails-aish

Subscribe to get the latest posts sent to your email.