ORA-01555 on Standby

I recently ran into a case of ORA-01555 on standby database. Initially it looked exactly like what you would expect: long-running queries, undo retention set to 900 seconds, and the familiar “snapshot too old” error.

However, this was a physical standby, i.e. Active Dataguard used for reporting. Moreover, the same SQL statements were executing successfully on the primary database, while failing intermittently on the standby. That immediately suggested this wasn’t just a regular SQL tuning or undo sizing problem.

Configuration

Primary database.
- undo_retention set to 900 i.e. 15 mins
Physical standby – Active Dataguard
- Reporting queries offloaded to the standby system.
- undo_retention set to 900 i.e. 15 mins
Alert log excerpts from standby:

$ grep ORA-01555 alert_orcls.log
alert_orcls.log:ORA-01555 caused by SQL statement below (SQL ID: 2h22ujgdsn7d2, Query Duration=12589 sec, SCN:
alert_orcls.log:ORA-01555 caused by SQL statement below (SQL ID: 2h22ujgdsn7d2, Query Duration=36142 sec, SCN:
alert_orcls.log:ORA-01555 caused by SQL statement below (SQL ID: c4dgx7aubkb95, Query Duration=316846 sec, SCN:
alert_orcls.log:ORA-01555 caused by SQL statement below (SQL ID: 25fgr7b58gfmh, Query Duration=1319 sec, SCN:
alert_orcls.log:ORA-01555 caused by SQL statement below (SQL ID: fgr56qb91gxok, Query Duration=508 sec, SCN:

This led to the following observations

Multiple SQL IDs involved → so not an issue with a single long running SQL
Query duration ranged from a few seconds to several hours

So yes, some of these queries were clearly exceeding the configured retention (900s). But, that still didn’t explain the inconsistent behavior.

Understanding ORA-01555 on Standby

ORA-01555 occurs when Oracle cannot construct a consistent read (CR) version of a block at the required SCN because the necessary undo is no longer available.

On a standby, there is no user-driven DML. However, redo received from the primary ends up continuously updating the undo blocks along with the regular data blocks. So although no users are modifying data locally, the effect is similar: the data required to reconstruct an older version may no longer be available if Undo is overwritten.

The standby database is advancing it’s SCN based on the REDO received

However, this critically also implies that changing the undo_retention parameter on the standby has absolutely no effect. Undo retention is entirely dictated by the primary database.

So, if your primary database only needs 10 minutes of undo, but your standby runs 30-minute reports, you must configure the primary database to retain at least 30 minutes of undo. So that would explain the the issue with c4dgx7aubkb95 which was running for 316846 secs when our undo retention was 900.

Redo Apply Behaviour

However that does not explain why fgr56qb91gxok failed after ~500 secs. This is instead explained by the way Redo Apply is configured in your system – real-time apply or not.

As explained in Oracle Documentation “If the real-time apply feature is enabled, then apply services can apply redo data as it is received, without waiting for the current standby redo log file to be archived…. It also enables real-time reporting on an Oracle Active Data Guard standby by keeping it more closely synchronized with the primary database. “

So if Real Time apply is configured incorrectly or if Standby Redo Logs are missing, then MRP will wait for redo switch on the primary, and apply the batch of changes almost instantly on standby. This batched application causes bursts in System Change Number (SCN) advancement, which can inturn impact the read consistency and lead to ORA-01555.

SCN Jump Example

Time  10:15────────────────────────10:25──10:30─────────10:40─>
Primary:
   SCN: 200 ── 210 ── 220 ── 230 ── 240 ── 250-▶
   Undo: [older versions gradually overwritten]
Standby (NO SRLs / batch apply):
   SCN: 200 ────────────────(idle)────────────▶ 250

Let’s suppose a Query that runs for 10 min, begins at 10:25 on both primary & standby – let’s observe how things go.

On Primary
- 10:25 – SCN of the db is 240.
  - So the query will use this as its Query SCN and refer to undo data as of SCN 240 for Consistent Reads.
  - The system retains undo for SCN 240 until 10:40 based on the 15-minute undo retention.
- 10:35 – The query finishes successfully within the available undo window.
On Standby
- 10:25 – SCN of the db is still 200.
  - So the query will use this as its Query SCN and refer to undo data as of SCN 200 for Consistent Reads.
  - The primary generated undo for SCN 200 at 10:15, and it expires at 10:30 based on the configured retention.
- 10:31 – The standby applies the redo and advances the SCN.
  - At this point, the primary has already overwritten the undo for SCN 200.
  - SQL fails within 5 minutes with ORA-01555

Consistent read on the standby depends entirely on undo generated on the primary and its availability at the time reconstruction is required.

On our system, we could see that we did not have Standby Redo logs configured properly and so they couldn’t keep up with the activity on primary. This inturn led to a lag and that subsequently the ORA-01555.

Undo Still Matters

Some of the queries observed were running for several hours. If those queries must succeed consistently, configure the primary database to retain undo for at least that duration. Don’t just increase the parameter—size the undo tablespace appropriately so Oracle can maintain higher tuned undo retention. And importantly, make the change on the primary.

Checklist

Hence the things to to check on your system when you encounter this.

First verify your Active Dataguard Configuration

Standby should be in sync with Primary
Configure Standby Redo Logs to match the size of online redo logs and add at least one additional group.
Ensure Real Time Apply is configured and working as expected.

Use Dataguard broker’s validate commands to review the configuration

DGMGRL> validate database verbose orcls;
Database Role:     Physical standby database
Primary Database:  orcl
....
....
Current Log File Groups Configuration:
Thread #  Online Redo Log Groups  Standby Redo Log Groups Status
(orcl)               (orcls)
1         3                       1                      Insufficient SRLs

2. Review UNDO_Retention

Undo retention value set on primary is the the one that’s effective – so make the undo retention parameter change on primary (Do also update parameter on standby, but that is for consistency only)
Increase the undo tablespace on the primary if required and ensure the standby reflects the change.
Also review the “Tuned Undo Retention” in v$undostaat.

Finally, I would strongly recommend, going through Connor’s following video that discusses Tuned undo retention and this issue as well.

Discover more from oratrails-aish

Subscribe to get the latest posts sent to your email.