ORA-600 [3020] and Point-In-Time Database Recovery

ORA-600 [3020] itself is not fun and mixed with a Failed Recovery, it is a recipe for disaster. Recently, we ran into an interesting scenario, where the Database Point in Time Recovery (PITR) failed with “ORA-00600: internal error code, arguments: [3020]". This was a 19.21 database (on ExaCC) and looking at the alert log we could find the following messages.

ERROR: ORA-00600: internal error code, arguments:[3020] recovery detected a data block with invalid SCN.
This could be caused by a lost write on the primary; do NOT attempt to bypass this error by copying blocks or datafiles from the primary database to the standby database because that would propagate the lost write from the primary to the standby.Reference Doc ID 1265884.1 and Doc ID 30866.1

ORA-600 [3020] Scenario

We had a single copy of disk backups available, and using the backups, we conducted a PTIR to create a copy of the database. The restore phase was successful. The problem presented itself during the Recovery Phase. After applying the incremental backup, when the database attempted to apply the required archives, it encountered an ORA-600.

Datafile reported along with ORA-600 was a part of Bigfile Tablespace sized at 5 TB. As the documentation around these errors will indicate, the problem is block corruption. The documented solution is to recover the block from a valid backup/copy of the data file or drop & re-create with the object containing the affected blocks.

Since we were in the middle of a PITR, neither of the above solutions would work for us. That is

  • We did not have an alternate copy or backup of the data file available to help with the particular PITR recovery. And of course, unless we could find a Time Machine, we did not have the the luxury of taking another backup that would be helpful to us.
  • Since the database was not open, we could not think about dealing with objects.

An alternative that was considered but not chosen, due to the size of the tablespace, was to conduct the recovery by skipping the tablespace altogether. The steps for this would be

  • Take the datafiles belonging to the tablespace offline
  • Continue with the documented steps of PITR with the skip tablespace clause and open the database
  • Drop the tablespace to get rid of the corruption

This would be a very involved process, and you would not be able to get the PITR data for the tablespace in question.

Solution

We decided to proceed with recovery, despite the corruption, i.e. we decided to recover the database along with the corrupt blocks, and then drop/recreate the objects with corrupt blocks once the database was open. To do so we used the following command.

recover database until cancel allow n corruption;

"n" indicates the maximum allowable corrupt blocks. One of the ways to find a suitable value of "n", is to increase it incrementally from 1 until the recovery moves forward. We had ~50 affected blocks and the command that worked for us was as follows.

recover database until cancel using backup controlfile allow 60 corruption;

alter database open resetlogs;

After performing the PITR, we opened the database using the “resetlogs” option. Although we had fixed the ORA-600, we still had to address block corruption. Luckily, the blocks impacted turned out to be “FREE” blocks. We fixed them using the procedures described in this blog

Note here that there was no corruption of the Datafile at the source. This indicates that the corruption likely occurred during the backups of the FREE blocks. There are no relevant bugs for the 19.21 version. Do let me know if you find any !