ERR replikacji danych (Data Guard) + move forward czyli aplikowanie intrementala

— objawy: wysypała się replikacja Data Guarda (DB 11.2.0.4), brak błędow (wisi od kilku dni), rozwiązanie: zaaplikowanie incrementala (bo „uszkodzonego” archa brak – ale to wyjdzie później)

— wysypala sie replikacja DG:

db_primary – Primary database
Error: ORA-16724: cannot resolve gap for one or more standby databases
db_standby – Physical standby database

— w alercie na standby widac, ze jest ok, czyta archa:

Tue Mar 16 10:39:37 2021
Recovery of Online Redo Log: Thread 1 Group 1 Seq 135334 Reading mem 0
Mem# 0: /u02/dbdata/db_standby/onlinelog/group_1.393.982755799
Mem# 1: /u02/dbdata/db_standby/onlinelog/group_1.448.982755803
Media Recovery Log /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc

— plik naturalnie jest

[oracle@nod_standby ~]$ ls -la /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc
-r–r—–+ 1 oracle asmadmin 3196416 Mar 12 13:20 /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc

— i MRP aplikuje archa

SQL> select process,status,thread#,sequence# from v$managed_standby;
PROCESS STATUS THREAD# SEQUENCE#
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CLOSING 2 83790
ARCH CLOSING 2 83792
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
MRP0 APPLYING_LOG 2 83498
RFS IDLE 0 0

— sprawdzmy laga

SQL> set linesize 200
SQL> select name, value, unit, time_computed from v$dataguard_stats where name in (’transport lag’,’apply lag’);
NAME VALUE UNIT TIME_COMPUTED
transport lag +03 21:44:17 day(2) to second(0) interval 03/16/2021 10:44:51
apply lag +03 21:44:28 day(2) to second(0) interval 03/16/2021 10:44:51

— i jaki jest ostatni checkpoint (z tego wyzej mozna tez wyczytac)

alter session set nls_date_format=’dd-mm-yy hh24:mi:ss’;
select max(checkpoint_time) from v$datafile_header;
MAX(CHECKPOINT_TIME
12-03-21 13:00:22

— zatrzymujemy replikacje i odpalamy ponownie (czesto pomaga DB v11)

edit database db_standby set state=apply-off;
edit database db_standby set state=apply-on;

— albo:

alter database recover managed standby database cancel;
–alter database recover managed standby database using current logfile disconnect;

— czyli patrzac wyzej: transport archow jest, w alercie brak info o bledzie, a jedynie, ze czyta archa, w bazie to samo MRP aplikuje to sprawdzamy aplikowanego archa ze standby na primary w alercie (Seq: 135334):

Fri Mar 12 12:59:34 2021
LGWR: Standby redo logfile selected for thread 1 sequence 135334 for destination LOG_ARCHIVE_DEST_2
Thread 1 advanced to log sequence 135334 (LGWR switch)
Current log# 6 seq# 135334 mem# 0: +DATAC1/db_primary/onlinelog/group_6.422.982755755
Current log# 6 seq# 135334 mem# 1: +RECOC1/db_primary/onlinelog/group_6.1005.982755759

Fri Mar 12 13:01:03 2021
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Error 16198 for archive log file 6 to 'db_standby’
ORA-16198: LGWR received timedout error from KSR
LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'db_standby’
LGWR: Failed to archive log 6 thread 1 sequence 135334 (16198)
Thread 1 advanced to log sequence 135335 (LGWR switch)
Current log# 9 seq# 135335 mem# 0: +DATAC1/db_primary/onlinelog/group_9.592.1013471479
Current log# 9 seq# 135335 mem# 1: +RECOC1/db_primary/onlinelog/group_9.21516.1013471483

— i jak widac wyzej mamy niefarta bo zaliczył timeouta – czyli jakąś część archa przekopiował, ale nie do końca (w rezultacie DB standby probowal aplikowac, ale nie dało rady i co ciekawe nie walnął żadnym błędem oprócz informacji z dgmrl-a, ze utworzyl sie gap)

— wylaczamy replikacje

alter database recover managed standby database cancel;
Database altered.

z alerta:

Tue Mar 16 11:50:24 2021
Errors with log /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc
MRP0: Background Media Recovery cancelled with status 16037
Errors in file /u01/app/oracle/diag/rdbms/db_standby/db_standby1/trace/db_standby1_pr00_59425.trc:
ORA-16037: user requested cancel of managed recovery operation

— zmieniamy mu nazwe niech walnie bledem (dla sportu może odpyta jeszcze raz o archa)

[oracle@nod_standby 2021_03_12]$ mv /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc_

— odpalamy aplikowanie logow:

alter database recover managed standby database using current logfile disconnect;
Database altered.

— elegancko pojawil sie blad:

Tue Mar 16 11:53:12 2021
Recovery of Online Redo Log: Thread 1 Group 1 Seq 135334 Reading mem 0
Mem# 0: /u02/dbdata/db_standby/onlinelog/group_1.393.982755799
Mem# 1: /u02/dbdata/db_standby/onlinelog/group_1.448.982755803
Media Recovery Log /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc
Error opening /u02/dbdata/db_standby/archivelog/2021_03_12/o1_mf_2_83498__1615550507481734_.arc
Attempting refetch
Completed: alter database recover managed standby database using current logfile disconnect
Media Recovery Waiting for thread 2 sequence 83498
Fetching gap sequence in thread 2, gap sequence 83498-83498

— próbujemy odtworzyć brakującego archa (czyli tego co nie przesłal się do końca)

restore archivelog FROM SEQUENCE 83498 UNTIL SEQUENCE 83498;
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 03/16/2021 11:54:52
RMAN-20242: specification does not match any archived log in the repository
Tue Mar 16 11:55:07 2021
FAL[client]: Failed to request gap sequence
GAP – thread 2 sequence 83498-83498
DBID 3080213620 branch 982755744
FAL[client]: All defined FAL servers have been attempted.

— NAPRAWA
— pomimo tego, ze na primary i standby mamy ten sam seq to nie mozemy archa odtworzyć bo juz nie ma na backupie (zfs: „incremental / continuous backup”)

SQL> select max(sequence#) from v$archived_log where archived=’YES’;
MAX(SEQUENCE#)
135870

— wylaczamy transport logow

DGMGRL> show database db_primary
Database – db_primary

Role: PRIMARY
Intended State: TRANSPORT-ON

DGMGRL> edit database db_primary set state=’transport-off’;
Succeeded.
DGMGRL> show database db_primary

Database – db_primary

Role: PRIMARY
Intended State: TRANSPORT-OFF

— wyłączamy wrzucanie logow z primary na standby

alter system set log_archive_dest_state_2=’DEFER’ scope=both;

–sprawdzamy scn na standby (select current_scn from v$database;)

CURRENT_SCN
6932123446457

— na primary lecimy z incrementalem:

backup incremental from SCN 6932123446457 database tag=’standby_odswiezenie’ format '/export/warsztat/database_%d_%u_%s’;

— plik controlny z primary

SQL> select name,open_mode,database_role from v$database;
SQL> alter database create standby controlfile as '/export/warsztat/stby_control.ctl’;
alter database create standby controlfile as '/export/warsztat/stby_control.ctl’
*
ERROR at line 1:
ORA-00245: control file backup failed; target is likely on a local file system
— wrzucamy controla na ASM-a i pozniej kopiujemy na locala
alter database create standby controlfile as '+RECOC1/stby_control.ctl’;
ASMCMD> cp +recoc1\stby_control.ctl /tmp/
mv /tmp/stby_control.ctl /export/warsztat/
chown oracle:asmadmin /export/warsztat/stby_control.ctl

— to nizej robimy w przypadku gdy mamy inne lokalizacje plików bazy danych (primary na asm -> standby local)
— sprawdzamy w jakich lokalizacjach sa pliki bo jak podlozymy controla to sie rozleca lokalizacje
— pliki bazy

select 'alter database rename file ”’ || p.file_name || ”’ to ”’ || s.file_name || ”’;’ from dba_data_files p left join dba_data_files@wiech_db_standby s on p.file_id = s.file_id;

— logi (lepiej wykreowac je ponownie żeby jakiś śmietnik nie pozostał)

select 'alter database rename file ”’ || p.member || ”’ to ”’ || s.member || ”’;’ from v$logfile p left join v$logfile@wiech_db_standby s on p.group# = s.group#;

— sprawdzamy gdzie jest control file na standby, kladziemy db i kopiujemy we wszystkie lokalizacje

show parameter control
NAME TYPE VALUE
control_file_record_keep_time integer 20
control_files string /u02/dbdata/db_standby/CONTROLFILE/control01.ctl, /u02/dbdata/db_standby/CONTROLFILE/control02.ctl

mv /u02/dbdata/db_standby/CONTROLFILE/control01.ctl /u02/dbdata/db_standby/CONTROLFILE/control01.ctl_usunac
mv /u02/dbdata/db_standby/CONTROLFILE/control02.ctl /u02/dbdata/db_standby/CONTROLFILE/control02.ctl_usunac
cp /export/warsztat/stby_control.ctl /u02/dbdata/db_standby/CONTROLFILE/control01.ctl
cp /export/warsztat/stby_control.ctl /u02/dbdata/db_standby/CONTROLFILE/control02.ctl

— gdyby zabraklo czegos to wracamy

–cp /u02/dbdata/db_standby/CONTROLFILE/control01.ctl_usunac /u02/dbdata/db_standby/CONTROLFILE/control01.ctl
–cp /u02/dbdata/db_standby/CONTROLFILE/control02.ctl_usunac /u02/dbdata/db_standby/CONTROLFILE/control02.ctl

— montujemy DB

srvctl start database -d db_standby -o mount

— zatrzymujemy mrp (upewniamy sie bo w mouncie…)

recover managed standby database cancel;

— zmieniamy zarzadzanie plikami na manual

alter system set standby_file_management=manual;
alter database rename file…z tych sql-i wyzej


— pozniej włączamy:

alter system set standby_file_management=auto;

— restor incrementala

rman target /
catalog backuppiece '/export/warsztat/database_db_primary_j2vpsn2m_305762′;
catalog backuppiece '/export/warsztat/database_db_primary_j3vpsn2n_305763′;
catalog backuppiece '/export/warsztat/database_db_primary_j4vpsn2o_305764′;
catalog backuppiece '/export/warsztat/database_db_primary_j5vpsn2o_305765′;
catalog backuppiece '/export/warsztat/database_db_primary_j6vpsn2o_305766′;
catalog backuppiece '/export/warsztat/database_db_primary_j7vpsn2o_305767′;
catalog backuppiece '/export/warsztat/database_db_primary_j8vpsn2o_305768′;
catalog backuppiece '/export/warsztat/database_db_primary_j9vpsn2o_305769′;
catalog backuppiece '/export/warsztat/database_db_primary_javpsn2o_305770′;
catalog backuppiece '/export/warsztat/database_db_primary_jbvpsn2o_305771′;
catalog backuppiece '/export/warsztat/database_db_primary_jcvpsn2p_305772′;
catalog backuppiece '/export/warsztat/database_db_primary_jdvpsn2p_305773′;
catalog backuppiece '/export/warsztat/database_db_primary_jevpsn2p_305774′;
catalog backuppiece '/export/warsztat/database_db_primary_jfvpsn2p_305775′;
catalog backuppiece '/export/warsztat/database_db_primary_jgvpsn2p_305776′;
catalog backuppiece '/export/warsztat/database_db_primary_jhvpsn2p_305777′;
catalog backuppiece '/export/warsztat/database_db_primary_jivpsn2p_305778′;
catalog backuppiece '/export/warsztat/database_db_primary_jjvpsn2p_305779′;
catalog backuppiece '/export/warsztat/database_db_primary_jkvpsn2p_305780′;
catalog backuppiece '/export/warsztat/database_db_primary_jlvpsn2p_305781′;
catalog backuppiece '/export/warsztat/database_db_primary_jmvpso77_305782′;

list backup tag standby_odswiezenie;

— rman restor

recover database from tag standby_odswiezenie noredo;

— odpalamy mrp (bedzie blad ale sie nie przejmujemy)

alter database recover managed standby database disconnect from session;


— na primary mozemy zobaczyc:

Tue Mar 16 17:16:14 2021
FAL[server, ARC0]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance db_primary1 – Archival Error. Archiver continuing.
— a to znaczy, ze nie zatrzymalismy transportu logow na dest_2

— na primary:
— alter system set log_archive_dest_state_2=’DEFER’ scope=both; — jak zapomnielismy wylaczyc wyzej to teraz wylaczamy i wlaczamy

alter system set log_archive_dest_state_2=’ENABLE’ scope=both;

— dgmgrl wlaczamy transport logow:

EDIT DATABASE db_primary SET STATE=’TRANSPORT-ON’;

— na primary switch loga (ale nie trzeba jak jest ruch)

alter system switch logfile;

— czyscimy na standby:

RMAN> delete backup tag = 'standby_odswiezenie’;