This is one of the
least focused areas in many businesses, and when disaster strikes, businesses
often struggle to recover. Every organisation should have a BCP team headed by
senior management to make sure to manage various disasters. If you take
disaster, there are two types of disaster.
· Natural Disaster Example of natural disasters are earthquake, flood, storms etc
· Human Made Disaster Example of this are vulnerability attack, power outage, fire, etc
Any disaster will
impact the availability of the system. Sometimes human-made disasters like
vulnerability attacks on parts of the system will impact the integrity of the
system. This availability and integrity are part of the CIA triad
(confidentiality, integrity, and availability).
A primary goal for DR solutions is to
increase the high availability and decrease the single point of failure (SPOF),
thus building strong system resilience and fault tolerance.
· A single point of failure (SPOF) is any component that can cause an entire system to fail
· System resilience refers to the ability of a system to maintain an acceptable level of service during an adverse event
· Fault tolerance is the ability of a system to suffer a fault but continue to operate
· High availability is the use of redundant technology components to allow a system to quickly recover from a failure after experiencing a brief disruption
As part of business
impact analysis, the BCP team will arrived RPO, RTP, MTD and WRT based on
business risk and criticality. Based on these values DR solution will be
designed. For example, BCP team committed MTD (Maximum Tolerable Downtime)
value for customer service is 4 hours, DR system needs to build around that to
achieve. BCP team might not be aware of technology to achieve, and they will
arrive this value based on business impact.
MTD figure will communicate to get the production environment back to
normal after a disruptive event.
RPO (Recovery Point Objective): This represents the amount of data loss at the time of recovery.
Based on the schedule, the database is backed up. Once after a disaster, the
difference between the last backup datetime and the disaster datetime is called
RPO. The RPO value indicates the earliest backup available to recover. Lower
RPO represents the high frequency of data backup. For example, if our RPO rate
is 5 minutes, our backup data must be refreshed every 5 minutes to meet the RPO
requirement.
RTO (Recover Time Objective): This represents the maximum amount of time it will take to restore
the environment after a disaster. During this time, the environment is
completely down and not available for production. Lower RPO will have an
automated redundant build with the system, while higher RPO will require more
manual intervention to get up and running.
WRT(Work Recovery Time): This represents, once the system is up and running, the maximum
time it will take to confirm the integrity of the system. RTO deals with
getting the environment up and running, and WRK will make sure the application
is ready to be shared with business users to start working.
MTD (Maximum
Tolerable Downtime): This represents the maximum time business can use to tolerate the
system's unavailability. The total downtime consists of RPO+WKT. Based on
business criticality, MTD time will change with the business.
The BCP team will identify MTD and RPO
based on business requirements, and the DR team will build the system to
achieve them. Lower RTO and MTD will require higher costs to achieve, and
higher RPO and MTD will require lower costs to achieve. As part of our building
recovery strategy, we mostly experience the following three types of
disruptions:
·
Nondisater means this will have a significant impact on service and a
limited impact on the facility. The solution to this issue may include system
or software restoration.
·
Disaster means this will have a significant impact to server and
facility. To address this issue we need to have alternative facilities. Restore
software and hardware in the alternative facility to overcome the disruption,
and an alternative environment should be available until the problem in the
main facility is resolved.
·
Catastrophes have a significant impact on facilities, necessitating both
short-term and long-term solutions to rebuild the facility.
Building a recovery site is major part of
DR solution. Recovery site will help us recover from disaster. Following is
popular site to manage nay disaster
· Cold Site: Just facility will be available to restore. It will take a week to get up and running
· Warm Site: Minimal configuration will be available. If there is any disaster it will take few days to up and running
· Hot Site: Fully configured redundant site will be available to run in few hours.