Tuesday, September 13, 2022

Disaster Recovery Plan

 

This is one of the least focused areas in many businesses, and when disaster strikes, businesses often struggle to recover. Every organisation should have a BCP team headed by senior management to make sure to manage various disasters. If you take disaster, there are two types of disaster.

·         Natural Disaster Example of natural disasters are earthquake, flood, storms etc

·         Human Made Disaster Example of this are vulnerability attack, power outage, fire, etc

Any disaster will impact the availability of the system. Sometimes human-made disasters like vulnerability attacks on parts of the system will impact the integrity of the system. This availability and integrity are part of the CIA triad (confidentiality, integrity, and availability).

A primary goal for DR solutions is to increase the high availability and decrease the single point of failure (SPOF), thus building strong system resilience and fault tolerance.

·         A single point of failure (SPOF) is any component that can cause an entire system to fail

·         System resilience refers to the ability of a system to maintain an acceptable level of service during an adverse event

·         Fault tolerance is the ability of a system to suffer a fault but continue to operate

·         High availability is the use of redundant technology components to allow a system to quickly recover from a failure after experiencing a brief disruption

As part of business impact analysis, the BCP team will arrived RPO, RTP, MTD and WRT based on business risk and criticality. Based on these values DR solution will be designed. For example, BCP team committed MTD (Maximum Tolerable Downtime) value for customer service is 4 hours, DR system needs to build around that to achieve. BCP team might not be aware of technology to achieve, and they will arrive this value based on business impact.  MTD figure will communicate to get the production environment back to normal after a disruptive event.

 


 

RPO (Recovery Point Objective): This represents the amount of data loss at the time of recovery. Based on the schedule, the database is backed up. Once after a disaster, the difference between the last backup datetime and the disaster datetime is called RPO. The RPO value indicates the earliest backup available to recover. Lower RPO represents the high frequency of data backup. For example, if our RPO rate is 5 minutes, our backup data must be refreshed every 5 minutes to meet the RPO requirement.

RTO (Recover Time Objective): This represents the maximum amount of time it will take to restore the environment after a disaster. During this time, the environment is completely down and not available for production. Lower RPO will have an automated redundant build with the system, while higher RPO will require more manual intervention to get up and running.

WRT(Work Recovery Time): This represents, once the system is up and running, the maximum time it will take to confirm the integrity of the system. RTO deals with getting the environment up and running, and WRK will make sure the application is ready to be shared with business users to start working.

MTD (Maximum Tolerable Downtime): This represents the maximum time business can use to tolerate the system's unavailability. The total downtime consists of RPO+WKT. Based on business criticality, MTD time will change with the business.

The BCP team will identify MTD and RPO based on business requirements, and the DR team will build the system to achieve them. Lower RTO and MTD will require higher costs to achieve, and higher RPO and MTD will require lower costs to achieve. As part of our building recovery strategy, we mostly experience the following three types of disruptions:

·         Nondisater means this will have a significant impact on service and a limited impact on the facility. The solution to this issue may include system or software restoration.

·         Disaster means this will have a significant impact to server and facility. To address this issue we need to have alternative facilities. Restore software and hardware in the alternative facility to overcome the disruption, and an alternative environment should be available until the problem in the main facility is resolved.

·         Catastrophes have a significant impact on facilities, necessitating both short-term and long-term solutions to rebuild the facility. 

Building a recovery site is major part of DR solution. Recovery site will help us recover from disaster. Following is popular site to manage nay disaster

·         Cold Site: Just facility will be available to restore. It will take a week to get up and running

·         Warm Site: Minimal configuration will be available. If there is any disaster it will take few days to up and running

·         Hot Site: Fully configured redundant site will be available to run in few hours.

 



Thursday, September 1, 2022

Business continuity plan

 

Often, we think business continuity plans (BCP) and disaster recovery plans (DRP) are the same. In the reality, these two are not the same. DR is a subset of BCP and focuses on how to recover once a disaster has struck. BCP is at strategy level, it will talk about plans for business continuity if there is any disaster.

Business continuity management (BCM) is a holistic management process to handle both BCP and DRP. BCM provides a framework for integrating resilience with the capability for effective responses in a manner that protects the interests of the organization’s key stakeholders. The main objective of BCM is to allow the organisation to continue to perform business operations under various conditions. BCM is the main approach to managing all aspects of BCP and DRP.

 



The following are a few widely used industrial standards and frameworks that are available for BCP.

·         ISO/IEC 27031:2011: describes the concepts and principles of information and communication technology (ICT) readiness for business continuity

·         ISO 22301:2019 Security and resilience — Business continuity management systems — Requirements

·         NIST outlines the following steps in SP 800-34

 

BCP helps organizations achieve

·         Appropriate response to emergency situations

·         Ensure safety

·         Reduced business impact

·         Resume critical business functions.

·         plan to work with the vendor for DR

·         Reduce confusion during a crisis

·         increase customer confidence.

·         It is up and running quickly after a disaster.

 

NIST SP 800-34 outlined following steps



 

Business continuity planning (BCP) entails assessing organisational risks and developing policies, plans, and procedures to mitigate their impact if they occur. The BCP focused on how to keep the organisation in business after a major disruption takes place. It is about the survivability of the organisation and making sure that critical functions can still take place even after a disaster. The goal of BCP planners is to implement a combination of policies, procedures, and processes such that a potentially disruptive event has as little impact on the business as possible. The BCP process has four major steps.

·         Project scope and planning

·         Business impact analysis

·         Continuity planning

·         Approval and implementation