Jahith's Tech Sharing: 2022

Wednesday, December 28, 2022

AWS Disaster Recovery

Disaster recovery planning and business continuity planning are very important for any organization to come out of a disaster very quickly. Disaster recovery for on-premises environments requires a lot of effort and planning because it involves a lot of third-party services like transportation, network connectivity, etc., and various staff help to setup networking, systems, etc. Disaster recovery planning for the cloud will not require much effort, but we need to have strong planning to build redundancy to recover very quickly.

The resilience of the AWS cloud environment is a shared responsibility. AWS infrastructure is available across different AWS regions. Each region is a fully isolated geographical area; within each region, multiple isolated availability zones are available to handle failure. All AWS regions and availability zones are interconnected with high bandwidth. When we use AWS as a cloud, we have various options to manage the high availability of the system.

Within region high availability

Regions represent a separate geographic area, and availability zones are highly available data centres within each AWS region. Each availability zone has isolated power, cooling, networking, etc. AWS provides a built-in option for dealing with an availability zone outage. We have to configure our environment with multi-AZ redundancy so that if an entire availability zone goes down, AWS is able to failover workloads to another availability zone. Within a region, the high availability architecture option will ensure compliance by keeping data in the permitted region and ensuring high availability.

Cross region high availability

A multi-region disaster recovery strategy will be helpful to address the rare scenario of an AWS region being down due to a natural disaster or technical issue. Very highly sensitive applications are required to plan cross region replication options. When we plan this approach, we need to consider the AWS availability for each service. Most of the AWS services are committed to high availability. Cross region high availability can be achieved in different ways based on our budget and compliance needs. We need to choose the proper strategy.

Back up and restore
Pilot light
Warm Standby
Multi-site active/active

Backup and restore

This approach will help us to solve the data loss issue. This approach will have a high RPO and RTO rate. RPO will determine how frequently we schedule data backups. As the environment is not yet ready, building an environment using backed up data will take time, so our RTO is also very high.

Pilot light

This approach will replicate the data to another region and also set up a core infrastructure. Servers are switched off and will be used when needed for testing and recovery. This approach will reduce the RTO and RPO based on the backup schedule. This approach is cost effective in terms of recovery, but database corruption or any malware attack still require a backup.

Warm Standby

This method is similar to pilot light, but a scaled-down version of the environment is now operational. Disaster and recovery testing can be carried out anytime, so comparatively, this will improve the confidence of those who recover quickly. RTO will slightly improve when compared to pilot light, and RPO is based on the replication schedule.

Multi-site active/active

In this approach, both sites in different regions will be active and running. Requests will be distributed across regions by default. If any one of the regions is down, another region automatically picks up the request. This approach is the most costly. RPO and RTO will be reduced to near zero, but backup will be required if there is any data corruption or malware attack.

These strategies increase the possibilities of high availability in a disaster scenario. Each strategy addresses a subset of disasters but not all of them. Depending on the disaster, RPO and RTO will change.

Cross region and cross account high availability

For security or compliance reasons, many organizations require complete separation of environments and access between their primary and secondary regions. This helps mitigate the malicious threat to an organization that comes from people within the organization or any malware attack on our primary account. Having our backups or primary database routinely copied to the secondary account will help to recover the primary account.

The AWS backup feature can be used to backup data across accounts. AWS Backup is a fully managed service for centrally and automatically managing backups. Using this service, we can configure backup policies and monitor the activities of our AWS resources in one place.

Tuesday, December 27, 2022

High Availability and Disaster Recover.

Often, people are confused about high availability and disaster recovery because availability and disaster recovery will share some of the same common best practices, like monitoring issues, deploying to multiple locations, and automatic failover.

Availability is different from disaster recovery in terms of objective and focus. Availability is more concerned with ensuring high system and application availability in a given time frame. Application availability is calculated by dividing uptime by the total sum of uptime and downtime. Disaster recovery is concerned with the recovery of applications, environments, and people from large-scale disasters. To determine the goal of disaster recovery, the two most important parameters, Recover Point Objective (RPO) and Recovery Point Objective (RTO), are used.

Availability	Disaster Recover
High availability is about to eliminate single points of failure.	Disaster recovery is a process contains set of policies and procedures will trigger when loss of high availability
High availability helps us make sure our system is operational in identified failure scenarios.	Disaster recovery and business continuity planning address man-made disasters such as cyber attacks, terrorism attacks, human error, and natural disasters such as floods, hurricanes, earthquakes, and so on
Application availability is calculated by dividing uptime by the total sum of uptime and downtime.	Depending on the disaster, the following two main objectives will be defined: Recover point objective (RPO): amount of data loss due to disaster Recover Time Objective (RTO): Maximum amount of time required to recover an application
High availability system will be build with proper redundancy. In the cloud, multi-AZ and multi-region hosting will help to ensure high availability.	A proper disaster recovery strategy will help to recover quickly from a DDoS attack, pandemic-like situation like COVID.

Any product company needs to plan for high availability and disaster recovery for business continuity. High availability planning protects us from high probability events that occur on a regular basis. Instance and data storage failure due to software or hardware issues, a web server not responding due to an unexpected issue, a load-induced outage, and so on are all common occurrences; hosting an instance in more than one availability zone or region will help to resolve this issue.

The disaster recovery process will help to recover from major outages caused by human-made or natural disasters. Creating a backup of data storage in a different data centre and storing a month's worth of data will aid in data recovery in the event of data loss or data corruption, as well as protecting data loss from cyber-attack. Replicating the data storage and environment in a different location will help to recover quickly from complete environmental failure, but this will not be helpful for data loss or corruption issues.

Over all, High Availability and Disaster Recovery are aimed at the same problem: keeping systems up and running in an operational state, with the main difference being that HA is intended to handle problems while a system is running, while DR is intended to handle problems after a system fails.

Friday, November 4, 2022

Installing New relic PHP agent in AWS EC2 ARM64 Graviton family

Installing New Relic PHP agent is very simple. Once we upgraded the instance to the AWS EC2 arm64 family, we were unable to install php agent using the usual steps. When we dug around, we found the following link, which solved our issue.
https://docs.newrelic.com/docs/apm/agents/php-agent/installation/php-agent-installation-arm64/
Following the steps that we followed to install in the Amazon linux 2 instance I hope it will be helpful for others.
1.       SSH to an EC2 instance
2.       In the root folder, create a new folder called “newrelic”
mkdir newrelic
3.       Get into the newly created folder
cd newrelic
4.       As per above mentioned link new relic php agent code is available in the following github path
https://github.com/newrelic/newrelic-php-agent

Clone new relic php agent code to compile. Run the following command in the command prompt
git colne php agent to compile for arm64

5.       Once the code is cloned, a new folder called “newrelic-php-agent” will be created in the cloned folder. Get into that folder to compile the source code
cd newrelic-php-agent/

6.       Follow the following steps to compile and install the new relic php agent
a.       Run sudo make all in the newrelic-php-agent folder to build. it will take a few minutes
b.       Once it is compiled, run sudo make agent-install
c.       Run the following command to create log folder
sudo mkdir /var/log/newrelic
d.       Run the following command to give full permission to the log folder
sudo chmod 777 /var/log/newrelic
e.       Run the following command to copy deamon file to the appropriate folder
sudo cp bin/daemon /usr/bin/newrelic-daemon

7.       Copy newrelic.ini file to php ini path. In amazon linux, additional php ini are stored in /etc/php.d/
a.       To find out ini file path location. Run following command
To find additional ini file path: php --ini | grep 'additional'
To find ini file path: php --ini | grep ' Configuration File'
To copy newrelic.ini file to the appropriate path run the following command
sudo cp agent/scripts/newrelic.ini.template /etc/php.d/30-newrelic.ini
8.       Once newrelic.ini copied, edit the file and change the following parameters
a.       To edit file, run the following command
vi /etc/php.d/30-newrelic.ini
b.       Update the following parameters and save
Add a valid license key to the line newrelic.license = "INSERT_YOUR_LICENSE_KEY".
Change the application name that's shown in one.newrelic.com on the line newrelic.appname = "PHP Application"

9.       Once everything is done, run the following command to restart php,httpd servers and start new relic service
sudo systemctl restart httpd.service
sudo systemctl restart php-fpm
/usr/bin/newrelic-daemon start

The above steps will be helpful to install the new relic php agent. If you encounter any issue, the following link will be helpful to troubleshoot

https://discuss.newrelic.com/t/php-troubleshooting-framework-install/108683

Tuesday, September 13, 2022

Disaster Recovery Plan

This is one of the least focused areas in many businesses, and when disaster strikes, businesses often struggle to recover. Every organisation should have a BCP team headed by senior management to make sure to manage various disasters. If you take disaster, there are two types of disaster.

· Natural Disaster Example of natural disasters are earthquake, flood, storms etc
· Human Made Disaster Example of this are vulnerability attack, power outage, fire, etc

Any disaster will impact the availability of the system. Sometimes human-made disasters like vulnerability attacks on parts of the system will impact the integrity of the system. This availability and integrity are part of the CIA triad (confidentiality, integrity, and availability).

A primary goal for DR solutions is to increase the high availability and decrease the single point of failure (SPOF), thus building strong system resilience and fault tolerance.

·         A single point of failure (SPOF) is any component that can cause an entire system to fail
·         System resilience refers to the ability of a system to maintain an acceptable level of service during an adverse event
·         Fault tolerance is the ability of a system to suffer a fault but continue to operate
·         High availability is the use of redundant technology components to allow a system to quickly recover from a failure after experiencing a brief disruption

As part of business impact analysis, the BCP team will arrived RPO, RTP, MTD and WRT based on business risk and criticality. Based on these values DR solution will be designed. For example, BCP team committed MTD (Maximum Tolerable Downtime) value for customer service is 4 hours, DR system needs to build around that to achieve. BCP team might not be aware of technology to achieve, and they will arrive this value based on business impact. MTD figure will communicate to get the production environment back to normal after a disruptive event.

RPO (Recovery Point Objective): This represents the amount of data loss at the time of recovery. Based on the schedule, the database is backed up. Once after a disaster, the difference between the last backup datetime and the disaster datetime is called RPO. The RPO value indicates the earliest backup available to recover. Lower RPO represents the high frequency of data backup. For example, if our RPO rate is 5 minutes, our backup data must be refreshed every 5 minutes to meet the RPO requirement.

RTO (Recover Time Objective): This represents the maximum amount of time it will take to restore the environment after a disaster. During this time, the environment is completely down and not available for production. Lower RPO will have an automated redundant build with the system, while higher RPO will require more manual intervention to get up and running.

WRT(Work Recovery Time): This represents, once the system is up and running, the maximum time it will take to confirm the integrity of the system. RTO deals with getting the environment up and running, and WRK will make sure the application is ready to be shared with business users to start working.

MTD (Maximum Tolerable Downtime): This represents the maximum time business can use to tolerate the system's unavailability. The total downtime consists of RPO+WKT. Based on business criticality, MTD time will change with the business.

The BCP team will identify MTD and RPO based on business requirements, and the DR team will build the system to achieve them. Lower RTO and MTD will require higher costs to achieve, and higher RPO and MTD will require lower costs to achieve. As part of our building recovery strategy, we mostly experience the following three types of disruptions:

· Nondisater means this will have a significant impact on service and a limited impact on the facility. The solution to this issue may include system or software restoration.

· Disaster means this will have a significant impact to server and facility. To address this issue we need to have alternative facilities. Restore software and hardware in the alternative facility to overcome the disruption, and an alternative environment should be available until the problem in the main facility is resolved.

· Catastrophes have a significant impact on facilities, necessitating both short-term and long-term solutions to rebuild the facility.

Building a recovery site is major part of DR solution. Recovery site will help us recover from disaster. Following is popular site to manage nay disaster

·         Cold Site: Just facility will be available to restore. It will take a week to get up and running
·         Warm Site: Minimal configuration will be available. If there is any disaster it will take few days to up and running
·         Hot Site: Fully configured redundant site will be available to run in few hours.

Thursday, September 1, 2022

Business continuity plan

Often, we think business continuity plans (BCP) and disaster recovery plans (DRP) are the same. In the reality, these two are not the same. DR is a subset of BCP and focuses on how to recover once a disaster has struck. BCP is at strategy level, it will talk about plans for business continuity if there is any disaster.

Business continuity management (BCM) is a holistic management process to handle both BCP and DRP. BCM provides a framework for integrating resilience with the capability for effective responses in a manner that protects the interests of the organization’s key stakeholders. The main objective of BCM is to allow the organisation to continue to perform business operations under various conditions. BCM is the main approach to managing all aspects of BCP and DRP.

The following are a few widely used industrial standards and frameworks that are available for BCP.

·         ISO/IEC 27031:2011: describes the concepts and principles of information and communication technology (ICT) readiness for business continuity
·         ISO 22301:2019 Security and resilience — Business continuity management systems — Requirements
·         NIST outlines the following steps in SP 800-34

BCP helps organizations achieve

·         Appropriate response to emergency situations
·         Ensure safety
·         Reduced business impact
·         Resume critical business functions.
·         plan to work with the vendor for DR
·         Reduce confusion during a crisis
·         increase customer confidence.
·         It is up and running quickly after a disaster.

NIST SP 800-34 outlined following steps

Business continuity planning (BCP) entails assessing organisational risks and developing policies, plans, and procedures to mitigate their impact if they occur. The BCP focused on how to keep the organisation in business after a major disruption takes place. It is about the survivability of the organisation and making sure that critical functions can still take place even after a disaster. The goal of BCP planners is to implement a combination of policies, procedures, and processes such that a potentially disruptive event has as little impact on the business as possible. The BCP process has four major steps.

·         Project scope and planning
·         Business impact analysis
·         Continuity planning
·         Approval and implementation

Wednesday, August 24, 2022

RDS to Aurora data source migration in AWS QuickSight

Recently, we came across a problem when we migrated our existing RDS to Aurora. We have already developed a lot of QuickSight reports using RDS. Once we migrated to Aurora, there was no direct way to change the data source connection from RDS to Aurora.

Data source editing using the AWS console is allowed to change the instance id, username, and password.

The following is the step to edit the data source connection using the AWS console.

Select Datasets on the left side of the QuickSight, and then click the New Dataset button in the top right corner.
Scroll down to the FROM EXISTING DATA SOURCES section and select a data source.
In the popup, click Edit Data Source.
Change the required details like instance ID, username, and password.
Click "Validate connection."
If the connection validates, click Update data source.

The above steps will help you to update your RDS connection. If you want to change the connection from RDS to Aurora or any other connection, you can use the following AWS CLI commands.

Step 1: We wanted to know the data source id to edit via AWS CLI. To know the data source ID, run the following command in the command line. which will list all the available data sources along with their data source id

aws quicksight list-data-sources --aws-account-id <<account id> --region <<Region>>

Step 2 : Generate an update skeleton using the following AWS CLI command.

aws quicksight update-data-source --generate-cli-skeleton input > edit-data-source.json

JSON contains following content which includes all the section

"{
"AwsAccountId": "",
"DataSourceId": "",
"Name": "",
"DataSourceParameters": {
"AmazonElasticsearchParameters": {
"Domain": ""
},
"AthenaParameters": {
"WorkGroup": ""
},
"AuroraParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"AuroraPostgreSqlParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"AwsIotAnalyticsParameters": {
"DataSetName": ""
},
"JiraParameters": {
"SiteBaseUrl": ""
},
"MariaDbParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"MySqlParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"OracleParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"PostgreSqlParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"PrestoParameters": {
"Host": "",
"Port": 0,
"Catalog": ""
},
"RdsParameters": {
"InstanceId": "",
"Database": ""
},
"RedshiftParameters": {
"Host": "",
"Port": 0,
"Database": "",
"ClusterId": ""
},
"S3Parameters": {
"ManifestFileLocation": {
"Bucket": "",
"Key": ""
}
},
"ServiceNowParameters": {
"SiteBaseUrl": ""
},
"SnowflakeParameters": {
"Host": "",
"Database": "",
"Warehouse": ""
},
"SparkParameters": {
"Host": "",
"Port": 0
},
"SqlServerParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"TeradataParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"TwitterParameters": {
"Query": "",
"MaxRows": 0
},
"AmazonOpenSearchParameters": {
"Domain": ""
},
"ExasolParameters": {
"Host": "",
"Port": 0
}
},
"Credentials": {
"CredentialPair": {
"Username": "",
"Password": "",
"AlternateDataSourceParameters": [
{
"AmazonElasticsearchParameters": {
"Domain": ""
},
"AthenaParameters": {
"WorkGroup": ""
},
"AuroraParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"AuroraPostgreSqlParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"AwsIotAnalyticsParameters": {
"DataSetName": ""
},
"JiraParameters": {
"SiteBaseUrl": ""
},
"MariaDbParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"MySqlParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"OracleParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"PostgreSqlParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"PrestoParameters": {
"Host": "",
"Port": 0,
"Catalog": ""
},
"RdsParameters": {
"InstanceId": "",
"Database": ""
},
"RedshiftParameters": {
"Host": "",
"Port": 0,
"Database": "",
"ClusterId": ""
},
"S3Parameters": {
"ManifestFileLocation": {
"Bucket": "",
"Key": ""
}
},
"ServiceNowParameters": {
"SiteBaseUrl": ""
},
"SnowflakeParameters": {
"Host": "",
"Database": "",
"Warehouse": ""
},
"SparkParameters": {
"Host": "",
"Port": 0
},
"SqlServerParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"TeradataParameters": {
"Host": "",
"Port": 0,
"Database": ""
},
"TwitterParameters": {
"Query": "",
"MaxRows": 0
},
"AmazonOpenSearchParameters": {
"Domain": ""
},
"ExasolParameters": {
"Host": "",
"Port": 0
}
}
]
},
"CopySourceArn": ""
},
"VpcConnectionProperties": {
"VpcConnectionArn": ""
},
"SslProperties": {
"DisableSsl": true
}
}"

Step 3: Make modifications to the JSON file.Run the following command to update the connection:

aws quicksight update-data-source --cli-input-json file:// edit-data-source.json --region <<Region>>

Example JSON to change data source connection to Aurora

{

"AwsAccountId": <<aws account id>>

"DataSourceId": <<Datasource nam>>,

"Name": <<datasoure name>>,

"DataSourceParameters": {

"MySqlParameters": {

"Host": “<<Aurora hostname >>",

"Port": <<port_number>>,

"Database": "<<database_name>>"

}

"Credentials": {

"CredentialPair": {

"Username": "<<user name>>",

"Password": "<<password>>",

"CopySourceArn": ""

"VpcConnectionProperties": {

"VpcConnectionArn": "<<VPC ARN>>”

"SslProperties": {

"DisableSsl": false

}