Part Three: BC and DR Processes

Contents

keyboard_tab

The course is part of this learning path

Start course
Overview
Difficulty
Beginner
Duration
47m
Students
32
Description

In this course, we will discuss various vitally important metrics used to determine how well we have mitigated risk and how closely we have matched the requirements of our enterprise. These metrics include Annualized Loss Expectancy (ALE), Recovery Time Objective (RTO), Recovery Point Objective (RPO), Service Delivery Objectives (SDO), Maximum Tolerable Outage/Downtime (MTO/MTD), and Allowable Interruption Window (AIW).

We then move on to look at how these metrics can be applied to business continuity (BC) and disaster recovery (DR) planning and we'll also have a look at BC and DR in general, how it works, and the associated processes and techniques. Finally, we move on to testing BC/DR planning and the types of tests we can use.

If you have any feedback relating to this course, please reach out to us at support@cloudacademy.com.

Learning Objectives

  • Learn about the metrics for measuring performance in managing risk
  • Get a solid understanding of business continuity and disaster recovery
  • Understand how to test business continuity and disaster recovery practices

Intended Audience

This course is intended for those looking to take the CISM (Certified Information Security Manager) exam or anyone who wants to improve their understanding of information security.

Prerequisites

Any experience relating to information security would be advantageous, but not essential. All topics discussed are thoroughly explained and presented in a way allowing the information to be absorbed by everyone, regardless of experience within the security field.

 

Transcript

That completes section nine. So in moving on, we're going into section 10 and we're going to talk further about business continuity and disaster recovery in terms of the process and how it works. So to briefly recap from the previous section, disaster recovery, recovers the d, IT systems after some form of disruption has taken place.

The business recovery looks at recovering critical business processes after the same sorts of disruption. In order to achieve business recovery, we have to be able to do disaster recovery which in this context will include elements of business continuity. So in the planning phases we see this, we first conduct a risk assessment and a business impact analysis.

From our findings, we develop our response and recovery strategy. These must be documented and then tracked. Required every step will be training of the various participants in the teams of these processes. Testing is iterative in various forms and will require the regular updating of these plans. These plans and their results must be reviewed and audited and then used to update business operations where discrepancies or additional risks are identified.

So now let's look at the BCP slash DRP event cycle. At the left-hand side, our graphic begins with normal operations. Moving straight across the top, we encounter an adverse event which then causes operations degradation. Here we see the operations curve dropping rather steeply to intersect at a point where minimum operations is encountered and passes as it drops below the bottom level which reflects the degree of impact of the adverse event.

As we return to normal operations, the curve begins to slope upwards until it rises back to the level of normal operations. This upward trend indicates that recovery efforts are having a positive impact. From this graphic, several things should be noted.

First, the period between normal operations and the adverse events occurrence, we need to go through and do our business impact analysis and our risk assessments to make the determinations of RTO, RPO and MTD. We also need to set what the minimum level of operations is that will be needed to ensure that the business will survive through any adverse impact.

At the bottom of the graph, we see business continuity which is focused on the operations running the entire length of the picture. We also see the disaster recovery which is event focused running from the adverse event back through to normal operations. Between the adverse event and normal operations, we see several additional things.

We see where the RTO and RPO extends from the first intersection of the curve with the minimum operations line to the second intersection with the operations line on the up curve as we return to normal. Below this we see MTD reflecting the maximum tolerable downtime which if exceeded means our business will likely fail.

Now, the time durations are not shown because these time durations are specific to particular scenarios of outage and specific business types. As such, these are truly context-specific and would be determined by the individual organization, BCP, BIA, risk assessment and disaster recovery planning efforts.

Now, having all of these things in place gives us the basis for defining what our recovery operations would need to be at a procedural or a detailed level. At this point, we need to define exactly how we would do our recovery efforts by walking through each area at each set of tasks associated with each area in order to recover them. Doing this often requires creation of multiple teams. Each one of which would be focused in a particular area requiring their particular recovery efforts. 

The team breakdown is typically one person as leader, a second person as backup to the leader and a number of other persons acting as team members. Each team is comprised of skilled and experienced persons to be able to handle whatever comes up in their particular area of responsibility. Each member of the team should be redundant to every other member of the team to ensure that all skills needed are always present when required.

Also part of recovery operations will be selection of alternate sites. They will be used in the event that business processing must be relocated. This would include investigation of cloud computing as a possible alternative as with every other part of this process.

Recovery operations must be looked at as an overall process to be able to examine for gaps, lessons learned, and any other form of weakness or skills absence that might be present. Performing a risk assessment on this type of process is simply insurance against having missed anything by relying on an insufficiently well-developed recovery process.

Our recovery strategies must also be developed in order that they may deal effectively with variable situations and competing objectives. As you deal with the question of recovery strategies, it is important to bear in mind that various types of solutions may apply in one scenario and not in others. This translates into having exercised careful consideration and flexibility when composing the elements of your strategy.

It must always be supposed that the strategic, ultimately accepted as to achieve the best possible recovery within the time and at reasonable cost. This also means that the recovery effort is not striving to achieve perfection, but rather it is trying to achieve a proper outcome to preserve the operation and get the business running again at the earliest reasonable time within the RTO and RPO parameters.

As always, elements of strategies must include a cost of preparation prior to a crisis. Cost of resourcing during a crisis and the cost of business interruption insurance as a stop gap measure for the entire event. As I said a moment ago, prudent thinking, informed decision-making and flexibility of approach must be employed in putting together a strategy.

A single plan will have multiple aspects and it must be presented in this form to senior management for their approval and adoption. As part of the overall business continuity portion of this spectrum of activities, addressing threats will necessarily have to be undertaken.

Different approaches will therefore need to be part of the strategy in dealing effectively with threats. One possible approach will be to eliminate or in some way, neutralize a given threat. This may involve various approaches as different threats will require different responses.

It must be realized that to compare internal and external threats, more capability to defend against internal will be possible. Whereas a limited amount of response capability maybe possible against an external threat. By and large, the analyst conducting the work will find that the greatest capability will be found in modifying the asset or the environment of the asset to make it more immune or at least less susceptible to the given threat. This of course means taking the results of the business impact analysis and the risk assessment and composing a plan by which the findings will be acted upon in an optimally effective way to reduce vulnerabilities and exposures and thus becoming more resistant and resilient to threat actions.

It is also worth mentioning that while there may be possibilities to eliminate threats proactively this does not eliminate the possibility of being forced to react once the threat has materialized and began its action. As we look at the overall strategy one element that we are looking for is the point at which an appropriate action is best and most effectively taken. This may not always be before the threat occurs but may in fact be after the threat materializes. So long as we can respond rapidly and counter that threat.

It is a very common element of every disaster recovery strategy that alternate sites and recovery methods associated with alternate sites will be included. So let's cover the types of sites that may fall into our consideration for inclusion or plan. These are the types of alternate sites that may be considered.

Hot site. A site that is fully populated with resources and requiring only the most current data set. Typically, these must be operational within hours but no more than one day. Next comes the warm site. A site with a long lead time infrastructure operational but possibly requiring computing and storage resources to be completed. Typically, must be operational within a matter of days but usually no longer than a week.

Cold sites. These are sites with virtually no infrastructure in place and require complete resourcing and the current data as well. Typically, these may take as long as 30 days or more to put into operation and would not be for use in recovering the most critical resources obviously.

We have a redundant site which is a full replication of the primary site. This is of course considered to be the most expensive option. It also functions as a mirror site. We have a relatively modern construction known as the mobile site. This is a small site built into a semi-tractor trailer or a construction trailer or a container sized format with capability to restore only the most critical systems. It has the advantage of being mobile for relocation as maybe necessary.

Then we have the reciprocal agreement. This is typically an agreement between businesses of similar type and size allowing one to use the other businesses capabilities when adverse conditions for the first may arise. It is often restricted by size of business involved and frequently complicated by the processing conflicts that can arise between the two partners.

With every option noted above, timeframes that are suggested here are taken to mean that whichever option is employed must succeed in its operation bringing the business function back to full online within the RTO, RPO timeframes. Longer timeframes mean greater business resilience where shorter timeframes equate to more brittleness by requiring almost immediate recovery. In the case of each recovery site option the following factors must be considered.

The various timeframes that have been defined by the BIA and the risk analysis process, proximity factors regarding alternate sites and primary site and consideration of alternate sites and various infrastructure issues such as environmental, electrical grid or other factors that may be common to the primary and the alternate site that may make them both vulnerable to the same outage causing event at the same time.

The bullets that you see on this slide regarding RTO shrinkage and cost rising, as result reflect the fact that the shorter timeframes required for proper recovery may indeed require more expensive solutions simply because they must be ready at an instance notice.

Such conditions being present may indicate that other alternatives may be required such as cloud computing. However, cloud computing is not itself a silver bullet. It in fact brings in other considerations not present in the classic alternatives. It can however, be an answer to a strategy.

Planning is of course a process that must be done with the best quality information and the highest quality methods if the best results are expected. Thus it is true that no matter how good your planning might be its implementation must be of equal or better quality in order to make sure that the plan succeeds. So, as the plan is developed methods, procedures for its implementation must be considered simultaneously in order to address issues before they arise where possible.

On the slide you see a list of things that must be considered during this process. Many of these do not involve technology as such but in fact involve various transitions and processes that include people and their movements. One of the most critical elements that must be considered in the process of developing a business continuity and recovery strategy is communications.

Part of developing the communications plan is to realize and actualize the fact that non normal conditions will exist and must be prepared for. As an example, when a major hurricane impacted the Gulf Coast of America, first responders were sent in great numbers. The first ones on scene may have attempted to use communication methods that they routinely used daily. Only to find that none of them worked.

Due to the fact that the hurricane had damaged so extensively the infrastructure that what remained could not be salvaged. The response to this in order to ensure the communications were established and maintained was a somewhat exotic type in that since cell phones and other forms of mobile devices could not be used.

Satellite phones had to be provided to ensure that diverse responders could communicate. In addition to this, local ham radio operators were pressed into service to assist and augment the communications. Taking the plan itself as a document it should be constructed in such a way that giving it to any person we might expect them to be able to read it and make sense of it. Even to the point of being able to carry out most of what the plan describes.

The plan itself that should be organized in a way that has a very logical flow is organized in a way that is easy to follow and is written in a very clear, straightforward manner to make understanding and quick grasp easy. Fundamental to this reflects a mission, strategies and goals, priorities and ultimately the communications plan to ensure that all parties can be informed in a straightforward manner under these non-normal circumstances.

This plan manual must also reflect that senior management has been accepted and clearly approved by them and should be taken by all parties as the guide to be followed. One element that must not be neglected is the recognition that each element of the plan may have supplies requirements. And without these being properly satisfied, that portion of the plan may fail.

Any plan section that fails may cascade in a series of failures through the succeeding plan sections. Failing to take supplies into account has often lead to complications and failures of plan elements in many different settings. Therefore, this may represent a small element that can cause a very large impact if not properly addressed in advance.

As mentioned earlier, communications are one of the most important aspects of any disaster recovery or business continuity plan. In a similar fashion, having networks restored for systems to come online and communicate amongst themselves may be equally critical for success of the recovery effort. Thus communications of all types will have a very high priority on being established as early in the cycle as possible.

In the advanced planning stages, this may give consideration to bringing in uninterruptible power supplies to ensure that communications and network components will have power even if the grid power itself is lost. As with the case of provisioning satellite phones for first responders, consideration of alternative methods of communicating may benefit other recovery activities to the use of other kinds of alternatives for data networks. And so providing for the continuity of network services, we need to look at various aspects of how these services can be brought into operation.

We have redundancy. This may take the form of extra capacity, multi pathway, special routing protocols or other factors. In consideration of these things, single points of failure must be avoided and compensated for if they are unavoidable. We have alternative routing and diverse routing. We have diverse capacity through the use of more than one long haul network service provider. Where available last mile circuit protection and itself, telephone system redundancy where it's available.

Other considerations should also be applied. For example, storage needs need to be planned for as does compute capacity. And there are various options that can be employed to accomplish this. For storage purposes, there may be direct attached, network attached storage or NAS, storage area networks, and RAID. Another option is cloud computing and cloud storage.

As we discuss before, there is a very close relationship between RTO and RPO. Given this very close relationship, the combination will need to be considered as a single element and should be one of the key characteristics when deciding what solution will be sufficient. When it comes to storage or compute fault tolerance should be considered as a necessary inclusion in any system trusted with failover and data protection. 

oad balancing and clustering should also be considered particularly when it comes to web access or cloud computing. High availability storage systems such as cloud storage can help to reduce the cost and improve the speed of recovery as well. Now throughout this process, we've been focusing on how to plan for and mitigate outages through disaster recovery and business continuity planning.

At the end of this process, we typically find that we have done a more than adequate job of doing so but that residual risk remains in the form of losses for which there is no practical mitigation. This of course is where business interruption and other forms of insurance enters the conversation. The task of planning and acquiring business interruption or similarly named insurance requires care and due diligence to ensure that all aspects that are necessary for your business to be covered are in fact covered. And that all the terms of claims and restitution are understood thoroughly.

As an example, knowing what the policy covers in the form of direct or primary impact and how that differs from consequential damages must be thoroughly understood to avoid unpleasant surprises at the point of assessing damage and filing claims.

On the slide you see a list of the most common things that need to be considered. It is, however, the case that each business has its own concerns, its own assets, its own valuation, and its own particular needs. So this is clearly not a one size fits all problem to be solved.

In the course of consideration of policy acquisition, the two primary subjects that need to be addressed are adequate coverage for assets and conditions of loss and the valuation of the coverage to ensure that it is proportional within the proper limits.

As is always the case with business continuity and disaster recovery planning, once the plans are developed and implemented they will need to be tested periodically in order to ensure that they remain updated with regard to any change in the organization, IT systems or any other critical aspect.

The different forms of testing are well-known. They include the checklist type, the structured walkthrough type, parallel systems or facilities type, and the full interruption type. These tests can be seen as either a sequence and a progression of increasingly complex and thorough types of testing or as individual and point type tests. Whichever way is chosen to see these tests, the key ingredient to the test employed is that it is a meaningful test that demonstrates that the system or the facility or the business operation can in fact be properly recovered within the specified RTO, RPO timeframe.

The process of plan maintenance is facilitated by various feeder processes built within the organization to ensure that significant changes that have a direct impact on the plan itself are fed to the manager of the plan to ensure that information flows that is necessary to update the plan itself.

In addition, the same information will be necessary to update the business interruption insurance that has been acquired as part of the total BCDR process.

About the Author
Avatar
Ross Leo
Instructor
Students
2892
Courses
44
Learning Paths
6

Mr. Leo has been in Information System for 38 years, and an Information Security professional for over 36 years.  He has worked internationally as a Systems Analyst/Engineer, and as a Security and Privacy Consultant.  His past employers include IBM, St. Luke’s Episcopal Hospital, Computer Sciences Corporation, and Rockwell International.  A NASA contractor for 22 years, from 1998 to 2002 he was Director of Security Engineering and Chief Security Architect for Mission Control at the Johnson Space Center.  From 2002 to 2006 Mr. Leo was the Director of Information Systems, and Chief Information Security Officer for the Managed Care Division of the University of Texas Medical Branch in Galveston, Texas.

 

Upon attaining his CISSP license in 1997, Mr. Leo joined ISC2 (a professional role) as Chairman of the Curriculum Development Committee, and served in this role until 2004.   During this time, he formulated and directed the effort that produced what became and remains the standard curriculum used to train CISSP candidates worldwide.  He has maintained his professional standards as a professional educator and has since trained and certified nearly 8500 CISSP candidates since 1998, and nearly 2500 in HIPAA compliance certification since 2004.  Mr. leo is an ISC2 Certified Instructor.