Implementing Disaster Recovery Processes
Implementing Disaster Recovery Processes

This course is the 3rd of 4 modules of Domain 7 of the CISSP, covering Security Operations.

Learning Objectives

The objectives of this course are to provide you with the ability to:

  • Implement recovery strategies
  • Implement disaster recovery processes
  • Testing the disaster recovery plan

Intended Audience

This course is designed for those looking to take the most in-demand information security professional certification currently available, the CISSP.


Any experience relating to information security would be advantageous, but not essential.  All topics discussed are thoroughly explained and presented in a way allowing the information to be absorbed by everyone, regardless of experience within the security field.


If you have thoughts or suggestions for this course, please contact Cloud Academy at


We're going to continue our discussion on disaster preparedness and talk about disaster recovery processes in section 12. So here we're going to look at disaster recovery as it's part of the total business continuity spectrum of activities, and communications and their importance. In this particular aspect of security, we need to look at disaster recovery as a way to keep the business running even when faced with outages of varying seriousness. It means that we're going to have to take a look at the urgency and importance of the various tasks and accomplishments that we must achieve during these difficult periods.

Urgency and importance typically create four categories of actions that we have to take. We will have the urgent and important which of course is where a lot of the things that we're going to do during disaster recovery are going to be. We have the urgent and unimportant. These are potential time-wasters, but we must be careful about how we put something into this particularly pigeonhole. We have to be sure that it is unimportant before we delegate it to someone else to handle. We have our non-urgent important. These are strategic initiatives that should go into the building of the plan. And then we have the non-urgent unimportant, which are vacation or other forms of time-wasting. Now the primary goals that we need to follow are going to be to resolve the urgent important with all of the urgency that it seems to indicate. We have to work the non-urgent important strategic items, such as when we build our plan. We can delegate the urgent unimportant and try our best to remove the non-urgent unimportant to keep from being distracted from the real issues confronting us.

Now as we did with our risk assessment, in the building of our disaster recovery plan, we need to look again in our threat and threat agent analysis, because these elements, if they're materialized, they can cause the kinds of events that our disaster recovery plan and business continuity efforts are going to have to address. So we need to run through these again and make sure that we plan for the contingencies that these will create if they materialize.

Now you've seen this slide before, but here it presents a different sort of a strategy that we must address. As we look through these events, all the events that we're going to face will have pretty much the same spectrum of activities and the same kind of timeline, whether it's longer or shorter. As we plan our disaster recovery and business continuity efforts, we need to look at each of these phases and make sure that we have in place the right protective, recovery, preventive, corrective, compensating, and countermeasure types of controls to make our disaster recovery efforts should we have an adverse event easier and more straightforward, providing us much more of an action plan than confusion.

Now looking at our comparative recovery times as we build our strategy. Here we have in the left the recovery point objectives which measures the required data currency needed to achieve RTO, our recovery time objective. On this side, we're talking about data loss, restoration of lost data back to current levels, and then ramping back up to the level of data currency that we need to start again. Tape backup at the furthest extent is the slowest. Remote data journaling is faster and database shadowing is still faster. On the recovery time objective side, this measures the time required to activate the critical systems and assure event survival and enable recovery. Again, we start at the furthest extent with tape restore, again, the slowest. We have our online restore from one drive set to another. We have remote replication, and then we have clustering. Now as we move into the center where you see the lighting bolt, signifying the adverse event that has occurred, we go from very slow, the furthest reach to the right and to the left, to the fastest or shortest recovery time there in the center. We also go from lowest cost at the furthest extent outward, to the highest cost, to the furthest inward. But the thing to bear in mind about these strategies is that the cost of the loss will justify the strategy necessary to avoid the costs in current. So if it's most expensive there in the center, that would mean that our losses would be highest and exacerbated if we were to go with something as slow as tape restores. So we move into the center because the loss caused by the event should justify by its magnitude the price we pay for the strategy to avoid it or recovery from it.

So let's take a look at an example timeline of a business continuity or disaster recovery event. Here you see, on the left side, normal operations. Here is when we have to do all of our strategizing, putting together our plan, determining what our recovery time objective should be so that we don't achieve the maximum tolerable downtime. The importance of RTO is very great because to achieve it, as I said earlier, all but guarantees we don't achieve MTD, which would be a good thing to not. So we have to determine what that is. And in parallel with determining the RTO, we must determine what the RPO is as well. Another thing to be established is what is the minimum level of operations in order to keep us alive in the face of an event so that we will have the opportunity to recover. And as we go across the normal operations line, we have our event. Operations degrade at some rate, possibly very, very quickly, possibly much slower. But as they degrade, it's going to hit and cross through the minimum operations level until it achieves its lowest level. At that point, that might be considered the maximum tolerable downtime. And from there, we have to advance forward to return to normal operations. By establishing where we are and establishing that minimum operations, we're then able to put together the policies, procedures, activities, supplies, et., that we're going to need so that we can close the gap on the RTO RPO period, achieve those so that we avoid achieving MTD. As you see there in the graphic, achieving the MTD and failing to do better means that we're going to continue to go down and ultimately fail to recover at all. If we are able to achieve RTO RPO, then we start the degradation of services. We end up paused at the minimum operations level, and that enables us to return to normal. And notice at the bottom of the graphic, it shows business continuity. Business continuity is a variety of activities focused on the business, what it takes to keep the business in operation from normal through adverse back to normal. The disaster recovery is focused on the event. The disaster recovery part, a necessary reactive control set, responds when the adverse event happens, and it focuses on the event to minimize the impact of the event on operations and help us get back to normal in a reasonably short period of time.

Now understanding what we need to do to get there to achieve those particular goals, we put together what are sometimes called a continuity of operations plan. Several actions take place and two sets of actions are joined. We have the business continuity plan where we put together a good project to determine what we need to do to keep in business, stay alive, face adverse events, recover from them, and continue to improve, back to normal, and then continue our operations. This is the proactive or protective part of the total plan spectrum. Then taking our continuity strategy planning product. That feeds into our disaster recovery plan cycle. From there, we do the plan design and development for disaster recovery, again, focused on the event. When we have the plan, then we determine what we need to do in the way of team building and training exercises, and things need to be documented to ensure that we're prepared by having a trained workforce. And by accomplishing the DRP portion, we have fulfilled the entire spectrum of activities with the reactive or the responsive activities of the DRP. Now as you see at the bottom, we have as the last two activities of our DRP cycle, testing and maintenance. And these are iterative because we do testing, we do update, so that we capture changes in technology, changes in our facilities, changes in our staffing. Because anyone of those can cause an adverse event within our plan and our recovery efforts. So we have to account for them, then go back and retest to make sure that all the persons, all the facilities, all the plans, and all the systems are again still in good shape and prepared. Just like any tool that needs to be sharpened, this plan needs to be tested, calibrated if you will, so that we can actually rely on the plan to recover our business.

The intent then of disaster recovery is to restore the services from the contingency state and return them to a normal state. Now, this typically performs tasks across several areas, and these include handling personnel, the actual response itself, the communication which will pervade all the difference phases and activities, assessment activities, various restoration tasks that will have to be performed. And those who will participate, of course, must go through proper training targeted to help them with the skillset they will need to successfully accomplish all of these.

What we should be doing all through this spectrum of activities is documenting the plan. As with anything, we start with something as a very general framework. And as we move through the planning cycle, we progressively elaborate, add detail, add flesh to the bones if you will, covering all the different activities we're going to have to conduct, such as activation and recovery procedures, plan management, in other words managing the document itself to make sure that we put together the proper plan and keep it up to date. It will involve people and therefore it must account for human resource management. There's going to be a cost associated with this, the cost to put together the plan, then the cost of activating the plan, and all the cost associated with the pieces that will be activated when the plan itself is activated. We're going to have to have required documentation of various types and various places in the hands of various team members, so that they could do the job that they're assigned under these contingent circumstances. We have to identify those. We have to plan for internal and external communications, which means we might have to have alternative methods. And we have to have detailed action plans per team and per team member. At the time that we have one of these events, that would be the absolutely worst time to start making decisions about who's going to do what. That leads to chaos, inefficiency, conflicts, and ultimately it may spell catastrophe instead of just disaster.

Now, with all the preparations that are being made, one of the things to bear in mind is the plan itself must be at such a level of detail that you could hand a copy of the plan off to almost anyone, almost anyone, and expect them, with reasonable skill and reasonable time, to be able to read the plan and figure out exactly what needs to be done, so that even if they have a minimum of experience with the procedures, they should be able to fumble their way through the recovery. Now, we're not trying to make sure that anyone can read this. We can't make it too simple. But it does mean that it's readable and understandable to anybody who is likely to have a copy of the plan.

So there is, of course, the response, and the response is one of the very first areas of activity that we're going to incur once an incident has been identified. Once it has been, it must be reported to a central communications group responsible for initiating a proper response. This may start with a call to the help desk, but the help desk will not be the place where the declaration of the plan will occur. The communications much reach response participants and anyone who is impacted by the event to spread the word that something is going wrong. Everyone in the organization should have the central number so that they can report this. And if the event needs to be escalated, there must be an escalation ladder that will be able to be accessed and gone through to ensure that the event information is escalated to the right level, so that if the planned declaration becomes necessary, that can be done by the appropriate person.

The process of communication at its escalation as the situation grows needs to reach the executive emergency management team which would consist of the senior executives responsible for recovery. Here, they have to respond to and help resolve any of the issues that need their direction. They're going to have to take the plan and make decisions about how the organization is going to manage the business impact of the event. The mechanics of it have already been planned and executed in the document itself. Now they have the higher-level issues that they must address. Part of that will involve the appointment of a spokesman for the organization to the media to handle all of the communication that will need to go to the public. There are emergency management teams, a division of labor that will deal with all the various and specific areas that require their attention for the recovery effort. This includes people who will directly report to the command center through their team leads and oversee the recovery and restoration process that will be executed by the individual team members.

Now the objectives will differ depending upon which team you happen to be on, but they all have a set of objectives that must be achieved. Each one will need to go out, assess preliminary damage in their particular area. Through their team lead, they will notify management of the current status impact and the plan of action to cope with it. Based on their input taken from all the teams, the executive emergency management team will make the decision to declare or not or partial, so that instead of declaring the entire thing, which based on the feedback that they've gotten, may not be necessary, they'll be able to gauge the response on parity with the events.

Once the declaration has been made, the plan will be initiated, command and control centers will be organized as the central point of contact for team leads, and then they will set about the business of organizing and providing whatever administrative support yet another team will have to be provided. The emergency management team will then advise and direct how the recovery effort will be conducted.

Some of the responsibilities that the individual teams will have to take will include retrieval of offsite records and recovery information form the offsite storage, they will need to report to the alternate site identified in the procedures, and then with respect to each team that they belong to, they will execute the recovery processes within their area. Throughout, communication will be one of the most critical functions that any of the team members or leads will perform. So the communication will keep the teams in tune with the emergency executive management team and communicate the status of their efforts. They will identify issues or problems and escalate them as rapidly as is reasonable given the circumstances. If it is a prolonged effort, they may have to establish shifts for the recovery teams, thus highlighting the need for redundant members. There will be of course the function of liaison with critical business personnel. And then as part of their efforts, they will need to asses whether or not restoration or replacement of the equipment or software will become necessary.

Various issues will have to be decided at various points in this recovery process. One of the things that will have to be taken care of is examining the existing technology for the site, what will be needed, what's been damaged, what needs to be replaced. We're going to have to have a recovery strategy that's in place to handle each one of the locations. If multiple locations are involved, decision-makers at each location will have to be identified and made contact with. The declaration process for declaring a disaster needs to be put together before the event occurs so that as an assessment is being performed by the executive emergency management team they can go through a series of checkbox type exercises of criteria that will help them make a proper declaration. If alternate sites are going to be needed, we have to make sure that the alternate site is available. And in the event that the building, the primary site where the event is taking place, if it gets to be closed by public safety officials, then we know that we have preparations made and ready to receive this at the alternate site. That will, of course, make sure that all the workforce can get there, making sure that everybody knows where do they need to go, who they need to see. Everyone should already know what they need to do when they get there. And all the arrangements should be made in advance for hotels, transportation services, and the like, so that the process of transition from one site to another is as seamless and as trouble-free as we can make it. If the event is the kind that affects multiple organizations within a geographic area, these kinds of resources may become very scarce or altogether unavailable. Planning for these kinds of contingencies is also part of this effort.

Now, the communications in detail include communicating with the personnel, that is, the recovery teams and the management, and making sure the general workforce members, including those who are not going to be directly involved in the recovery effort, get timely, accurate communications. For example, those of the organization that will not be part of the direct recovery effort may need to know status. They may need to know or if to report, and to where, and to whom. We have stakeholders which include contractors, suppliers, distributors, shareholders, the community, and the authorities. Through a proper channel, which would include management and legal counsel, the messages that we're going to send out to these shareholders, various stakeholders, various contractors and so forth, needs to be a proper revelation of what has gone on and, in general statements, what has happened and what you're doing about it. There needs to be an assessment. The state of condition, severity, urgency, danger, duration. This is a message that needs to be sent out to virtually everyone, so that they have a sense of what to expect, of what to avoid, where not to go, and so on. Then in getting to the restoration, the communication needs to be with legal, with our insurance, so that assurance assessment and legal liabilities can start being evaluated at the earliest possible point.

About the Author
Learning Paths

Mr. Leo has been in Information System for 38 years, and an Information Security professional for over 36 years.  He has worked internationally as a Systems Analyst/Engineer, and as a Security and Privacy Consultant.  His past employers include IBM, St. Luke’s Episcopal Hospital, Computer Sciences Corporation, and Rockwell International.  A NASA contractor for 22 years, from 1998 to 2002 he was Director of Security Engineering and Chief Security Architect for Mission Control at the Johnson Space Center.  From 2002 to 2006 Mr. Leo was the Director of Information Systems, and Chief Information Security Officer for the Managed Care Division of the University of Texas Medical Branch in Galveston, Texas.


Upon attaining his CISSP license in 1997, Mr. Leo joined ISC2 (a professional role) as Chairman of the Curriculum Development Committee, and served in this role until 2004.   During this time, he formulated and directed the effort that produced what became and remains the standard curriculum used to train CISSP candidates worldwide.  He has maintained his professional standards as a professional educator and has since trained and certified nearly 8500 CISSP candidates since 1998, and nearly 2500 in HIPAA compliance certification since 2004.  Mr. leo is an ISC2 Certified Instructor.