The course is part of this learning path
This course is the 3rd of 4 modules of Domain 7 of the CISSP, covering Security Operations.
Learning Objectives
The objectives of this course are to provide you with the ability to:
- Implement recovery strategies
- Implement disaster recovery processes
- Testing the disaster recovery plan
Intended Audience
This course is designed for those looking to take the most in-demand information security professional certification currently available, the CISSP.
Prerequisites
Any experience relating to information security would be advantageous, but not essential. All topics discussed are thoroughly explained and presented in a way allowing the information to be absorbed by everyone, regardless of experience within the security field.
Feedback
If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.
We continue now with the Cloud Academy presentation of the CISSP examination preparation review seminar. We're going to continue with Domain 7 - Security Operations, beginning on slide 92. In this particular area, we're going to be talking about preparatory steps and various strategies for doing disaster recovery: advanced planning, including backup storage strategies, recovery site strategies, various agreements, resilience, high availability, quality of service, fault tolerance, drives and data storage, backup and recovery systems, and then staffing for resilience, including a brief discussion of cloud options.
So the backup strategies that we have are basically to ensure that the data that we have that is critical to our operation is not lost or hopefully not even compromised. Now, in thinking about these backup strategies for data, we're going to rely on data storage technologies. We have two different objectives that we have to meet to make sure that we don't achieve the ultimate maximum tolerable downtime objective, which determines whether or not we're going to recover at all. And these would be the recovery time objective and the recovery point objective. By achieving these, we all but guarantee that we're not going to achieve the maximum tolerable downtime and stand a far better chance of recovery.
So the various recovery strategies we're going to explore are the redundant center, the hot site, both internal and external options, a warm site, a cold site, a mobile site, followed by discussion of cloud options.
Now, the redundant center option is exactly what the name suggests. It is redundant, that is, a full replication of what we use for normal operations. The advantages of this redundant center are near zero downtime. There is easier maintenance planning with potential complicated execution, of course, but no recovery is required. Being redundant, it is something that should just about instantly take over. Now this, of course, does have some disadvantages. One is, and the most obvious, is it's the most expensive option. It, because of its nature, requires fully redundant hardware, networks, staffing, the works. It does allow for easier maintenance planning, but with two centers now, a primary and a redundant, there is potentially very complex execution to make sure that when you fail over, you fail over to exactly what you expect, which means you have to maintain both. And there are distance limitations that you face with a redundant center.
Now, the hot site is by its intention a recovery site for the most critical data and the most critical applications, those things that without which your business will all but be guaranteed to fail. You can have an internal or an external version of this. But bear in mind these are only for the most critical applications. The hot site is not intended to recover everything. Now, the advantages of an internal or external hot site is that it allows recovery to be tested. It should be highly available and it can be operational within hours, but not longer than the time frame specified by the recovery time objective and the recovery point objective. Now, the disadvantages of this internal or external option is that it is a costly option, hardware and software compatibility issues could exist if you are using an external hot site, and the internal hot site systems are unavailable for any other use given the fact that they have been intended for hot site recovery. Now, bear in mind that these are intended to recover only the most critical applications. There's a temptation to want to recover more than that, but you're going to size it, plan the capacity and the execution times for something that will be intended only for the most critical.
Following this, we have the warm or the cold site. Now, the advantages of warm and cold sites are that they are much less expensive, especially the cold site, and they should be planned for use in longer recoveries after the hot site period expires, whether it's specifically that it's an external site by subscription, such as SunGard or IBM Global Services would provide. These would be options that you are going to use for prolonged recovery efforts. Now, the disadvantages, of course, are that neither of these are immediately available. It might take a few weeks to get a warm site up. It might take a month or so to get the cold site up. And this must be planned for. For example, when you go to your hot site option, you should begin the process of getting your warm site or cold site prepared in the event that you foresee a prolonged recovery effort. And because of the nature of warm site and cold site being almost incomplete, they're not fully testable without extensive work to make them testable.
A more modern option would be the mobile site. Now, the advantages to a mobile site is that it is highly mobile and relatively easy to transport. This can be a modular approach to building data centers, such as is being experimented with by Facebook and Google and Amazon and Microsoft, where you take containers, fill them with computing equipment intended for a data center, and then basically stack them like Lego blocks. It's certainly much more complicated than that, but that's the general idea. The options for this mobile site are several. We can use containers, we can use semi-truck trailers, or we can use modular office portables. Now, the disadvantages, of course, are that the cold site capability must be built at a determined location. The density and design of container make upgrading and customizing challenging, and maintaining a shipping contract or equipment contract to move the container in times of disaster can be expensive.
Moving to the BCDR plan in the cloud. We have three options. We have the on-premise with the cloud as the business continuity site, so we're operating the IT in our own shop and using the cloud solely as a backup provider. Option number two is where we are already in the cloud at one provider or another and we're using that same cloud provider as our BCDR site. We could go for maximum safety, where we are the cloud consumer at cloud provider site one and then we have an alternative provider, a different provider than we use for our primary operation acting as the backup site to our primary at a different cloud provider. And even while cloud addresses the data and application continuity, we still have to plan for working spaces, access methods, and planning for our workforce to be able to access these resources in the cloud from someplace so that work can continue.
There is a form of agreement, sometimes referred to as a gentleman's agreement, called a reciprocal. Now, this reciprocal agreement is typically between businesses where they both may have computing resources, but the reciprocal agreement, or sometimes called mutual aid, means that in the event that one of them runs into difficulties and their computing system is not working or they have a disaster of some sort and need additional or replacement power, they turn to the other person, the other party, and at a time of convenience for the first party, they are able to make use of the second party's equipment when the second party doesn't need it. Now, this of course means that each organization commits to hosting the data and processing of the other in the event of any kind of an outage. And this, of course, brings up many issues. For example, unless your disaster occurs at a mutually convenient time, there would be no way for the one experiencing the disaster to be able to plan to move into the provider's location because the provider may be running their business at that time and is unable to give up any capacity whatsoever. And any delay to the one suffering the outage may prove to be fatal to their business being able to survive. So while there are many disadvantages to this particular arrangement, they have been known to work, but this by no means should be your sole strategy for recovery.
There is, of course, the very common process of outsourcing. Planned and executed properly, outsourcing eliminates the issues associated with reciprocal agreements and the cost burdens associated with building alternative sites. The site, of course, is already up, ready for you to move in, and if you happen to be first on the list, you get there and you're able to process. Typically your period is 30 days, but that depends on what the language of your contract actually says. This can be an especially cost-effective solution since the organization only incurs major costs when the plan is activated. In the meantime, you're paying a subscription fee for your place in line to use the resource when the time comes.
One form of making sure that you have sufficient capacity and for BCDR purposes would be clustering. Now, clustering is the joining of two or more systems in a tightly coupled, logically single image arrangement that can provide service at the same time. Now, the cluster is managed as a single logical image. It has the advantage of, since it is a single logical image, that the multiple units that make it up, they can fail within the cluster and the cluster itself might never lose availability if it has sufficient units. From a storage perspective, there was developed some decades ago a strategy we know as RAID, Redundant Array of Independent Disks. Now, RAID levels have been developed over time to accommodate various strategies, whether it's speed or safety. Now, these have been standardized for quite some time and they refer to the way that multiple disks are configured to work together. Some provide enhanced performance, some provide additional reliability, and there are some that provide a combination of both.
So the RAID levels begin with RAID zero. Now, RAID zero is focused on providing speed. It writes files in stripes across multiple disks without the use of parity. In contrast to that, we have RAID one, which duplicates all disk writes across multiple disks to ensure that no data is lost through the use of multiple copies of the same data. Now, RAIDs two, three, and four are not in any way common usage. The one that comes as a factory default most often is RAID five, where data and parity information is striped together across all drives so that the loss of a single drive in, for example, a four-drive array can be lost, hot swapped, and the remaining three drives can rebuild the lost volume over a period of time. Now, RAID five is extended one level by RAID six, and RAID six extends these capabilities by computing two sets of parity information, so essentially doubling the RAID array. Then we have combinations, zero plus one, one plus zero, one plus five, five plus one. Various combinations so that it tries to get the advantages of both speed and safety.
There was an attempt to take the idea of RAID on drives and extend that to tape. The Redundant Array of Independent Tapes, called RAIT, was the process of providing redundancy for tape media. Utilizing striping without redundancy, it would write to multiple volumes of tape at the same time in much the same way that RAID would write to multiple physical volumes of disk at the same time. RAIT did not really catch on and as a disaster recovery method, it was certainly adequate but slow. It has been most commonly used in conjunction with tape vaulting.
Now, the backup and recovery systems that we have focus on copying data from one location to another so that it can be restored, and here we have some typical examples. We have the off-site facility such as Iron Mountain might provide. We have electronic vaulting, which are bulk data dumps that are electronically stored after transmission, and we have remote journaling, near real-time journaling of transactions and re-do logs to a secondary site. Electronic vaulting and remote journaling are often done together. The one thing that has to be planned for with all the diligence that we plan for the data to prevent loss is also staff. We have to plan for staff resilience. We have to avoid single points of failure associated with critical individuals on the operations team. A disaster recovery plan that relies on a guaranteed availability of all the key personnel to make it work is a disaster recovery plan that may itself suffer disaster. So we have to plan for the possibility that some of our key individuals will not be available, such as for sickness, vacation, having left to go find another job elsewhere, or possibly worse consequences. But we must plan for resilience in our staffing. Otherwise we court the failure of our plan. If two or more individuals are capable of providing similar services, then the operation in question will be less influenced by the unavailability of a single member. Adequate staffing levels are going to be contingent on when staff will be required. People cannot work endlessly. They're going to have to be relieved. So whatever the case is, we must plan to have multiple people be able to fill multiple roles.
Mr. Leo has been in Information System for 38 years, and an Information Security professional for over 36 years. He has worked internationally as a Systems Analyst/Engineer, and as a Security and Privacy Consultant. His past employers include IBM, St. Luke’s Episcopal Hospital, Computer Sciences Corporation, and Rockwell International. A NASA contractor for 22 years, from 1998 to 2002 he was Director of Security Engineering and Chief Security Architect for Mission Control at the Johnson Space Center. From 2002 to 2006 Mr. Leo was the Director of Information Systems, and Chief Information Security Officer for the Managed Care Division of the University of Texas Medical Branch in Galveston, Texas.
Upon attaining his CISSP license in 1997, Mr. Leo joined ISC2 (a professional role) as Chairman of the Curriculum Development Committee, and served in this role until 2004. During this time, he formulated and directed the effort that produced what became and remains the standard curriculum used to train CISSP candidates worldwide. He has maintained his professional standards as a professional educator and has since trained and certified nearly 8500 CISSP candidates since 1998, and nearly 2500 in HIPAA compliance certification since 2004. Mr. leo is an ISC2 Certified Instructor.