[ISN] Should you maintain three data centers for disaster recovery?

From: InfoSec News (alerts@private)
Date: Thu Aug 09 2007 - 23:32:11 PDT


http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9029881

By Bert Latamore
August 09, 2007 
Computerworld

Pushed in part by U.S. business regulations concerning data 
preservation, financial and other high-end organizations are moving to a 
three data center architecture for disaster recovery, says Wikibon.org 
community member and data center consultant Josh Krischer.

In this architecture, two nearby data centers are linked synchronously 
with a third, located farther away, linked asynchronously. However, he 
warned, some data is always lost in a disaster, even when the remote 
copy is done via a synchronous link. Keeping data losses to a minimum is 
critical for some applications, but a more important issue is assuring 
data consistency and integrity at the recovery site. Inconsistent data 
at the recovery site usually requires time-consuming recovery processes, 
which may take days.

Speaking at Wikibon.org's weekly Peer Incite teleconference, which is 
open to all interested parties, Wikibon.org co-founder David Floyer 
related his experience consulting with one such company that was 
considering implementing very high-speed continuous asynchronous data 
transfer from its U.S. to its European data centers to guard against a 
potential major loss. "The company had two data centers, 15 miles apart, 
synchronously connected so transactional data is written to both 
simultaneously," he says. "If one goes down, it can recover from the 
other, theoretically with very little loss of data."

The proximity of the two centers, determined in part by the distance 
over which a synchronous link can be maintained, also avoided one of the 
common errors in disaster recovery planning, putting the recovery site 
too far from the main data center. "Putting them far apart may make you 
feel safer," says Floyer, "but it actually makes recovery harder and 
more expensive and may therefore decreases the plan's effectiveness."

However, he says, this company was concerned about the possibility of a 
regionwide disaster that might bring down both data centers. The 
organization was sending a 2TB incremental backup to its European data 
center twice daily, but in a regional disaster that could result in the 
loss of up to 20 hours of transactions. It wanted to invest in an 
advanced network-based system to create an asynchronous link between the 
U.S. and European data centers to reduce the maximum potential loss to a 
few minutes. The implementation and operational costs for this upgrade 
was estimated at about $25 million over three years.


The business leads

This might seem to be an extravagant solution to the problem, and Floyer 
emphasizes that this isn't the answer for everyone. "I worked with a 
retailer, for instance, who decided that local backup site was 
sufficient for their DR needs. If a regional disaster took out both data 
centers and distribution centers, they expected their business would not 
survive in any case."

IT organizations (ITO) can't make the basic decisions on disaster 
recovery strategy, Floyer says. They must be based on business decisions 
concerning how much data and time a company can afford to lose, how much 
that loss will cost the organization and how best to mitigate that loss 
(e.g., insurance, different technical solutions or accepting the problem 
as a business risk). Only senior business executives, and in some cases, 
the board of directors can make those decisions. So rather than going it 
alone, IT needs to push the business to examine disaster recovery in 
light of its financial and legal compliance situation.

"ITOs in organizations that talk about disaster recovery but fail to 
develop a business-lead plan should not be seduced by the opportunity to 
buy more technology or experiment with new products," added Peter 
Burris, Wikibon.org's co-founder and chief content officer. "Instead, 
they must act as aggressively as possible to force the business to lead 
the process."


Triangulating cost

The first responsibility of the business, Krischer says, is to develop a 
business impact analysis to estimate the recovery cost of data lost and 
damage caused by each minute the business is interrupted in a disaster. 
This is based on the amount of business that will be lost, as well as 
other business damages such as reputation loss, for example. This is 
clearly a business rather than an IT calculation, and it's often 
difficult to develop. One of the most common errors in disaster recovery 
planning is misestimating the potential cost of a business outage to the 
corporation.

Business impact can be hard to assess and has multiple aspects. Instead 
of relying on just one estimate -- for example, an internal computation 
of the cost per minute of a business interruption times the maximum 
number of minutes before systems can be restored -- businesses often 
seek multiple estimates from different experts who approach the issue 
from differing perspectives, Burris says.

Some companies, for example, will ask their investment banker for an 
estimate of the impact of a business interruption on the organization's 
capitalization. Another alternative, Burris says, was to have a company 
that specializes in investigating business disasters create an estimate 
of potential loss. They also can usually provide a good estimate of the 
probability of the disaster occurring.

In the case of Floyer's client, the disaster planning team calculated 
the average dollar value of a transaction and the average number of 
transactions per minute to arrive at a basic potential loss per minute 
of lost data.

They also needed to calculate the probability of a regional disaster 
that would take both the local data centers down. Probabilities of 
various disasters are usually based on historical information -- how 
often these events have happened in the past -- and often are publicly 
available.

Based on these calculations and the average amount of data that would be 
lost under its existing daily backup schedule, they estimated that the 
company could expect one regional disaster taking down both data centers 
every decade, for a staggering loss of $2.5 billion a decade, or $250 
million per year. The best they could do by improving their disaster 
recovery processes would reduce this to about $1 billion, or $100 
million per year.

The team then approached an insurer and found that the annual premium to 
ensure against a regional disaster would be at least $100 million. The 
team then looked at the annual interest the firm would lose if it 
self-ensured by posting a reserve as required by the International 
Convergence of Capital Measurement and Capital Standards (Basel II). The 
annual lost of income from the reserve was well over $100 million.

Given these alternatives, the three data center solution was the obvious 
choice. The payback period was seven months, with a net present value 
over three years of over $150 million.


Invitations to the table

Burris and Floyer suggested that at least four and possibly seven groups 
need to be represented at disaster recovery planning sessions:

   1. CXO-level corporate management and possibly corporate directors 
      who must make the final strategic and financial decisions.
   2. The head of the line(s)-of-business the disaster recovery solution 
      will serve.
   3. Facilities or operations management, which must provide an 
      assessment of relevant external factors such as the proximity of 
      earthquake fault lines, chemical or nuclear power plants and so 
      on, to the data center that increases the risk probability.
   4. IT, which must quantify the potential risks and present the 
      technical disaster recovery options for mitigating that risk.
   5. Corporate auditors to ensure that auditing procedures are included 
      in the recovery plan.
   6. The corporate compliance officer or legal counsel to discuss 
      regulatory and other potential legal exposure, depending on the 
      nature of the organization's business.
   7. Outside consulting to aid the planning process and ensure that 
      nothing important is missed, important if the organization lacks 
      depth of internal experience in disaster recovery planning.


Keep it simple

"In a disaster, nothing will work as planned," says Krischer. "So you 
have to improvise." To allow that, companies need to keep their plans as 
simple and flexible as possible. One of his clients focused much of its 
planning effort on ensuring that key business executives would be 
reachable in emergencies to make the business decisions on what to do. 
Discussion focused on what was adequate emergency communications and 
whether, for example, the disaster recovery budget should include 
satellite phones for those executives, and whether they would keep those 
phones charged and constantly with them if it did.

Also, he says, "Users will accept lower service levels in a disaster," 
so IT doesn't have to recover all systems immediately to normal service 
levels.


Practice, practice, practice

Floyer's IT client had a second item on its agenda. IT was testing its 
disaster recovery plan twice a year, but the CIO had less than complete 
faith that it would work in a real event. "They were testing an ideal 
scenario with historical data, and when real disasters happen, a lot of 
other things go wrong," Floyer says. "The overall testing strategy is 
one of the most important things that you have to get right." The 
literature is replete with stories of disaster plan failures. "They 
wanted to move operations from one center to another regularly, to make 
what is essentially a disaster recovery from center A to B or C part of 
the normal way activities were scheduled." That required an expenditure 
of time and money but is the best way to reduce the risk that they would 
suffer major complications in a real disaster.


Budget and time

Finally, Burris says, "Business management must commit to supporting the 
plan, not just talking about it. The level of that commitment is 
expressed in how close the level of funding they authorize approaches 
the ideal funding level and in their willingness to commit their own 
time to planning, testing and other activities that will prepare the 
organization for the eventual disaster."

Without that level of commitment, he says, IT can't hope to develop an 
adequate disaster response.

-=-

Bert Latamore is a journalist with 10 years' experience in daily 
newspapers and 25 in the computer industry. He has written for several 
computer industry and consumer publications. He lives in Linden, Va., 
with his wife, two parrots and a cat.


____________________________________
Visit the InfoSec News book store!
http://www.shopinfosecnews.org



This archive was generated by hypermail 2.1.3 : Thu Aug 09 2007 - 23:41:54 PDT