Showing posts with label techops. Show all posts
Showing posts with label techops. Show all posts

18 March 2011

Conclusions from Betfair's Outage



Niall Wass and Tony McAlister of betfair recently published a summary of betfair's 6 hour outage on 12 March 2011.  What follows is a review of their analysis.

Most of betfair's customers will have no idea what Niall and Tony are talking about.  "This [policy] should give maximum stability throughout a busy week that includes the Cheltenham Festival, cricket World Cup and Champions League football" is about the only non-tech part of the article that their customers can relate to.  However, for us technologists, the post provides some tasty detail for us to learn from other's mistakes.

The post is consistent with a growing and positive trend of tech oriented companies disclosing at least some technical detail of what happens to cause failures and what is to be done about it in the future.  Some benefits from this approach:
1. Apologize to your customers if you mess them about - always a good thing to do when you mess them about (Easyjet or Ryan Air - I hope you're reading this).  Even better is to offer your customers a treat - unfortunately betfair only alluded to one and didn't provide concrete commitment.
2. Give public sector analysts some confidence that this publicly traded company isn't about to capsize with technical problems
3. Receive broad review and possibly feedback about the failure.  Give specialist suppliers a chance to pitch to help out in potentially new and creative ways.
4. As a way to drive internal strategy and funding processes in a direction they otherwise might not be moving.

Level of change tends to be inversely proportional to stability.  "In a normal week we make at least 15 changes to the Betfair website…".   This is a powerful lesson that many non-tech people do not understand - the more you shove change into a system, the more you tend to decrease it's stability.  This statement also tips us that betfair has not adopted more progressive devops and continuous delivery trends to more safely pushing change into production.  

The change control thinking continues with "… but we have resolved not to release any new products or features for the next seven days".  This is absolutely the right thing to do when you're having stability issues.  Shut down the change pipeline immediately to anything other than highly targeted stability improvements.  Make no delivery of new features a "benefit" to the customer (improved stability) and send a hard statement to noisy internal product managers to take a deep breath and come back next week to push their agenda.

Although betfair might not be up on their devops and continuous delivery, they have followed the recent Internet services trend of being able to selectively shut down aspects of their service to preserve other aspects:
- "we determined that we needed our website 'available' but with betting disallowed"
- "in an attempt to quickly shed load, we triggered a process to disable some of the computationally intensive features on the site"
- "several operational protections in place to limit these types of changes during peak load"

Selective service shutdown is positive, it hints that:
1. The architecture is at least somewhat component based and loosely coupled.
2. There is a strategy to prioritize and switch off services under system duress

The assertion that betfair spent several hours verifying stability before opening the site to the public suggests bravery under fire.  "We recovered the site internally around 18:00 and re-enabled betting as of 20:00 once we were certain it was stable".  There must have been intense business pressure to resume earning money once it appeared the problem was solved.  However, during a major event, you want to make sure you're back to a stable state before you reopen your services.  A system can be in a delicate state when it is first opened back up to public load levels (e.g., page, code and data reload burden) which is one reason why we still like to perform system maintenance during low use hours so that the opening doors customer slam when the website/service opens are at least minimized.

The crux of the issue appears to be around content management, particularly web page publication.  Publishing content is tricky as there are two conditions that should be thoughtfully considered:
- Content being served while it is changing which results in "occasional broken pages caused by serving content" and here-and-gone content where content has been pushed to one server, but not another
- Inconsistency between related pieces of content (e.g., a promotional link on one page pointing to a new promotion page that hasn't been published yet)

It appears that betfair's content management system (CMS) is not async nor real time: "Every 15 minutes, an automated process was publishing…".  Any time a system is designed with hard time dependencies is a timebomb waiting to go off, with the trigger being increasing load: "Yesterday we hit a tipping point as the web servers reached a point where it was taking longer than 15 minutes to complete their update".  A lack of thread safe design is another indicator of a lack of async design that tends to enforce thread safety: "servers weren't thread-safe on certain types of content changes".  A batch, rather than on-demand approach is also symptomatic of the same design problem: "Unfortunately, the way this was done triggered a complete recompile of every page on our site, for every user, in every locale".  Therefore likely not an async on-demand pull model but rather a batch publish model.

The post concludes with a statement of what has been done to make sure the problem doesn't happen again:
1. "We've disabled the original automated job and rebuilt it to update content safely" - given the above design issues, while thread safety may have been addressed, until they address the fundamental synchronous design, I'd guess there will likely be other issues with it in the future.
2. "We've tripled the capacity of our web server farm to spread our load even more thinly" - hey, if you've got the money in the bank to do this, excellent.  However, it probably points to an underlying lack of capacity planning capability.  And of course, everyone one of those web servers depends on other services (app server, caches, databases, network, storage, …) - what have you done to those services by tripling demand on them?  Lots of spare capacity is great to have, but can potentially hide engineering problems.
3. "We've fixed our process for disabling features so that we won't make things worse."
4. "We've updated our operational processes and introduced a whole new raft of monitoring to spot this type of issue." - tuning monitoring, alerting, and trending system(s) after an event like this is crucial
5. "We've also isolated the underlying web server issue so that we can change our content at will without triggering the switch to single-threading"

And here are my lessons reminded and learned from the post:
- If you're having a serious problem, stop all changes that don't have to do with fixing the problem
- Selective (de)activation of loosely coupled and component services is a vital feature and design approach
- Make sure the systems are stable and strong after an event before you open the public floodgates
- Synchronous and timer based design approaches are intrinsically dangerous, especially if you're growing quickly
- Capacity planning is important, best done regularly, incrementally and organically (like most things), not in huge bangs.  One huge bang now can cause others in the future.
- Having lots of spare capacity allows you avoid problems… for awhile.  Spare capacity doesn't fix architectural issues, just delays their appearance.
- Technology is hard and technology at scale is really hard!

Niall and Tony, thanks for giving us the opportunity to learn from what happened at betfair.

05 March 2011

The Trial Environment - Innovation Infrastructure with an Enterprise wrapper

Introduction

A "trial" environment is a high risk production environment that sits within a low risk Enterprise environment.

Circumstances that might drive you to set up a trial environment:
  • Business has revenues derived from enterprise production systems it wants to protect through risk management, but...
  • Business wants to move fast and be innovative, and...
  • Business wants to work with third parties, some of which are "two-guys-and-a-dog" start-ups who can't afford to focus on making their systems enterprise friendly.
The trial environment helps a technical team to balance these potentially conflicting requirements and deliver both risk- managed and risk-embracing services into the business.

(NB: Like most articles on this blog, the trial environment was conceived within a context of internet delivery systems and online gambling.  Please keep that context in mind as you read.)

Rationale

But why would a production IT service want to enable lunatic startup companies and me-me-me product managers into a carefully risk managed production environment?

One reason is to foster innovation.  New innovative products tend to focus on the core features and not the "-ilities" such as scalability, stability, and security.  The new product team shouldn't be spending time on having to specify and justify a large enterprise environment when there are crucial features to be coded.  If we can provide infrastructure and costs that play by the same rules as cheeky little start-ups, we can limit their ability to end-run us.

Another reason is that from a business case perspective there is no reason to spend on big expensive kit to meet fanciful revenue forecasts.  It's much better to trend off of real data and provide an environment that can scale to a medium level quickly.

But mostly it's just nice to be able to say "yes"when the panicked bizdev guy comes over to you in desperation so he can close a deal tomorrow as opposed to "that will take 6 sign-offs, 3 months to order the kit, and will cost £400,000."  Operational process and cost intensity should match as closely as possible to revenue upsides and product complexity.

The Business Owners Point of View

How the trial environment is presented to the business, perhaps product managers, that have to pay for it:
  • Suitable for working with small, entrepreneurial, and/or external companies/teams
  • You can move quickly with it
  • Fewer sign-offs, less paperwork
  • Cheap (after a base environment is set up)
  • Enables you to focus on initial product bring-up and delivery and not overspend on an unproven product.
  • It's billed back to your project as you use it so no big up-front costs; if you're project stops, we don't have left over dead kit
  • Suitable for lower concurrent users and transaction volumes
  • Good for proof of concept projects - if project not signed off, no big capital investment
  • More risky (less stable, scalable, secure) than our enterprise environment
  • Your first point of contact if there are technical problems is the small entrepreneurial company you're working with and not IT support
  • Not particularly secure
  • Not PCI/DSS friendly (so don't store related data or encode related processes in trial)
  • Only small to medium sized products can use trial - we only have so much capacity standing by
  • If there is a failure in the trial environment, it will generally be the responsibility of the third party to fix it.  We won't know much about it.  We'll only take care of power, connectivity and hardware.
  • At a practical level, a failure in the trial environment might mean several days of downtime
  • If your revenue goes up for a product running from trial, we recommend it's moved it from trial to enterprise.  That will be your call for you to manage your revenue risks.
  • A new product that is failing will still accrue operational costs.  Pull the plug if you need to and with trial shutting down a member environment is trivial.
What happens if things take off for a product in the trial environment?  It's up to the small team or company the product manager is working with to identify this and initiate a project to "enterprise" their product.

From the Entrepreneurial Point of View

How the trial environment is offered as a production option to small, entrepreneurial, and/or external companies or teams:
  • We'll give you an infrastructure that you're comfortable working with that doesn't have the usual enterprise computing overheads
  • We'll take responsibility for deploying and fixing the hardware, power, and connectivity - everything else is yours.
  • Quickly receive 1 or servers you need to get your product going - no paperwork and waiting around for kit to show up
  • We only have a few types of servers on offer - likely a "small" one for web/app servers and a "big" one for a database server.  We'll recommend some options if you're not sure what you need.  The servers are not redundant, fault-tolerant kit.  If you want that in trial, you'll need to build it into your application.
  • Tell us what OS you want. We have 3 standard OSs (Linux, Solaris, and, maybe Windows) and if you want something else it's going to be more difficult for everyone.
  • Tell us how much storage you need.  You'll get a little bit local on the server and a flexible capacity will be mounted on your server.  The flexible capacity can grow over time without any retooling or paperwork.
  • Tell us how much network capacity you'll need.  We'll QoS at that level.  Maybe no bursting allowed.
  • Your servers will be on their own subnet, just one flat LAN for everything.  No DMZ, multi layer firewalls.
  • You get a firewall in front of you with, tell us what inbound and outbound ports you want open for each server you request.  80, 443, and 22 are easy for us, everything else will make us raise an eyebrow.
  • Beyond simple firewalling, you manage your own security, e.g.,  locking down ports/services and OS patching
  • No content switch.  They're expensive and you're clever enough to use Apache to figure that out I'm sure.
  • Put your own monitoring in place, we're not going to watch it for you.  If you need to go from a "small" to "large" server or need more servers, you'll need to let us know.
  • Put your own backups in place.  Specify some flexible storage for them on one of the servers.  We won't be backing up anything.
  • All change control sits with you.  We have no oversight.
  • No remote hands provision is expected to be required.
  • If you're doing anything that affects production or other members of trial you're servers will be powered down immediately.
The Production Operations Team Point of View

How the trial environment is managed by the production operations team:
  • Beneath the edge network, the trial environment is on hardware fully separate and distinct from production.
  • Production operations owns and is responsible for the hardware, network, and power - both initial and on-going.  We provision a base OS and hand over the keys to the product team.  That's it.
  • Fairly generous SLA on responding to HW, network, power failures reported to production support.
  • The trial environment is ideally implemented with some type of in-house cloud service and/or VMWare.  If that's not possible, you'll have to manage by-box inventory so that you always have a few unused boxes of each type ready to commission.  Must keep a stand-by inventory ready to go.  Effective maintenance of slack and procurement to backfill is essential.
  • Create two server types, small and large.  Decide on cores, memory, disk space for each.  You will need to change this view over time, so re-evaluate it every 6-12 months.
  • Establish maybe 3 standard OS installs.  We don't own patching or securing the OS.
  • Use a SAN to enable flexible filesystem provisioning
  • Fixed maximum allocation of internet bandwidth for all members of trial, then fixed allocation to each member.  No trial member should be able to stomp on other trial members or anything in production.  QoS implemented.  Bursting is debatable.
  • Dedicated edge firewall.
  • Network to enable multiple subnets for each different user of trial.  Each user of trial can generally only see only their own network and servers.  Holes/routing between subnets and between enterprise and trial subnets may be conditionally opened for API (and only API; no e.g. DB) access.
  • No content switch, load balancer
  • No backups
  • No hardware RNG
  • Some Single Points of Failure ok
  • We own firmware updates for hardware
  • We don't monitor or alert on any virtual servers.  We do monitor and alert on underlying hardware, including the network kit and SAN.
  • May use a second tier hosting location for trial kit
  • It might be possible to use older kit being decommissioned from production for the trial environment.  While this would likely increase day-to-day operational costs (heterogenous and older kit), it would bring down initial capital investment in trial.  Also consider used/refurb kit.  Think cheap.
  • Keep a basic overview of trial and its services updated on the intranet.  Make sure all product managers and bizdev types are educated about it.
  • Periodically review trial usage with each business owner.
As the production team evolves the trial environment offer, it's likely that some of the "we don't do this in trial" items above will change as cost effective and lightweight ways are found to deliver them into trial.  Possible examples are backups, more sophisticated networking (load balancer), a provision of fault tolerant disk on server instances, or a shared (between trial members) database instance.

Other Considerations

Some other things to keep in mind to make the trial environment successful:
  • Aspects of production may be accessed via e.g. an API.  This introduces a point of vulnerability to production.  There are good design practices for hardening APIs that are exposed to trial such as logging, monitoring, authentication, rate limiting, and kill switches to protect what is on the production side of the API.
  • It may be cost efficient to spin up a single instance of an expensive service (e.g., Oracle) that can be shared between multiple trial members.  This introduces a fair amount of complexity to manage the DB itself including QoS, security, and change control.
  • The trial environment won't build and run itself for free.  Technical operation staff are required.  The number of staff should be proportional to level of change and size of the environment.
  • If a third party is involved, they must have an internal business representative championing their product or service, someone who understands the product and will champion it regularly.  A bizdevy, bring-the-external-party-in, hurl-over-the-wall-to-IT-ops doesn't work.
  • A product or service typically requires other functional contributions as well:  game platform operations, marketing, account management and sales, handling of amended contracts for the new product, on-going product management to improve the product, and website integration and updates.
  • Trial could also be used to spin up staging or pre-production test environments.
Conclusion

The trial environment can be used to provide a low cost alternative for startup, experimental, speculative, and just plain insane product ideas.  It's a hosting option that edgy product managers and bizdevs will like because of the lightweight commitment and speed of delivery.  Entrepreneurial teams and startups will like it because it'll feel like something they're use to and won't slow them down.  The production support team may feel uncomfortable with it initially because trial violates a lot of "best practices" in production.  But in the long run they'll see how it becomes a business enabler that fosters innovation in a cost effective way.

Good luck and let me know if you manage to establish a trial environment in your shop!





03 May 2010

IT Hotsite Best Practices

Introduction

A "hotsite" is a general term for unplanned downtime - a failing site, product, or feature that is having significant impact on revenue generation.  A problem is escalated to hotsite level when significant numbers of (potential) customers are affected and a business ability to earn money is significantly affected.  Hotsite handling may or may not be used if the problem is not under direct control of the team controlling a set of systems (e.g., a critical feature the systems depend on is provided by a remote supplier, such as a web service being used by a mashup).

Hotsites happen.  Costs increase infinitely as you push your system design and management to 100% uptime.  You can aspire for 100% uptime, but it's foolish to guarantee it (e.g., in an SLA).  Change can also cause service disruptions.  In general, the less change, the less downtime.  However, it's rarely commercially viable to strongly limit change.

This article isn't about reducing planned or unplanned downtime, it's a collection of tips, tricks, and best practices for managing an unplanned downtime after it has been discovered by someone who can do (or start to do) something about it.  I'll also focus in on a new type of downtime, one that the people involved haven't seen before.

General strategy - the management envelope

It's important early on for a major problem to separate technically solving the problem from managing the problem itself into the wider business.  Because an unplanned downtime can be extremely disruptive to a business, it's often almost as important to keep people informed about the event as solving the event itself.

Although that may feel like an odd statement, as a business grows there are people throughout the business that are trying to manage risk and mitigate damage caused by the downtime.  Damage control must be managed in parallel with damage elimination.

You want to shelter those that are able to technically solve the problem from those that are hungry for status and are slowing down the problem solving process by "bugging" critical staff for information.  Technical problem solving tends to require deep concentration that is slowed by interruptions.

It is the management envelope's responsibility to:
  • Agree periods of "no interruption" time with the technical staff to work on the problem
  • Shelter the team from people asking for updates but are not helping to solve the problem
  • Keep the rest of the business updated on a regular basis
  • Set and manage expectations of concerned parties
  • Recognize if no progress is being made and escalate
  • Make sure the escalation procedure (particularly to senior management) is being followed
  • Make sure that problems (not necessarily root cause related) discovered along the way make it into appropriate backlogs and "to-do" lists
General strategy - the shotgun or pass-the-baton

Throughout the event, you have to strike a balance between consuming every possible resource that *might* have a chance to contribute (the "shotgun") versus completely serializing the problem solving to maximize resource efficiency ("pass-the-baton").

Some technologists, particularly suppliers who might have many customers like yourself, may not consider your downtime as critical as you do.  They will only want to be brought in when the problem has been narrowed down to their area and not "waste" their time on helping to collaboratively solve a problem that isn't "their problem".

There is a valid argument here.  It is ultimately better to engage only the "right" staff to solve a problem so that you minimize impact on other deliverables.  Your judgment about who to engage will improve over time as you learn the capabilities of the people you can call on and the nature of your problems.

However, my general belief for a 24x7 service like an Internet gambling site that is losing money every second it is down, calling whoever you think you might need to solve the problem is generally fully justified.  And if you're not sure, error on the shotgun side rather than passing the baton from one person to the next.

General strategy - the information flows and formats

Chat.  We use Skype chat with everyone pulled into a single chat.  Skype's chat is time stamped and allows some large number of participants (25+) in a single chat group.  We spin out side chats and small groups to focus on specific areas as the big group chat can become too "noisy", although it's still useful to log information.  It gives us a version history to help make sure change management doesn't spin out of control.  We paste in output from commands and note events and discoveries.  Everything is time threaded together.

The management envelope or technical lead should maintain a separate summary of the problem (e.g., in a text editor) that evolves as understanding of the problem/solution evolves.  This summary can be easily copy/pasted into chat to bring new chat joiners up to speed, keep the wider problem solving team synchronized, and be used as source material for periodic business communications.

Extract event highlights as you go.  It's a lot easier to extract key points as you go then going through hours of chat dialogues afterwards.

Make sure to copy/paste all chat dialogues into an archive.

Email.   Email is used to keep a wider audience updated about the event so they can better manage into partners and (potential) customers.  Send out an email to an internal email distribution list at least every hour or when a breakthrough is made.  Manage email recipients expectations - note if there will be further emails on the event or note if this is the last email of the event.

The emails should always lead off with a non-technical summary/update.  Technical details are fine, but put them at the end of the message.

At a minimum, send out a broad distribution email when:
  • The problem first identified as a likely systemic and real problem (not just a one off for a specific customer or fluke event). Send out whatever you know about the problem at that time to give the business as much notice as possible of the problem. Don't delay sending this message while research is conducted or a solution is created.
  • Significant information is discovered or fixes created over the course of the event
  • Any changes are made in production to address the problem that may affect users or customers
  • More than an hour goes by since the last update and nothing has otherwise progressed (anxiety control)
  • At the end of a hotsite event covering the non-tech details on root cause, solution, impact (downtime duration, affected systems, customer-facing affects)
Chain related emails together over time.  Each time you send out a broad email update, send it out as a Reply-All to your previous email on the event.  This gives new-comers a connected high-level view of what has happened without having to wade through a number of separate emails.

Phone.  Agree a management escalation process.  Key stakeholders ("The Boss") may warrant a phone call to update them.  If anyone can't be reached quickly by email and help is needed, they get called.  Keep key phone numbers with you in a format that doesn't require a network/internet connection.  A runbook with supplier support numbers on the share drive with a down network or power failure isn't very useful.

The early stage

Potential hotsite problems typically come from a monitor/alert system or customer services reporting customer problems. Product owners/operators or members of a QA team (those with deep user-level systems knowledge) may be brought in to make a further assessment on the scope and magnitude of the problem to see if hotsite escalation is warranted.

Regardless, at some point the first line of IT support is contacted.  These people tend to be more junior and make the best call they can on whether the problem is a Big Deal or not.  This is a triage process, and is critical in how much impact the problem is going to make on a group of people.  Sometimes, a manager is engaged to make a call of whether to escalate an issue to hotsite status. Escalating a problem to this level is expensive as it engages a lot of resources around the business and takes away from on-going work. Therefore, a fair amount of certainly that an issue is critical should be reached before the problem is escalated to a hotsite level.  The first line gets better at this with escalation with practice and retrospective consideration of how the event was handled.

Once the event is determined to be a hotsite, a hotsite "management envelope" is identified.  The first line IT support may very well hand off all problem management and communications off to the management envelope while the support person joins the technology team trying to solve the problem.

All relevant communications now shift to the management envelope.  The envelope is responsible for all non-technical decisions that are made.  Depending on their skills, they may also pick up responsibility for making technical decisions as well (e.g., approving a change proposal that will/should fix the problem). The envelope may change over time, and who the current owner and decision maker is should be kept clear with all parties involved.

The technical leader working to solve the problem may shift over time as possible technical causes and proposed solutions are investigated.  Depending on the size and complexity of the problem, the technical leader and management envelope will likely be two different people.

Holding pages.  Most companies have a way to at least put up "maintenance" pages ("sorry server") to hide failing services/pages/sites.  Sometimes these blanket holding pages can be activated by your upstream ISP - ideal if the edge of your network or web server layer is down.  Even better is being able to "turn off" functional areas of your site/service (e.g., specific games, specific payment gateways) in a graceful way such that the overall system can be kept available to customers while only the affected parts of the site/service are hidden behind the holding pages.

Holding pages are a good way to give yourself "breathing room" to work on a problem without exposing the customer to HTTP 404 errors or (intermittently) failing pages/services.

Towards a solution

Don't get caught up in what systemic improvements you need to do in the future.  When the hotsite is happening, focus on bringing production back online and just note/table the "what we need to do in the future" on the side.  Do not dwell on these underlying issues and definitely no recriminations.  Focus on solving the problem.

Be very careful of losing version/configuration control.  Any in-flight changes to stop/start services or anything created at a filesystem level (e.g., log extract) should be captured in the chat.  Changes of state and configuration should be approved in the chat by the hotsite owner (either the hotsite tech lead or the management envelope).  Generally agree within the team where in-flight artifacts can be created (e.g., /tmp) and naming conventions (e.g., name-date directory under /tmp as a scratchpad for an engineer).

All service changes up/down and all config file changes or deployment of new files/codes should be debated, then documented, communicated, reviewed, tested and/or agreed before execution.

Solving the problem

At some point there will be an "ah-ha" moment where a problem is found or a "things are looking good now" observation - you've got a workable solution and there is light at the end of the tunnel.

Maintaining production systems configuration control is critical during a hotsite. It can be tempting to whack changes into production to "quickly" solve a problem without fully understanding the impact of the change or testing it in staging.  Don't do it.  Losing control of configuration in a complex 24x7 environment is the surest way to lead to full and potentially unrecoverable system failure.

While it may seem painful at the time, quickly document the change and communicate it in the chat or email to the parties that can intelligently contribute to it or at least review it.  This peer review is critical in helping to prevent making a problem worse, especially if it's late at night trying to problem solve on little or no sleep.

Ideally you'll be able to test the change out in a staging environment prior to live application.  You may want to invoke your QA team to health check around the change area on staging prior to live application.

Regardless, you're then ready to apply the change to production.  It's appropriate to have the management envelope sign off on the fix - certainly someone other than the person whose discovered and/or created the fix must consider overall risk management.

You might decide to briefly hold off on the fix in order to gather more information to help really find a root cause.  It is sometimes the case that a restart will likely "solve" the problem in the immediate term, even though the server may fail again in a few days.  For recurring problems the time you spend working behind the scenes to identify a more systemic long term fix should increase with each failure.

In some circumstances (tired team, over a weekend) it might be better to shut down aspects of the system rather than fix it (apply changes) to avoid the risk of increasing systems problems.

Regardless, the step taken to "solve" the problem and when to apply it should be a management decision, taking revenue, risk, and short/long term thinking into account.

Tidying up the hotsite event

The change documentation should be wrapped up inside your normal change process and put in your common change documentation archive.  It's important you do this before you end the hotsite event in case there are knock on problems a few hours later.  A potentially new group of people may get involved, and they need to know what you've done and where they can find the changes made.

Some time later

While it may be a day or two later, any time you have an unplanned event, as IT you owe the business a follow-up summary of the problem, effects and solution.

When putting together the root cause analysis, keep asking "Why?" until you bottom out.  The answers may become non-technical in nature and become commercial, and that's ok.  Regardless, don't be like the airlines - "This flight was late departing because the aircraft arrived late.".  That's a pretty weak excuse for why the flight is running late.

Sometimes a root cause is never found.  Maybe during the event you eventually just restarted services or systems and everything came back up normally.  You can't find any smoking gun in any of the logs.  You have to make judgment call on how much you invest in root cause analysis before you let go and close the event.

Other times the solution simply isn't commercially viable.  Your revenues may not warrant a super-resiliant architecture or highly expensive consultants to significantly improve your products and services.  Such a cost-benefit review should be in your final summary as well.

At minimum, if you've not solved the problem hopefully you've found a new condition or KPI to monitor/alert on, you've started graphing it, and you're in a better position to react next time it triggers.

A few more tips

Often a problem is found that is the direct responsibility of one of your staff.  They messed up.  Under no circumstances should criticism be delivered during the hotsite event.  You have to create an environment where people are freely talking about their mistakes in order to effectively get the problem solved.  Tackle sustained performance problems at a different time.

As more and more systems and owners/suppliers are interconnected, the shotgun approach struggles to scale as the "noise" in the common chat increases proportional to the number of people involved.  Although it creates more coordination work, side chats are useful to limit the noise, bringing in just those you need to work on a sub-problem.

Google Wave looks like a promising way to partition discussions while still maintaining an overall problem collaboration document.  Unfortunately, it's easy to insist all participants use Skype (many do anyway), but it's harder with Wave that not many have used or don't even have an account or invite available.

Senior leadership should re-enforce that anyone (Anyone!  Not just Tech) in the business may be called in to help out with a hotsite event.  This makes the intact team working on the hotsite fearless about who they're willing to call for help at 3am.

Depending on the nature of your problem, don't hesitate to call your ISP.  This is especially true if you have a product that is sensitive to transient upstream interruptions or changes in the network.  A wave of TCP resets may cause all kinds of seemingly unrelated problems with your application.

Conclusion

Sooner or later your technical operation is going to deal with unplanned downtime.  Data centres aren't immune to natural disasters and regardless, their fault tolerance and verification may be no more regular than yours.

When a hotsite event does happen, chances are you're not prepared to deal with it.  By definition, a hotsite is not "business as usual" so you're not very "practiced" in dealing with them.  Although planning and regular failover and backup verification is a very good idea, no amount of planning and dry runs will enable you to deal with all possible events.

When a hotsite kicks off, pull in whoever you might need to solve the problem.  While you may be putting a spanner into tomorrow's delivery plans, it's better to error on the shotgun (versus pass-the-baton) side of resource allocation to reduce downtime and really solve the underlying problems.

And throughout the whole event, remember that talking about the event is almost as important as solving the event, especially for bigger businesses.  The wider team wants to know what's going on and how they can help - make sure they're enabled to do so.

13 December 2009

Supplier Reviews

The following is a sample agenda for a supplier review.

Objectives for Review
  • Agree a plan with supplier to solve key issues
  • All ownership, processes and contact points well understood and up-to-date
  • Understand where the supplier is going, their strategy and roadmap
Areas to Consider
  • What areas does the business want to see improved; what are the pain points with this supplier?
  • Number of new features delivered (e.g., new products and/or services made available, functional improvements, elements of previous supplier review agenda delivered).  Consider both velocity and acceleration of deliveries.
  • General process improvements (e.g., escalation of issues back into supplier)
  • Review major product/service failures and downtime events – root cause?
  • Review product/service roadmap, both current and from last time.  Assess delivery ability against stated roadmap as a way to evaluate currently presented roadmap
  • Evaluate other customers of the supplier - their situations, requests; how are they driving the suppliers strategy and priority choices?
  • Roadmap of technical/internals
  • Upgrade plans
  • Review who owns what and escalation details; general review of support and ability to escalate
  • Review recent and related project retrospectives - any issues to progress with supplier?
  • Can the supply/delivery chain be optimized? Come into session with a summary of the delivery chain, owners and timing - what can be tuned?
  • Supplier costs summary.  Any invoice issues.
  • Alternatives to supplier
Output of Review

If this is a major review, consider setting up a wiki page that pulls together all the information related to the review, e.g., specification, bug lists, retrospectives, API specs,

Documentation updates, e.g., a supplier support change names and phone numbers

Action items between now and the next review.