Theoutsourceblog.com

Myer fail displays appalling IT, business incompetency

2014-01-08

Tweet

The week-long outage of Myer’s website starkly displays the fact that the company and its outsourcing partner IBM had failed to properly develop and test their infrastructure or put in place the most basic disaster recovery and business continuity plan, as well as highlighting the incredible immaturity of online retailing in Australia.

If you are involved in IT infrastructure work at all, you will be aware that this field has very steadily evolved from something of an art back in the 1990′s and early 2000′s, to something much more akin to a precise, exacting and increasingly standardised science in 2014.

I am sure many of us remember the cowboy days of the ‘Wild West’ when it seemed only the personal skill of a handful of systems administrators, programmers and database experts could keep any major or minor IT system running on a sustainable basis at all — especially websites prone to traffic spikes. Many minor patch upgrades to operating systems or applications would knock services offline; tiny network routing problems unexpectedly became major headaches, and hard disks would fail on a regular basis and need to be physically replaced.

However, it’s safe to say that for most organisations, and for most important applications, those days are long gone.

The onset of virtualisation technologies, especially, has changed the game in terms of application uptime for almost every organisation. Did your application fail after a series of seemingly innocuous setting changes? Boot up an instance you copied five minutes ago, before you made those critical changes which took the live instance down. Has a physical server failed? No problem. Shift the instance onto another server. Instance needs a couple of extra CPUs and some more RAM? No worries. Re-configure and reboot. No problem. Hell, banks such as ING can even deploy whole new banking IT stacks in a matter of hours these days.

In 2014, many disaster recovery and business continuity strategies are built almost wholly on virtualisation. Most major organisations have at least two major datacentres, replicating data between each other continuously. If an earthquake takes out one, the other one goes live. This key concept of disaggregating applications and data from the hardware they run on has solved a billion billion headaches for IT professionals. Many consider VMware’s software manna from Heaven. It should be — it costs enough.

Versioning has also solved a heap of problems. No major change should be made to any major piece of code these days without versions being maintained for every step of the way. When something goes live, if it doesn’t work even after having been tested in practice, it is normally a cinch to roll back to the previous code and restart.

Cisco executives used to be fond of stating during the 2000′s that consistently, in their measurements of the uptime of their network routers, it was the human factor, not system errors, which caused almost all of the downtime. The minute a live human makes a change to a live system, so the saying went, up to a percentage point of annual uptime would be lost. I can’t quite remember the precise wording, but you get the picture: Machines work better by themselves. If humans get involved, they always eventually screw things up.

All of this systemisation and mechanisation of the IT process has caused many IT professionals to come to be relatively sanguine about the downtime of important systems. When every application or platform now comes with a giant “undo” button attached to roll back any problematic changes, life starts to feel a lot more predictable; the CEO isn’t breathing down your neck every few weeks because important systems are now much more stable. Hell, even Australia’s major banks have drastically cut down their number of Severity 1 incidents, an outcome which I think many of their IT staffers used to believe was an impossible dream.

It’s for precisely all of these reasons that your writer found the week-long outage of Myer’s online retailing site over the Christmas and New Year period so remarkable.

The fact that the website went down at all is hardly surprising. Australians are jumping on online retail with a massive passion right now. The combination of post-Christmas sales with that trend always places strains on such sites, and I’m sure many online retailers in Australia are locked into a cycle of regularly upgrading their web infrastructure to handle growing demand. Every website has its outages, and a few hours here or there, even when there is a great deal of money to be made, isn’t a huge deal.

But the fact that Myer’s website went down for a week is simply stunning.

If you read the dozens of online articles about the subject, there is very little technical detail available about why this happened or how. IBM has constrained itself to telling ZDNet that there was a “communication breakdown between Internet servers and a software application” (pre-school language if I ever saw it), while Myer itself has actually praised IBM’s behaviour during the crisis, telling the Sydney Morning Herald that Big Blue allocated a “critical situation team” to finding the “bug” which caused the issue.

Myer chief executive Bernie Brooks made the extraordinary statement that resolving the technical difficulties was not as simple as “running an automatic clean-up process on a personal computer, or turning a device off and on again”.

The problem for any competent IT team is … it damn well should have been.

In 2014, the technology underpinning public-facing websites is very standardised. The relationship between database servers, application servers, webservers, caching systems and the networks and physical servers they run on is very well understood. Over the past decade, the success of a million online retail sites, including very large ones, has meant that keeping such platforms up is very much the bread and butter of IT departments.

Reading between the lines, it appears that one component of Myer’s setup had buckled under the unexpected load of the Boxing Day sales, taking down the rest of the site’s interlinked systems in a graphic demonstration of the concept that a chain is only as strong as its weakest link.

But what is puzzling is why this problem kept Myer’s site offline for a week. One would have thought that the retailer would have been using a standardised setup that would have avoided such a headache; with components that had been extensively load-tested in production environments at other major sites. If the ‘bug’ had been recently introduced, Myer should have rolled back its platform to a prior version. If it was a problem of resourcing, Myer should have dialled up the virtual power available to each component. This isn’t rocket science, people — it’s basic IT practice. Bog boring.

In the worst case scenario, you might expect that Myer would have kept its website offline for a maximum of 24 hours. By that stage the Boxing Day load would have been drastically reduced, additional resources could have been deployed, and the ‘bug’ issue could have been eliminated. Certainly the retailer had extensive resources available to it — the extensive media coverage alone would have guaranteed that IBM would throw some of its top web experts at the issue, as it apparently did.

The fact that the outage was allowed to continue for a week, during Myer’s busiest annual period, indicates that what IBM’s crack team found was much more than a software bug. They must have found a nest of non-standard systems which lacked basic business continuity and disaster recovery functionality. Not just one issue, but many cascading issues. Not just a system which was failing under load, but a system which had never been comprehensively tested for the load that a major retailer like Myer could expect.

And probably, worst of all, a system with many links to legacy Myer infrastructure dating back decades. It only takes five minutes at the checkout counter of a traditonal retailer like Myer checking out their archaic point of sale equipment to know that such companies don’t like investing in their IT platforms. I should know. A decade ago I used to work as a systems administrator for David Jones.

The other aspect to this situation was how it dramatically highlighted the incredible immaturity of the online retailing scene in Australia.

In the US, online retailing is the absolute norm in many categories. Companies like Amazon have mastered the art of selling products online to such an extent that they now own whole spin-off companies which sell cloud computing platforms to others so that they can build their own mammoth websites.

The offhand comment by Myer chief executive Bernie Brookes during the website outage that Myer only made “less than one percent” of its annual sales from its website was extremely revealing. With the company’s annual revenues being about $2.8 billion, this would put Myer’s online sales at around $28 million.

Not only does this comment suggest that Myer was not overly discomforted by the fact that it would likely lose millions due to its post-Christmas outage, but it sent a huge signal to consumers that the retailer was not taking the online delivery mechanism seriously.

In many ways this is sheer business lunacy. Not only is the online retail category in Australia booming (consumer electronics retailer Kogan, for example, sold $1 million of stock in just one day on 31 July 2012), but taking orders online allows retailers such as Myer the ability to drastically cut costs out of their expensive retail outlets in prime locations. It doesn’t cost much to ship a new Van Heusen business shirt to a customer via Australia Post’s new specialised e-parcel delivery platform. I bet it costs a lot more to operate Myer’s menswear departments in-store.

Right now, Myer should be using its premium brand and its reputation for customer service to take tiny rivals like Kogan to town. Myer’s website should be a pillar of shining excellence, with a 100 percent uptime guarantee and a dazzling array of products in every category, delivered overnight on a crystal plate with complimentary caviar.

Instead, companies like Kogan are constantly eating Myer’s lunch. I remember regularly browing books, consumer electronics and clothes in Myer. Now I shop exclusively at the online stores of Amazon, Kogan, Apple or AusPCMarket and The Iconic. It’s quicker, easier and if I don’t like what I receive, I just send it back.

There is absolutely no doubt that online retail is the future. There is also absolutely no doubt that no major Australian corporation should allow a situation where its website — let alone one that also acted as a point of sale platform — to stay offline for more than a few hours.

Myer’s actions over the past week speak to deep technical incompetency within the organisation. It’s not clear to what degree its partner IBM is to blame, but I suggest Big Blue should take a close look in the mirror to determine what its part in the debacle was. But beyond that, the scenario also speaks further to the sheer ignorance which Myer’s business leadership has when it comes to the massively growing online retail sector.

Myer’s corporate jingle, which I can hear echoing in my well-trained gray matter right now due to the stellar efforts of the company’s marketing department, is that “Myer is my store”. Well, I can tell the company conclusively that there are thousands of Australians, perhaps tens or hundreds of thousands of Australians, who probably feel right now that that is no longer true, due to the events of last week. Myer doesn’t make products, after all — it just sells them. And what digital-savvy consumer wants to buy products through a company whose website is plainly closed for business?

Source:http://delimiter.com.au/2014/01/08/myer-fail-displays-appalling-business-incompetency/

Tweet