Retail Decoded: Learning from Disaster

Retail Decoded with JD Trask, CEO and Co-Founder, Raygun.

As internet users, we’ve all experienced the pain and frustration of broken software. Anybody with an internet connection has come across busted links, slow pages, and malfunctioning shopping carts. 

Not only do all users encounter these issues, but every website and app has dealt with errors, mishaps, and even catastrophic failures in their own software. Nobody is immune to these issues. So today, I want to look at a couple of lessons from the most powerful companies in the world on how to handle software failures - or how not to.

Meta, formerly known as Facebook, rolled out Horizon Worlds, an ambitious virtual reality project, in late 2021, as the forerunner of their new metaverse. It was supposed to be the arrival of the next era of digital reality. But less than a year later, the project was exposed to public embarrassment when leaked internal memos revealed major issues: “feedback from our creators, users, playtesters, and many of us on the team is that the aggregate weight of papercuts, stability issues, and bugs is making it too hard for our community to experience the magic of Horizon.” People using Horizon Worlds weren’t having fun, and they weren’t sticking around. Sometimes they couldn’t get it to work at all. The tech giant had tried to do too much, too fast.

The result? Well, it’s now been reported that Meta is putting Horizon Worlds on “quality lockdown” and significantly reducing their user adoption targets. To their credit, Meta heard the feedback from early users (and their own team) that the product just wasn’t up to standard and began efforts to get on top of issues early, so that users can judge the product on its value, not its flaws.

What can we learn from this? While you’re probably not trying to build an entire immersive world like the Meta team, there’s still some universal lessons here. The obvious takeaway is to test your software thoroughly and adjust timelines if necessary; it’s not your customers’ job to find your problems. The second is to own mistakes and take decisive action. If Meta had simply ignored their issues and continued to pour advertising funds into convincing more people to try Horizon Worlds, this PR nightmare could’ve been a lot worse. And third, when it comes to software, you’re better off doing the basics really well rather than trying something too elaborate that just doesn’t work. A simple, well-designed storefront is going to serve you far better than a load of flashy animations and slow, distracting pop-ups and plug-ins.

Our second tale comes from Amazon, the mall of the internet. Each year, Amazon prepares for the annual e-commerce blitz of Thanksgiving weekend (also known as Cyber Five). In 2022, they once again handled Black Friday and Cyber Monday pretty flawlessly for consumers. Amazon’s team achieved an incredible feat of software engineering, scaling their systems and servers to handle an epic amount of traffic as millions of people surged to their site in search of deals. But for their advertisers, it was another story. 

During the biggest sales weekend of the year, when every business was pouring resources into promoting special offers, Amazon’s advertiser reporting platform stopped working. This was a massive blow for companies who were planning to adjust their ad spend according to the performance of their campaigns, and for some, it resulted in blindly overspending on ineffective promotion. This vital service only came back up on the Sunday.

The results of this outage are still to be seen. While the consumer market (which is much larger in scale) encountered Amazon’s usual exacting standards, businesses did not, and many misused gigantic advertising budgets. This comes at the cost of trust, jeopardizing Amazon’s future ad dollars. All this is in a tougher economic climate where many of the tech giants are fighting for ad spend. Amazon released a fairly uninspiring apology and didn’t offer much in the way of amends.

What does this teach us? During key sales periods, when there’s more pressure on your systems, make sure there’s somebody available to keep an eye on every component of your software. Don’t allocate all your resources and attention to a single area at the expense of another, and don’t scale at the cost of stability. If you’re going to chase multiple markets (e.g. consumer and business), make sure you have the technical resources to serve their separate needs.

Our last lesson is Atlassian. If you’re not in the tech space, you may not be familiar with Atlassian. They’re a pretty big deal, though: 200,000 teams use their products to help build, improve and maintain software. And in April of last year, they experienced an outage that lasted up to 14 days for some of their customers. The cause, it turns out, was a routine maintenance script that was supposed to do a cleanup of archived data for some accounts (sort of like a Roomba for data). 

The results for Atlassian were immediate and obvious. Their stock price tanked, and their next earnings report showed a hit to operating costs as they were forced to discount their services to compensate affected accounts. But the real cost is harder to pin down. The discussions that went on across social media, in forums, and in the tech media often centred around the lack of communication from Atlassian during the disastrously long outage. In the first days, Atlassian also publicly announced a bunch of splashy new features while hundreds of users frantically tried to access crucial services that were gone seemingly indefinitely. Not very tactful.

So what can we learn? I can’t help but suspect that much of the damage to Atlassian’s reputation could have been mitigated with better communication, which could have been as simple as more personalised outreach to high-paying accounts (there were multiple reports of stonewalling from even big Atlassian users, with vague mass emails and questions going unanswered). Often, even bad news is much better received when it’s delivered gracefully and proactively. The root cause of the outage was also a communication issue between two teams, one of whom misunderstood a request and provided the wrong data for deletion. The ultimate lesson, then, is a really simple one: talk to people. Talk to your team, talk to your customers. Encourage the habit of being unafraid to ask questions and answer them, especially on technical issues. This especially matters when something goes wrong, but good communication can also help prevent problems in the first place.

The moral of the story

Everyone has errors. They’re part of the development process. Software is complex; it’s made and used by human beings, and that means a degree of unpredictability. It’s the visibility of those issues that’s most crucial. It’s like going to your doctor or your mechanic. Ignoring them or being unable to detect them doesn’t mean they will go away, it just means they’ll eventually escalate. 

Research shows that only one percent of users who encounter a software error will ever go on to report it. If you’re not health-checking your software, you’re probably significantly underestimating the problem. One recent study found that 94 percent of Europe’s digital checkouts have issues, whether it be poor layout or dysfunctional design.  

While it’s healthy to admit that your tech won’t be 100 percent flawless all the time, it remains absolutely mandatory to control and improve what you can, where you can. Don’t insist on perfection, but aim for progress. Be prepared to make apologies and amends to your customers if you stuff up. Have the humility to recognise that your tech (like all tech!) probably needs work, to ask for help with fixing it, and to say sorry if it affects your customers.

That’s all for this month. I hope you found something here that rang true — your feedback and questions on anything in this column are always welcomed. Send comments, queries, or stray observations to jd@raygun.com.