Human error was the root cause suspected behind the November outage of Microsoft Azure cloud storage service. The company expects that the recent updates introduced by them that automate formerly manual processes will assist in preventing any such activity to take place in future.
They had mentioned on their blog, which was posted by Jason Zander, CVP for Microsoft Azure team, giving details about the outage, “Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol. With the tooling updates the policy is now enforced by the deployment platform itself.”
The company says that this was perhaps not the first time when Azure has been affected by human failure, while sharing one past incident as well. It says, in February 2013 as well, a mjor outage was led due to a lapsed security certificate. While both cases indicate that even small errors may have major impacts in a service as large as Azure and thus had even got imposed the importance of automating manual processes as thoroughly as possible by the Microsoft.
The latest azure took place in the evening of 18 November (Pacific Time), because of intermittent failure from some of the company’s storage services. Different Azure services which backed up on the storage service were also gone offline, most importantly the Azure Virtual Machines.
The outage was powered due the change in the configuration of the storage service, which was otherwise planned to improve the performance of the service. Microsoft, like most other cloud providers will test and try the changes proposed by it initially on a handful of servers. By this, even if there will be a problem with the configuration change, engineers will be able to spot it early before a large number of customers are affected. It it works out well, and then it will be introduced at a bigger scale.
In this, the company after assuming that the update had already been tested, it then applied it across the rest of the system as well. The configuration however impacted the storage service software with the elusive bug that it had, making it to go into an infinite loop, thereby affecting further communications with other components of the system.
However, the engineers had pointed out the problem and then had fixed it, without wasting any further time. The storage service was back on line by 10:50 am, while restoring all of the virtual machines. Zander wrote in the blog, “With the tooling updates the policy is now enforced by the deployment platform itself.”
While sharing the root cause analysis of November incident, he had mentioned about the reason behing putting all things upfront was to make the users to realise the act of transparency to be proof of Microsoft’s commitment about providing quality cloud hosting service.
Check out the video below, to get a better insight of the November outage:
In video: CTO Mark Russinovich