CIO

Problems flagged months before HPE hardware failure hit ATO systems

At least 77 events related to components that failed in the December 2016 outage were logged months earlier

Log data generated by the Hewlett Packard Enterprise (HPE) storage hardware being used by the Australian Taxation Office (ATO) revealed potential issues months before the agency’s systems were hit by last year's massive outage.

The hardware trouble struck the ATO in December last year when an “unprecedented” failure of 3PAR storage area network (SAN) hardware that had been upgraded in November 2015 by HPE resulted in widespread outages among many of the ATO’s systems.

Now, the ATO has released its much-anticipated report into the outage, revealing that analysis of SAN log data for the six months preceding the incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage.

While HPE and fellow integration partner, DXC Technology, continue to investigate the issues related to the outage, the report reveals that, while HPE had taken some actions in response to the problems flagged by log data, alerts continued to be reported, indicating that the actions did not resolve the potential SAN stability risk.

Specifically, since May 2016, at least 77 events related to components that were observed to fail in the December 2016 failure were logged in the ATO’s incident resolution tool managed by IT contractor, Leidos.

In addition, at least 159 alerts were recorded in SAN device monitoring and management logs, the ATO report stated.

Some actions had been initiated by Leidos and HPE in response to the indicators, including the collation of incidents by Leidos, and some infrastructure maintenance, including the changing of cables on the Sydney SAN by HPE.

Despite these actions, alerts continued to be reported that indicated these actions did not resolve the potential SAN stability risk.

“We were not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN,” the ATO said in the report.

Ultimately, the ATO said that, the massive outage experienced in December 2016 resulted from the compound impact of several factors, including multiple SAN component failures on the agency’s Sydney SAN, which also involved failures associated with stressed fibre optic cabling.

At this stage of the investigation, the ATO considers that stressed fibre optic cabling issues were a major contributor to this outage – regardless of the actions taken by the ATO’s external IT partners, which included the replacement of specific cables.

Other factors contributing to the failure include subsequent unsuccessful attempts for the system to auto-recover in response to the component failures. Consequently, the SAN was unable to provide read/write services to the applications it supported.

Meanwhile, control, management and monitoring systems being placed “in-band” also played a part, with these systems relying on the same data pathways as the production systems that were supporting impacted services.

The second outage

The report also revealed that a second outage that hit the ATO’s systems on 2 February was caused by further issues associated with the cables.

The second outage followed remedial work by HPE on the SAN fibre optic cables, according to the ATO. Unfortunately, during one cable replacement exercise, the agency was informed that data cards attached to the SAN had been dislodged.

“This caused the 3PAR SAN to act in a similar way to that noted during the December outage,” the ATO said. “This included unsuccessful steps to automatically remediate, followed by a systems shut‑down to preserve data integrity. HPE communicated this Priority 1 incident to us immediately.”

As a result, HPE and the ATO monitored the cables around the clock following the outage, until they were comprehensively replaced between 23 and 26 March.

“We have since been advised that SAN alerts ceased completely once the new fibre optic cables were installed,” the ATO said in its report.

The report also outlines other issues that arose when the initial outage occurred early in the morning on 12 December 2016, revealing that firmware supporting impacted disk drives in the affected SAN prevented those drives from re-booting.

Despite having met ATO-specified conditions for categorisation as a “Priority 1” incident, service provider logs indicated the incident was not escalated to this level until around 7.00am that morning, almost seven hours after the hardware first struck trouble.

Further, system management, configuration, monitoring, and data recovery systems that were relying on the SAN also experienced outage extended the recovery process for some applications.

In addition, the impact of pre-incident design and build decisions were material in extending the time to recover data and bring production and supporting systems online.

The SAN was neither designed nor built to cater for greater than single drive failure or single cage failure, the report said.

The storage hardware build also included “daisy‑chain” cage configuration, which exacerbated the risk of errors spreading across cages as occurred during the incident.

“Although a viable design option at the time of SAN implementation, no evidence has been presented of subsequent options being explored by HPE to mitigate this risk,” the ATO said.

Read more on the next page...

Page Break

No ATO access under HPE deal

Meanwhile, the report revealed that ATO IT staff had no direct access to the new infrastructure being operated by HPE under the agreement the ATO struck with its technology partner for its replacement of the pre-existing EMC Corporation SAN with its own 3PAR SAN2 hardware in 2015.

The storage solution provided by HPE to the ATO comprised a primary 3PAR SAN in Sydney with a backup 3PAR SAN in Western Sydney.

Under the deal, the ATO engaged HPE to provide turn-key IT solutions, whereby HPE designed, owned and operated the computing infrastructure and provided services to the “required ATO standard”.

Under the turn-key deal, instead of having direct access to the SAN hardware, the ATO relied upon HPE to provide a full service. The agency also contracted with Leidos as service integrator.

Leidos operates a virtual dashboard over myriad ATO IT systems, and provides a problem management process should issues arise with parts of our IT infrastructure.

While procedures were in place to provide manual fail over for selected applications in the event of a failure, full automated fail over for the entire suite of applications and services in the event of a complete SAN failure in Sydney was not part of the storage solution for the SAN, the report stated.

What’s next?

According to the report, the ATO has developed a new storage strategy to enhance its IT stability and resilience. This involves rebuilding its primary and back up storage systems with the newest technology from the HPE product portfolio working in conjunction with its 3PAR SAN technology.

“All production system workloads are now utilising the enhanced storage system,” the ATO said. “Once data transfer activities are completed, the existing 3PAR SAN will be replaced by a new 3PAR and the current 3PAR SAN decommissioned by late July 2017 for forensic analysis.”

While the ATO is still working with HPE in relation to the storage infrastructure, insights from its experiences relating to the outage will also inform its future IT acquisitions, the agency said.

“As contracts come up for renewal, we need to balance service, stability, resilience and cost,” the ATO said. “Our IT program continues to prioritise government policy reforms and ATO corporate priorities, with a primary focus on another successful Tax Time for 2017.

“Future sourcing of IT is also influenced by whole of government initiatives, including closer collaboration with the Digital Transformation Agency,” it said.

The release of the report comes just days after Australian Commissioner of Taxation, Chris Jordan, told Senate Estimates committee members that the ATO had reached a commercial settlement with HPE, the detailed terms of which are subject to contractual confidentiality.

The move to strike a settlement agreement following the hardware failure and subsequent systems outage echoes the settlement IBM reached with Australian Government over the troubled eCensus project with the Australian Bureau of Statistics (ABS) last year.

Jordan also took the opportunity to reveal the preliminary findings of various investigations into the storage hardware failure last year, citing a combination of factors, suggesting that the design of the SAN infrastructure played a part in its failure.

“The SAN design and configuration meant we had an over emphasis on performance features rather than stability or resilience - a relatively small disk drive failure had a large impact - only 12 of some 800 disk drives failed, but they impacted most ATO systems,” Jordan said.

“The recovery was slower because some of the recovery tools required were themselves stored on the same SAN that failed,” he said.

In early February, Jordan took aim at HPE, suggesting that the agency’s technology partner had “failed” to reliably provide it with its contracted services.

“Initial indications are there has been a failure by Hewlett Packard Enterprise (HPE) to provide contracted services in a reliable way and ensure stability of our systems,” Jordan said in a statement published on 8 February.