Software testing lessons learned from Knight Capital fiasco

Comments

It took only one defect in a trading algorithm for Knight Capital to lose $440 million in about 30 minutes. That $440 million is three times the company's annual earnings.

The shock and sell-off that followed caused Knight Capital's stock to lose 75 percent of its value in two business days. The loss of liquidity was so great that Knight Capital needed to talk on an addition $400 million line of credit, which, according to the Wall Street Journal , effectively shifted control of the company from the management group to its new creditors.

Knight Capital was regulated by the Securities and Exchange Commission, routinely audited and PCI complaint. If that bug could affect Knight, it could happen to any company. At least that's what Knight Capital CEO Thomas Joyce seemed to imply in an interview with Bloomberg Television. "Technology breaks. It ain't good. We don't look forward to it," he says, adding, "It was a software bug....It happened to be a very large software bug."

"Technology breaks. It ain't good. We don't look forward to it," Knight Capital CEO Thomas Joyce says. (Image courtesy of Bloomberg)

This incident wasn't the first of its kind. In 2010, something caused the Dow Jones Industrial Average to drop 600 points in roughly five minutes in what is now known as the " flash crash." Nasdaq blamed the disastrous Facebook IPO on a similar technical glitch.

Mistiming, Bad Orders Crash High-Frequency Trading Algorithm

In early June 2012, the New York Stock Exchange (NYSE) received permission from the SEC to launch its Retail Liquidity Program. The RLP, designed to offer individual investors the best possible price, even if it means diverting trades away from the NYSE and onto a so-called "dark market," was set go live on Aug. 1. This meant trading houses had roughly a month and a half to scramble to write code to take advantage of this new feature.

The Knight Capital incident happened in the first 30 minutes of trading on Aug. 1. Something went very wrong in the code that had been introduced overnight. The code itself was a high-frequency trading algorithm designed to buy and sell massive amounts of stock in a short period of time. A combination of mistiming and bad orders let to disastrous results.

Beyond admitting a software defect, the staff at Knight Capital have been reluctant to discuss exactly what caused the defect. They aren't alone-the majority of financial-related inquires for this article led to responses such as "No comment," I can't comment" or "We cannot comment on this story."

One technologist at a financial services company, who asked to remain anonymous, suggests two possibilities. It could have been the standard rush to production without proper testing. Parse the statements from Knight Capital carefully, the technologist says, and it's possible that the program that went into production was actually a test program-one designed to simulate trade requests and evaluate if they went through properly. Nanex conducted an analysis of the trades last week and came to the same conclusion.

Column: Stupid QA Tricks: Colossal Software Testing Oversights

Rick Lane, CTO of Trading Technologies in Chicago, agrees that the problem might be a test program in production-or, possibly, a configuration flag that wasn't ready for production and should have been turned off. He points out that these trading algorithms are developed incredibly quickly, as they are designed to chase fleeting opportunities, and that good change management may take a backseat to speed.

"The scary thing is this happens more often than people think, and not just by trading shops," Lane says. "In September 2010 the Chicago Mercantile Exchange ran a program that accidentally injected test orders into [its] production system-and the CME doesn't even have the kind of time pressure that these trading shops have."

Adding Retrospective to Development Process Can Reduce Errors

Jeff Sutherland, a co-author of the Agile Manifesto who helped formalize the Scrum methodology, adds a third possibility- the team may have been using a development method prone to error.

Sutherland, also a former U.S. Air Force pilot, recommends an external assessment, much like the process the National Transportation and Safety Board uses for airplane accidents. Without some assessment, he says, we may never know what went wrong-and we run the risk of trying to prevent the wrong problem.

Only a thorough assessment of Knight Capital's software development lifecycle will tell us what happened at the New York Stock Exchange on Aug. 1, 2012, experts say. (Image courtesy of Ryan Lawler via Wikimedia Commons)

George Dinwiddie, principal consultant at iDIA Computing, also recommends an assessment. Any company can assess its organization using a tool called a retrospective, Dinwiddle says. The retrospective is a formal "look back" process that considers what is actually happening, what the risks are and how the team can improve.

In the Army, retrospectives are called "after-action-reviews." The latest thinking in software, though, is to have the conversation before the software is deployed in order to catch and fix the problem. The Agile Retrospective Resource Wiki provides a host of options.

One effective method I recommend is to ask what is going right, what is going wrong, and what we (the team) should do differently. Team members create cards to list what they would like to talk about, then vote by placing a dot on the cards to decide what to talk about. The team discusses the two most heavily dotted items in each category.

Analysis: Rethinking Software Development, Testing and Inspection

When there is a problem, another anonymous source points out, someone in the organization usually knows about it but may not feel safe enough to bring up the issue in a large, supportive forum. Retrospectives provide not only an open door, but group consensus as well. Someone can raise an issue and get support. That's hard to turn a blind eye to.

4 Ways to Improve Software Testing and Reduce Risk

After the retrospective, your team may come up with a list of risks and issues such as those (theoretically) identified in the Knight Capital case. If so, consider these four techniques to reduce risk.

Knight Capital may never be transparent enough for us to conduct an assessment of what went wrong, or to even see a retrospective report. That shouldn't stop your organization. This could be an opportunity to examine your systems and how they interoperate while determining the value of investing time and energy in risk management.

It's hard work, and it's not eye-popping, but good risk management is likely to keep your company off the CNN, Wall Street Journal or Financial Times home page. That just might turn out to be a most excellent thing.

Matthew Heusser is a consultant and writer based in West Michigan. You can follow Matt on Twitter @mheusser, contact him by email or visit the website of his company, Excelon Development. Follow everything from CIO.com on Twitter @CIOonline, on Facebook, and on Google +.

Read more about agile in CIO's Agile Drilldown.