This, perhaps, the most painful
This report has all the characteristics of technical debt in a huge, deprived of support and running code base (the error occurred due to the execution of the code, which was not used by almost 9 years) and a terrible and sad story of interaction between software developers and IT professionals.
Basic moments:
To ensure the participation of its clients in the Liquidity Program (PL) on the New York Stock Exchange, which was planned to be launched 1 august 2012 of the year, Knight has made a number of changes to their systems and program code, related to order processing. These changes included the development and deployment of new programming code in SMARS. SMARS is an automated, high speed, algorithmic router, which sends orders to the market. One of the main functions of SMARS is to receive orders from other components of the Knight trading platform. ("Parent" orders), And, as needed based on available liquidity, sending one or more representative (or "child") orders to external services for execution.
13. When deployed, the new PL code in SMARS was supposed to replace the unused code in the corresponding part of the router.. This unused code was previously needed for the Power Peg function, which the company has not applied for many years. In spite of this, it remained working and called during the deployment of the submarine. New PL code used flag, who was previously tied to the Power Peg. Knight wanted to remove the Power Peg code, so that when this flag is activated, the new PL functionality is used, and not Power Peg.
14. Previously, when using Power Peg, the summing function calculated the number of shares in the child orders being executed and signaled the need to stop placing the child orders after, how the parent order was completed. IN 2003 year Knight stopped using Power Peg. IN 2005 Knight changed the Power Peg code, by moving the parent order tracking function to an earlier stage in the SMARS code sequence. Retesting the Power Peg code after the Knight change was not performed, and in fact, that the procedure still works correctly, not convinced.
15. Beginning with 27 July 2012, Knight deployed new submarine code in SMARS, by hosting it on a limited number of servers. While deploying new code, one of the technicians did not copy the new code to one of the eight SMARS servers. Knight didn't have a second technician, which would test the deployment, and nobody understood, that the Power Peg code was not removed from the eighth server and the new PL code was not added. There were no written procedures at Knight, which would require such verification.
16. 1 August Knight received orders from broker-dealers, whose clients could participate in the PL. Seven servers were processing orders correctly. But orders, sent to 8 server with start flag set, run defective Power Peg code, which was still present on this server. As a result, the server interpreted the orders as parent orders and began to send child orders to trading centers.. As a consequence, that the parent order fulfillment check function has been moved to a different stage in the process, the server continued to place child orders nonstop - regardless of the fact, that the parent order has already been completed. Although some part of the order processing system defined, that the parent order is completed, this information did not get into SMARS.
19. 1 August Knight also received orders, who belonged to the PL, but were meant to be traded before the market opened. 6 SMARS servers processed these orders and, starting from about 8:01 morning, internal systems generated automatic messages (called "BNET failure"), who referred to SMARS and described the error as "Power Peg disabled". The Knight system sent 97 such messages before 9:30 morning, when the market opened. Messages of this type were not evaluated by the system, as dangerous, and the staff didn't read them at all.
The further is even more fun:
27. 1 August at Knight there were no procedures, incident response. In other words, the company did not have control procedures for personnel management, when serious problems happened. 1 August Knight used his team of technicians, to identify and fix problems in SMARS in a live trading environment. The Knight system continued to send millions of "child" orders, while staff were trying to identify the source of the problem. The company even removed the new PL code from seven servers., on which it was installed correctly. This made the situation worse, because new parent orders have activated the Power Peg code, which was present on these servers, Just, what has already happened on the eighth server.
Undoubtedly, worth reading the entire document, it focuses heavily on new verification procedures, performed by people, to avoid a similar tragedy. Developer errors, undoubtedly, were associated with human factors, but such consequences were the result of poor deployment script and disgusting monitoring. What is this office, which does not even check the software version of the cluster? Not to mention the deployment script, which checks for return codes.
We can only hope, that "written test procedures" for unused code meant systematic tests, although wikipedia says, that it is not.
And for sweet: the fine was still 12 million dollars, the audit showed, that the system was constantly trying to carry out speculative short selling.
Original : http://habrahabr.ru/post/198766/