I have been following the Boeing 737 Max MCAS software issue with great interest since the crash of Ethiopian Airlines Flight 302. For those that are not familiar with the story the basic and highly simplified version is that the Boeing 737 Max is a redesign of the highly successfully Boeing 737.
The 737 Max redesign required larger engines than the original 737, which in turn required the new engines to be mounted farther forward on the wing, which in turn changed the aerodynamics of the plane. This resulted in a higher propensity for the plane to pitch up under certain conditions which in turn could cause the plane to stall and nose-dive under certain conditions.
Rather than redesigning the airframe to accommodate the new engines Boeing chose to develop a software solution – MCAS (Maneuvering Characteristics Augmentation System) to mitigate the pitch problem. The basic concept is that when MCAS recognizes a problem with the angle of attack of the plane – essentially too high of a pitch with respect to air speed and other variables - MCAS automatically pushes the nose down (reduces the angle of attack) to an acceptable level.
The explanation above is woefully simplified but sufficiently conceptually accurate to provide the context for this post. Fast forward to today – the recent crashes of Lion Air Flight 610 and Ethiopian Airlines Flight 302 resulted in the grounding of the Boeing 737 Max jets worldwide. The investigation by the FAA and other regulatory bodies appears to be focusing on flaws in the MCAS software.
The focus of this post is not to judge Boeing’s business and engineering decisions regarding the 737 Max redesign. I am not qualified to enter into that debate. My focus is to provide insight into how to get business systems requirements “right” - which result in high-quality highly-functional software – and to provide insight, from my perspective, into how and why Boeing got it “wrong” with the MCAS software.
Epiphany [Def]: An unusually sudden manifestation or perception of the essential nature or meaning of something; an intuitive grasp of reality through something (such as an event) usually simple and striking; an illuminating discovery, realization, or disclosure.
To-date, I have over five million miles of air travel. Yes, I do a lot of business travel. I am probably one of the most confident and fearless fliers. On red-eye flights I am typically asleep before the flight is wheels up. I am an ardent supporter of the commercial airline and aircraft industry.
Excerpt from my book Mastering Business Chaos. P. 170:
“Let me interject here to advocate a bit for the airlines. I am a frequent flyer. From my experience, it is amazing how well the airline industry runs in an extremely chaotic environment. There are so many moving parts in the airline industry! There are different types of equipment, different flight crews across the world, international schedules and time zones, multiple languages, FAA regulations and inclement weather (even volcanic ash) – the list is long.
To me, the airline industry does an excellent job of mastering their particular chaos. Despite all these moving parts, many of which are completely out of their control, the airlines are able to achieve solid consistent results across a wide range of metrics such as on-time arrival, departure and flight safety.”
I also have extensive experience and expertise in business strategy, business process reengineering, business and systems analysis and software engineering. That is why I am so deeply disturbed by the article that I just read in Bloomberg: “Boeing's 737 Max Software Outsourced to $9-an-Hour Engineers - How did the company renowned for meticulous design make seemingly basic software mistakes?”
I highly encourage you to read the article. It sparked my epiphany moment with respect to the flaws of the MCAS Software. It’s an excellent article. However, in all fairness to Boeing, the title of the article is a bit misleading. Excerpt from the article:
“Boeing said the company did not rely on engineers from HCL and Cyient for the Maneuvering Characteristics Augmentation System, which has been linked to the Lion Air crash last October and the Ethiopian Airlines disaster in March.”
In other words, the MCAS software, according to Boeing, was not developed by $9-an-Hour engineers. However, this is perhaps even more disturbing because the defects resulted from presumably higher-paid higher-quality engineers from Boeing’s own team working under Boeing’s direct control and supervision.
That said, the flaws in the engineering of the MCAS software just did not “fit” my long-held belief in the integrity of Boeing’s commitment to flight safety. Others felt the same way – excerpt from the article:
“It remains the mystery at the heart of Boeing Co.’s 737 Max crisis: how a company renowned for meticulous design made seemingly basic software mistakes leading to a pair of deadly crashes.”
The article goes on to say:
“The coders from HCL were typically designing to specifications set by Boeing. Still, “it was controversial because it was far less efficient than Boeing engineers just writing the code,” Rabin said. Frequently, he recalled, “it took many rounds going back and forth because the code was not done correctly.”
Boom! That was my Epiphany moment!
There is no mystery. Boeing does not appear to me to be thinking about this correctly. The deficiencies in the MCAS software are not so much about the software engineers (whether Boeing, HCL or Cyient software engineers) writing the code or that the code was not done correctly. The root cause, in my opinion, is inadequate and unsophisticated business and systems analysis further exacerbated by poor collaboration at all levels among the development team. In other words, there is no mysterious or exotic cause of the deficiencies. The root cause of the deficiencies are as classic and old-school as it gets. More on this later.
Flight Safety and the Customer Value Proposition
In my Business Process Reengineering training course I discuss in detail the concept of customer and business value and dimensions of effectiveness (drivers of customer and business value). A very popular exercise with participants is a facilitated discussion regarding the dimensions of effectiveness in the commercial airline business – from both a customer and from a business perspective.
As part of the discussion I ask the participants to identity the number one dimension of effectiveness in commercial air travel. Participants provide a wide range of excellent responses such as convenient flight schedules, on-time arrival, comfort and seat selection, etc. After some discussion of the various dimensions, I reveal that the #1 dimension of effectiveness from both a customer and company perspective is flight safety!
That’s it, nothing else matters unless customers first and foremost have a high level of confidence that the flight will arrive safely. And, that is why the defects in the 737 Max MCAS software are so disturbing.
Getting the Requirements Right (or wrong in this case)
It’s reasonable to take the position that flight safety was compromised at some level by Boeing’s relentless drive to reduce design and engineering costs and, perhaps more importantly, to reduce the time to market. Based on the article and some reading between the lines, it appears that systems analysts and the software engineering team were under significant time pressure to get the MCAS software into production. That said, time pressure is pretty much business as usual in most organizations – systems analysts and software engineers are constantly under significant time pressure to get software into production.
So, there must be more to the story than time pressure. In my opinion, based on the article and based on my experience gleaned over many years, the primary reason for the undiscovered discoverable defects in the MCAS software was the lack of sophistication of Boeing’s business systems analysis and methods and lack the of experience of the MCAS analysis and development team.
In my Business Systems Analysis course I discuss the concept of value engineering, essentially the tradeoff between quality and the cost of quality, in detail. The example that I use is the distinction between developing software to support a fast food point of sale order entry system vs. developing software for the navigation and guidance systems of a NASA deep space probe. In either case, the development teams does not want to release software into production that does not have acceptable functionality and/or is full of bugs and defects. However, with the NASA deep space probe, there is clearly a compelling business case to invest in a deeper level of analysis, design, development and testing vs. a fast food point of sale system.
That said, there are some basic fundamental systems analysis and software engineering concepts that would apply to both of the applications. And, in my opinion, based on the article, those basic fundamental concepts, if applied to MCAS, would have detected the undiscovered discoverable defects early in analysis and resolved the defects prior to design, development, testing – and certainly prior to production.
Break-fix Scenario Analysis
Great software requires three types of high-performing professionals: 1) Subject Matter Experts (SMEs) that understand the business space; 2) Business System Analysts (BSAs) that are experts in gleaning deep business knowledge of the SMEs and translating that knowledge into clear, thorough and accurate business systems requirements; and 3) software engineers (developers, DBAs, UI and UX specialist’s, etc.) that actually create the software from the requirements. It also takes significant collaboration among these three types of professionals to get it right. Part of the secret sauce of getting the requirements right and ultimately getting the software right is a technique called break-fix scenario analysis.
The concept of break-fix scenario analysis is to discover and resolve discoverable defects as early as possible. So, for example, based on a series of initial meetings between the SMEs and BSAs, the BSAs create an initial set of baseline business systems requirements. The BSAs then walk through the baseline set of requirements with the SMEs to confirm/validate the requirements and revise the requirements as necessary. The SMEs and the BSAs typically will go through several iterations to get to a reasonably solid set of requirements. This is where break-fix scenario analysis between the SMEs and the BSAs comes into play. It’s the responsibly of competent, sophisticated, experienced BSAs to challenge the SMEs to identify what-if scenarios that break the requirements that have been identified and defined at this point.
I have been doing sophisticated business and business system analysis for many years. It never ceases to amaze me, that regardless of how well I have identified and defined requirements prior to break-fix scenario analysis (and I am very good at discovering and identifying requirements), the number of additional requirements that are discovered just by asking the SMEs to go a bit deeper by asking themselves to expand their thinking to identify what-if scenarios. I love these scenarios! We talk through the scenarios with the SMEs and revise the requirements accordingly. Much better to identify these requirements now, rather than during software development and certainly you do not want to identify these requirements in production – which is essentially what happed in the case of MCAS.
From my perspective, it’s almost inconceivable, that if Boeing’s BSAs applied even rudimentary break-fix scenario analysis with the SMEs during analysis, that at least one of the SMEs would not have come-up with the scenario that “what-if” the sensor providing input to MCAS was defective or damaged during flight and provided erroneous data to MCAS. That would then lead to the discussion of “how would MCAS know its receiving erroneous data” and “what is the impact on the flight controls affected by MCAS.” It seems to me that this type of analysis would then lead to identifying that, under certain conditions, MCAS could put the aircraft into an unrecoverable dive.
The concept of break-fix scenario analysis also applies to collaboration between BSAs and the software engineering team. The requirements discussion with the software engineering team has a different dynamic than the discussion with SMEs. The discussion with SMEs is from a customer and business facing perspective regarding what the software needs to do. The discussion with software engineers, from a software engineer’s perspective, is “how do I make software out of these requirements?”
Accordingly, software engineers ask deep and interesting questions about the requirements based on their very difficult challenge of developing high-quality, highly-functional software from the requirements provided by the BSAs. Conducting break-fix scenario analysis with software engineers requires BSAs to think about requirements from a different perspective – a software engineering perspective.
And, again, it never ceases to amaze me the number of new and/or revised requirements that result from conducting break-fix scenarios analysis with the software engineering team. Software engineers are very smart people. And, again, it’s inconceivable to me that, if the Boeing BSAs were conducting at least rudimentary break-fix scenario analysis with the Boeing software engineering team, that at least one member of the Boeing software engineering team did not ask the question “what-if” the sensor providing input to MCAS provided erroneous data to MCAS.
The focus of this post is not to judge Boeing’s business and engineering decisions regarding the 737 Max redesign. My focus is to provide insight into how to get business systems requirements “right” - which result in high-quality, highly-functional software – and to provide insight, from my perspective, into how and why Boeing got it “wrong” with the MCAS software.
My message is that getting the requirements “right” is difficult and transforming requirements into software is equally difficult - regardless of the time and budget. However, applying solid best practice analysis techniques such as break-fix scenario analysis and utilizing experienced professionals results in thorough, accurate, clear requirements enabling software engineers to create high-quality, highly-functional software.
A key issue in many organizations, particularly organizations that are not in the information technology industry, is that leadership often does not fully understand the level of complexity associated with software developed and the level of sophistication and collaboration required of the development team (SMEs, BSAs, and Software Engineers) to create high-quality, highly-functional software. The result is that leadership often views systems analysis and software engineering skills as a commodity - leading to selecting resources based on low cost rather than based on the level of talent.
I can say to you with complete confidence based on years of experience in the information technology industry that there is an exponential increase in the quality of analysis and resulting code as you move up the experience curve. There is far more net value created is using experienced (and therefore higher paid) analysts and software engineers than the money saved using less experienced analysts and software engineers. The legendary best-selling book The Mythical Man Month, by Frederick P. Brooks, Jr., provides an excellent in-depth discussion of this concept.
The combination of using low-cost, less-experienced resources combined with the “just get it done and into production” attitude typically results in software that does not adequately support the needs of the business, is flawed and fraught with defects, and is highly maintenance intensive.
Was the cost savings resulting from getting MCAS software into production quickly using, presumably, less than highly experienced resources comport to the cost of grounding the Max 80 worldwide fleet for many months and, perhaps more importantly, the effect on Boeing’s reputation and customer confidence in Boeing’s products?
Boeing’s seemingly lack of sophistication and professionalism in identifying and analyzing business system requirements raises an additional question in my mind – was failing to discover the faulty sensor scenario a one-off “miss” by the team, or are there other discoverable, but yet undiscovered flaws in the MCAS software?
From my perspective, unless the Boeing team engages in sophisticated break-fix scenario analysis at all levels (business analysis, systems analysis, software development) my concern is that additional undiscovered discoverable flaws will be discovered in production (just like the sensor input flaws) rather than in analysis and development. I am a confident and fearless flyer, but it’s going to take some time before I am sufficiently confident to fly in a Max 80.
A final note. Of my over 5 million miles of flying, over 3 million of those miles are on American Airlines and close to 1 million miles are on Southwest Airlines. I have significant respect for and confidence in both organizations and especially in their pilots and mechanics with respect to identifying and vetting safety issues. Accordingly, if American Airlines and Southwest Airlines and their pilots and mechanics deem the Max 80 and the MCAS software to be safe, I’ll be more inclined, regardless of my loss of confidence in Boeing, to be an earlier adopter of the Max 80 when the FAA grounding is lifted.
* * * * *