Because I am a system safety engineer it seems appropriate to write about safety now and then. This post is one of those times.
A couple of events over the past few days have gotten me to wondering. One event was a meeting with system safety engineering friends of mine. The topic of “software system safety” came up for discussion. This issue is concerned with how to deal with the safety aspects of software that controls machinery – all types of machinery including missiles, aircraft and the current hot topic of driverless automobiles. Obviously there can be a few safety considerations with software controlling this type of equipment, especially with the newer “smart” software that teaches itself! The question is along the lines of “how do we ensure that the software will control the machine in a way that is safe?”
When boiled down to the basic bottom line, system safety is a process of analyzing and/or understanding a system (thing, product, operation – whatever), figuring out what bad things could happen, figuring out what could cause those bad things, figuring out how to prevent them from happening, and then doing what is necessary to implement the control measure(s) (as well as verifying that what you think will work is actually in place and actually solves the problem). This is not an easy task, but luckily there are a few techniques to help with this.
There is general agreement that the process that I sketched above works for “hardware” because it is relatively easy to visualize the parts and interactions, and it is apparently relatively straightforward to test and simulate whatever needs to be tested. It is usually judged to be “simple” in comparison to the situation involving complex software controlling critical systems. Software is considered more complex because it is not unusual to have millions of lines of software code written by a large number of individuals (and computers in some instances) and all of which is highly interconnected in ways that are extremely difficult to visual in sufficient detail to have much confidence that all of the potential logical “paths” through the code. It is extremely difficult to ensure that all such paths have been analyzed, designed and tested to prevent some sort of safety problem during operation. One advantage of software is that it doesn’t exactly “fail” – it does the same thing given the same input every single time – unless something changes in the hardware system that runs the software. These “somethings” can be bit changes because of cosmic rays, speed differences in various components changing the order of information being processes (referred as “race conditions”), various ‘hiccups’ where a portion of the hardware (perhaps a microprocessor” stalls for an instant (think about having to reboot windows because it “hangs up” – which would not be such a good thing while landing a jet liner), dropping data, and many many more potential “hardware” problems that change how the “software” operates.
Because of the apparent increase of complexity and a decrease in visibility for software controlled systems, there has been a strong push to do “something different” for software than for hardware. (I use the world “apparent” because I don’t think any system is as obvious as it appears to be.) The software folks insist that their stuff is so different that an entirely new approach is required, one that somehow does follow the model that I sketched out in a previous paragraph.
It has been my contention that there is nothing particularly new or different with software safety engineering than what I sketched out in a previous paragraph. We still have to do the same things, but we might do them a little differently because of the nature of the system under investigation. I have lots of ideas about how to approach that, but in general they are all just more of the same thing that I do for all systems. One point that the software folks seem to ignore (or maybe don’t recognize) is that almost all systems depend critically upon inherently invisible “software” logic systems – it is just that a lot of the software is embedded in the squishy matrix of people’s brains. Ultimately, almost all safety is not only dependent upon hardware acting as it should, but that the control “systems” do the same. Unfortunately people’s minds are not nearly as predictable from person to person, or within the same person at different times. Including the impacts of people in the system is a MUCH more difficult problem than anything that software controls can generate. So at a basic level, it is all based upon some sort of unanalyzable control system.
One of the things that the software folks like to point up as being inherently different from hardware is that there are so many possible paths through the logic that it is extremely difficult to identify the possibilities, and even harder to test all of them out. This is where the “hardwood” part of my story comes in.
A couple of days ago I was outside preparing for a “hard freeze” forecast by the weather guessers. In central California this means putting towels or other coverings over exposed water pipe and covering delicate plants. While doing that I turned a corner around a post and ran smack into a short piece of 2×4 wood that had been screwed onto the post at head height. I am a slow learner, but being smacked in the head with a 2×4 woke me up enough to check out the safety of the situation. Upon inspection, it was obvious that the 2×4 had been installed many years ago (probably by me) for some long forgotten purpose. This “hazard” was on a predictable path (similar in concept to the software logical paths) to causing an accident. I find it interesting that the “hazard” had been created long ago, had been what should have been an obvious issue at that time, when the original purpose had been abandoned the hazard “should have been” identified and removed. the identification and removal process could have been applied each time someone (such as myself) walked down the path past the obvious hazardous condition.
It was very much like one of those “special” software “features” that everyone is so worried about. My board was similar to something that was put into the code for some sort of good reason, but then was abandoned – but not removed or “disarmed.” Once the situation changed, it “all of a sudden” came back to life. The thing that changed in the case of my head banger was that it was raining making the ground muddy and slippery, Therefore, I was looking down to avoid tripping and slipping instead of looking up where I would have seen the board just like I had for many years previously. It was never a problem (“hazard”) before because it was in plain sight, obvious, easy to avoid – and I avoided it.
The point it that almost all system safety problems have features like; otherwise they would not exist because they would have been recognized and fixed. What appears to be obvious in hind sight isn’t so obvious in the moment. Software has a similar problem, and so do simple things such as that board that didn’t actually “do” anything, it was just there – as an innocuous “feature” of my barn. I see no reason to treat software as anything special, or something that has unusual properties other than it can become involved with turning “hazards” into “accidents” – just like has happened with all accidents. Of course that doesn’t mean that it doesn’t require special tools, techniques and knowledge to “solve” the safety problem, just like most everything else in the universe of system safety requires special tools, techniques and knowledge. Rather than spending a lot of time and effort attempting to figure out what is “special” about software safety, I think it is best to figure out how it integrates with hardware and people, and in finding engineering and management tools capable of ensuring that hazards are identified and controlled.