Please Document the Shop: On the importance of good systems documentation

Friday, May 21st, 2010

By Laird Hariu

We have all heard this: You need to document the computer infrastructure. You never know when you might be “hit by a bus”. We hear this and think many frightening things, reassure ourselves that it will never happen and then put the request on the back burner. In this article I will expand on the phrase “hit by a bus” and then look at the consequences.

Things do happen to prevent people from coming into work. The boss calls home. Talks to the wife and makes the sad discovery that Mike wont be coming in anymore. He passed away last night in bed. People get sudden illnesses that disable them. Car accidents happen.

More often than these tragedies occur, thank goodness, business conditions change without warning. In reorganizations whole departments disappear, computer rooms are consolidated and moved, companies are bought and whole workforces replaced. I have had the unhappy experience of living through some of this.

Some organizations have highly transient workforces because of the environment that they operate in. Companies located near universities benefit from an influx of eager young, upwardly mobile university graduates. These workers are eager to gain experience but soon find higher paying jobs in the “real world” further away from campus. These companies have real turnover problems. People are moving up so quickly, they don’t have time to write things down.

Even when you keep people in place and maintain a fairly stable environment, people discover that what they have documented in their heads can just fade away. This is getting to be more and more of an issue. Networks and servers and other such infrastructure functions have been around for 20 years in many organizations. Fred the maintainer retired five years ago. Fred the maintainer was transferred to sales. The longer systems are around, the more things can happen to Fred. Fred might be right where he was 20 years ago. He just can’t remember what he did.

What does all this mean? What are the consequences of losing organizational knowledge in a computer organization? To be blunt, it creates a hideous environment for your computer people. The system is a black box to them. They are paralyzed. They are rightfully afraid. Every small move they make can bring down the system in ways they cannot predict. Newcomers take much longer to train. Old-timers learn to survive by looking busy while doing nothing. The politics of the shop and the whole company is made bloody by the various interpretations of the folklore of the black box. He/she who waves their arms hardest rules the day. This is no way for your people to live.

This is no way for the computer infrastructure to live as well. While the games are played the infrastructure evolves more slowly and slowly. Before long the infrastructure is frozen. Nobody dares to touch it. The only way to fix it is to completely replace it at considerable expense. In elaborate infrastructures this is easier said than done. The productive lifetime of the platform is shortened. It was not allowed to grow and evolve to lengthen its lifetime. Think of the Hubble Telescope without all the repairs and enhancements over the years. It would have burned out in re-entry long ago.

Having made my case, I ask again; for your own good, please document the shop. Make these documents public and make them accurate. Record what actually is rather than what you wish it to be. It is better to be a little embarrassed for a short while than to be mislead later on. Update the documentation when changes occur. An out of date document can be as bad as no document at all. Make an effort to record facts. At the same time don’t leave out general philosophies that guided the design and other qualitative information because it helps your successors interpret the facts when ambiguities occur.

Think of what you leave behind. Persuade your boss to make this a priority as well. Hopefully the people at your next workplace will do the same.

Seven Observations On Software Maintenance And FOSS

Monday, November 30th, 2009

By CJ Fearnley

The November 2009 issue of Communications of the ACM (CACM) has a very interesting article by Paul Stachour and David Collier-Brown entitled “You Don’t Know Jack About Software Maintenance”. The authors argue energetically for using versioned data structures and “continuous upgrading” to improve the state of the art of software maintenance.

The piece got me thinking about FOSS (Free and Open Source Software) and “continuous upgrading”. Here are seven observations on FOSS software maintenance that occurred to me as I reflected on the CACM article:

  1. FOSS projects “continuously” apply bug fixes and feature enhancements at no additional cost to their users. By applying these improvements “continuously”, the user reaps a steady stream of “interest payments” providing ever-improving security, performance, and functionality.
  2. Since FOSS incurs no licensing or license management costs, upgrading FOSS is not hindered by capital expenses.
  3. Typically support in FOSS projects is focused on the current stable version. Therefore, upgrading to the current stable version is the preferred way to receive the best support from FOSS communities.
  4. One of the key reasons behind Debian‘s strong track record of “continuous upgrading” is its way of handling the tricky issues involved with dependent library upgrades (such as libc6, libssl.so.0.9.8, & etc). The chapter on Shared Libraries in the Debian Policy Manual details a proven method to effectively handle library upgrade issues (including its sophisticated handling of versions).
  5. When upgrading is applied routinely and “continuously”, it becomes crucial to support customizations across upgrades which can be one of the biggest obstacles to a smooth upgrade (see my earlier post on customization and upgradeability). One reason for Debian’s effectiveness in this regard is its robust configuration file handling policy.
  6. It is worth noting that the “continuous” implied here is not the one emphasized in dictionaries (which takes its nuances from the mathematical / physics concept of “no interruptions” and the epsilon-delta definition that students of Calculus learn). That concept of “continuous” is impossible in systems administration which is necessarily discrete as are all computer operations. The connotation required here is, perhaps, “unending”, or “eternal” or somesuch.
  7. The “right” frequency for “continuous” upgrades is a complex tradeoff between business requirements and upgrade infrastructure maturity. Debian and Ubuntu provide vary mature support for “continuous upgrading”. They support the upgrade of production servers through release after release after major release with minimal downtime or risk of a glitch that could affect users. Their current release frequency of about 2 years may be the best we can do given the current state of the art of software maintenance. I hope we can learn to increase the frequency as better engineered upgrade policies are developed.

I prefer the name “eternally regenerative software administration” over “continuous upgrading”. It avoids the philosophical problems with the word “continuous” and emphasizes the active, “ecological” approach needed to envision the engineering of “regenerativity” in software. By that I mean software maintenance should involve building the system so each new version enables installation of the next while facilitating management of any customizations and integration with other software (including libraries and other “helper” applications). Regenerativity is the process of growth and change used by Nature itself. Software maintenance needs to follow similar principles.

Customization, Upgradeability and Eternally Regenerative Software Administration

Friday, October 16th, 2009

By CJ Fearnley

Mary Hayes Weier wrote an interesting article in this week’s edition of InformationWeek on "Alternative IT: CIOs are more receptive than ever to new software models". What is great about her article is how she captured the divergent views on IT models (such as SaaS, cloud computing, etc.) and gave nice vignettes of different organizations trying different parts of various models. I especially valued her use of cognitive dissonance to leave the reader thinking … better informed but without a firm conclusion.

There are so many parts of the article that I could blog about, but the one that touched the core of my thinking about “eternally regenerative software administration” was the quote by Bill Louv, CIO at GlaxoSmithKline, who said

"And here’s the rub: When you customize software, it’s difficult to implement future upgrades from the vendor"

Louv touched the very bane of eternally regenerative software administration! Software should accommodate both customization and upgradeability: these two elements of software administration are at the heart of my notion of eternally regenerative software administration: how to preserve customizations and provide smooth (near zero downtime with almost no glitches) upgrades through major release after major release. It is a big challenge, but in our experience the Free and Open Source Software (FOSS) communities are at the leading edge in finding solutions to these conflicting objectives. Here are some of the innovative ideas from the FOSS world which should serve as models or design patterns for all software developers (if only these ideas would become commonplace!).

First, Debian (a FOSS operating system which is the root of Ubuntu, Knoppix, Xandros and many other Linux distributions) requires that their official packages, a collection of software prepared for easy administration, must adhere to a very mature policy. Debian’s policy is a marvel in the FOSS world and to a very large degree is responsible for its strong support for both customization and upgradeability. I think Debian’s reputation for stability and maintainability is almost certainly due to their decision to develop a consensus-driven policy that its software must implement.

For example, the Debian package maintainer, Luigi Gangitano, for Drupal, a FOSS content management platform, did a great job making the software both customizable and maintainable. The package supports configuration of multiple virtual hosts which can all be upgraded at once! And the Debian drupal6 package stores the look-n-feel in /etc/drupal/6/themes/ so that each site’s GUI can be customized without interfering with upgrades. If only all web applications were built to be as maintainable as Debian’s Drupal package!

Another example is the overlay support included in RT: Request Tracker, a FOSS ticket tracking system. This allows putting replacement subroutines in special files in /usr/local/share/ which overlay or substitute the upstream code. This approach is more likely to break on upgrades, but it supports minimal changes to the business logic with a decent chance that upgrades will be smooth.

There are countless more examples from the FOSS world of innovative solutions to inter-accommodate customization and upgrades in support of eternally regenerative software administration. What are some of your favorite examples?