Monday, August 22, 2011

Use this data, not that

The Salt Lake Tribune has recently been running a series of articles dealing with some privacy and security concerns that a fraud probe into a Utah prenatal health care program raised. The articles are primarily immigration-themed but they are also eye-opening from a software design perspective. One of the articles focuses on the fact that the software Utah uses required them to enter a Social Security Number (SSN) for patient identification. Some of the women coming into the clinics were unable or unwilling to provide this information, so the clinic staff would issue them 'dummy' SSNs to get them into the system. This eventually caused an issue because one of the dummy numbers entered happened to match the real SSN of a man in Maine. The end result was a case of accidental identity theft.

Anyone who's ever developed software will be un-surprised by the details the article gives about how the data ended up so muddled. The system required a nine-digit ID, so the staff used SSNs. When the SSN was unavailable, they'd make one up. For years they'd prepend a "V" or something to try and distinguish the reals from the fakes, but then an upgrade forced the values to become numeric only. Under both schemas ID duplication was occurring, a fact the staff was well aware of. Changing the ID field's parameters was too expensive, so they just lived with the mess. Investigations by the U.S. Social Security Administration (SSA) only prompted the helpful advice to use a different numerical prefix that the SSA doesn't use in SSNs. The state's processes have been modified to continue doing exactly what they've been doing all along, except now they have to keep a separate (most likely paper) log to be used to sort out any difficulties.

There are some very important lessons about software development that can be learned here: first, that using government-issued numbers as IDs is a very bad practice, and second, that good software design cannot ignore the human element.

SSNs are not used as IDs as much in software anymore, but I think some designers and developers don't really understand why this is the case. We may say "People don't want to give us that information" or "We don't want to be responsible for keeping that data private." While it's good to recognize the inherent privacy concerns, these reasons miss the point, plus most organizations that would use SSN in the first place do have valid reasons to collect it. The real reason SSNs make poor IDs because they cannot be changed, and because they are intended to be a private key.

An example to illustrate: I worked for an automotive shop where the mechanics would track the vehicle work via a touch-screen terminal. The mechanics would log in to said terminals using their SSN. The software running on the terminal communicated only with the server in the back room, and the shop had valid reasons to know the SSNs of the people their customers were entrusting their vehicular safety to. It all seemed like a reasonable setup. But then Employee B found out Employee A's SSN, and began to enter work under Employee A's login. I don't remember why firing Employee B was not an option, but it wasn't. We couldn't change Employee A's SSN without screwing up the payroll system, and we didn't have the resources to redo the terminal software (it was really, really bad code). This left Employee A entirely without recourse.

Using an SSN as a private key to eliminate duplication or provide positive identification for legal purposes is a perfectly valid thing to do. But to use SSN as a username or a public ID number is just wrong. It boxes you into logistical and ethical corners that can be very expensive to get out of.

The complaint is raised that we don't want our users to have to be responsible for yet another number or ID that they have to remember. This is a valid concern, and it has a simple solution: don't do it. Look the patient up by name when they come in the clinic. Issue them a card with the ID number printed on it (and include a barcode or magnetic stripe so it can just be scanned). Issue them an ID badge with an RFID chip. Let them choose a username - these are intended to be public, so people can reuse them ad infinitum. Require SSN as a element of account creation if you must, but store it privately (and securely) and map to it by the public ID of your/their choosing.

David Platt once said, "Your user is not you," and I don't think truer words have ever been spoken. Developers tend to have a certain mental block about this; they assume that because the field says "SSN" or "Email" or "Date of Birth" then that's what the user will enter. But we forget that to a user, a field is not a discreet, re-usable piece of information - it is a post-it note where they can write stuff till they need it again. Users will put information wherever they can fit it, regardless of categorization. A balance has to be struck between making forms daunting or too permissive. Validation goes a long way to helping with this. I work for a company that receives real-time (multiple per second) data feeds from the largest retail chain on the planet. One element in these feeds is email address. We get the data just as the store associate enters it, and since the software on their side does not require any validation at all - not even a check to be sure it includes a "@"! - the email addresses are not viable. A trivial regex would allow this information, which is invaluable for our marketing purposes, to be usable instead of dross. Validation is no magic bullet though - the most rigid validation in the world won't alert you to the fact that the patient's birth-date is not 1/1/1970. Unless for some exceptional reason you can verify the person's birth certificate, there's pretty much no way to independently verify that kind of data, and it would not be worth the effort for you to try. So in that instance the software should simply be aware that this value is not ironclad and may need to be treated with kid gloves.

The bottom line is that as computers and software become more and more ubiquitous, we have to avoid creating any further pitfalls like this.