Friday, April 23, 2010

My Anti-Spam: Protect Email Addresses From Spammers Without Inconviencing Anybody

It goes without saying that spam is a big problem that affects a lot of people.  While spam blockers are increasingly effective - almost nothing gets past the Gmail one - they aren't perfect, and really, that's attacking the symptom rather than the problem.  The best way to avoid spam is keep email addresses out of the hands of spammers. Web developers have a responsibility to make sure websites they construct protect their customers and clients by taking steps to thwart spam-bot harvesting attempts.

From a technical standpoint, the most effective approach would be to never give the website user access to an email address.  Submissions to company email boxes would go through contact forms built by the web team, and all email communication would handled server-side.  There are two drawbacks to this approach, however.  The first is that for small companies/websites, developing the code and infrastructure to handle this might be beyond their budget or skills.  Secondly, I'd be willing to bet a large number of people are like me, in that filling out those contact forms is usually an obnoxious and overly cumbersome process that we have no confidence even works in the first place.

For a small website it's temping to just put the email addresses in "mailto:" links with the "@" and "." written out, with the expectation that the user will fix those "escapes" in their mail client.  However, "me (at) zero (dot) net" can be parsed by a spam-bot just as easily as the human brain (easier, in fact), so this approach offers zero protection.  Plus, this just confuses some people, even with the handy "Hey replace with the right things" message that a lot of people put next to the link.  It violates the #1 rule of user interface design: "Don't Make Me Think."

A number of forum posts and advice pages I've come across have recommended an approach whereby clicking on an "email me" link calls a JavaScript function that redirects the browser to the appropriate "mailto:" URL.  The reasoning behind this is that the email is protected because it is not directly exposed through the href element on link.  If the JavaScript is placed in a separate *.js file, this is not a bad approach.  It removes the email from the HTML document itself, making it less likely that the spam-bot will find and parse it.

If, however, we assume that the spam-bot is sophisticated enough to download the *.js files included in the page, then this approach offers us no added protection, because once the bot has the *.js file, it will be able to parse the email out of it just as easily as it would from an HTML document.  I honestly don't know how vigilant your average spam-bot is in this regard, but even if they're not, this smells of "security by obscurity" - "the bot is less likely to go there, so we're secure."  "Hard to find" does not mean protected, so this approach raises a red flag to me.

In grappling with this issue, I eventually arrived at an approach that might considered a hybrid of the 'do it server-side' and the 'do it in JavaScript' approaches.  In this solution, the "email me" link in the HTML page has no href (or rather an href="#"), but has an onclick event that calls a JavaScript method.  This JavaScript method creates an XMLHttpRequest that calls out to a handler or service that returns the email address, which is then pre-pended with the "mailto:" protocol and navigated to by the browser.

So in many ways this is the "use JavaScript" approach with a twist, that twist being the server-side component that retrieves the actual email address.  This component could be an extremely simple web service, a script that writes out an XML or a simple text response, or anything really.  The exact implementation would depend on your server-side language of choice of course; in ASP.NET, I implemented this as a managed handler.

The advantage of this approach is that the email address is never written to a document - HTML, JavaScript, or otherwise - that can be found and parsed by a bot.  The email address is pulled across the wire in the JavaScript call, and thereby resides only in the browser's memory.  And because the component that delivers the email address is a server-side piece of code, you're able to add functionality to it that block requests from people/apps you don't like.  In my particular implementation, I check a white list of acceptable referrers, and do not return anything to a client who doesn't match.

The web page code ends up looking something like this:

<html>
<head>
  <script type="text/javascript">
      function openEmailClient(addressKey) {
        var request = new XmlHttpRequest();
        request.Open('POST', 'getEmail?key=' + addressKey, false);
        request.Send('');
        window.location.href = 'mailto:' + request.responseText;
     }
  </script>
</head>
<body>
     <p>
        <a href="#" onclick="openEmailClient('help');">Click here to email someone</a>
    </p>
</body>
</html>

There are a million different ways the "getEmail" component might be implemented - there might be a database involved, you could get the email from the config file, or it might just be hardcoded.  Either way, at the core of it, it needs to parse the incoming request (query string in this case), take the given key, lookup the email address by that key from wherever it might be stored , and write that out as the response.

I like this solution because it lets the user have control over the email sending process - they're able to actually send an email, no hand-waving involved - yet offers a measure of security without being particularly difficult to implement.  I had this whole thing done in only an hour, and that's including writing some tests around my "getEmail" component and doing manual testing of the actual page.  It's also very scalable - "getEmail" can be as simple or complex as you need.

This is not, of course, an infalliable solution.  An email address can still be harvested if a spam-bot is set up to perform an actual browser click and then process the "mailto" protocol itself.  However, this process is slow, and spammers need volume, so it's pretty unlikely that there's someone out there doing that.  And of course, this approach does nothing to stop a human being from clicking on the link and taking your email - in fact, that's actually the point.  This approach protects against spammers, not stalkers.  If there are privacy issues with a specific person or group of people, then you would need to explore other avenues.  (Also, filtering by referrer is not the greatest option; bots can spoof referring agents when making HTTP requests.  IP filtering would be a much better option; I just didn't have it in the project I was working on.)

Comments, questions, and concern are always welcome!