Tuesday, April 15, 2014

Solve your problem by almost asking a question on StackOverflow

If you were to look at my StackOverflow profile, you would see that I’ve only asked a handful of questions in the years that I’ve been a member. However, this is not because I don’t run into programming problems or don’t think to ask questions. In the process of formulating a good StackOverflow question, I usually find the answer myself.

First of all, it’s embarrassing to spend time typing up a question and have someone respond with a Google search providing dozens of hits that provide the answer. So before I even start typing, you’d better believe I Google the hell out of the problem. And many times, I find what I'm looking for.

Assuming Google fails me, now I start explaining my problem. Since my audience is a group of complete strangers that have no knowledge of my application, I can’t simply say “I’m getting a SplinesNotReticulatedException from my FlangerFactory.” Even if I happen to be working on an open source project, I can’t expect the StackOverflow community to take the time to understand my full code base in depth. Thus, I am forced to write a small example demonstrating the problem. This is a fantastic way to narrow down the problem, because if I think the problem is with Dojo, but I can’t write a small example outside of my code base to reproduce the problem, then I’m most likely not dealing with a Dojo bug. In fact, there’s a good chance I’m looking in the wrong area of my code base. And if I can reproduce the problem, then maybe I can file a bug report with Dojo.

Now I have a simple code example that helpful members of the community can work from. So that they don’t waste their time running into the same walls I did, I explain exactly what I’ve tried thus far that hasn’t worked. This saves a lot of back-and-forth “nope, I tried that and it doesn’t work” comments. And in the process of documenting what I’ve already tried, I often come up with new ideas. And sometimes, one of those ideas actually solves the problem.

Of course, while I used StackOverflow as an example, this process really works with any community and medium, including emailing my coworkers. I respect the time of those who help me, so I make every effort to ask for help in the most efficient manner possible. In every forum I use, I always follow this process:

  1. Thoroughly search Google (and any other available non-human resources).
  2. Write a concise example demonstrating the problem.
  3. Explain any failed attempts to solve the problem.

More often than not, I solve the problem myself by going through this process. And if I don’t discover the solution myself, then I have constructed a proper question that gives me the best chance of getting the help I need from my community.

Update: as several Redditers have pointed out, if you've documented a problem and found a solution, you should share it! StackOverflow actually encourages posting and immediately answering your own question, and I imagine most communities feel the same way. My focus in this post was on discussing how to properly research and document a problem before asking for help, and highlighting how often you could end up solving the problem yourself, but I left out the importance of sharing that information with others.

I wouldn't necessarily recommend posting your solution if it's "oops, I found a missing comma in one of my modules", and I don't think we need any more StackOverflow questions that could easily be solved via a Google search (that probably points back to a dozen StackOverflow questions). But if your solution could actually help others, then go ahead and publish it in whatever forum you like, or put it up on your personal blog, or email it to your coworkers...

Thursday, October 31, 2013

Using Node.js streams to massage data into the format you want

Google provides some pretty cool flu data in CSV format, and I wanted to display that in a chart at Dash. However, the raw data isn't quite right for my needs:
  1. It has a bunch of intro/header text (copyright stuff, description of the data, etc), and Dash needs just the raw data.
  2. It shows dozens of states/regions/cities, and I just want to show overall U.S. data and my home state.
Fortunately, Dash can read data from any publicly accessible endpoint, so I decided to throw together a quick Node.js app to massage the data into what I needed. The most straightforward solution was probably to load the whole file, read through it line by line, build up an array of data, then write it out. And since the data feed is currently just under 400KB, maybe that would have been alright. But a better pattern (and more fun, IMO) is to take advantage of Node Streams. As long as we use streams throughout the entire process, we can make sure that only a small buffer is kept in memory at any given time.

If you just want to see the full app, it's on GitHub. Otherwise, read on to see my thought process.

Filter out the intro/header text

First, we'll write a stream that filters out the copyright/overview stuff and passes on the rest:

var stream = require('stream')
  , util = require('util')

function CleanIntro(options) {
  stream.Transform.call(this, options)
}

util.inherits(CleanIntro, stream.Transform)

CleanIntro.prototype._transform = function (chunk, enc, cb) {
  if (this.readingData) {
    this.push(chunk, enc)
  } else {
    // Ignore all text until we find a line beginning with 'Date,''
    var start = chunk.toString().search(/^Date,/m)
    if (start !== -1) {
      this.readingData = true
      this.push(chunk.slice(start), enc)
    }
  }
  cb()
}

A Transform stream simply takes data that was piped in from another stream, does whatever it wants to it, then pushes whatever it wants back out. In our case, we're just ignoring anything before the actual data begins, then pushing the rest of the data back out. Easy.

Parse the CSV data

Now that we have a filter to get just the raw CSV data, we can start parsing it. There are lots of CSV parsing libraries out there; I like csv-stream because, well, it's a stream. So our basic process is to make the HTTP request, pipe it to our header-cleaning filter, then pipe it to csv-stream and start working with the data:
var request = require('request')
  , csv = require('csv-stream')
  , util = require('util')
  , _ = require('lodash')
  , moment = require('moment')
  , OutStream = require('./out-stream')
  , CleanIntroFilter = require('./clean-intro-filter')

// Returns a Stream that emits CSV records from Google Flu Trends.
// options:
//   - regions: an array of regions for which data should be generated.
//     See http://www.google.org/flutrends/us/data.txt for possible values
module.exports = function (options) {
  options = _.extend({
    regions: ['United States']
  }, options)

  var earliest = moment().subtract('years', 1)

  request('http://www.google.org/flutrends/us/data.txt')
    .pipe(new CleanIntroFilter())
    .pipe(csv.createStream({}))
    .on('error',function(err){
        // Oops, got an error
    })
    .on('data',function(data) {
      var date = moment(data.Date)

      // Only return data from the past year
      if (date.isAfter(earliest) || date.isSame(earliest)) {
        // Let's build the output String...
        console.log(data.Date + ',' + _.map(options.regions, function (region) {
          return data[region]
        }).join())
      }
    })
    .on('end', function () {
      // Okay we're done, now what?
    })
}

Alright, now we're getting close. We've built the CSV output, but now what do we do with it? Push it all into an array and return that? NO! Remember, we'll lose the slim memory benefits of streams if we don't keep using them the whole way through.

Write out to another Stream

Instead, let's just make our own writeable stream:
var stream = require('stream')

var OutStream = function() {
  stream.Transform.call(this,{objectMode: false})
}

OutStream.prototype = Object.create(
  stream.Transform.prototype, {constructor: {value: OutStream}} )

OutStream.prototype._transform = function(chunk, encoding, callback) {
  this.push(chunk, encoding)
  callback && callback()
}

OutStream.prototype.write = function () {
  this._transform.apply(this, arguments)
}

OutStream.prototype.end = function () {
  this._transform.apply(this, arguments)
  this.emit('end')
}

And now our parsing function can return that stream and write to it:
module.exports = function (options) {
  options = _.extend({
    regions: ['United States']
  }, options)

  var out = new OutStream()
  out.write('Date,' + options.regions.join())

  var earliest = moment().subtract('years', 1)

  request('http://www.google.org/flutrends/us/data.txt')
    .pipe(new CleanIntroFilter())
    .pipe(csv.createStream({}))
    .on('error',function(err){
        out.emit('error', err)
    })
    .on('data',function(data) {
      var date = moment(data.Date)

      // Only return data from the past year
      if (date.isAfter(earliest) || date.isSame(earliest)) {
        out.write(data.Date + ',' + _.map(options.regions, function (region) {
          return data[region]
        }).join())
      }
    })
    .on('end', function () {
      out.end()
    })

  return out
}

Serve it up

Finally, we'll use Express to expose our data as a web endpoint:
var express = require('express')
  , data = require('./lib/data')
  , _ = require('lodash')

var app = express()

app.get('/', function(req, res){
  var options = {}

  if (req.query.region) {
    options.regions = _.isArray(req.query.region) ? req.query.region : [req.query.region]
  }

  res.setHeader('Content-Type', 'text/csv')

  data(options)
    .on('data', function (data) {
      res.write(data)
      res.write('\n')
    })
    .on('end', function (data) {
      res.end()
    })
    .on('error', function (err) {
      console.log('error: ', error)
    })
})

var port = process.env.PORT || 5000
app.listen(port)
console.log('Listening on port ' + port)

Once again, note that as soon as we get data from our stream, we manipulate and write it out to another stream (the HTTP response, in this case). This keeps us from holding a lot of data in memory unnecessarily.

Now if we fire up the server, we can use curl to see it in action:

$ curl 'http://localhost:5000'
Date,United States
2012-11-04,2492
2012-11-11,2913
2012-11-18,3040
2012-11-25,3641
2012-12-02,4427
[and so on]

$ curl 'http://localhost:5000?region=United%20States&region=Pennsylvania'
Date,United States,Pennsylvania
2012-11-04,2492,2579
2012-11-11,2913,2889
2012-11-18,3040,2785
2012-11-25,3641,3248
2012-12-02,4427,3679
[and so on]

As long the server is running someplace that is accessible to the public, we can head on over to Dash and plug it into a Custom Chart widget, giving us something like this:
Hey, looks like December and January are big months for the flu in the U.S. Who knew?

Want to try this yourself? The full source for this app is on GitHub, along with step-by-step instructions for running the project and creating a widget in Dash. Enjoy!

Wednesday, October 16, 2013

Cut the HealthCare.gov people some slack

On October 1, 2013, HealthCare.gov was opened to the nation. As one of the more tangible aspects of the Affordable Care Act (aka ACA, Obamacare), the glitches it has experienced are getting lots of negative attention. It's been described as an "inexcusable mess" and a "disaster". I'm not going to discuss the merits of the ACA here. But as someone who spends every day building software, I think the criticisms of the HealthCare.gov application have been unfair. Here are a few reasons why we should cut them some slack.

Most webapps aren't immediately released to millions of users

When Google or Facebook are releasing a new app, do you think they just flip a switch and release it to the world? Absolutely not! They slowly release it to beta testers, or add the feature to a subset of users. They encounter bugs, fix them, slowly scale up their infrastructure, and wait until they've seen it succeed before opening it up to the general public.

HealthCare.gov, on the other hand, launched to the entire nation at once. It didn't just have to deal with traffic from thousands (hundreds of thousands? millions?) of people looking for health plans for themselves; I'm sure a lot of curious people (like myself) went there just to check it out. There was no chance to work out the bugs, and no chance to gradually scale the infrastructure. This was almost guaranteed to be a problem.

Now, perhaps they should have slowly rolled out the site. I suspect, however, that the reasons were political. How do you decide who gets to sign up for healthcare first? By state? By last name? Random drawing? People who signed up ahead of time? And how do you explain to everyone else that they have to wait? People expect this from Google, but a highly charged political issue like the ACA makes it dicey.

Much of what happens in the exchange is out of their control

HealthCare.gov is an extremely distributed application. John McDonough, a health policy professor at the Harvard School of Public Health, described it as:
"an enormously complex task. The number of systems that have to work together – federal, state, insurance companies, the Internal Revenue System – the number of systems that have to align here is pretty daunting."
If there's a bug in the IRS endpoints, there's going to be a problem. If one of the hundreds of private insurance companies' systems can't handle the increased workload, there's going to be a problem. And as someone who has built plenty of distributed systems, I know that end users don't say "oh, it must be a problem with system X that this app depends on". They think it's a problem with whatever webapp they're using, and don't care what's going on in the backend.

Nor should they care. But professional IT people should know better.

Most webapps aren't serving as a referendum on a heated national controversy

Frankly, there are a lot of people that want HealthCare.gov to fail primarily because they want the ACA to fail. And many people who oppose the ACA will still have a need to use HealthCare.gov. New Google apps are not typically tied to such a heated political issue, and are generally not used by people who don't want to use them. 

So let's have some empathy for the people who build HealthCare.gov

This application was an enormous undertaking, with many challenges that few (if any) systems out there need to deal with. My expectation and hope is that the bugs will be worked out in the coming weeks, as happens with any new system. Whether the Affordable Care Act will be a success is yet to be seen; but the healthcare exchanges that support it were built by regular IT professionals that are just trying to do their jobs. Let's cut them some slack.

Thursday, March 7, 2013

A case for putting off documentation until the end

I have a bad habit of putting off documenting my code as long as possible; it's often my last task before I submit a pull request. And every time I'm slogging through hundreds or thousands of lines of code writing documentation, I think "Next time, I'm going to do this as I go along." And the next time I write a new feature, I'll probably leave the documentation until the end again.

As bad as this is, though, I've realized there are a few advantages to waiting until your code is finished before writing the documentation:
  1. Minimize wasted effort: In an ideal world, your API and requirements wouldn't change after you've started coding. But how often does that happen? If you need to change your code and you've already documented it, now you have to waste time updating your documentation.
  2. Forced self-review: Have you ever written a section of code, then looked at it later and realized you could have done it better? While going back through your code to document it, you are forced to reevaluate everything you've done; the extra time between coding and documenting can sometimes give you the perspective you need to refactor it.
  3. Catch dead code: Nothing makes me want to delete unnecessary code more than having to document it. Before I document a method, I'll probably check to make sure it's actually still being used.
  4. Consistency: I find I used more consistent terminology in my documentation when I'm doing it all at once. I get in a zone, find a language I think works, and stick with it the whole way through.
Of course, there are some reasons nobody actually encourages this behavior:
One class documented... many more to go...
  1. It's awful: If you don't enjoy documentation (and I definitely do not), it is really painful to write it all at once. Writing the documentation every time you write a method is annoying; writing documentation for dozens of methods is brutal.
  2. Faded memory: By the time you finally write the documentation, you may have forgotten what the code does or why you made certain decisions. On the flip side, this is a good indication that you should refactor your code anyway.
  3. Eh, forget it: When faced with this monumental task, it will be very tempting to just skip the documentation, or put it on your TODO list for later. Sometimes you may be able to get away with this, but you really never should.
So I guess I'm not exactly recommending that you wait until the last minute to write your documentation. I'm just saying that there are some upsides to it. Or maybe I'm really just trying to justify my own behavior...

Thursday, August 2, 2012

A petition to release government-developed software to the OSS community

In July 2012, a petition was created to mandate that U.S. federal government-developed software be released to the open source community. I think this is a fantastic idea, and I'd like to elaborate on some of the points made in the petition (highlighting is mine):
Openness: Open Sourcing ensures basic fairness and transparency by making software and related artifacts available to the citizens who provided funding, consistent with the President’s 2009 declaration that “Information maintained by the Federal Government is a national asset.”
If you pay taxes in the United States, then you are paying for the software developed by the government; taxpayers can reap maximum value for their investment by releasing that software to the open source community.

There's even a logical precedent for this; a "work of the United States government" is not entitled to copyright protection (essentially public domain). An excellent example is photos taken by the federal government, which are public domain and freely available.
Supports the Federal “Shared First” Agenda: Maximizes value to the government by significantly increasing reuse and collaborative development between federal agencies and the private sector...
I think this collaboration could result in a few interesting scenarios:

  • Software developed by the government (which costs taxpayers a significant sum of money) could be leveraged by private sector developers, just as software developed by the private sector and then open sourced has driven so many fantastic projects. In fact, the National Security Agency has already given their Accumulo NoSQL database to Apache.
  • Private sector developers could actually improve the software the government has developed! Imagine what some of those sharp Google Summer of Code developers could do for a summer project! That sounds to me like democracy for the 21st century.
  • Open sourcing the software could be an important bridge to opening up the wealth data our government collects. Obviously there would be privacy and security concerns, but far more data would be released if private sector developers were contributing new APIs to open sourced government software.

What you can do

The easiest thing you can do is sign the petition and get the word out through whatever channels you prefer (Twitter, Facebook, your own blog, whatever). 

Unfortunately, the petition needs 25,000 signatures by August 16 to receive a White House response; with less than 700 signatures as of August 2, this seems unlikely. However, this petition could just be the beginning of a movement. The more publicity it gets, the more likely it will be that the effort will take off.

Friday, June 22, 2012

Beware of Dojo Selects, Stores, and numeric IDs

A handy feature of Dijit Select (or an extending widget such as FilteringSelect or ComboBox) is the integration with Dojo Data stores. You can populate your Select/FilteringSelect/ComboBox with any data that can be accessed through a Dojo Data store, including JSON REST requests, HTTP queries, CSV files, or even Wikipedia.

However, there is a limitation to this that has caused me problems, even though it's documented and I've run into it a couple times: Selects do not play nicely with stores that have non-string IDs (like integers). To quote from a Dojo tutorial:
dijit.form.Select possesses an important limitation: it is implemented in such a way that it does not handle non-string item identities well. Particularly, setting the current value of the widget programmatically via select.set("value", id) will not work with non-string (e.g. numeric) identities.
To drive this point home, I'll give a concrete example illustrating the problem. You can also see the example in action on OrionHub or download the source code and run it yourself.

The goal

We're going to create a pair of Selects that will display a list of ships: one backed by a store with String IDs, and one backed by a store with numeric IDs. The generated page will look like this:


The data

Our store will be a dojo/data/ItemFileReadStore, which will load JSON from a given file. The String ID JSON looks like this:
{
    "identifier": "shipId",
    "label": "shipName",
    "items": [
     { "shipId": "1", "shipName": "Constitution" },
        { "shipId": "2", "shipName": "Enterprise" }, 
        // You get the point...
        { "shipId": "6", "shipName": "Yorktown" }
    ]
}
And the numeric ID JSON looks like this:
{
    "identifier": "shipId",
    "label": "shipName",
    "items": [
     { "shipId": 1, "shipName": "Constitution" },
        { "shipId": 2, "shipName": "Enterprise" }, 
        // You get the point...
        { "shipId": 6, "shipName": "Yorktown" }
    ]
}

The HTML

The HTML is pretty uninteresting; it basically loads the JavaScript files and throws in some DIV placeholders for the widgets:
<html>
 <head>
  <link rel="stylesheet"
   href="http://ajax.googleapis.com/ajax/libs/dojo/1.7/dijit/themes/claro/claro.css">
  
  <script src="//ajax.googleapis.com/ajax/libs/dojo/1.7.2/dojo/dojo.js"
   data-dojo-config="async: true"></script>
  <script src="//ajax.googleapis.com/ajax/libs/dojo/1.7.2/dijit/dijit.js"
   data-dojo-config="async: true"></script>
  <script src="selectStoreExample.js"
   data-dojo-config="async: true"></script>
 </head>
 
 <body class="claro">

  
  <h1>Using String IDs</h1>
  <div>
   shipSelectString: 
   <div id="shipSelectString"></div>
   
   <div id="selectStringEnterpriseButton"></div>
   
   <div id="selectStringYorktownButton"></div>
  </div>
  
  <h1>Using Numeric IDs</h1>
  <div>
   shipSelectNumber: 
   <div id="shipSelectNumber"></div>
   
   <div id="selectNumberEnterpriseButton"></div>
   
   <div id="selectNumberYorktownButton"></div>
  </div>
 </body>
</html>

The JavaScript

Here's where all the Dijit bindings happen. For both the String and numeric ID stores, we create a Select and a couple buttons that call select.set("value", [the ID]):
require(["dojo", "dijit", "dojo/data/ItemFileReadStore", "dijit/form/Select", 
   "dijit/form/Button", "dojo/domReady!"
  ], function(dojo, dijit, ItemFileReadStore, Select, Button) {
 //****** Using Strings for IDs ******
 //Create the data store using a JSON file
 var shipStoreString = new ItemFileReadStore({
  url: "ships-string.json"
 });
 
 //Create and start the Select, using the ItemFileReadStore as its store
 var selectString = new Select({
  name: "shipSelectString",
  store: shipStoreString
  }, "shipSelectString");

 selectString.startup();
 
 //Create buttons that will use the Select widget's set() function to set the value
 new Button({
  label: "Set shipSelectString to Enterprise",
  onClick: function() {
   selectString.set("value", "2");
  }
 }, "selectStringEnterpriseButton");
 
 new Button({
  label: "Set shipSelectString to Yorktown",
  onClick: function() {
   selectString.set("value", "6");
  }
 }, "selectStringYorktownButton");
 
 //****** Using numbers for IDs ******
 //Create the data store using a JSON file
 var shipStoreNumber = new ItemFileReadStore({
  url: "ships-number.json"
 });
 
 //Create and start the Select, using the ItemFileReadStore as its store
 var selectNumber = new Select({
  name: "shipSelectNumber",
  store: shipStoreNumber
  }, "shipSelectNumber");

 selectNumber.startup();
 
 //Create buttons that will use the Select widget's set() function to set the value
 new Button({
  label: "Set shipSelectNumber to Enterprise",
  onClick: function() {
   selectNumber.set("value", 2);
  }
 }, "selectNumberEnterpriseButton");
 
 new Button({
  label: "Set shipSelectNumber to Yorktown",
  onClick: function() {
   selectNumber.set("value", 6);
  }
 }, "selectNumberYorktownButton");
});
Now if you fire up the HTML page in your browser, you'll see the screen shown above in your browser. Click on the buttons, and you'll see that the Select with Strings as IDs changes when you click the buttons, but nothing happens to the Select with numeric IDs. This is the limitation (bug?) that prevents you from using numeric IDs in a store that's backing a Select.

Resources

Wednesday, May 16, 2012

Secure Password Storage - Lots of don'ts, a few dos, and a concrete Java SE example

The importance of storing passwords securely

As software developers, one of our most important responsibilities is the protection of our users' personal information. Without technical knowledge of our applications, users have no choice but to trust that we're fulfilling this responsibility. Sadly, when it comes to passwords, the software development community has a spotty track record.

While it's impossible to build a 100% secure system, there are fortunately some simple steps we can take to make our users' passwords safe enough to send would-be hackers in search of easier prey.

If you don't want all the background, feel free to skip to the Java SE example below.

The Don'ts

First, let's quickly discuss some of the things you shouldn't do when building an application that requires authentication:
  • Don't store authentication data unless you really have to. This may seem like a cop-out, but before you start building a database of user credentials, consider letting someone else handle it. If you're building a public application, consider using OAuth providers such as Google or Facebook. If you're building an internal enterprise application, consider using any internal authentication services that may already exist, like a corporate LDAP or Kerberos service. Whether it's a public or internal application, your users will appreciate not needing to remember another user ID and password, and it's one less database out there for hackers to attack.
  • If you must store authentication data, for Gosling's sake don't store the passwords in clear text. This should be obvious, but it bears mentioning. Let's at least make the hackers break a sweat.
  • Don't use two-way encryption unless you really need to retrieve the clear-text password. You only need to know their clear-text password if you are using their credentials to interact with an external system on their behalf. Even then, you're better off having the user authenticate with that system directly. To be clear, you do not need to use the user's original clear-text password to perform authentication in your application. I'll go into more detail on this later, but when performing authentication, you will be applying an encryption algorithm to the password the user entered and comparing it to the encrypted password you've stored. 
  • Don't use outdated hashing algorithms like MD5. Honestly, hashing a password with MD5 is virtually useless. Here's an MD5-hashed password:  569a70c2ccd0ac41c9d1637afe8cd932. Go to http://www.md5hacker.com/ and you can decrypt it in seconds.
  • Don't come up with your own encryption scheme. There are a handful of brilliant encryption experts in the world that are capable of outwitting hackers and devising a new encryption algorithm. I am not one of them, and most likely, neither are you. If a hacker gets access to your user database, they can probably get your code too. Unless you've invented the next great successor to PBKDF2 or bcrypt, they will be cackling maniacally as they quickly crack all your users' passwords and publish them on the darknet.

The Dos

Okay, enough lecturing on what not to do. Here are the things you need to focus on:
  • Choose a one-way encryption algorithm. As I mentioned above, once you've encrypted and stored a user's password, you never need to know the real value again. When a user attempts to authenticate, you'll just apply the same algorithm to the password they entered, and compare that to the encrypted password that you stored.
  • Make the encryption as slow as your application can tolerate. Any modern password encryption algorithm should allow you to provide parameters that increase the time needed to encrypt a password (i.e. in PBKDF2, specifying the number of iterations). Why is slow good? Your users won't notice if it takes an extra 100ms to encrypt their password, but a hacker trying a brute-force attack will notice the difference as they run the algorithm billions of times.
  • Pick a well-known algorithm. The National Institute of Standards and Technology (NIST) recommends PBKDF2 for passwords. bcrypt is a popular and established alternative, and scrypt is a relatively new algorithm that has been well-received. All these are popular for a reason: they're good. 

PBKDF2

Before I give show you some concrete code, let's talk a little about why PBKDF2 is a good choice for encrypting passwords:
  • Recommended by the NIST. Section 5.3 of Special Publication 800-132 recommends PBKDF2 for encrypting passwords. Security officials will love that.
  • Adjustable key stretching to defeat brute force attacks. The basic idea of key stretching is that after you apply your hashing algorithm to the password, you then continue to apply the same algorithm to the result many times (the iteration count). If hackers are trying to crack your passwords, this greatly increases the time it takes to try the billions of possible passwords. As mentioned previously, the slower, the better. PBKDF2 lets you specify the number of iterations to apply, allowing you to make it as slow as you like.
  • A required salt to defeat rainbow table attacks and prevent collisions with other users. A salt is a randomly generated sequence of bits that is unique to each user and is added to the user's password as part of the hashing. This prevents rainbow table attacks by making a precomputed list of results unfeasible. And since each user gets their own salt, even if two users have the same password, the encrypted values will be different. There is a lot of conflicting information out there on whether the salts should be stored someplace separate from the encrypted passwords. Since the key stretching in PBKDF2 already protects us from brute-force attacks, I feel it is unnecessary to try to hide the salt. Section 3.1 of NIST SP 800-132 also defines salt as a "non-secret binary value," so that's what I go with.
  • Part of Java SE 6. No additional libraries necessary. This is particularly attractive to those working in environments with restrictive open-source policies.

Finally, a concrete example

Okay, here's some code to encrypt passwords using PBKDF2. Only Java SE 6 is required.
import java.security.NoSuchAlgorithmException;
import java.security.SecureRandom;
import java.security.spec.InvalidKeySpecException;
import java.security.spec.KeySpec;
import java.util.Arrays;

import javax.crypto.SecretKeyFactory;
import javax.crypto.spec.PBEKeySpec;

public class PasswordEncryptionService {

 public boolean authenticate(String attemptedPassword, byte[] encryptedPassword, byte[] salt)
   throws NoSuchAlgorithmException, InvalidKeySpecException {
  // Encrypt the clear-text password using the same salt that was used to
  // encrypt the original password
  byte[] encryptedAttemptedPassword = getEncryptedPassword(attemptedPassword, salt);

  // Authentication succeeds if encrypted password that the user entered
  // is equal to the stored hash
  return Arrays.equals(encryptedPassword, encryptedAttemptedPassword);
 }

 public byte[] getEncryptedPassword(String password, byte[] salt)
   throws NoSuchAlgorithmException, InvalidKeySpecException {
  // PBKDF2 with SHA-1 as the hashing algorithm. Note that the NIST
  // specifically names SHA-1 as an acceptable hashing algorithm for PBKDF2
  String algorithm = "PBKDF2WithHmacSHA1";
  // SHA-1 generates 160 bit hashes, so that's what makes sense here
  int derivedKeyLength = 160;
  // Pick an iteration count that works for you. The NIST recommends at
  // least 1,000 iterations:
  // http://csrc.nist.gov/publications/nistpubs/800-132/nist-sp800-132.pdf
  // iOS 4.x reportedly uses 10,000:
  // http://blog.crackpassword.com/2010/09/smartphone-forensics-cracking-blackberry-backup-passwords/
  int iterations = 20000;

  KeySpec spec = new PBEKeySpec(password.toCharArray(), salt, iterations, derivedKeyLength);

  SecretKeyFactory f = SecretKeyFactory.getInstance(algorithm);

  return f.generateSecret(spec).getEncoded();
 }

 public byte[] generateSalt() throws NoSuchAlgorithmException {
  // VERY important to use SecureRandom instead of just Random
  SecureRandom random = SecureRandom.getInstance("SHA1PRNG");

  // Generate a 8 byte (64 bit) salt as recommended by RSA PKCS5
  byte[] salt = new byte[8];
  random.nextBytes(salt);

  return salt;
 }
}

The flow goes something like this:
  1. When adding a new user, call generateSalt(), then getEncryptedPassword(), and store both the encrypted password and the salt. Do not store the clear-text password. Don't worry about keeping the salt in a separate table or location from the encrypted password; as discussed above, the salt is non-secret.
  2. When authenticating a user, retrieve the previously encrypted password and salt from the database, then send those and the clear-text password they entered to authenticate(). If it returns true, authentication succeeded.
  3. When a user changes their password, it's safe to reuse their old salt; you can just call getEncryptedPassword() with the old salt.
Easy enough, right? If you're building or maintaining an application that violates any of the "don'ts" above, then please do your users a favor and use something like PBKDF2 or bcrypt. Help them, Obi-Wan Developer, you're their only hope.

References