I know that a lot of people here have tried digit classification using the famous MNIST dataset. But HTM is really about sequences and it’s confusing to build sequences of numbers. Letters are much easier to appreciate, as a human!
So we have written some code to preprocess the NIST SD19 dataset of handwritten uppercase characters A…Z (widely studied) into the same image format and resolution as the MNIST dataset of digits 0…9. If you combine MNIST + NIST SD19 you get an alphanumeric dataset.
So now you can feed text with uncertain representation of individual characters into your NUPIC/HTM engines and see how it predicts sentences etc. Your existing code using MNIST digit encoding will work without any changes.
Java code here to preprocess the images:
See the README for info on where to get the NIST data (they recently made it free to download).