SDR classification

Hi,
I’m trying to building a swipe keyboard recognizer.

My data has Time, X, Y stroke recordings. So I can either give a timeseries, or a 2d image between each word.
I also have a labeled dictionary to go with stroke recordings.

I use the term glyph for my swiped word. So glyph->dictionary word is my end goal.

I have a caveat:

  • I’d like to be able to add new words to the dictionary.
    • I can generate synthetic strokes for those words. Then let the network online learn a users specific strokes.

This post discusses classification using one model per class. That won’t be usable with a dictionary ofr ~160000 words.

Can SDR’s be compared in a time & space efficient way? Can I encode an SDR for each dictionary word, and choose the nearest matching word (SDR)?

Most useful to me would be a little direction and explanation.

EDIT: I can rephrase my query.

Assume I have an SDR generated via a SP->TP->SDR process.
If I have a large set of these SDRs, one for each class, can I classify a sample SDR with this set?
What is the orthodox HTM way of doing that?

1 Like

So you want to recognize 160k words encoded as swipes?

1 Like

Ideally, yes.

I could also recognize 26 latin letters as swipes forming words. But that doesn’t need HTM.

1 Like

Hello,

Your project sounds interesting and I think an HTM could work for it.

For the general design of it, I would recommend starting with:


Typically, you’d compare two SDRs by calculating the “overlap” which is the number of ON bits they have in common (aka the dot product). Apply a threshold to the overlap to determine if they match (the threshold is typically around 10-20 but depends on your setup).

1 Like

For such tasks, HTM’s often use a simple single-layer artificial neural network. In the htm.core library, we use the htm.algorithms.Classifier

1 Like

In 2004 when the On Intelligence book came out I read it and played around with HTMs for a little while, revisiting in 2010.
I’m both pleased people are still working with it, and surprised it hasn’t been pushed much further.

My current plan, since I have timesteps with X/Y values, is to phsycically draw out the strokes as binary images for the SP, feeding the SDRs into TM. The end result is an array of images incrementally completing a glyph at each step, with the final step being the completed glyph. I’m assuming this will work better than feeding in binarized X/Y values.

Is this a sound approach? Is there a better way to do this?

1 Like

TM (temporal memory) is the main workhorse of HTM in current stage. The way TM works is relatively simple: It predicts next time step input in a sequence. Sequence of XY is ok as raw input from this point of view for two reasons:

  • it is a temporal sequence which TM is meant to handle, not static images.
  • an image requires much larger input frame and loses the temporal information.

So you-re better of with encoding each XY time step as a SDR and feed that series to a TM.

But one thing that bugs me: Since you say you already able to decode 26 letters why don’t you simply generate a sequence of letters from these 26 glyphs e.g. gliphs for “S”, “I”, “M”, “P”, “L”, “E” would make the word “SIMPLE”
?!?

EDIT: Another question - how big is your dataset currently and can you share it? I would like to run some experiments myself, thanks.

1 Like

The downside to just matching nearest neighbour on the swipe path is it has a huge 25% word error rate. So my hope is SP/TM could get me better results.

I could certainly feed in the nearest letter as a 26bit vector at each temporal step as enrichment.

I’ve got about a million samples in my test set.

1 Like

So you say it isn’t alphabetical? A glyph for “CAT” is not a sequence of the 3 letter glyphs for “C”, “A” and “T”?

Edit: HTM alone cannot recognize words in a 130k word dictionary. It makes time series predictions.

There is a method for a classification by using the network activation states as input data for a classifier. The default classifier is a simple variation of a linear classifier, I doubt it is usable in a 130k x 2k (input size) matrix.

The NN search might get improved by

  • trying to obtain better embeddings for the full word glyph
  • having a sufficiently large number of samples for each word.

2nd Edit: have you considered reservoir networks?

1 Like

Think swipe paths over a keyboard. Swiping from “C’“→”A”→"T”.For “cat” that would look like a sideways V.

In the process of that swipe you’ll pass over X and S to reach the letter A, and S, D, E and R to reach T. This results in high error rates.

HTM alone cannot recognize words in a 130k word dictionary. It makes time series predictions.

I see your point here. My assumption was to extract the SDR from the TM and use the nearest match to one of the dictionary words, also stored as SDR.

trying to obtain better embeddings for the full word glyph

having a sufficiently large number of samples for each word.

On point #2. I have about a million labeled examples, and I can generate an infinite number of synthetic ones. #1I can generate synthetic examples, so I can generate better embeddings if I can think of them.

2nd Edit: have you considered reservoir networks?

Now there’s a term I haven’t heard in a while. No I it never crossed my mind.

1 Like

To understand how SDR Classifier (or any “Standard” classifier) can be used together with TM to recognize sequences, you need to know that:

  • Use a numeric SDR encoder to encode X,Y numerical values into a fixed size SDR e.g. 512 bit sparse vector each X,Y at every time step. These encoders are deterministic they do not learn anything.
  • TM internally is organized in a 2D matrix “nodes” of 512 columns by (e.g.) 8 columns height. You can think of it as a type of binary recurrent network, with 512x8 = 4096 “neurons” which are trying to predict the next input SDR.
  • You feed the TM several sequences (no class attached whatsoever) and when you see a plateau - it can not improve prediction rate, just freeze it and use its firing state of the 4096 neurons as an embedding for the classifier algorithm which is fed 4096 binary input vectors.
  • It can be any classifier, e.g. K-nearest neighbor which, if populated with sufficient large number of samples (each a 4096 bit vector) then it might work. It’s just a shot.

If you ponder upon this .. strategy it should work with any recurrent network, and the simplest conceptually and computationally are reservoirs aka echo state networks.

1 Like

Hi Cezar and others,

So I’ve pondered my problem for a bit and come back to it with a different viewpoint.

Taking the completed glyph and classifying NxM pixels → 160K classes Won’t work. I had nope that SDR’s would be robust enough for that. They won’t be.

Taking the raw X/Y data and classifying → 160K classes won’t work either.

So let’s go back to the “easy” problem I dismissed from the very start, swipe input of the (latin) alphabet. We know the keyboard layout along with button boundaries, we also have the swipe time and X/Y data. We can generate velocity and trajectory data from that X/Y data.

There’s a shortcut there, we can use the structure of language to reject bad input. Commonly used letters in words are rarely next to each other on the keyboard. “fg”, “fv” and “fd” rarely occur in real words.. “fr” does though.

Similarly once a word is complete (swipe released or spacebar) can use an inverted dictionary “ztasre” rather than “ersatz”. There’s more variation at the end of words than at the beginning.If we have confidence in our first letter we can use that to reject any characters not on the “tree” from that point in our dictionary.

So what are the orthodox HTM approaches to identifying “debouncing” and rejecting bad input for movement data?

Here’s the X/Y data for the word “the”. Apologies, I tried to overlay it on a keyboard, by my pixelmator skills failed me. (Path not to scale of keyboard)

1 Like

WENT ?

I would suggest that you focus on what eye saccades do and look at the transitions and not the lines inbetween or any velocity, etc. Direction transitions are what the input source is instigating and the velocity is just an invariant decay you have no control over.

Map the whole keyboard as input points, then give the input layer the transition intensity as a type of attention focus and sparsity ends up clustered around the target letter (inverse of thalamus local inhibition). Greater the direction change the greater the input gain which may or may not narrow the sparsity of the input pattern. The transition locality would need to be close to but not directly over the target letter as a result.

The input pattern sparsity may also not be consistent as a result of the gain difference but the sparsity rule could always enforce a given number.

There-s also the problem of ambiguity e.g. “TIP” trace is the same as “TOP” , “TYP” or “TUP” or even “RIO” if the operator isn’t precise enough and shifts the starting point towards left.

2 Likes

Thinking about it a bit more, how do double letter instances get represented or is the reliance upon an assumption that double letters are not a critical factor within a concept, just a spelling ambiguity in language ? Surrounding context may resolve the ambiguity, however it’s an additional larger scale context processing possibly beyond the imediate input ?

Pool vs Poll

I went to the pool vs I went to the poll

Pool time vs Poll time

Poll table vs pool table

Guessing there are other words that will bring up other issues.

Context matters beyond the sentence if uncertainty remains.

1 Like

Map the whole keyboard as input points, then give the input layer the transition intensity as a type of attention focus and sparsity ends up clustered around the target letter (inverse of thalamus local inhibition). Greater the direction change the greater the input gain which may or may not narrow the sparsity of the input pattern. The transition locality would need to be close to but not directly over the target letter as a result.

Can you explain this a bit more. I’m not sure I follow the input mapping/attention part.

Thinking about it a bit more, how do double letter instances get represented

This was something I was thinking about, “Moose”, “Loonie”, and so on. I’ll have to look at how existing keyboards do this. I assume you just hover over a key for some time period. Also letter accents in some languages like Swedish E.G. “kalenderår”. Again, I assume you just hover for a beat before moving on.

look at the transitions and not the lines in between or any velocity, etc.

This would bring up an issue like with Cezars “tip” “top”, example. Users will zip right through y,u,i,o.

double letters are not a critical factor within a concept,

Are you thinking of decoding into some latent concept space? I’m thinking of decoding into keys. It doesn’t have to be perfect as we have a dictionary/language model we can constrain our outputs to.

Once you’ve decoded “Brai” the next letter is unlikely to be b, j, or h. Websters 2nd ed says there’s 40 total words that start with “brai”, two with five letters.

I’m thinking that perfectly decoding keyboard input isn’t necessary - important, but not necessary. Hanging a good language model (or autocorrect) on top of fairly clean input is necessary in practice.

1 Like

With tip and top if you don’t have any requirement for the “wave” to pause on the letters then you would have to rely on the surrounding context to determine the probable word.

Double letters - if your using surounding context they end up like texting, where al the leters are not neded for overal context. The brian fils the context. stoop stop : ones scottish the others english, lol. jokes aside…

The map over the keyboard would be say 4x the resolution of the key mapping so that your not just getting one point as a winning bit. Spread the possible selected bits around the area or path of the transition point. The direction leading into the letter and out of the letter are information as well. I’m assuming that the wave path is proportional to an actual keyboard layout and not just an arbitrary size wave that changes each time ? Otherwise ask ash add ass are just the same if the length of the line is also unknown. quip wet err rip top yip up

The lead in should narrow the prior letter probability and lead out feeds into the next letter prediction if you have many layers. Not sure on the network dimensions and layers your thinking about.

“brai”, two with five letters. : You will always miss the unexpected. dictionary 3 brain brail braid. then the unexpected brait plc, braiz cosmetics, braiw M18 impact wrench, brais islamic studies, etc.

1 Like

I figured out my pixelmator skills. Here’s an image of the path for “the” overlayed on the a keyboard.

You can see the swipe passes over r and g. The swipe path and keyboard is over a roughly 400x190px region. Depending on phone and orientation. But it’s known and the two match.

I looked into other swipe keyboards, they just ignore “tip/top” conflicts, and “moose” issues, and let the language/autocorrect model handle it.

Luckily the keyboards actually have key layouts/boundaries available. So I can know exactly when a swipe is over a key easily. Then it’s a matter of deciding on intentionality.

Map the whole keyboard as input points, then give the input layer the transition intensity as a type of attention focus and sparsity ends up clustered around the target letter (inverse of thalamus local inhibition). Greater the direction change the greater the input gain which may or may not narrow the sparsity of the input pattern. The transition locality would need to be close to but not directly over the target letter as a result.

I actually think velocity might be useful here as well. If you slow down or pause on “I” as you transition through “tyuiop” You indicate you’re typing “tip” not “top”. Changing direction like in the “h” in my example above also slows down velocity indicating intent.

1 Like

You mentioned saccades, are this hand or eye movements?

1 Like

@cezar_t @dmac @spitfire

While we’re on the subject of SDR classification, I want to reintroduce the classifier algorithm I developed several years ago that seems to be more performant than the regular linear classifier that is used in the HTM community. For lack of a better term, I will call it a Cortical Classifier, since it is looks like a stripped down Spatial Pooler that is augmented for supervised learning.

Types of Classification

Before I start, I want to clarify what I think are two different types of classification tasks. The first task is static classification, which is the task of receiving a single SDR code and classifying it with a label. The second task is dynamic classification, which is more of a sensorimotor process involving a series of inputs and a hierarchical cortical process that eliminates possibilities and predicts likely future inputs based on a classified model. This latter is what I think this keyboard stroke glyph classification problem is and it highlights the kind of problems that Discrete Cortical Circuits (DCC) inspired by the brain are likely to solve. However, the Cortical Classifier I present here is just a static classifier which may be of some use in the future when solving dynamic classification.

Demo

To start, you can see the Cortical Classifier in action using my DCC dashboard:

If my domain is white-listed, you should be able to see the demo embedded below. Otherwise, just use the above link.

The demo shows three blocks: a discrete data source that loops through a sequence of integer values, a discrete encoder that transforms discrete values into an SDR encoding, and the third is the Cortical Classifier that takes in the SDR input from the encoder and the discrete training signal that specifies the proper label for each SDR input.

The block for the Cortical Classifier has two activity plots. The top part is the probabilistic interpretation of which label to classify the input, and the bottom part is the cell activations that vote on which class label the current input should be assigned to.

How It Works

The classifier cells are just normal HTM-like cells with synaptic inputs with binary weights and permanence values applied to the input field. The classifier is parameterized with n number of cells and m number of labels. For each class, l number of cells are assigned for detecting that class. Ideally, these cells specialize to detect different aspects of a class if that class has multiple possible inputs (this is not the case for the demo). You can interpret the cell activations as a probability vector over the classes by, for each class, divide the number of class cells activated by k.

I said originally that the Cortical Classifier (CC) is like a stripped down HTM Spatial Pooler. You can imagine the CC as being a spatial pooler without the topology, without the neuron boosting, and without the ability for neurons to activate of their own volition. Instead, activation of the cells is done with a k-Winner-Take-All algorithm applied over the whole field which guarantees a fixed k-sparsity.

The use of k-WTA seems to be the natural mechanism for activating a layer of neurons since it efficiently mimics the effect of lateral inhibition. However, since we don’t use topology, (which would inhibit cell neighbors), we can easily visualize what is happening by clustering similar cells next to each other just for the purpose of visualization. You can see that effect in the demo as the classifier fully learns to classify around step 60.

The other major aspect of the CC is the learning mechanism. Unlike the HTM Spatial Pooler which is an unsupervised learning algorithm, the CC is a supervised learning algorithm with a training signal that instructs the neurons whether their activation is appropriate. When a neuron activates, it either reinforces its synaptic permanences if the activation was correct or it punishes its synaptic permanences if the activation was incorrect. It does not, however, apply changes to neurons if they don’t activate, even if they were supposed to. I did some experimentation and I found that this was the quickest and most stable way to create a classifier across a wide variety of circumstances. This was just based on experimentation and intuition, and I didn’t formally codify this approach, so there may be some variations that are more appropriate in different situations.

One final feature that does not exist in HTM is the ability to prune and grow synapses. For each cell, they have a finite number of synapses that are a fraction of the total input field size. Each synapse has a permanence value which is an integer. If the integer is above some positive threshold, the synapse is connected and passes its input value to its connected neuron for summation. Unique to CCs, however, if the permanence goes to zero, the synapse is pruned (i.e. deleted), and a new synapse is grown (created) with a randomly chosen input location. This synaptic prune-grow strategy allows the algorithm to efficiently allocate resources and enables neurons to adapt and specialize to the structure of the input they receive.

You can review an older implementation in my old BrainBlocks library from 4 years ago.

In particular, you can view the cell activation step here and the learning step here

The synaptic prune-grow code is in the BlockMemory class. However, looking at the BrainBlocks code, I see we had disabled the prune-grow capability at the time. You can see still see the prune-grow code in the learn_move function of the BlockMemory class. I do use the prune-grow functionality in my latest DCC library run in the demo. I haven’t shared that code yet.

I just thought I would share this because I was surprised how well it performed once I made the desired algorithm HTM-like instead of using a classical ML algorithm.

2 Likes