Learning how birds teach themselves to sing Dad's song


12:10 If there is an example of internal teaching between different brain regions this is a great one. An auditory memory of the goal (his dad’s song), then an exploration of the vocal search space (‘babbling’), then selection of the vocals that have the smallest error compared to his dad’s song. Gradually the similarities increase (errors decrease) until its perfect.

1 Like
David Schneider Interview

Really - how many say they heard a certain style growing up and it inspired them?
Very common story.


I’d like to develop on the “selection of the vocals that have the smallest error” thing.
Depending on how you see that error and selection happening, it may feel as there is still some ANN idea, or a bit from how we think about consciously and actively learning biasing this.

I’m wanting to find something more… ‘On-Intelligence’-style than this (whether or not it was what you had in mind yourself). Less of a conscious choice. Less of error-detection + optimization. More “forward” maybe. Using just prediction from an original intent (same sound as dad) and associated motor signal (babling).
Can’t put my finger exactly on what it is that tickles me, or what it is I’m striving to find…
Simply less “error computation in between” maybe. More Hebbian, in fact.
You guys see what I mean ? Was already how you envisioned the thing ? Have some thoughts about it ?

Should maybe be in its own thread, though.


Yeah, so this idea is pretty general to machine learning I think. The jargon is different ways of saying the same sort of thing.

Not matter what method you are using: gradient-descent, genetic-algorithms, reinforcement-learning, etc. they all have an error function (in genetic algorithms it is called ‘fitness function’ or ‘objective function’ for example), where it is simply a measure of the distance between the current output of the agent and the desired/expected output. They all also have a way to use noise/stochasticity to search the space of possible outputs. Those outputs that have a relatively smaller distance to the desired output are selected, so it could be said they have a ‘smaller error’, so they are ‘good’. In HTM talk they could be said to have a ‘greater overlap’ when talking about SDRs. Either way, its the same thing - a comparison/measure between two things, then selecting the representations with greatest similarity/overlap, then repeat this process. This whole process is common to a vast number of machine learning methods.

So far HTM is working on feed-forward learning from sensory input, to model the world. Once you have a semantic model of the world you can then leverage that to do the process as described above. If you have an objective/goal (say to learn to produce a song-bird song) then that would require generating noisy/stochastic SDRs within the motor region hierarchy. The motor SDRs that produce a similar sounding output as the dad’s song SDR will be remembered. So the agent produces an output from a stochastic motor SDR (‘blabbering’), the sound it produces is fed into the auditory region where it is represented as a sensory input SDR then is compared to the dad’s song SDR. The overlaps are then probably sent to an association area along with the outputs of the motor SDRs that produced that sound. This region is probably where the SDRs are compared and the reinforcing feedback to the motor region is determined. Repeat that enough times and it will build up a representation in the motor region that produces are very similar output to that of the dad song. Again, the process basically being stochastic sampling and semantic selection. It is likely that the motor representation is built from the bottom-up in the motor region as it starts with small features (as they are easier to compare) then combines them to more complex features until the top-level of the hierarchy represents the whole motor representation of the dad song.

I don’t know if this is what the brain does, but it shows how this general machine learning method could be implemented in a HTM system to replicate the learning of the song-bird.


[First, a parenthesis about SDR overlap, but that’s not my main question]

I’m quite okay that having a set of possible SDRs makes it possible to chose one with greater overlap to another, in computers. If we except segments&cells doing, in essence, precisely that against input SDRs, and instead reason about output SDRs, I’m not really understanding how brain would do it, if there’s not some output-global-scheme enforcing that result settles to one particular choice among the initially possible, though…
like, a grid? ^^’ but that’s another topic.

What I’d really want to express, I guess, is more related to the underlying implementation of learning : both HTM spatial pooler and TM algorithm uses a very Hebbian-agreeing scheme to this point. And I like precisely that fact.
Maybe you’re able to envision a higher abstraction of that phenomenon that one could qualify as error detection, but that’s not natural to me.
So, I’m not really able either to go from an “error detection” concept in motor development, that I could bring back to the local Hebbian stuff that I expect is really happening. Could you explain that to me ?


Sure, Hebbian learning is THE candidate for error-based learning, which is pretty awesome. Think about a cell that is connected to other cells in an SDR through their distal dendrites. Using Hebbian learning, if the cell is active while receiving input from distal connections of the other cells in that SDR then the synapses between them get reinforced (‘neurons that fire together wire together’). For feedforward learning the proximal dendrites are the primary drivers for cell activity. When each cell becomes active from proximal dendrites from sensory input the distal synapses strengthen along with the other active cells near by. Take this idea but switch out proximal input with apical input.

If apical feedback can drive cell activity like proximal input then learning will occur in the same way. The feedback (teaching signal) comes down the apical dendrite and excites the cell (like proximal dendrites would do) and any other cells that are active that are connected via distal dendrites will have their synapses strengthened. So if this was a stochastic SDR and there was feedback from a teacher region then the cells in that SDR will get reinforced then tend to co-activate with together as an SDR in the future. The more times the apical teaching signal is fed to the same cells in that SDR the more they will be reinforced.


So from a cell’s perspective it is feedforward learning flipped on its head. Instead of proximal inputs it is apical.

Of course the verdict is still out on if aprical dendrites can drive cell activation. But there is some evidence that apical dendrites form ‘plateau potentials’ which cause the cell to burst, which is super convenient for a cell that is learning. Here is a post with the relevant papers.

Does that answer your question?


Thank you very much for those thoughts. Yes, there must be something of an apically-driven activation at some point, if we want hierarchy to make sense at all.
We could even link together “imagination” and “motor” using that.

We need to integrate 3 things here in contrast maybe to vanilla TM :

  • teacher signal (dad’s song)
  • current try (babbling from motor POV)
  • actual sensed output (kwak)

And we need to wire previous motor to current apical only (or with strong incentive) “in the context where output turned out to be less-kwaky”. (=comparison??). Not sure we currently have the fitting implementation for it. Or do we ?

I hope upcoming paper will provide details on those SMI concerns and that something will finally clic in my mind.

1 Like

I haven’t really thought about it that far. There are relay connections… possibly using that to play a role.

This type of experimentation can only be performed on a hierarchy. The ‘H’ is still missing from HTM, so for now we don’t have a model to experiment on. But it would be nice to develop a ‘general purpose’ hierarchy model that people can experiment with. It doesn’t need to be perfectly inline with biology. The more we can experiment the faster we discover things. That would be one awesome community effort to host on Github.


Yes, I know. Although… one of the cell arrays in the current SMI draft could maybe be envisionned as ‘above’ the other… even if same area. We could try and extrapolate from that towards a more global view of hierarchies… but granted, we’re not there yet.

I’m aware that you and others, Paul in particular, are quite willing to direct thought experiments at that kind of problematics, and that you’re sufficiently tooled at home to test those on HTM-based models, so I was wondering if you had insights towards that global scheme specified in such detail.

Once, I almost felt that something of those motor concerns had finally ‘clicked’ in my mind as similar to the HTM-view of the sensory part and that the details for it were at a fingertip… but I fear I’ve lost it. I guess my questions were directed at recovering that.

As long as we agree on the necessity of pursuing the same kind of Hebbian-like explanation to an implementation, we’re on the same page entirely… towards eventually discovering how it all fits together :wink:



Sorry if I explain things that tend to be obvious, but I tend to explicitly set the context for what I’m about to explain. Like Matt said - it can be hard to communicate complex ideas in a forum.

When it comes to SMI you know more than me anyway, as I’m afraid to say that I’m not up-to-date with it all. I’ll check it out this weekend so I can get a better understanding in what you’re talking about there.

1 Like

Thinking through how to actually implement something like this…

Through SMI, the motor actions that I am taking will form predictions in the sensory input space. These predictions will then be strengthened/degraded based on the actual sensory input that I hear. This is the area of focus for current HTM research. Additionally there needs to be a memory of the sensory input that I would like to mimic.

So far this is easy to imagine an implementation for (assuming we can work out the remaining SMI bits like object pooling that captures semantics, etc).

From there, there needs to be a goal (to hear dad’s song), and a plan of action (motor actions to take which I predict will achieve the goal). This is where the implementation details start to get fuzzy. I definitely believe cell grids are a key component here for driving the long-term goal, but low-level RL is also required for tuning actions along the way.

Additionally, we need to solve the problem of breaking down a high-level concept (sing a song) into its lower-level sequential components (tighten vocal chord, exhale, widen mouth, etc). This part in particular cuts to the heart of hierarchy, as you pointed out. I think overall trying to implement a system for “mimicking dad’s song” is an excellent goal to work toward, because if it can be achieved, we will have covered a lot of the main areas that are needed for embodying HTM.

1 Like
Temporal unfolding of sequences
Temporal unfolding of sequences

I’m painfully aware that getting a clue of what is or isn’t obvious to others is a challenge in itself.
As far as I’m concerned, you’ll never have to apologize for that. Keeping things explicit and simple goes a long way towards ensuring that we communicate on same grounds.

1 Like

This is it really. We need a memory model of the world (or bird-song domain) before we can do anything behavior-driven.

I think it could be tackled bottom-up. It learns to produce low-level features of the song first. Those lower-level features combine up the hierarchy after each feature at each level of the hierarchy has a good overlap with the dad song in the auditory hierarchy. But yes, the temporal ‘unfolding’ of these sequences is fuzzy to imagine right now. Jeff wrote about this in On Intelligence, but I forget the details.

Temporal unfolding of sequences

We are forming a group to model in HTM the capabilities required to achieve the zebra finch’s ability to “mimic Dad’s song”. Anyone interested in joining or just listening in on the verbose interactive brainstormy stuff is welcome to join our Slack channel. Purpose of the channel is to prevent cluttering the forum or growing this thread to epic proportions.

1 Like
What are the flaws in Jeff Hawkins's AI framework?
Self Play for HTM?

I just want to point out that when HTM comes around to doing RL that it could have an advantage over most ML models. The reason I believe this is because you can encode the goal as a semantic representation into the model itself. Then using the ideas discussed above the motor output that caused the sensory feedback will be compared internally with the semantically distributed representation of the goal.

In contrast to traditional RL where there is an external signal calculated by the fitness/objective function to provide positive or negative feedback to the model, with HTM the signal is computed internally by comparing sensory feedback with the goal representation.

The dad’s song bird is a good example, but it can be anything else. I’ll take an example of a robot learning to walk. Like with the dad bird, you provide the model with examples of the goal. In this case you show the model many examples of humans walking. With enough examples the model learns a decent hierarchical representation of walking. So given a novel example of walking, it can easily recognize it as walking, no matter how variant or noisy it is, because it has learned the underlying structure that is common to all version of walking.

This representation is now the objective. As the representation is distributed across a hierarchy as features, it allows for an easy breakdown of semantic parts. Starting the motor learning with noisy motor outputs it will tend to get some low-level feature of walking correct. It is statistically easier to get small trials correct than larger ones. (ie the robot might stick its leg out.). The motor output is fed back in to the sensory input in the same presentation as the examples. The sensory feedback of that action will activate a corresponding low-level feature in the sensory part of the model. (So the robot sticking its leg out will be similar to the examples of how humans stick their leg out in routine of walking). So it will recognize that small actions are related to the goal because it has some level of semantic overlap (although only activating a low-level feature). The robot will tend to repeat the motor actions that correspond to activating features in the sensory region corresponding to the goal repersentation. Every time it repeats these actions it does so with noise and variation as to fill in gaps and string together actions into combination of actions that get represented as motor features going up the hierarchy. Each action at each level of the hierarchy corresponds to a lateral feature in the sensory hierarchy. It could be thought that a motor feature ‘satisfies’ a lateral sensory feature. Trial and error literally stack on top of previous successes in the motor hierarchy as to fill to the top, eventually satisfying the high-level sensory walking feature.

This rich semantic information in both sensory and motor hierarchies could really set HTM a part from other models for RL. It gets away from the short-term stimulus/response reinforcement in other models. Instead it can learn very large motor sequences over a long period of time given that it has a sensory (or even abstract) representation of that goal high up in the hierarchy.

For a robot to learn walking it would mean keeping the high-level sensory feature of walking constantly active during learning as to provide top-down biasing in the sensory region. When bottom-up sensory input overlaps with top-down biasing feedback those overlap activations feed over to the motor region to form Hebbian-based association between the motor output and sensory input. The connections can go the both ways (from sensory to motor) as to form a sensory/motor feedback loop. This will be necessary as to allow sensory input to predict motor actions, and motor actions to predict sensory input. This bi-lateral connection between sensory and motor repeats for each level up the hierarchy. The co-ordination between the regions form learning and motor execution at the same time.

There is much more to it than this (in theory) but I don’t know how much this lines up with biology. But I just wanted to spill that out as a high-level idea of why I think HTM could be so valuable when Numenta get the hierarchy implemented.

1 Like

Bummer. I can’t seem to get into the slack channel. I get lots of “reconnecting” but no connection.

If I could I would post this:
So - the road ahead: obtain finch song(s). Run said song through an FFT (there may be one build into audacity) and get a training set of encoded value for training. At some point this will have to be built into what Numenta calls an encoder.

Build a network model roughly structured like the auditory cortex through to the temporal lobe. See below for why this terminus is necessary.

Build some sort of tools to see what our training looks like as it progresses. I would call this section “Critic.” It will be necessary to use this tool to assess our future project “Whistler” to see if it sounds anything like our learned song. That implies an output that is used as a training signal to Whistler to vote it’s efforts up and down.

Note that this may well be part of the reason that the hippocampus exists and why it is tied to the amygdala: To remember recent history and add good/bad reinforcement weighting to the memory before consolidation.

As far as the connection between the listener and Whistler parts, this will also have to go to the forebrain drive to make the song and decide to try again if it does not match the Critic fitness evaluation.

This is NOT a direct connection between the two hierarchies.

The “limbic” section calls for some vocal play as a need to explore this part of the motor system. The critic calls it good or bad. At first any sound might be considered good. As it starts to match remembered songs there would be more “goodness” reinforcement.

What do we name our bird listener model?


BTW there were Slack problems across the internet today. I can now connect.