Thinking through how to actually implement something like this…
Through SMI, the motor actions that I am taking will form predictions in the sensory input space. These predictions will then be strengthened/degraded based on the actual sensory input that I hear. This is the area of focus for current HTM research. Additionally there needs to be a memory of the sensory input that I would like to mimic.
So far this is easy to imagine an implementation for (assuming we can work out the remaining SMI bits like object pooling that captures semantics, etc).
From there, there needs to be a goal (to hear dad’s song), and a plan of action (motor actions to take which I predict will achieve the goal). This is where the implementation details start to get fuzzy. I definitely believe cell grids are a key component here for driving the long-term goal, but low-level RL is also required for tuning actions along the way.
Additionally, we need to solve the problem of breaking down a high-level concept (sing a song) into its lower-level sequential components (tighten vocal chord, exhale, widen mouth, etc). This part in particular cuts to the heart of hierarchy, as you pointed out. I think overall trying to implement a system for “mimicking dad’s song” is an excellent goal to work toward, because if it can be achieved, we will have covered a lot of the main areas that are needed for embodying HTM.