Esperanto NLP using HTM and my findings

Hi all.
I have been messing with NLP using Temporal Memory in the weekends. I coded up a simple program with and made it to learn the structure of Esperanto sentences. (I’m testing with Esperanto because the language itself is regular and has obvious features.)

And my results are… not surprising. HTM isn’t the most powerful ML model yet and it performs close-to or slightly better then a RNN. Yet it is still powerful in the sense that there is no need to backprop all the way trough the entire sequence thus making it possibly suitable for long sequences.

This is by blog post detailing the results.

BTW, It guess it is not NLP since Esperanto is not a nature language. :stuck_out_tongue:

10 Likes

Well… I did some more experiments. And now HTM is learning way better.
The changes boils down to the following.

  1. Sample from the result distribution. Always taking the character with the highest predicted probability will cause repeating words.
  2. Add some noise if not using SP with boosting to beak symmetry inside TM.
  3. Better hyper parameters
  4. Better dataset

One key observation is that setPredictedSegmentDecrement helps HTM to learn the structure and memorize common words. TM performs way worse without it.

Now the algorithm is outputting something like this:

Tis al liajlej vo ka pronis altaŭ pli kstaŭ pro kortonis al laĉanto. Kun li ĵeokonis laĉanton si sukcesi ĉicestrigis al la turo korpofteno pren de alto. I nto en simirviro korpor blaj multo.Amirĥ.Sensonis en diren li s en sen sen linokorpoforto. Zohomalprektis laĉofacili bom .En simeno.Kiameno.Ake i aŭ pren lino ke iajŭen direktis lino.Kiamalpaiano pron larĝa dustrigaĥĥhetrige is en la arbekun si

It is very clear now that HTM is learning the construction of Esperanto words. Most works ends with either a a, e, o, i, as, is or n. And There should be a space after a period. (I don’t know why it doesn’t predict that in some cases.) And even have memorized some common words like la, li, aŭ and ke.

Again, source code is available on GitHub.

I think the result is comparable to the first experiment in this post. Am I getting somewhere?

7 Likes

Fascinating work @marty1885! Its interesting to see your findings there.

Do you mean that it does better with decrementing incorrectly predicted segments than without doing it at all? Did you try different increment/decrement ratios? Also, are you predicted one character at a time using a category encoder for each different character? Great work.

Thank you.


Yes, decrementing incorrectly predicted leads to a better result. Without it HTM generates really long words.

Ŝaj buŝondajne utis cer lcesterage onton senta korpoformomeĝon.Kio.Lcilmomes junidis arbaimilludis oko arigecer lmfŭtod.Sehomzor pripeĉo.Nitriis da vaĵvaspŝisonis ontigraĵo irontigaicaj le zormome.Niĥĉodkiunnbrue ĝustor pol eblkceskjo.Lbuŝocestagratis bruoe. Kor ie adriis bruo.Eovosoni detaŭdidis ĉi ta lhealisoj skfacilmomege ĉefe or policajn ĝis.Irtimilsekmirgisrorpoforipeŝoblisvestetorigeĥkunuli


Setting it too high leads to repetition and only generating short/common words.

Sen d sen dla l li dro kaj la . Ciloŭ noŝahon siro kajn peĝiris la brtupeĝone ajn pren dis la ŝovo dris la ce aze li ekto li plen dla sen du feta . En dulon pris li lovo li sta moboal li zeŝamen dren dŝiro kaj la o kie aranoĵola to li brj la aa sen dhone aon sen dblgatubono. Pris la ĉi brfacadiro kaj li ĵoĵohonoĝacaŝoĥen dĵoen dze kien dhone aa done anois la sta pidiris li ĥvesiro kajn sen dhon


Yes, I found that the basic rule is that the increment should be slightly lower then the decrement. Ex: increment=0.03, decrement=0.032. Yes, I encode each character using a category encoder and the model is a character-level one.

1 Like

Very interesting that the decrement should be even high than the increment value, I wouldn’t have guessed but it does make sense now that you describe it. I wonder if this could be slightly adjusted to do next word prediction, like when you text and it predicts a few possible next words (some of which usually make no sense). Thank you again for sharing!

1 Like

Next word prediction is certainly doable. But as I pointed out in another post. TM does not give me any sense of probability. Making predicting the next character a lot inaccurate.
https://discourse.numenta.org/t/possiblity-of-global-inhibition-in-temporal-memory/

No need! I think sharing projects is an if not the most important way for the community to grow. :smile:

3 Likes

Mi parolis la esperantan, sed mi ne parolas pli. Mi forgesis, kar mi neniam praktikis, sed en 2003 mi gastigis francan virinon ĉe mia domo en Savanto, Brazilo. Mi ^sin renkontis en liston de Esperanto-parolantoj el grupoj de yahoo. Mi perdis tuŝon sed ĝi estis bonega sperto. Ŝi vojaĝis tra Sud-Ameriko. Mi perdis vian kontakton.
Tio estis bonega sperto.

***** Mi uzis google tradukiston por helpi min skribi ĉi tion.

2 Likes

I spoke the Esperanto, but I did not speak more. I forgot, since I never practiced, but in 2003 I hosted a French woman at my house in Salvador, Brazil. I’ve met in a list of Esperanto speakers from Yahoo! groups. I lost a touch but it was a great experience. She traveled through South America. I lost your contact.
That was a great experience.

***** I used a Google translator to help me write this.

********** I used a Google translator to help me translate this.

3 Likes

Ŝi vojaĝis mia domo en merikis sed ĝizilo. Mi pli. Sperto. Ŝi vojaĝis tra sudamerikis sed ĝiĵorgesisrto. Ŝi vojaĝis tra sudamerikis sed ĝituŝon gastigis francancan virinon ĉe mi ne parolas ĉe mia de ahoo. Mia de ahoo. Mi forgesiston.Tio esperantoparolaniam praktikis sed ĝimo ed ĝiniam pli. Forgesisi forgesisĵlistoj de esperantancan virinon.Tio el grupoj de esperantoj de estis virinon.Tio el grupoj de ahoo. Mia domo eniam pli. Jaĝis tra sudamerikis sed ĝiĉe mi pli. Can vian kar mi forgesismia de esperantancancancancan virinon.Tio estis pli. Gis frantoj el grupoj domo en listoj el grupoj

**** What HTM generates after learning from your paragraph.

Nice to meet another Esperanto speaker.
Bone rekonti alian Esperanto parorantan.

2 Likes

(plene mirinda!) Totally wonderful!
(bela laboro, HTM) nice job, HTM!

2 Likes

Whoaa hold on a sec, I just want to fathom what the HTM did here. So @marty1885, you passed @Matheus_Araujo’s paragraph into your HTM letter by letter:

Mi parolis la esperantan, sed mi ne parolas pli. Mi forgesis, kar mi neniam praktikis, sed en 2003 mi gastigis francan virinon ĉe mia domo en Savanto, Brazilo. Mi ^sin renkontis en liston de Esperanto-parolantoj el grupoj de yahoo. Mi perdis tuŝon sed ĝi estis bonega sperto. Ŝi vojaĝis tra Sud-Ameriko. Mi perdis vian kontakton.
Tio estis bonega sperto.
***** Mi uzis google tradukiston por helpi min skribi ĉi tion.

Then your HTM generated the paragraph:

Is that right?

If so I’m curious:

  • once the model was trained on his paragraph, what did you feed in to invoke the predictions of all those characters? or did it start predicting them online?

  • was his paragraph the only data the model had ever seen?

  • does the paragraph generated by the model make logical sense in Esperanto?

I’m no expert in this area at all, just intrigued by your work here. I know there are LSTM systems used to generate content and even stories one character at a time, and it seems a worthy pursuit to compare HTM on that task. I know there are many potential ways to encode natural language into HTM (like Cortical.io 's approach), but this way of treating characters as distinct categories is certainly simpler to understand (if I have it right). I can’t help but wonder how far it could go. Anyways, thanks again for sharing and I hope I don’t burden too much with these questions.

That’s the translation of HTM said:

Ŝi vojaĝis mia domo en merikis sed ĝizilo. Mi pli. Sperto. Ŝi vojaĝis tra sudamerikis sed ĝiĵorgesisrto. Ŝi vojaĝis tra sudamerikis sed ĝituŝon gastigis francancan virinon ĉe mi ne parolas ĉe mia de ahoo. Mia de ahoo. Mi forgesiston.Tio esperantoparolaniam praktikis sed ĝimo ed ĝiniam pli. Forgesisi forgesisĵlistoj de esperantancan virinon.Tio el grupoj de esperantoj de estis virinon.Tio el grupoj de ahoo. Mia domo eniam pli. Jaĝis tra sudamerikis sed ĝiĉe mi pli. Can vian kar mi forgesismia de esperantancancancancan virinon.Tio estis pli. Gis frantoj el grupoj domo en listoj el grupoj

She traveled my house in merchandise but hooked. Me the more. Experience. She traveled through South America but a song concert. She traveled through South America but a funeral hosted a French woman with me does not talk to me at Yahoo. Ahoo mine I’m a forgettler. An Esperanto speaking person has practiced but it’s just about it. Forgotten recorders of a Esperantist woman. This is a group of Esperantists of a woman. This is Yahoo group. My house never more. It wore through South America but it was more than me. “Can” you because a forgetfulness of a woman of “Esperantancecancancan” woman. That was more. Since the groups of groups are grouped in lists of groups

I used google translator in order to translate it and made some changes based on what I rememebr of the language I once spoked.

1 Like

:thinking:

3 Likes

I think it has a pejorative meaning in English, isnt it?

Thanks for the translation! It makes sense to me that even if the total sentences don’t make sense the letter-letter transitions are almost all do, except for ‘Esperantancecancancan’ I suppose.

I have to marvel at this last line too.

This makes me wonder what it would sound like if the HTM were fed words at a time with each word as a different category (surely necessitating some vocab limit). That of course gets more complex because many words are similar or have similar meanings, which gets lost when they’re treated as distinct categories (enter Cortical.io 's Semantic Folding approach). If nothing else it would further show the general purpose ability of HTM to learn many transitions from a limited amount of data regardless of the data type. Thanks again for sharing

1 Like

I’m really amazed about it.

Yes and no on the vocabulary limit.
If you programmed the retina at the same time as the sequence generator like the brain does then the two units should always be perfectly synchronized.

1 Like

Which units are you refereing to, @Bitking ?

I assume that he is referring to training two units: a “word generator” (instead of letters) and a word sequence generator (grammar?). There is no reason that the word generator could not be trained with letters as is being done here and now.

1 Like

Oh, I see what you mean, now. Thanks! I just don’t have enough knowlegde to infer whether you’re right. I think that if we’re thinking HTM as analogous to our brains, it should be able to learn based on character sequences. But to make it easier to HTM learinning a language we should use the same process a baby does, learning the words first by listening to it and than expressing through the learnt words. We only learn how to read and write a language after we can understand some language and then learn how to speak a language.

2 Likes