Re-Organisation of the C++, PY Community repos

community
structure
repo
hackers

#1

EDIT:
Status

TL;DR: How to structure/organize these repos and their dependencies

Topic (re)started from here NEW htm-community forks: nupic.cpp, nupic.py

There has been historical discussions (quite long) but with good points :


(TODO find other threads)

Current picture:

  1. nupic.PY (fork of nupic)
    https://github.com/htm-community/nupic.py
  • the oldest repo
  • Python
    • python only code
    • wrappers for c+±optimized classes (used by default, decent speed)
  • tests
  • serialization
  • NAPI (API to create the HTM models)
  • swarming (parameter optimization)
  • almost 1:1 feature and CODE parity with c++

Nupic.CPP (fork of nupic.core):

  • c++11, focused on speed
  • tests
  • serialization
  • NAPI
  • nupic.bindings (the SWIG bindings that expose the library to other languages)

Htm.JAVA

  • 1:1 feature parity with nupic, official fork
  • does not depend on any of the repos, afaik

Now, the questions are:

  1. Do we want the keep the 1:1 feature,code parity between the repos?
  2. What features/structure we want to keep in the repos? Ie. how to split them into modules?

NEW htm-community forks: nupic.cpp, nupic.py
#2
  • PROs:
    • it’s been a lot of work, so we could keep it
    • nice for testing, where we can run parallel py:cpp tests to confirm new PRs
    • easier for the users to have all-in-one tuned install
  • CONs
    • a LOT of work for PRs, can be detering innovation, esp if you work in one language and don’t care about the other.
    • very hard more complex refactoring
    • less KISS

What features/structure we want to keep in the repos? Ie. how to split them into modules?

Similar arguments go for the monolitic / modular repo organization. The monolitic is more “stable”, slower to develop, ensured to work. The modular is more rapid, easier for new people to jump-in, or work only on features that one is interested.
We will see these problems in what people are currently working on: Proposal to introduce pybind for move toward Python 3 compatibility , optimization, porting, etc…

I’ll post my proposed approach


#3

@chhenning already has a repro that he is working on that re-structures the core. This will contain the Python 3 and Python 2.7 interfaces. SWIG and CapnProto will have been removed (I think that is the plan) and file structure moved around. So when it is ready he will upload it and then it can be the base of the new nupic.community.core module (which you call nupic.cpp).

I will then add my C++/clr interface (windows managed code) which can also be used by C# clients. In the process of doing that I will also add some missing C++ code that would allow conventional C++ clients to register into the Network class any local C++ algorithms (not part of core) in the same way that a Python client can register local Python classes with the Network class.

In answer to an earlier question. Yes, we do need a C++ interface. There are a few other parts needed to complete the set so that I can write my own algorithm in C++, register it with the Network class in the core, and use that to try out my algorithm…in other words, a C++ client.


#4

I’ll post my proposed approach

Ok, so one of the possible ways to handle the situation:
I’m more the “modular guy”, but will try to workaround and make the best of both words.

  • A) nupic.core.c++

    • blazing fast, small, c++11 (17) only
    • encoders, SP, TM, Anomaly + tests
    • goto place to understand HTM concept, refactor or spin your completely different fork
    • the API are public methods of the classes, data type is the output vector (std::vect, eigen, …)
    • A1) nupic.bindings
      • builds on core.c++
      • provids bindings to all other clients (c++, c#, py,…)
      • c+±optimized wrappers for PY will be moved here (ie TM==cppImpl)
      • NAPI - probably has to provide NAPI, of we don’t want to have another bindings (too much)
  • B) nupic.cpp

    • c++ client of nupic.core.c++ & nupic.bindings
    • all other features, serialization
    • it would be nice if NAPI could be here(?)
    • just “any other repo”, just like nupic.c#, …
  • C) nupic.py

    • the python repo
    • already without the c++ wrappers (in bindings)
    • possible future separation too into: core, client
  • D) nupic.all.{py,cpp}

    • here comes the gimmic
    • little bit duplicit repos, integrate all the previous sub-modules; no new code/features added here
    • advantage is:
      • keeps existing monolitic structure for those who want it (Numenta, external projects)
      • allows us to run the PY:cPP tests,…

#5

I don’t understand your layout.
A is ok but still don’t like the name.


#6

Can I help explain the fuzzy bits? (don’t care about the name now)

In your other reply:

already has a repro that he is working on that re-structures the core. This will contain the Python 3 and Python 2.7 interfaces. SWIG and CapnProto will have been removed (I think that is the plan) and file structure moved around.

It’s this thread, right?
https://discourse.numenta.org/t/proposal-to-introduce-pybind-for-move-toward-python-3-compatibility?source_topic_id=3327
(I’m still unsure of so quick drop of serialization and swig…but maybe it can be mitigated? cont’ in the thread)

So when it is ready he will upload it and then it can be the base of the new nupic.community.core module (which you call nupic.cpp).

with the PROPOSED restructure, would that be A nupic.core.c++ ?

I will then add my C++/clr interface (windows managed code) which can also be used by C# clients. In the process of doing that I will also add some missing C++ code that would allow conventional C++ clients to register into the Network class any local C++ algorithms

this would be the nupic.bindings (new name, not the current)?

and use that to try out my algorithm…in other words, a C++ client.

Ah, a client to “nupic”, I call it a 3rd party app. I do that too and had this problem.
Maybe you want to start a separate thread for that, as it could be also interesting discussion?
In C++ (app), I simply interfaced the public API, no changes needed. For non-cpp apps on local machine I used IPC, or network protocols.


#7

I prefer to keep the core and all of the language interfaces together so we don’t have version issues. But if we must break it up into separate repositories then perhaps it would be something like

  • core (the algorithms implemented in C++) as a static library.
  • Execution Harness (Network and Region classes)
  • Python interface
  • C++ interface
  • C++/clr interface
  • C# interface

The problem is that each of the interfaces need include files from the core (not just a library). So if they are in separate repositories the repositories MUST be the same version and placed in parallel directories or nothing will build.

The original nupic repository (all of the Python code) is a separate thing. The Python 3 conversion will change those so we need a place to put them but that is not the current topic. The Java port is also a separate thing.


#8

oops, the Python interface is actually two… 3.x and 2.7


#9

@rhyolight how do you reference bigger groups here? #htm-hackers #committer-lounge , this is a big step, so it would be good to know feeling of most people. Or at least come up with some ITERATIVE process, so we can learn and adapt along the way. Maybe simply just separation of nupic.cpp (core) and nupic.bindings (from core) …??


#10

@breznak have you read the discussions we have been having over that last few weeks?


#11

Ok, I think we are getting to understand each other…

core (the algorithms implemented in C++) as a static library.

  • tests and include headers

Execution Harness (Network and Region classes)

sounds good. The serialization will be kept here? (I don’t think it’s a good idea to ditch it completely)

Python interface
C++ interface
C++/clr interface
C# interface

All the interfaces. So you propose it’s better to have separate interfaces “bindings.py”, “bindings.c#”, rather than all of them in one code? Probably yes, the repos will then be quite small and managable.

The problem is that each of the interfaces need include files from the core (not just a library). So if they are in separate repositories the repositories MUST be the same version

git-submodules, or ideally some dependency management (like pip for c++).
The problem you mention would manifest only with BREAKING changes in API, while you can still do a lot of development insude “your submodule” repo without breaking the outside consumers. Interfaces also enforce good coding practices.
My biggest point for this is that NuPIC is already quite old, so the interfaces would be stabilized more or less.


#12

@chhenning I would be very interested in optimizing the hell of c++ HTM code. Would love to see your changes and merge them into the proposed structure!

I did some performance analysis …do you have a thread about #optimization ? If not, can we start one? I 've done

  • some micro-benchmarks
  • considered ditching the hand-made ASM
  • initial test with eigen
  • performance analysis of the HTM chain

#13

Not quite.

What I prefer is to have everything as one repro.
That list is the sections that it contains.
We have been working on this for a while so that the Python interface is not as intertwined with the other code.

The Execution Harness calls serialization. The original code had two types of serialization, CapnProto and Yaml. The Yaml one is simpler (although slower). The CapnProto serialization was quite invasive and complicated. By removing it we simplified everything a lot.

The execution harness is not really an application although I guess you could use it as such. Read about it in the API specification (Network API).


#14

If nobody is actually using the Network API and the Regions, it would simplify everything to drop that too. About half the code (I would guess) is in some way supporting that API.


#15

I suggest you start a topic in #htm-hackers with a poll if you have questions about the preferences of the community.


#16

@rhyolight do you know if the Network ‘framework’ is being used? Or would it be simpler for people to learn if we just provided language interfaces to just the algorithms and perhaps some example code in each language showing how to connect them up and pass data around.

I personally think the Network harness is kind of nice (and efficient) but do people understand what it is for?


#17

@breznak As @David_Keeney already pointed out we have a somewhat working community port. There is a visual studio solution which showcases the current structure. You can even run python inside the solution. :slight_smile:

A few things I did to nupic.core:

  • removed all python code and python helper code
  • removed dependencies such as Apache Portable Runtime (will use c++11 or boost)
  • removed cap’n proto. This might be controversial but so far it has caused too many issue. Amazingly a lot of compiler warnings disappeared. :wink: Eventually, we will have to find way to save and load networks.
  • Added the use of smart pointers to the engine (needed that for pybind)

I agree with you that nupic.core’s biggest offer are the algorithms. The network stuff is neat but there might be better solutions out there.

There is Eigen but there is also Blaze. I’m not sure which one fits our puposes best. Good start for benchmarking. No?


#18

Getting all the better :wink: … is it this one? https://github.com/chhenning/nupic.core

There is a visual studio solution which showcases the current structure.

Is it this thread? or where is it drafted? https://discourse.numenta.org/t/nupic-core-c-bindings
I don’t have VS, but I can review the branch, some image or description would be helpful.

removed all python code and python helper code
removed dependencies such as Apache Portable Runtime (will use c++11 or boost)
removed cap’n proto. This might be controversial but so far it has caused too many issue. Amazingly a lot of compiler warnings disappeared. :wink: Eventually, we will have to find way to save and load networks.
Added the use of smart pointers to the engine (needed that for pybind)

All sound good to me, but the removed serialization. Is something (yaml) working? Do you think to get someting in a reasonable time? As for many experiments…you need a model that trains a loong time.


#19

Getting all the better :wink: … is it this one? https://github.com/chhenning/nupic.core1

This one is old and should be deleted soon. Please use https://github.com/chhenning/nupic

Here is the structure that I have created in conjuction with @David_Keeney and @heilerm.

  1. nupic.core – static cpp lib
  2. nupic.core_test – nupic.core unit tests using gtest
  3. nupic.python.algorithms – pybind11 module
  4. nupic.python.engine – pybind11 module
  5. nupic.python.math – pybind11 module
  6. nupic.python27.algorithms – pybind11 module
  7. nupic.python27.engine – pybind11 module
  8. nupic.python27.math – pybind11 module
  9. nupic.python_test – cpp app to test python modules
  10. python3 - nupic python 3 port
  11. yaml-cpp - static cpp yaml lib

I also provide all necessary binaries to build the system. This obviously only works for Windows.

There are some more serialization methods like writeToString for some of the algorithms. Matrices can also be saved and loaded from file. But it’s all rather crude. I’m open to suggestions like boost::serialization (saving to zipped archive for instance) or cap’n proto, or cereal, etc. But I think we need to talk about what we actually trying to achieve?


Requirements from Serialization
#20

These are binaries, folder structure, right? Not separate repositories?

From the hindsight, it seems to me that you focus more on “bindings, python, windows/platform”, while mine is on “c++, optimizations, refactoring”. It’s good if we can define both these views and can try to chew it.

This would be exactly “mine” :

?

nupic.python.algorithms – pybind11 module
nupic.python.engine – pybind11 module
nupic.python.math – pybind11 module
nupic.python27.algorithms – pybind11 module
nupic.python27.engine – pybind11 module
nupic.python27.math – pybind11 module
nupic.python_test – cpp app to test python modules

And these nupic.bindings.py(2) :

python3 - nupic python 3 port

separate nupic.bindigs.py3 (?)

yaml-cpp - static cpp yaml lib

even removed?, serialization.

I can see the benefits of merged py+cpp repos.

TL;DR: we more or less agree on the structure, just should consider STRICT isolation of separate layers. And possible modularization in repos: What will happen when @David_Keeney merges the c# bindings? And some others? And alternative implementations of the algorithms? …I think the repo might get huge, too crowded with issues and activities to manage.
Another thing is 3rd party projects that would like to use some functionality (swarming, python, c++ iib) and don’t want to pull,manage,build all the other stuff.

Or we could brainstorm the suggested alternative which provides way for the both worlds:

  • atomic repos
  • layered “OSI” model, where on a higher layer the API is shielded
  • Good thing is all of this can be changed later, unless you base your changes on forks of nupic.core, nupic.
  • It depends what you expect from the repo: your/python-feature repo, the structure is fine; community or generic-use repo, the design might change a bit.