Sequence Classifier based on HTM

Hello everyone, I am currently submitting a dissertation proposal for creating a Sequence Classifier based on HTM. The proof of concept will attempt to recognize Stalling Code from a training from VirusShare.com, which is a huge repository of live viruses (not for the faint of heart). The gist of research will be to use a parser that disassembles a executable file and feeds the resulting hex values representing the assembly code to the input stream. The main issue that I’m struggling with is converging this low level data with higher abstracted concepts such as Stalling Code (code that uses various stalling tactics to evade detection). I’ll be posting more of my work on here and am hoping for constructive feedback from the community. Below is the abstract. …

An Dissertation Proposal Submitted to Nova Southeastern University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science

A Hierarchical Temporal Memory

Sequence Classifier Application for Classifying Streaming Data

by

Jeffrey V. Barnett

July 2019

Identifying abnormal data within modern data streams is extremely difficult. Modern sequence classifiers such as Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Rule-Based Classifiers (RBC), Deep Learning, and sequence alignment tools all fail to successfully classify data streams because of two factors: the overwhelming volume of data and a distinctive feature known as concept drift. Concept drift is a term used to describe changes in the statistical properties of an object or learned structure that occur over time which eventually leads to a drastic drop in classification accuracy. Some of the more prevalent and effective malware utilize concept drift to force classifiers to frequently retrain, resulting in a degradation of performance and an increase in the consumption of system resources. One of the simplest methods of forcing concept drift is through the use of stalling methods that execute trivial instructions, waits for user interaction, or invokes system sleep calls. The neocortex (brain) is a predictive modeling system which receives information from sensors (i.e., retina, cochlea, touch) and builds a real-time model of the world that enables it to make predictions, detect anomalies, and generate behavior. Sparse Distributed Representations (SDR) are considered by neuroscientist to be the data structure of the brain. Hierarchical Temporal Memory (HTM) is a type of sequence memory that exhibits the predictive and anomaly detection properties of the neocortex. HTM algorithms conduct training through exposure to a stream of sensory data and are designed for continuous online learning. Current HTM models use SDRs which model the way that neurons contribute to maintaining a sparse pattern of activity which represents the world. Previous research of HTMs has shown promise for use in malware detection; however, HTMs currently fall short of practical use in this area due to insufficient behavioral analysis of malware needed to properly encode sufficient semantic meaning in the SDRs for a classifier to process. This research proposes to develop SDRs of stalling code based on a thorough behavioral analysis and to develop a novel HTM sequence classifier that is capable of recognizing stalling code in streaming data which is resistant to concept drift.

Keywords: Hierarchical Temporal Memory, Intrusion Detection, Malware, Sparse Distributed Representation, Concept Drift

4 Likes

Exciting application here @barnettjv! I think this could be an arena where HTM really shines, and welcome it into our crosshairs.

To me this is where the real work and real innovation of it is, at least in terms of applying NuPIC. If you can drive this part by describing the nature of these hex values so they cap be mapped onto encoder(s) then you’re cooking.

Again I think this is a great concept, I’d just add that you don’t need to frame it specifically as a classifier for stalling code. It can be a streaming anomaly detector which will pick up the changes in recent predictability generated by stalling code as anomalous, along with any other changes in recent system predictability as they set in.

The HTM’s continuous learning should foil their scheme to confuse the guards, but given the general nature of the anomaly detection it could also be helpful to have some domain-specific classification of the anomalies after they are raised. If you’re inclined to open up your drawing board some I’m interested to see how you’ll harness HTM for this.

3 Likes

GREAT feedback sheiser1! Although some may warn be about posting my work in an open forum such as this, I’m gonna roll the dice and be as transparent as I can. I’m currently reviewing some notes from Subutai to address our Deans’ concern about convergence. I’ll post the draft proposal here shortly and welcome any/all constructive criticism. My timeline is to have this proposal approved by my committee this week, then be given the blessing to start the research. Look back here in a day or so for what I hope to be my final draft version of the proposal.

3 Likes

I had to pick a toy problem that demonstrated HTM’s applicability as a classifier against concept drift. This is my comfort zone. I’ll be using IDA Pro to disassemble the executable files so that I can perform human analysis on it the virus, then I’ll have to pull some wizbang magic out of my hat and figure out a way to semantically encode the behavior(s) of stalling code.

1 Like

Glad to hear it! In my experience this community has been totally kind and helpful, and is well overseen too.

I wonder what the dean means by convergence, like not enough novelty to the project? If that’s the case that worry should fade away quickly, as you’ll need to innovate to encode the raw data you’re getting. If you come up with a flexible method it could be added as a new encoder to HTM which bridges it to a new application space, which I’d certainly call innovative, especially if it sets a new standard for ML performance in this space (aim no lower!).

I’m certainly interested to be in the loop as your proposal is reviewed and project takes form :smile:

Two figures which attempt to depict my proposal…


Figure 6 . Proposed HTM system with Sequence classifier (L 2/3) and input modalities. A Theory of How Columns in the Neocortex Enable Learning the Structure of the World. Hawkins et al. (2017), p. 4.

3 Likes

Outdate and removed figure.

1 Like

Ok I think I got this. Basically my dissertation will be extending the work Numenta did in " A Theory of How Columns in the Neocortex Enable Learning the Structure of the World " paper except the object being studied will be disassembled executable files. The “world” that my research will be trying to model is the Computer Hardware (mainly registers) and the executable that is under examination. The sensors will be the Program Counter, Instruction Register, that is derived as a resulted of parsing the hex file feeding it into an encoder. Same analogies as the cup and finger/eyes, except the cup is now the executable file (via parser) and finger/eyes are the loc/registers. Thoughts?

3 Likes

What are you thinking as the distal signal for the feature input layer? Will this just be the previously sensed feature (essentially Temporal Memory with the addition of object layer and voting) or some type of location encoding?

1 Like

I was planning on feeding the disassembled executable file to an encoder which encodes each line of code. Haven’t designed the encoder yet, but idea is that instructions such as BNE, BEQ, etc. would be semantically similar and the corresponding label that is being jumped to would use one of the existing NuPic encoders.

Is this similarity born out in their names? Like is BNE similar to BEQ because they both contain B and E? Or do you know about this similarity from domain knowledge? The encoder may need to get pretty specific the more domain-specific the data type is, it seems like they maybe categorical yet overlapping semantically, which there currently isn’t a direct NuPIC encoder for.

Yes, if I understand your proposed system correctly, that would give you the minicolumns to activate in the “Sensed feature” layer (using the label in the figure you posted). I was just curious about where distal input to this layer would come from in your system, for choosing the winning cells within those minicolumns to activate. In the figure, the distal input is coming from another layer labeled “Location *”, which represents the location of the sensor in object space.

If “locations” don’t make sense for this problem, that part could be simplified to be more like Temporal Memory (where the distal input comes from other cells in the “Sensed feature” layer). You could still incorporate the ideas of the “Object” layer and voting between mutliple sensors, as depicted in the figure.

I think this is the place to start. This is simpler to implement, and may work well enough to not need any more complexity.

1 Like

They are a subset of the branching instructions in assembly, depending on the architecture, below are the branshing instructions for x86 architecture (I think i listed the ones for Arm/Linux above)

Instruction Meaning
JNE Jump if not equal
JE Jump if equal
JG Jump if greater
JLE Jump if less than or equal
JL Jump if less than
JGE Jump if greater or equal

It seems to me that the key to developing a successful encoder is to focus on features of the observed input stream that relate to the behavior of the system. For example, semantically similar operations might include memory/cache accesses (read or write), floating point operations, etc. Each of the various instructions that share similar behaviors should share some parts of their encoded SDRs.

I don’t have a lot of experience with low-level hardware instructions, but I suspect that some of the behavior you might want to encode for will involve multiple instructions. It’s hard to say if the sequence learning algorithms will be able to pick up on these behaviors and associate them together appropriately. You may want to take a look at how Cortical.io creates their semantic fingerprints. In their NLP applications, it’s the words themselves that have semantic meaning rather than the individual letters.

2 Likes

I don’t know about hardware at all but I agree with this 100%. This process of exploration around preprocessing & encoding will likely effect detection performance much more directly than network design parameters.

@barnettjv - there are many ways of transfering program flow control.
You listed basic jump.

Jump

Opcode Mnemonic Description
EB cb JMP rel8 Jump short, relative, displacement relative to next instruction.
E9 cw JMP rel16 Jump near, relative, displacement relative to next instruction.
E9 cd JMP rel32 Jump near, relative, displacement relative to next instruction.
FF /4 JMP r/m16 Jump near, absolute indirect, address given in r/m16.
FF /4 JMP r/m32 Jump near, absolute indirect, address given in r/m32.
EA cd JMP ptr16:16 Jump far, absolute, address given in operand.
EA cp JMP ptr16:32 Jump far, absolute, address given in operand.
FF /5 JMP m16:16 Jump far, absolute indirect, address given in m16:16.
FF /5 JMP m16:32 Jump far, absolute indirect, address given in m16:32.

but also, there are conditional jumps.
These can be forced with things like xor A to clear the accumulator.
Or push a value and pop flags.

Jcc

Jump if Condition Is Met

Opcode Mnemonic Description
77 cb JA rel8 Jump short if above (CF=0 and ZF=0).
73 cb JAE rel8 Jump short if above or equal (CF=0).
72 cb JB rel8 Jump short if below (CF=1).
76 cb JBE rel8 Jump short if below or equal (CF=1 or ZF=1).
72 cb JC rel8 Jump short if carry (CF=1).
E3 cb JCXZ rel8 Jump short if CX register is 0.
E3 cb JECXZ rel8 Jump short if ECX register is 0.
74 cb JE rel8 Jump short if equal (ZF=1).
7F cb JG rel8 Jump short if greater (ZF=0 and SF=OF).
7D cb JGE rel8 Jump short if greater or equal (SF=OF).
7C cb JL rel8 Jump short if less (SF<>OF).
7E cb JLE rel8 Jump short if less or equal (ZF=1 or SF<>OF).
76 cb JNA rel8 Jump short if not above (CF=1 or ZF=1).
72 cb JNAE rel8 Jump short if not above or equal (CF=1).
73 cb JNB rel8 Jump short if not below (CF=0).
77 cb JNBE rel8 Jump short if not below or equal (CF=0 and ZF=0).
73 cb JNC rel8 Jump short if not carry (CF=0).
75 cb JNE rel8 Jump short if not equal (ZF=0).
7E cb JNG rel8 Jump short if not greater (ZF=1 or SF<>OF).
7C cb JNGE rel8 Jump short if not greater or equal (SF<>OF).
7D cb JNL rel8 Jump short if not less (SF=OF).
7F cb JNLE rel8 Jump short if not less or equal (ZF=0 and SF=OF).
71 cb JNO rel8 Jump short if not overflow (OF=0).
7B cb JNP rel8 Jump short if not parity (PF=0).
79 cb JNS rel8 Jump short if not sign (SF=0).
75 cb JNZ rel8 Jump short if not zero (ZF=0).
70 cb JO rel8 Jump short if overflow (OF=1).
7A cb JP rel8 Jump short if parity (PF=1).
7A cb JPE rel8 Jump short if parity even (PF=1).
7B cb JPO rel8 Jump short if parity odd (PF=0).
78 cb JS rel8 Jump short if sign (SF=1).
74 cb JZ rel8 Jump short if zero (ZF = 1).
0F 87 cw/cd JA rel16/32 Jump near if above (CF=0 and ZF=0).
0F 83 cw/cd JAE rel16/32 Jump near if above or equal (CF=0).
0F 82 cw/cd JB rel16/32 Jump near if below (CF=1).
0F 86 cw/cd JBE rel16/32 Jump near if below or equal (CF=1 or ZF=1).
0F 82 cw/cd JC rel16/32 Jump near if carry (CF=1).
0F 84 cw/cd JE rel16/32 Jump near if equal (ZF=1).
0F 84 cw/cd JZ rel16/32 Jump near if 0 (ZF=1).
0F 8F cw/cd JG rel16/32 Jump near if greater (ZF=0 and SF=OF).
0F 8D cw/cd JGE rel16/32 Jump near if greater or equal (SF=OF).
0F 8C cw/cd JL rel16/32 Jump near if less (SF<>OF).
0F 8E cw/cd JLE rel16/32 Jump near if less or equal (ZF=1 or SF<>OF).
0F 86 cw/cd JNA rel16/32 Jump near if not above (CF=1 or ZF=1).
0F 82 cw/cd JNAE rel16/32 Jump near if not above or equal (CF=1).
0F 83 cw/cd JNB rel16/32 Jump near if not below (CF=0).
0F 87 cw/cd JNBE rel16/32 Jump near if not below or equal (CF=0 and ZF=0).
0F 83 cw/cd JNC rel16/32 Jump near if not carry (CF=0).
0F 85 cw/cd JNE rel16/32 Jump near if not equal (ZF=0).
0F 8E cw/cd JNG rel16/32 Jump near if not greater (ZF=1 or SF<>OF).
0F 8C cw/cd JNGE rel16/32 Jump near if not greater or equal (SF<>OF).
0F 8D cw/cd JNL rel16/32 Jump near if not less (SF=OF).
0F 8F cw/cd JNLE rel16/32 Jump near if not less or equal (ZF=0 and SF=OF).
0F 81 cw/cd JNO rel16/32 Jump near if not overflow (OF=0).
0F 8B cw/cd JNP rel16/32 Jump near if not parity (PF=0).
0F 89 cw/cd JNS rel16/32 Jump near if not sign (SF=0).
0F 85 cw/cd JNZ rel16/32 Jump near if not zero (ZF=0).
0F 80 cw/cd JO rel16/32 Jump near if overflow (OF=1).
0F 8A cw/cd JP rel16/32 Jump near if parity (PF=1).
0F 8A cw/cd JPE rel16/32 Jump near if parity even (PF=1).
0F 8B cw/cd JPO rel16/32 Jump near if parity odd (PF=0).
0F 88 cw/cd JS rel16/32 Jump near if sign (SF=1).
0F 84 cw/cd JZ rel16/32 Jump near if 0 (ZF=1).

and call , you do not have to return, you can pop the stack to clear the address.

Call Procedure

Opcode Mnemonic Description
E8 cw CALL rel16 Call near, relative, displacement relative to next instruction
E8 cd CALL rel32 Call near, relative, displacement relative to next instruction
FF /2 CALL r/m16 Call near, absolute indirect, address given in r/m16
FF /2 CALL r/m32 Call near, absolute indirect, address given in r/m32
9A cd CALL ptr16:16 Call far, absolute, address given in operand
9A cp CALL ptr16:32 Call far, absolute, address given in operand
FF /3 CALL m16:16 Call far, absolute indirect, address given in m16:16
FF /3 CALL m16:32 Call far, absolute indirect, address given in m16:32

To make things hard to detect, you can push an address and return.
The push and return do not have to be next to each other.

RET

Return from Procedure

Opcode Mnemonic Description
C3 RET Near return to calling procedure.
CB RET Far return to calling procedure.
C2 iw RET imm16 Near return to calling procedure and pop imm16 bytes from stack.
CA iw RET imm16 Far return to calling procedure and pop imm16 bytes from stack.

I have coded in intel assembly since the original 8080 came out and I am sure that there are a bunch of other ways to alter flow control that would be hard to characterize.

I expect a good morphing virus to slide code around with things like the string copy instruction, then transfer control to that code. For that matter, it could have code that is munged in some way, unmunged in small chunks as needed, and generally scan-proofed in memory until needed. If this involved relatve addressed code that is randomly moved around and executed wherever it ends up your jump addresses would be changing all the time, even for the same block of code.

For that matter, they could have a tiny byte code interpreter and the code could be in something that does not look at all like intel op codes; dot-net anyone?

1 Like

Yes I noticed that too. Havent had much sleep the past 3 days. You wouldn’t know by the post but I actually teach Assembly language on both the x86 and Arm/Linux machines. :slight_smile:

2 Likes