So I decided to make a follow up post, given that the last post was purely an abstract demonstration of the scompress[] operator. It has since occurred to me that we might be able to apply scompress[] to raw text and hopefully extract out repeated phrases. I was only 50/50 on whether it would work in practice, or if we would simply get noise, but for the most part I think it has worked. Let me walk you through my SDB code.
I decided to use simple English wikipedia page about dogs as my source text, on the assumption that simpler English will work better than standard English. Especially since scompress[] is trying to extract out repeated substrings, and simpler text will presumably contain more repeat substrings than standard English. For standard English, we would need a much larger data-set to get comparable results.
So, let’s define our set of sentences, and in the process split them into sequences, of single letters, using the ssplit operator. With the warning that the current code breaks if it contains any digits, so we have deleted them from our sentences.
learn-page |dog> #=>
seq |0> => ssplit |Dogs (Canis lupus familiaris) are domesticated mammals, not natural wild animals.>
seq |1> => ssplit |They were originally bred from wolves.>
seq |2> => ssplit |They have been bred by humans for a long time, and were the first animals ever to be domesticated.>
seq |3> => ssplit |There are different studies that suggest that this happened between and years before our time.>
seq |4> => ssplit |The dingo is also a dog, but many dingos have become wild animals again and live independently of humans in the range where they occur (parts of Australia).>
seq |5> => ssplit |Today, some dogs are used as pets, others are used to help humans do their work.>
seq |6> => ssplit |They are a popular pet because they are usually playful, friendly, loyal and listen to humans.>
...
seq |54> => ssplit |Some of the most popular breeds are sheepdogs, collies, poodles and retrievers.>
seq |55> => ssplit |It is becoming popular to breed together two different breeds of dogs and call the new dog's breed a name that is a mixture of the parents' breeds' two names.>
seq |56> => ssplit |A puppy with a poodle and a pomeranian as parents might be called a Pomapoo.>
seq |57> => ssplit |These kinds of dogs, instead of being called mutts, are known as designer dog breeds.>
seq |58> => ssplit |These dogs are normally used for prize shows and designer shows.>
seq |59> => ssplit |They can be guide dogs.>
|>
To learn all of those sentences/sequences, we simply invoke the operator:
learn-page |dog>
It turns out that scompress[] processing of raw text works better if we convert everything to lowercase before proceeding. The reasoning is that if we preserved case, then “Dogs” and “dogs” would return “ogs” as the repeating pattern instead of “dogs”. So, here is a quick wrapper operator to do the work for us, which makes use of the to-lower operator that converts text to all lower case:
convert-to-lower-case |*> #=>
lower-seq |__self> => to-lower seq |__self>
|>
Now we apply the convert to lower case operator to all of our sequences that have been defined with respect to “seq” (using the relevant kets operator):
convert-to-lower-case rel-kets[seq]
Next, let’s run our scompress operator on our “lower-seq” sequences, storing them with respect to the “cseq” operator, and using "W: " as the scompress sequence prefix:
scompress[lower-seq, cseq, "W: ", 6, 40]
where 6 is the minimum ngram length, and 40 is the maximum ngram length used by scompress[]. By specifying them, it speeds things up a bit.
Now that the hard work is done, let’s take a look at what we have. There are two obvious things we can do next, one is to print out the repeated substrings detected by scompress[], and the other is to measure the system depth of our sequences. See my previous post for a definition of system depth.
Here is the relevant code to print out the repeated substrings, sorted by longest strings first:
filter-W |W: *> #=> |_self>
expand-W |W: *> #=> smerge cseq^20 |_self>
find |repeat patterns> #=> seq2sp expand-W cseq rel-kets[lower-seq] |>
print-coeff |*> #=>
print (extract-value push-float |__self> _ |:> __ |__self>)
|>
print-minimalist |*> #=>
print |__self>
|>
-- print-coeff reverse sort-by[ket-length] find |repeat patterns>
print-minimalist reverse sort-by[ket-length] find |repeat patterns>
Here are the repeating substrings, with length in range [6, 40]:
dogs, hunting dogs, herding dogs,
"man's best friend" because they
they have been bred by humans
man of about years of age,
dog is called a pup
between and years
because they are
different breeds
s of domestication
showed that the
the alpha male.
are sometimes
these dogs are
called mutts,
to that of a
suggest that
domestication
e domesticated
ed from wolves
dog is called
s have lived
loyal and li
wild animals
e great dane
than humans
domesticated
police dogs
ed together
e different
there are a
dogs often
dogs with
dog breeds
sometimes
s are used
dogs can se
domestic
dogs have
years ago
dogs can s
dogs with
modern dog
sometimes
with human
dogs are
there are
guide dogs
the first
and the
the dog
n average
different
a dog in
because
they can
popular
e of the
to human
such as
dogs are
other ar
designer
sometimes
s in the
parents
usually
e dogs,
they are
people
have be
or blind
e other
a dog,
there ar
before
of the
e breed
parents
t least
dogs, h
s closer
this is
human bo
trained
en years
longer
called
police
lifespan
er dogs
of dogs
s and ca
and li
where
ed that
e that
called
human b
e dogs.
a few
dogs,
called
people
friend
often
humans
their
as pets
parents
animals
better
ll and
breed
dogs w
breeds
breeds
poodle
wolves
before
dogs t
e pette
usually
and ar
known
a dog
long t
and a
were
other
dogs,
dogs w
dingo
human
of dog
red by
group
have
they a
breed
the a
breeds
s can
s and
years
t see
shows
, but
being
r this
it is
or pu
dog is
wolves
for d
the re
d not
s are
e dogs
been
So, it did a moderate job of extracting out repeat phrases and words, given the starting point was sequences of individual letters. Though it would presumably do an even better job if we had a much larger data-set, with more repeat phrases and words.
Finally, let’s look at the system depth, as defined in my previous post. Here is the relevant code in the SDB language (making use of recursion):
find-depth (*) #=>
depth |system> => plus[1] depth |system>
if( is-equal(|__self>, the |input>), |op: display-depth>, |op: find-depth>) cseq |__self>
display-depth (*) #=>
|system depth:> __ depth |system>
find-system-depth |*> #=>
depth |system> => |0>
the |input> => lower-seq |__self>
find-depth cseq |__self>
coeff-sort find-system-depth rel-kets[lower-seq]
And here is the result:
27|system depth: 4> + 15|system depth: 5> + 12|system depth: 3> + 6|system depth: 2>
While making the observation that not all of the input sequences share the same system depth.
And that is about it. I don’t think we have done anything useful with scompress[] yet, but it is a start.
I should also mention, the above data-set took about 10 seconds to process. And without passing in a min and a max ngram length to scompress[] the code took about 18 seconds to process. Given all the processing going on inside scompress[], repeatedly breaking the input sequences into smaller and smaller ngrams, I don’t think that is all that terrible. Do people have any other suggestions for where scompress[] might be more interesting? It should work for almost any collection of sequences.
Here is the Semantic DB github page.
Here is the code for this post
Feel free to contact me at garry -at- semantic-db.org