This is BIG:
“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
It might not sound like much, but this is a canary-in-the-coalmine for AGI. If you're not facile with machine-learning terminology, this might sound a lot more complicated than it is. In the video, he expresses astonishment that this technique works, but it's not too difficult to understand why it works. Let's say we have two identical positive reviews: A:"This movie was amazing!", B:"This movie was amazing!" Both are obviously marked "positive" sentiment in the training data-set. When we concatenate the strings together, "This movie was amazing!This movie was amazing!" and compress it, the compression algorithm is smart enough to recognize that the string has simply been repeated and will encode it that way: "[This movie was amazing!]\r" where I am using "\r" to represent whatever escape-code the compressor uses to indicate "repeat the last string in brackets." It won't be quite that simple in the compressed format, but this is one of the capabilities of any SOTA compressor. So, when you measure the "length" of the two compressed strings when joined, it will be barely longer than either of the original strings. So, the compression algorithm is measuring commonality, but it's doing it in a very sophisticated way.
Let's take two opposite reviews. A:"This movie was amazing!" B:"This movie was horrible!" Again, the compressor is smart enough to recognize that "This movie was " is common to both strings. So, it will use some short code to repeat that: "[This movie was ]amazing!\rhorrible!" It's not exactly like this, but this is the basic concept of what is happening internally. And the compressor is
much more sensitive to repetitions and statistical patterns than I am describing here.
In information theory, this kind of measure is called the
mutual information. And that's really what ncd is acting as a proxy for.
We are on our way to a computable approximation of AIXI (general-purpose AI with provably optimal behavior)...