• Welcome to our new home!

    Please share any thoughts or issues here.


The Omega Constant, or Say Hello To My Little Friend

ClaytonB

Member
Joined
Oct 30, 2011
Messages
9,088
The Omega Constant, or Say Hello To My Little Friend

image.png


This post will be philosophical and mildly mathematical. I want to focus on the philosophical implications of these ideas, so I will try to present the mathematics in sketch form.

What is Chaitin's constant and why should anyone care?

Mathematics has many famous constants. Pi, Euler's constant, Gamma, the Golden Ratio, and many more. Each constant has some significance within its domain of application, and this significance is why we say that constant is important. However, if you don't care about circles, then you have no reason to care about why pi is exactly equal to 3.14159... Such obsessions may seem to you like the trifles of children, especially if you are a man of the world, someone who does a lot of real things and has little time for academic scribblings.

But what about cause-and-effect itself? Does anyone not care about cause-and-effect? If you are rich and powerful, what does that mean but that your wealth and power are causally real, that they can produce (cause) real fruits (effects) in the world?

And what if there was a mathematical constant that contained within itself everything we mean by "causality"? Do you think such a constant would be important?

There is such a constant, and that constant is Omega, or Chaitin's constant.

It's not easy to explain why but if we could know the value of this constant, it would really be a sort of Philosopher's Stone. The most mind-bending part of Omega is that it wouldn't matter what laws of physics you had, its implications to the world are equally binding. The only constraint on Omega's implications on the world is that the world be finite. And even if you lived in an infinite universe, there is another, bigger Omega that would have the same implications to that world as ordinary Omega has to our world. And so on, up the transfinite cardinals -- no matter how big you define your Universe to be, if it follows cause-and-effect at all, there is always an Omega that constrains it!

OK, that's a lot of really big claims, so let's back up and explain our terms to understand a little more clearly what is going on, here.

Back in the 18th-century, David Hume explained his famous problem of induction. The problem of induction is this: how do we know that the future will be like the past? This question seems innocent at first, but it is deceptively hard. If you think carefully about it, you will realize that there is no reason to assume the future will be like the past. You can say, "cause-and-effect always held in the past" but how does that tell you anything about how things will be in the the future? It doesn't. Cause-and-effect may have held for all time until July 7, 2024, and then it ceases thereafter. There is no "law of Nature" which, in itself, can prevent this possibility. It is logically possible.

Now, that may seem like a bunch of angels-dancing-on-pinheads stuff, but it isn't. The underlying problem is that we do not have a clear definition of what we mean by the word "caused". What does it mean that "A caused B"? "B happened after A" is clearly not a correct definition since B could have happened by coincidence or by some other cause than A, and it just happened to occur at that moment. Consider the claim, "Billy hit the ball and caused it to fly out of the park." What exactly do we mean by "caused" here? Well, we mean something like this: if you were to replay the events happening at that moment, and Billy did not swing his bat, the ball would not have flown out of the park. Now, this is not everything we mean by "cause", but it is a good first-approximation. The point is that Billy's action is the thing that made the difference in respect to the ball flying out of the park.

But how do we encapsulate this "thing that made the difference"-ness, mathematically? And this is where we bring another concept of philosophy, called Occam's Razor. The razor says that, if there are more than one explanation for a thing, we should prefer the simplest explanation, all else equal. A generalization of Occam's Razor is to place a probability distribution over explanations, ranked by simplicity. So, we say the simplest explanation is the most probable, the next simplest explanation is a little less probable, and so on, down to the least simple explanation.

But what do we mean by "simple"? With some caveats, we can reduce simplicity to explanation length -- the explanation that requires the most words is the least simple, the explanation that can be (correctly and completely) stated in the fewest words is the simplest.

But what is a "word"? How do we measure any of this, mathematically?

To do this, instead of "words" and "explanations", we think of "bits" and "programs". We want the shortest program, that is, the program occupying the fewest bits.

Suppose you have a sequence of numbers, S, perhaps from some observation data. Imagine a program that prints out S. Now, imagine a shortest such program, p. If there is any pattern in the data, whatsoever, p will be able to leverage that pattern to express S in a "compressed" form. So, if S contains very repetitive data, like, "1, 1, 1, 1, 1...", then p can be extremely short, but if it has very complex data generated by a highly random process, the shortest p that prints out S will be very close to the length of S itself.

This is the key idea: p, being the shortest program contains all information, whatsoever, of any pattern that may exist in S. If it did not, then there were would be some other p' even shorter than p, that could print out S. So, since p is a shortest program that prints out S, we are assured that it contains all of the pattern or structure in S. In particular, we can define the redundancy of S as the difference in the length of S itself, and p: S-p. If S is 1,000 bits long in its raw encoding, and p is, say, 100 bits, then S has a redundancy of 900 bits. (Technically, we take the ratio 1-(100/1000) to calculate a redundancy of 90% so we can compare redundancy of strings having different length).

Back to Chaitin's constant -- what does this have to do with the shortest program p for a sequence/string S? Well, Omega is a concrete example of a particular number (it is a real-number between 0 and 1) whose digits (or bits) are *truly random* in every meaningful sense of the word "random", yet it is also completely determined. Thus, if p captures all the patterns that exist in some sequence S, Omega is a specific number for which its "p" is just itself... and since it has infinitely many digits, there is no shortest "program" p at all that computes it. It just is its own shortest program, and it is infinite in size. It is an absolutely irreducible brute fact of mathematical reality. As Chaitin says of this constant, Omega proves that mathematics has no "theory of everything" because, for any theory of mathematics which is supposed to "explain it all", there is an Omega inside that theory whose bits cannot be predicted and, thus break the predictive power of the theory. This is true in all logically possible worlds, at least, all logically possible worlds where it is possible to discuss Turing machines. In other words, physics doesn't matter, multi-verses don't matter... you can't escape Omega. Hence, "Say hello to my little friend."

The universal prior

In the previous section, I introduced Omega, but we still haven't gotten to the root of what "A causes B" means. What exactly do we mean by this? Ray Solomonoff formalized what we mean by causal inference (induction). Marcus Hutter says of Solomonoff's theory:

Solomonoff unified Occam’s razor and Epicurus’ principle of multiple explanations to one elegant, formal, universal theory of inductive inference, which initiated the field of algorithmic information theory. His central result is that the posterior of his universal semimeasure M converges rapidly to the true sequence generating posterior μ, if the latter is computable. Hence, M is eligible as a universal predictor in case of unknown μ... citation

This might be less than clear but I want to establish a concrete baseline on the subject because information on the Internet about this topic (and most advanced topics in math and science) is quickly disappearing as the Censorship-Industrial Complex continues to jump the shark to infinity and beyond.

Hutter explains that Solomonoff's theory shows the existence of a universal predictor M which converges rapidly (as fast or faster than any other predictor) to the actual generating distribution of an environment μ for any computable μ. Translated into plain language, you can think of SOlomonoff's predictor as an artificial super-intelligence with the property that you can put it into any environment (including with weird laws of physics, or any number of dimensions, etc.) and it will learn the laws of that environment as fast, or faster than any other predictor, including one specially purpose-built for that environment! And this is true of any computable environment. This still might not be completely clear, so let me translate to video games. Imagine you wrote a completely hand-crafted computer game with custom levels, etc. You built this game for Bob and M to play. You know a lot about Bob, and so you crafted the game to "tilt the odds" to Bob as strongly as possible. Solomonoff's predictor M will be able to play your game at least as good as Bob, or maybe even better, despite Bob's "homefield advantage". It will learn the rules of your game as fast as Bob possibly can. The only exception would be if you and Bob had had prior communication and there were secrets embedded in the game that you only told to Bob, but that would violate the terms of the experiment. As long as the game is fair (all rules openly stated and equally applied), just carefully tilted to Bob's natural abilities, M will at least keep up with Bob, no matter how skilled Bob is, and no matter how carefully you have tilted the game in Bob's favor. (Technical note: I am actually describing AIXI which is an extension of Solomonoff's M, but AIXI is really just M with a decision-making loop bolted onto it.)

OK, so now that you have some concept of M, let's go back to cause-and-effect. Imagine hearing an entire song played all the way up to the last note. In general, because of the nature of melodic sequences, the song will usually feel "incomplete", like there's one more note that "needs" to be played. Solomonoff's M works a lot like that. We look at a prefix pattern of S, say, the first k elements of S. Now, we want to know the k+1th element of S. Can we figure it out? Well, if there is any pattern in S (the full sequence), we know that the shortest-program p will "know about" that pattern. So, if we had p, we could just ask it to print the next element in the sequence. But we don't have p, since we just have S[1:n] (the first n elements of S) and we want to know what is S[n+1]. What Solomonoff says to do is to go ahead and compress S[1:n] to its shortest program p_n. Then, we run this program one more step and see what it outputs. It may or may not agree with p at n+1, but that's OK. Instead of worrying about that, we "normalize" over all programs (even the non-shortest ones) that output S[1:n] and weight them in probability order. This gives us a probability distribution of candidate programs p_n, p_n', p_n'', and so on, and we can sample (randomly) from this distribution, weighted by inverse program length. In other words, the shortest program is the most probable, the next-shortest is a little less probable and so on. We run all of these programs for one extra output step to see what they output for S[n+1], and we weight each result by the sum of the probabilities of all the programs that "voted" for that output. This is like having a room of bettors betting on the output but the amount they can bet is inversely proportional to their height. The shorter the bettor, the more they are allowed to bet. Then, we sum up all the bets on each output, and this is the inferred output distribution of the candidate programs -- this is the output of Solomonoff's M.

What makes this so powerful is that every possible pattern has been contemplated, including patterns that the human mind cannot even conceive of. Obviously, this makes M itself a mathematical fiction... no one can ever build an actual M. However, M allows us to define with complete precision what we mean by "cause". For A to "cause" B means that A is the best explanation (the shortest program p_n) which predicts the next output, S[n+1], which is what we are calling B. In short, cause-and-effect reasoning is running M in reverse. Given some output, calculate the possible causes and identify the most probable cause. "Given output B, calculate whether A caused it." And note that we're not just calculating mere correlations, here, we're calculating honest-to-goodness causality.

LessWrong has an article on this topic and I'll quote it here:

Solomonoff induction [makes the scientific method] into an algorithm:

1. Make an observation.

Our observation is our binary sequence of data. Only binary sequences are data, and all binary sequences can qualify as data.

2. Form a hypothesis that explains the observation.

We use a universal Turing machine to find all possible hypotheses, no fuzziness included. The hypothesis “explains” the observation if the output of the machine matches the data exactly.

3. Conduct an experiment that will test the hypothesis.

The only “experiment” is to observe the data sequence for longer, and run the universal machine for longer. The hypotheses whose output continues to match the data are the ones that pass the test.

4. If the experimental results disconfirm the hypothesis, return to step #2 and form a hypothesis not yet used. If the experimental results confirm the hypothesis, provisionally accept the hypothesis.

This step tells us to repeat the “experiment” with each binary sequence in our matching collection. However, instead of “provisionally” accepting the hypothesis, we accept all matching hypotheses with a probability weight according to its length.

Now we’ve found the truth, as best as it can be found. We’ve excluded no possibilities. There’s no place for a scientist to be biased, and we don’t need to depend on our creativity to come up with the right hypothesis or experiment. We know how to measure how complex a hypothesis is. Our only problem left is to efficiently run it.

Behind Solomonoff's M is something called the universal prior. What the heck is that?

Well, Solomonoff's theory is really algorithmic information applied to the domain of Bayesian probability theory.

Bayes' theorem:

Code:
P(A|B) = P(B|A) x P(A) 
         ----------------
               P(B)

... where A and B are events and P(B) ≠ 0.

Note: P(x|y) is read, "The probability of x given y" and just means what you think it means. If y happens, what is the probability of x? that is P(x|y).

For our purposes, we think of P(A) as the probability distribution itself, the "prior probability". In physics, P(A) would be the probability assigned by physical theory. P(B) is the actual probability (frequency) that we observe the event. P(B|A) is the probability of B according to our theory A. P(A|B) is called the posterior probability, and can be thought of as the probability of A (theory) after observing B. Sometimes, B is called D for Data and A is called H for Hypothesis (I find this the easiest way to understand Bayes' theorem.)

Imagine running Solomonoff's M over every possible data-sequence (finite length). So, instead of just some fixed S, we're looking at S_1, S_2, S_3, ... over the set of every possible finite sequence. Imagine every possible output distribution from every possible sequence S_*. THAT is the Universal Prior. And it is kind of a big deal. In a sense, the Universal Prior is absolute -- it cannot be any other way than it is and it is the same in every logically possible universe (once again, physics doesn't matter, multiverses don't matter, the UP is always the same, and fixed.) Super-advanced aliens would have to know about the Universal Prior because our interest in the UP isn't a particularity of human curiosity. The UP exists whether humans ever thought about it or not. There is provably no "theory of everything" but, if there were, the Universal Prior would be it. In short, the Universal Prior can be thought of as the pattern that is present in every pattern. It is "pattern-ness" itself. And, for this reason, every possible sequence of cause-and-effect is within the Universal Prior somewhere. It is like the ultimate Sherlock Holmes or Columbo. It computes the most likely explanation even from the scantest of information, based on careful observation and deduction according to the rules of the system being analyzed. I'm intentionally anthropomorphizing a bit, but the importance of the Universal Prior is truly under-appreciated. It's practically impossible to avoid the theological implications of something like the Unviersal Prior. I cannot see how one can study the UP and be an atheist... it is a living contradiction. To understand the UP is to understand the logical inevitability of God. It is doubtless a keyhole-glimpse into the mind of God.

We may think of the Omega constant and the Universal Prior as two opposite ends of one spectrum. On the one hand, Omega formally captures everything we mean by "randomness". On the other hand, the Universal Prior formally captures everything we mean by "pattern" or cause-and-effect. While these two ideas do not, on their own, exhaust everything that there is to be said on the subject of causality and randomness, they do really resolve some of the biggest open questions of human philosophy. While the Universal Prior cannot tell us that the future will be like the past, it does explain why we feel like the future should be like the past. It is not a baseless subjective feeling -- the only way that the future would not be like the past is if we are trapped in some kind of cosmic-scale hoax. And since a cosmic-scale hoax is a vastly less probable explanation than that the world is just what it appears to be, it follows that our feeling that the future will be like the past is not unjustified. There is some extremely tiny chance we could turn out to be wrong. Perhaps the hoax really is that elaborate. That's not logically impossible (not a contradiction-in-itself). But it's also vastly improbable given the information available to us (our immediate environment). In short, the probability that we are in a cosmic-scale Truman Show is vanishingly small. The probability that the world is just what it appears to be is a vastly more parsimonious (and probable) theory.

---

PS: If you find a technical error in this article, please kindly point it out below and I will correct it.
 
Last edited:


In the video, he makes the claim that most programs halt, and so the value of Omega is very close to 1. This is not correct and it's a substantial mistake in respect to understanding how Omega is calculated. The famous paper Computing a glimpse of randomness by Calude et. al. calculates the first 64 bits of Omega and finds that it is less than 1/64, which is a pretty small number. How come the value they calculated is so far from 1, while the video creator thinks it should be almost 1? Because the numerical value of Omega in the 0,1 interval is somewhat arbitrary and depends almost completely on how you choose to construct the universal Turing machine M on which you define Omega. Thus, when we think of Omega as a concrete numerical value, we usually notate it Omega_M, that is, "Omega relative to universal Turing machine M". The signficance of Omega, as explained in the video, is not altered by the relative arbitrariness of its initial bits. We can make any particular Omega_M very close to 0 or very close to 1, depending only on the M we choose. But make no mistake, this doesn't make the bits of Omega arbitrary! Regardless of the M we choose, Omega_M still contains the same total amount of information (an infinite amount, actually) about the Halting Problem on M. And since M is universal, it can simulate any other Turing machine M'. We can reduce the halting problem on any Turing machine M' to the halting problem on M, and answer that with the oracle Omega_M. Thus, it does not matter what universal Turing machine M we choose, so long as it is universal -- Omega_M will have all the information about the halting problem on any other M' we might want to ask about. Thus, numerical "magnitude" of the probability of Omega_M is incidental... we learn nothing meaningful if it is close to 0, close to 1, close to 1/2, or some other probability. It is not Omega's magnitude that matters; what matters is that we can use the information in the concrete bits of Omega_M to provide uncomputable speedup to a halting dovetail (this is explained in layman's terms in the final section of the video).
 
Last edited:


It can sometimes feel like all of this computational theory is little more than chicken-scratch on a chalkboard. In many ways, the theorems of computer-science are even more concrete than the foundations of mathematics. Mathematics, by its very nature, eschews any consideration of "how could this be done (physically)?" But even though computer-science has purely mathematical objects (e.g. infinities), most of it can be implemented in real systems, which is why we have computers everywhere today. Thus, the Turing machine is not merely a thought-experimental device, we can physically build an arbitrary approximation of the Turing machine (constrained only by the finite size of the tape), and that's one of the reasons why computer-science isn't just chicken-scratch on a chalkboard. In many ways, it is even more general than physics, but it is not merely thought-experimental as most of pure mathematics is. It sits in a middle-ground where it is able to treat thought-experimental objects, but it is able to do this from a concrete foundation, meaning, its conclusions are binding on any logically-possible universe under constraint of finitude, including ours.
 
Back
Top