WHOA -- GPT4-o3 is a legit big-deal... they're cooking with fire now:
OpenAI o3 Breakthrough High Score on ARC-AGI-Pub -- achieves 87.5% on private dataset (human-equivalent)
The hype is going to be unbearable. And the hypists, as always, are going to miss the real lesson, here. If you read the blog post, what made this big leap possible from previous GPT-4s is
architecture. I would describe GPT's as having a kind of internal "rat's nest computing architecture" that allows them to compute a random subset of cognitive tasks efficiently. And there is a large space of cognitive tasks that were not in their pre-training and so they just suck at those. And they lack a human-like adaptive learning construct, so whatever gets baked in, is what you get. GPT4-o1 enabled chain-of-thought which allows the model to "think", but at a cost: you pay for every token! GPT4-o3 is no different in that respect, but they've apparently tuned up its internal architecture so that it is much more efficient at solving the kinds of puzzles that are in the ARC dataset, which require very generalized forms of reasoning and inference.
I'll have to see some demos of o3 before rendering my opinion on whether it can be said to be "thinking" in any meaningful sense (obviously not consciously), but my gut instinct is that, despite its high score on the Arc prize, it's still going to be a No. And even if it is "thinking" in a sense that I would consider suitable to be labeled as such, we have no way to actually assess this or introspect into what is happening in its mind. Note that we can absolutely introspect into the minds of other humans... we do this simply by asking them questions.
The current AI architectures do not maintain state and they do not "develop". And while CoT gives them something that looks and feels a lot like inference, I'm willing to bet you that they still don't have unrecantable grounding, meaning, they don't really know that there are certain facts that are *absolutely* true. In the long run, after all the AI mega-hype has faded, after the absurd AI bubbles have burst and sober reality finally sets back in, researchers are going to start admitting that there really is no shortcut around
development. All intelligent organisms in Nature, without exception,
develop. They develop from early exploration and play, into adolescent wariness and skill-transfer, to mature solidity and cautious exploration based on long-developed intuition. That is as true of wolves as cats as deer as humans. The idea that Silicon Transistors are some kind of magical exception to the universal pattern of Nature is ridiculous. Computers have been called "a bicycle for the mind". It's a good metaphor, and I will extend it by suggesting that non-developmental AI is "a tractor for the mind" -- extremely useful but its lack of development means that its mind-state is necessarily in some kind of amnesic condition. It literally has no past memory of anything before the question, "Can I patch a nail that has punctured my car tire with some bubble gum?" That's not how thinking works, neither in humans nor animals. Context is an
indispensable ingredient of thinking and the context of these pre-trained models -- even with CoT methods -- is
zero.
Summary: While the 87.5% score on ARC-AGI is no mere stunt, it still doesn't get to the heart of the issue of the ARC prize, and the underlying lesson is still yet to be learnt. Memory, inference, grounding and contextual understanding (with or without embodiment! I consider disembodied AGI to be legitimately possible) are absolutely necessary ingredients. You can duct tape them onto your whiz-bang AI machine as an after-thought but this only shows that you're not really thinking seriously about AI. You're involved in some kind of magical thinking where AI "just happens" once some inevitable "scaling-law" is put into motion, and that's all just a bunch of bull-hockey.
All the big improvements in AI have been the result of architectural changes. You can "strip-mine" the edge of performance on any given architecture, taking an 80% score up to an 81.7% scoreboard "winner", using unlimited compute-scaling, but that's a kind of behavior that is only sustainable under hype. Hype, by its very nature, is transitory. Sooner or later, these companies are going to have to stop the constant hype-baiting and start building honest-to-goodness AI systems that do useful things without trying to suck people into some kind of mass-surveillance-and-mind-control matrix run by crappy AI algorithms...