The untold story here is that AI tech is being driven, in part, due to the collapse of Moore's Law. The exponential physical scaling of silicon transistors from ~1975 to ~2010 was so intense that AI research was often obsolete by the time it could be published. "We made an AI algorithm that performs 25% better than hand-crafted algorithms!" is irrelevant when my hardware can run the old hand-crafted algorithm at 200% of the speed it could be run when you started your research.
More importantly, this explains why a shift away from digital is inevitable unless there is some magic breakthrough in quantum physics that makes room-temperature, desktop-sized, quantum computers possible. I won't be holding my breath. We already have multiple technological alternatives to digital logic that will allow the energy performance numbers on real AI/ML workloads to be boosted 1,000x or more. See the last paper I linked above. Obviously, we're not going to "scrap" digital. Digital will continue to be the beating heart of your PC system. But we need high-performance, general purpose accelerators that will allow these currently power-intensive AI tasks to be performed at massively lower energy profiles. That's not just some blue-sky speculation, it's absolutely possible today, with existing tech, we don't even need new toolchains... literally just make the design and send it through.
As he notes in the beginning of the video, 1T (single-thread) CPU performance is tapering. Multi-threading is still providing performance boosts, not quite at pace with Moore's law, but still significantly better than flat-line. And GPUs are providing data-center scaling because GPUs can be endlessly glued together like Lego bricks to build ever bigger compute clusters with sub-linear overhead costs. So, if you are operating at AWS scale, GPUs are the future (for now). But the problem is that industry (including a lot of the leading tech-industry "decision-makers") don't understand one of the most basic principles of computational complexity theory: proving lower bounds is extremely hard. What does this mean, and why is it relevant?
In most industries, we have some kind of "basic unit" for every commodity. Even tools and other complex machinery will eventually get tranched by the market. So, you have various tiers of combines for corn-harvesting. Or various tiers of fencing for cattle ranches. Various tiers of feedstock. Various grades of bolts. Various grades of wheat. And so on, and so forth. Everything eventually gets graded, and those grades become the "Lego bricks" of that industry. If you need to do a job requiring X grade of equipment or materials, then your acquisitions department knows what to order. If you have some X+1 grade materials available, and they have no better use, then you could use them instead to clear out inventory that's just taking up space, and avoid placing a new order for materials. But if you have some X-1 grade materials, that's not going to work, because the job requires X grade. This is the bread-and-butter of most commercial activity, this is the beating heart of the economy, this is "how things get done".
In the tech industry, there has been a long and gradual process of taming the computational beast. Over time, there has been a continual drive towards standardized units of data storage, standardized units of networking, and standardized units of compute. However, computing itself is a wholly unique commodity from all others, because computing can't actually be reduced to basic units, as practically all other commodities can. Why not?
The reality is that there are two ways of looking at computing. Computing can be thought of as the activity of performing a computation. So, for example, if you want to multiply two numbers you can use the high-school add-and-shift algorithm to compute the result, and we can call this process "computing". The other way of viewing computing is in terms of definition -- a multiplication is just whatever activity gives the correct answer at the output. For example, I can use a slide-rule to multiply any 3-digit number with any other 3-digit number in a single motion. 6-digit numbers can usually be multiplied in 2 motions. And so on. There are other ways to perform a multiplication involving many digits in a single motion. If I do not specify how you are to multiply, only that the answer be correct (as checked by division), then multiplication is not any specific action, it's any of an infinite family of possible actions which will always give the correct result.
The problem arises when you try to define a "basic unit of multiplication". In reality, there is no such thing or, if there is, we can only define it in some purely abstract sense that is not applicable to real multiplications. The reason for this squishiness is that multiplication algorithms are always susceptible to improvement. Today, the fastest known symbolic matrix multiplication algorithm has complexity around O(n^2.371552), according to Wiki. While we know that exact matrix-multiplication can't be done faster than O(n^2), there is still a lot of mathematical headroom between 2.371552... and 2. That 0.3715... translates to gigawatts of energy that is spent every year on matrix multiplications that, if the exponent were somehow reduced to 2, would not be required. While nobody has yet improved on this latest exponent, there's nothing stopping a new and better exponent from being discovered next month.
The "squishiness" underneath basic elements of compute makes the idea of a "basic unit of compute" effectively impossible, and this makes computation unlike, say, oil. Oil has a basic unit called a barrel. This basic unit is what allows the market to bring its enormous allocation forces to bear on the oil industry. But compute breaks that model, at least, if you try to naively apply it to compute without understanding why it can't be naively applied.
AI will help solve a lot of this. But first, we're going to have to let the power-hungry digital logic go. AI "fuzzifies" compute, and that's exactly what you need in order to build some kind of basic-unit of compute. AI art shows this principle in action. "Hi, I need an image of a tabby cat leaping through the air with a park in the background for my marketing brochure" ... does the customer really care how this image is generated/created? Probably not. So, you can go down to the park with your cat and a camera OR you can fire up GIMP and Stable Diffusion and use AI to generate a "fake" image that serves the customer's needs just as well as taking the photo manually would have. Notice the parallel to the paragraph above where I drew the distinction between high-school multiplication and using a slide-rule or another mechanical method for calculating a multiplication. If the customer requires that a specific cat be photographed in a specific park (and they're willing to pay for that), then so be it, we'll do that. But does anyone care that a multiplication be done using the high-school add-and-shift method?? Of course not. So, computational tasks are inherently indifferent to the "how", and only ever actually care about the "what" (the correct answer). This is why there is no such thing as a basic unit of compute, and cannot be, in that sense.
But with AI, we can actually use the AI layer to chop up problems into basic units for us. That's essentially what we are paying software engineers to do for us today. The problem is that the demand for software engineers is effectively infinite because there is always more chopping that could have been done. Consider the accounting department of a large company. They've digitized everything so that all their accounts can be tracked digitally, charted, etc. However, nobody counts the screws that are used by the assembly team because they are just such a minor line item. Technically, the efficiency of the company can be improved by tracking those screws, too. But who's going to actually do that task? Are you going to hire someone to be the "screw-counter"? And can that actually be profitable given that you have to now pay this person's salary in order to count screws?? That's where AI fits, because it can perform these insanely fiddly tasks that nobody can economically do manually, but which do yield true efficiency improvements. Specifically, in the compute-space itself, you can think of any compute workload as a blob of "stuff" that needs to be done. That blob can be broken down a zillion different possible ways. Software engineers will do it in their particular way. It will be good, but of course, nothing is ever perfect, especially on a timeline. So there is always headroom for improvement. That headroom was inaccessible before because, like screw-counting, it always meant "one more head" to do the actual work of slicing up the work in that headroom space. We can now use AI to "churn" on such tasks and generate useful reductions of large, real-world compute tasks into many specific compute tasks which are then sent to be processed individually.
Note that what I'm describing in the previous paragraph is not that much different from the CPU/GPU architecture. This "small, fast, central brain with massively parallel auxiliary compute"-model really scales up to everything, even data-center compute. The data-center has a control center, and the control center is what you're actually talking to when you dispatch your compute workload. Once the control center receives the compute work order, it allocates servers/GPUs to do that compute, and then dispatches the compute to them. The more we "AI-ify" this model, the more closely we can approach full utilization of systems, and minimize latencies resulting from system wake-from-idle, and from demand peaks (using price-tiers to smooth out peak demand). Again, don't just think in terms of specific compute jobs like "run XYZ software program with ABC input", rather, think of more fuzzy tasks like, "Draw an image of a cat leaping through the air with a park in the background", and leaving it to the AI to figure out how to break down that task into individual "units of compute". Ironically, the more fuzzy we make the "unit of compute", the more we can commoditize it. This is exactly backwards of intuition, and shattering this pervasive misconception will be key to making real, forward progress in compute-scaling, instead of just feeding into the AI hype/mania.
GPUs are an absolutely insane substrate for AI/ML. It makes no sense whatsoever to be pushing all these fuzzy/approximate compute workloads through the power-hungry, exact multiplication circuits of GPUs. An approximate multiplication will serve just as well, which is why quantum computing can even possibly be considered as a candidate for replacing digital compute in the AI/ML space.