Category: AI

  • Nature is Laughing at the AI Build Out

    I was catching up on Acquired episodes this week and listened to the spectacular “Google: The AI Company” episode that Ben and David put together. What the two of them have created with their podcast fascinates me in the inevitability of its success. I think Acquired proves that if you are smart and work your ass off at creating pure value, they will, in fact, come.

    But what struck me in the episode is a quote that Ben shared from Greg Corrado from Google Brain, who said that in nature the way things work is the most energy efficient way they could work.

    The transformers paper and all the work that preceded the paper has led us to a fairly effective way to emulate some of what the human brain does. And I think it’s clear that most of the value that LLMs are delivering, and will deliver, is a cognitive supplement or replacement for many of the functions of the human brain.

    So we’re competing against nature. AGI is a declaration of intent to try to replace the human brain. We’ve made it clear that we’re coming after mother nature’s design.

    We’ve got a ways to go. The human brain uses 20 watts of electricity. I have three RTX 5090s here in my basement, and one of them consumes 800 watts when its working hard, if you include the CPU and chassis power. And while the RTX 5090 is a Blackwell powered Beast with 32G of VRAM, the models that it’s capable of running don’t even come close to competing with something like GPT 5.2, Claude Opus 4.5 or Gemini 3 Pro.

    It would be unfair for me to ignore concurrency, throughput, 24 hour availability, specialization and other capabilities that differentiate these models from the capability of a single human consuming 20 watts. We need sleep, we suck at multitasking and refuse to admit it, and we tend to work slowly on hard problems. But we’re still doing a lot with our 20 watts.

    The IBM 7090 was a groundbreaking transistorized computer available in 1959. I think with LLM’s and AI as a problem space, we’re in the IBM 7090 era. The 7090 consumed around 2500 square feet of space.

    “I think there is a world market for maybe five computers.”
    Thomas J. Watson, IBM chairman, 1943

    The cellphone you’re probably reading this on has 21 million times the processing power of the IBM 7090.  Humans are widely considered to be bad at having an intuitive understanding of exponentiation. Today we think in terms of large foundational models from providers with market caps of hundreds of billions or trillions of dollars, who consume hardware created by a company worth trillions of dollars. Today we have three IBM 7090s in the AI space: Anthropic, OpenAI and Google. Thomas Watson might suggest we have space for two more.

    It is clear to me that we’re in the IBM 7090 era of AI. Consider the anomaly that is Nvidia. Geoffrey Hinton, Alex Krizhevsky and Ilya Sutskever, working out of Geoff’s lab at the University of Toronto in 2012, went out and bought two NVIDIA GeForce GTX 580 GPUs to try to win the ImageNet contest, and managed to get a 10% improvement over previous models through the parallelization that CUDA provides.

    In 2012, gaming was overwhelmingly the biggest market for GPUs. Today NVIDIA’s data center sales are $51.2B for the latest quarter, and $4.3B for gaming.

    NVIDIA’s existence and success is an acknowledgement that we need to fundamentally reinvent the computers on our desks and in our hands. Ultimately VRAM and RAM will no longer be differentiated, and like the Blackwell architecture, memory bus width will be 512 bit or more, with a massive amount of memory bandwidth — 14 Tbps on Blackwell RTX 5090 today – along with CUDA style parallelization.

    But the real reason I wrote this post is regarding the so called “AI Bubble”.

    I’ve actually been a bubble skeptic for some time now. At times I’ve considered that we may be underestimating the value that AI will deliver and that this will be a rare anti-bubble, where market participants profoundly regret underinvesting. But recently my view has been shifting.

    What concerns me is the build-out we’re seeing in power and data centers. We are making potentially flawed assumptions about a few key things:

    1. GPUs will continue to be extremely power hungry, based on the power investments we’re seeing. Today a single DGX B200 node has 8 X B200 GPUs, consuming around 8KW of power for the single node. It has 1440GB of VRAM. Absolute best and completely naive case, Anthropic and OpenAI are using a single DGX B200 node per model instance. But my guess is that they’re using several chassis with Infiniband interconnect per model instance they’re hosting.
    2. Model hosting on GPU will continue to be very expensive, based on NVIDIA’s market cap. Today a single DGX B200 chassis with 8XB200 GPUs costs around $500,000.
    3. Model hosting will continue to occupy more space than normal year over year cloud infrastructure growth. Today the WSJ is covering the risk that real estate investors are exposed to in the AI boom.
    4. The power and compute hungry model architectures that we are using today will be comparable to the power and compute needs of future AI model architectures.

    What I think will happen is this:

    AI compute will move out of GPU and be part of every computer we build, whether desktop, handheld or data center. The GPU will become a historical curiosity.

    Hosting models that rival human intelligence, capabilities and human output on every computer will be the norm. 

    Cloud hosting will not go away, and cloud hosting of extraordinarily powerful models for specialized tasks and use cases will be a big business, but this will be in addition to local capabilities, and the ubiquity of models and the hardware that supports them.

    The power, space and cost of cloud AI hosting will plummet over the next three decades, with stair step gains multiple times per year. The DeepSeek phenomenon and how it terrified the markets is a harbinger of what is to come. And we will see this in model runners, model architectures, use of existing hardware, and hardware innovation.

    The CUDA monopoly on programmability of GPUs will end, and the availability of programmable high performance, low power AI compute across all computers will be solved. NVIDIA has a gross margin on data center GPUs today of approximately 75% which is an absurd anomaly based on the weird coincidence that gaming graphics compute is fairly well suited to the problem of parallelizing AI compute.

    And finally, and perhaps the most profoundly terrifying prospect: Model architectures will become more power and compute efficient, leading to sudden drops in hardware and power needs, amplified by the increasing compute and power efficiencies in newer hardware.

    So where does this leave us when evaluating the bubbliness of the so called AI bubble today?

    • Data center real-estate may be overinvested.
    • Power for data centers may be overinvested.
    • The ongoing cost of AI compute, and the profits that accrue to NVIDIA long term may be overestimated.
    • The cloud-hosted-models business will lose market share to on-device models.
    • The concentration of AI capability among foundational model providers won’t last as compute adapts and many models move on device, with the low-latency benefits that provides.

    And finally: As we see a stair step improvements in AI algorithms, we will see precipitous drops in market capitalizations among hardware vendors, power providers, real-estate investors and foundational model providers.

    If mother nature has it right, it’s possible to host something equivalent to human level intelligence using only 20 watts of power, in a space equivalent to the inside of your skull.

  • The Rise of the Operator

    The AI war between Anthropic and OpenAI have produced Codex via Claude Code, which is a coding tool so spectacular, that it is allowing teams, including ours, to increase ambition so far beyond what was previously possible that it is rapidly transforming the software industry beyond recognition.

    This change and empowerment is forcing us to reimagine roles in software engineering organizations. Ops engineers are particularly well suited to using AI agents for innovation, given their systems and architectural knowledge. It’s challenging for productive senior developers to embrace AI coding, given their muscle memory, but when able, Codex and Claude Code are spectacular enablers. Our QA Engineers at Defiant Inc are writing powerful utilities and seem to have no problem getting the most out of terminal AI coding agents.

    The skillset required to get the most out of Codex and Claude Code are systems knowledge, architectural knowledge (how things hang together), dev management, dev methodology, understanding test driven development, QA methodologies and being an excellent communicator.

    There’s another component, which is the role of AI industrial psychologist. You need to know how to talk to these things. For example, have it write a markdown document with your product design and iterate on that. Then have it write integration tests before you write a line of code. Then have it reflect on what it learned from the integration tests and have it rework the product document. Then have it create a PLAN.md in the base of the project with a step by step implementation plan, and update AGENTS.md (or CLAUDE.md) to track progress in that doc. AI industrial psychology at work.

    I’ve come up with a term in our team to refer to those particularly good at bringing these skills together and getting productive shippable product out of AI agents: Operators.

    I like the word ‘Operator’. My favorite context is the way I heard it used at Suits & Spooks in DC, which is a conference that used to bring the intelligence community and tech community together. Someone from a 3 letter agency referred to special forces personnel as operators. It sounds badass, but it also I think captures the role of the operator is in the field. They are backed up by a team, and are bringing together a range of skillsets to accomplish a mission. They are also very mission and outcome focused.

    I like how Operator places the emphasis on mission and outcome, because AI agents are particularly bad at completing the second 90% of a project, and an Operator is particularly good at driving the agent across the finish line.

    A few of my own techniques are to ask it what is left until the project is complete. Simply populating the context window with that already helps. Then have it update PLAN.md , and use a prompt that uses words like “finish line”, “outcome focused”, “deadline” and so on. Drive it across the finish line. AI Agents will find ways to fill an infinite project schedule. Kinda like humans will.

    We even have our own Operator Slack emoji now, which is of course Tank answering his headset in the original Matrix (1999) with: “Operator”.

  • RAG Enabled WordPress in Core Could Transform WordPress from CMS to AIMS

    Sometimes you see an idea in an area you’ve been thinking about deeply, and it doesn’t just click. It’s a puzzle piece that absolutely thunders into place. You take a step back, the puzzle is complete, and what was an abstract image has crystalized into a clear vision of an inevitable future. That just happened with a tweet by James LePage from 5 hours ago.

    I’ve been spending a huge amount of time reading AI, as have many of us. I’ve also been working hard with a product team, which gives me the opportunity to turn what I’m learning into applied knowledge. The benefit of this is that you’re taking a hard look at the tech landscape and identifying real problems and matching them to applied solutions using the latest (meaning this week’s) ideas and technology.

    I suspect that James mis-tweeted when saying WordPress IS a vector database, but I think we all know what he means – and the concept here is incredibly exciting. WordPress is:

    • Called a content management system, but this can also be described as a document management system.
    • Is one of perhaps the two most popular document management systems on Earth – with Google Docs being the other.
    • Has more documents under management than perhaps any other platform, already uploaded, categorized and tagged along with metadata for each doc.
    • Has an app store-like dev ecosystem with the plugin repository.
    • Has several hundred million installs with billions of human visitors to those installs (websites).
    • Is an online application that runs 24 hours a day and provides an interface for humans and machines to interact with.
    • Has a mature security ecosystem (shout out to my company Wordfence) with many vendors and solutions.
    • Has solved high performance document storage and retrieval at scale on a live site with live editing.

    Retrieval Augmented Generation, or RAG, is the process of turning documents into embeddings (an array of numbers) which represent the meaning of each doc or chunk of text, storing those numbers with an index referencing each doc or chunk in a database, and then retrieving the documents or chunks to augment prompts sent to AI. It works like this:

    A company has specific knowledge in a big document database. They vectorize the whole lot (generate embeddings for the docs or chunked docs) and store it all and create a RAG application. You come along and interact with, lets say a chat bot. Here’s what happens:

    • Your question is turned into an embedding (string of numbers) representing its meaning.
    • That embedding is used to retrieve chunks of text in the vector database that are related to your question.
    • The docs related to your question are used to create an AI prompt that reads something like “Here is a user’s question and a bunch of documents related to the question. Use this knowledge along with your own to answer the user’s question as best you can”

    That’s it. That’s RAG. But what’s super powerful is two things:

    1. That we can represent semantic meaning in a way that lets us retrieve on similarity in meaning. That’s breakthrough number one.
    2. That we can retrieve documents similar in meaning to a question and augment the knowledge of an AI as part of our question to that AI model.

    Put differently: Individuals and organizations can immediately put their SPECIFIC KNOWLEDGE to work and that can be a differentiator for them in this world of AI models. It’s not just your little business providing an interface into Sam Altman’s latest model. It’s your little business with its specific knowledge providing a differentiated AI powered application, because your AI knows things that others don’t and perhaps never will. That’s why RAG rocks.

    Here’s the thing about WordPress: Every WordPress website is a collection of specific knowledge that in many cases is extremely high quality and has taken years to accumulate. If you could put that specific knowledge to work in an AI context, WOW! You don’t even need to collect and organize the docs. You already have done the work to collect and index the knowledge. You just need to use RAG to feed it to an AI for each prompt, whether you’re generating new content, answering user questions, whatever.

    So what James tweeted was a user interface – that’s probably in working condition – of RAG for WP and I think he’s doing it in the context of this being added to WP core, since he works for Automattic. But the message here is really: Hey, consider enabling the largest document management system in the world (one of) as an enabler for RAG apps using the existing dev ecosystem, massive deployed base, massive collection of documents, and overnight turn this into the largest RAG AI dev ecosystem in the world.

    What I mean specifically is this:

    • Document vectorization would be as easy as it appears in James’ screenshots.
    • RAG retrieval would be available in the core API.
    • WordPress plugins would immediately be able to build applications around fetching chunks of text from the existing knowledge a site has and sending RAG augmented prompts to any AI interface, whether it’s self hosted, open source, closed source, REST endpoint or running local.
    • In other words, WP developers would immediately be able to put the specific knowledge that hundreds of millions of websites have spent years collecting to immediate work for the benefit of the site owner.

    The value of a WordPress website immediately increases by an order of magnitude.

    There are challenges that need to be solved. Specifically:

    • What model is generating the embeddings and where is it run? Local? API endpoint? Does it have vendor lockin or is it open source? Does the host have vendor lockin or is it open source? Ideally it would be CPU and usable directly from PHP so no new ops dependencies are introduced.
    • Is the orchestration of vectorization and retrieval in PHP? Or is it Python, which may not be available on a WP site? Ideally all PHP so that Python is not a dependency for existing sites.
    • How is retrieval being done? Pgvector, which adds postresql as a dependency? Or some kind of MySQL/MariaDB magic I’m not aware of? (Since MySQL/MariaDB doesn’t support vector retrieval/indexing). Ideally you wouldn’t add a new DB engine as a dependency.

    If you can eliminate the dependencies, you can deploy a new version of WP core overnight and enable this on every site, world-wide for immediate use.

    There’s a massive opportunity here if hosting providers collaborate with WP core for the hosts to provide local GPU resources to generate embeddings along with pgvector for retrieval. It’s a whole new source of revenue for them. At the last WCUS I literally went around and asked every host if they’re providing GPU hosting and only one said they are. It’s wide open for the taking and it’s worth billions, since GPU is far more expensive that CPU  to rent.

    It’s quite possible that James’ tweet will be a harbinger of the transformation of WordPress from CMS to AIMS (AI management system).

  • The Nanny Scale

    Lets say you’re writing a Slackbot that plugs into an LLM. When you send the LLM the prompt, instead of only sending the user message to the LLM, you could send all the Slack data associated with that message including the message itself. Then you could give the LLM tool calling access to the Slack API to perform a range of lookups using data from the Slack request, including for example the Slack user ID of the message sender. To continue this example, the LLM might use the API to look up the username and full name of the message sender, so it can have a more natural conversation with that person.

    For the moment, forget about the specifics of implementing a Slack user interface. The question here is, do we give the model all the data and let it call tools to do things with that data when it determines that would be helpful in building a response? Or do we nanny the model a bit more (provide more care and supervision) and only give it a prompt that we’ve crafted, without ALL the available metadata?

    I’m going to call this The Nanny Scale and suggest that as models continue to get smarter we’ll move more towards increasing model responsibility. It also varies based on how smart the model is you’re using. If it’s an o1 Pro level model with CoT and tool calling capability, maybe you want to give it all the metadata and as many tools as you can related to that metadata, and just let it iterate with the tools and the data until it decides it’s done and has a response for you.

    If you’re using a small model and then further quantizing it to fit into available memory, thereby risking reducing it’s IQ even further,  you probably want to nanny the model, meaning that you increase the care and supervision and reduce the responsibility the model has, and you pre-parse data, removing anything that can cause confusion but may be potentially useful, and reduce the available tools, if you provide any at all.

    It’s clear that a kind of Moore’s Law is emerging with regards to model IQ and tool calling capabilities. Eventually we’re going to have very smart models that are very cheap and that can handle having an entire API thrown at them in terms of available tools. But we’re not there yet. Models are expensive, so we like to use cheaper less capable models when we can, and even the top performers aren’t quite ready for 100% responsibility.

    So as we’re building applications we’re going to have to keep this in mind. We’ll launch v1, models will evolve over several months, and for v2 we’re probably going to have to slide the nanny scale down a notch or two or risk shielding our customer from useful cognitive capabilities that are revealed when a model takes on more responsibility.

  • Amidst the Noise and Haste, Google Has Successfully Pulled a SpaceX

    In 2013 Google started work on TPUs and deployed them internally in 2015. Sundar first publicly announced their existence in 2016 at I/O, letting the world know that they’d developed custom ASICs for TensorFlow. They made TPUs accessible to outside devs via Google Cloud in 2017 and also released the second generation that same year. And since we’re plotting a timeline here, the Attention is All You Need paper that launched the LLM revolution was published in June of that same year.

    OpenAI got a lot of attention with GPT4, a product based on the AIAYN paper, putting LLMs on the map globally, and Google has taken heat for not being the first mover. OpenAI last raised $6.6B at a $157B valuation late last year, which incidentally is the largest VC rounder ever, and they did this on the strength of GPT4 and a straight line trajectory that GPT5 will be ASI and/or AGI, or close enough that the hair splitters won’t matter.

    But as OpenAI is lining up Oliver Twist style asking NVidia if “please sir, may I have some more” GPU for my data center, Google has vertically integrated the entire stack from chips with their TPUs, to interlink, to the library (TensorFlow) to the applications that they’re so good at serving to a global audience at massive scale with super low latency, using water cooled data centers that they pioneered back in 2018 and which NVidia is getting started with.

    Google has been playing a long game since 2013 and earlier, and doesn’t have to create short term attention to raise a mere $6 billion because they have $24 billion in cash on their balance sheet, and that cash pile is growing.

    What Google has done by vertically integrating the hardware is strategically similar to SpaceX’s Starlink, with vertically integrated launch capability. It’s impossible for any other space based ISP to compete with Starlink because they will always be able to deploy their infrastructure cheaper. Want to launch a satellite based ISP? SpaceX launched the majority of the global space payload last year, so guess who you’re going to be paying? Your competition.

    NVidia’s margin on the H100 is 1000%. That means they’re selling it for 10X what it costs to produce. Google are producing their own TPUs at scale and have been for 10 years. Google’s TPUs produce slightly better performance than NVidia’s H100 and is probably on par when it comes to dollar per compute. Which means Google is paying 10X less for GPU compute than their competitors.

    And this doesn’t take into account the engineering advantages derived from having the entire stack from application to chips to interconnect all in-house, and how they can tailor the hardware to their exact application and operational needs. When comparing NVidia to AMD, the former is often described as having a much closer relationship with developers and releasing fixes to Cuda on very short timelines for their large customers. Google is the same company.

    As a final note, I don’t think it’s unreasonable to consider the kind of pure research that drives AI innovation as part of the supply chain. And so one might argue that Google has vertically integrated that too.

    So amidst the noise and haste of startups and their launches, remember what progress their may be in silence.

  • My 2025 AI Predictions

    The $60 million deal that Google cut with Reddit will emerge as incredibly cheap as foundational model providers realize amidst the data crunch that Reddit is one of the few sources of constantly renewed expert knowledge, with motivated experts in a wide range of fields contributing new knowledge on a daily basis for nothing more than social recognition. The deal is non-exclusive as was demonstrated by a subsequent deal with OpenAI, meaning Reddit will begin to print money.

    Google’s vertical integration of hardware via their TPUs, their software applications, and their scientists inventing the algorithms that underpin the AI revolution is going to begin to pay off. Google will launch a number of compelling AI applications and APIs in 2025 that will take them from an academic institution creating algorithms for others, to a powerhouse in the commercial AI sector. Their cost advantage will enable them to deliver those applications at a far lower price to their customers, and in many cases, completely free. Shops like OpenAI lining up for NVidia GPUs will be the equivalent of a satellite ISP trying to compete with Starlink who have vertically integrated launch capability.

    DeepSeek will continue to demonstrate unbelievable cost reductions after delivering V3 for less than $6 million as the group of former hedge fund guys continues to sit in a room and simply outthink OpenAI, which has been hemorrhaging talent and making funding demands approaching absurdity.

    OpenAI will be be labeled the Netscape of the AI revolution and be absorbed into Microsoft at the end of the year. But like Netscape, many of their ideas will endure and will shape future standards.

    As companies like Google and High-Flyer/DeepSeek prove how cheap is to train and operationalize models, there will be a funding reset and companies like Anthropic who raised a $4 billion series F round from Amazon in November will need to radically reduce costs and we may see down rounds.

    We will see new companies emerge that provide tools to implement o1 style chain of thought in a provider and model agnostic way. Why pay o1 token prices for every step in CoT when some of the steps can be done by cheaper (or free) models from other providers?

    China will continue to rival the USA in AI research and in shipped models. The new administration will rethink the current limits on GPU exports which will prove ineffective at accomplishing their goals of slowing the competition.

    And finally my personal hope is that the conversation around the dangers of AI will shift from a fantastic Skynet scenario to the practical reality that out of the $100 trillion global GDP, $50 trillion is wages, and that is both the size of the AI opportunity and the scale of the global disruption that AI will create as it goes after human labor and human wages.

    We need to acknowledge this reality and hold to account disingenuous companies and founders who are distracting from this through AGI and ASI scare mongering. This “look at the birdie while we steal your jobs” game needs to end. The only solution I’ve managed to think of is putting open source tools and open source models in the hands of the workers of the world to give them the opportunity to participate in what could, long term, become a utopian society.