Are Large Language Models (OpenAI's ChatGPT, Anthropic's Claude, Google's Gemini and the others) relentlessly getting better?
Not for me. Not so far, anyway. The problems I had with LLMs six months ago are still with me today: They make stuff up, they get facts wrong, they don't fix mistakes when told about them, and their writing is often meh.
Still, I use them every day. Even with their flaws, LLMs help me get more done in less time. With a LLM-powered search, I find out about ideas, people and facts more quickly. Also quicker: Checks on my understanding of technical prose ("does this mean what I think it does?") Another useful thing: Sometimes I'll get a LLM (GPT-4 or Claude) to produce some paragraphs on a topic, as a guide to what "everybody knows" already. (This can, among other things, remind me about aspects of a subject that I'm bound to address.)
In other words, they're fine tools, as long as you don't trust them overmuch. Don't paste their work into your own, don't assume the quote is what they say it is, check their math. That's the state of play in April 2024, just as it was in early 2023. Last year, stories emerged of people screwing up in their jobs because they trusted a LLM too much. Such stories still pop up now. And the new-computer glow seems to be coming off of LLMs for businesses and investors. More than a few companies that invested in "generative AI" (a category that extends beyond LLMs, to be fair) are asking where the payoff is. The research and consulting firm Gartner predicts that the majority of companies that built their own LLMs will abandon them by 2028.
Of course, the year and (not quite) half since ChatGPT appeared is not that long a time. Maybe it's just taking a while to get to the next leap.
In this post I want to briefly point you to two other possibilities.
Maybe the problem won’t go away with the next release
(A) One is that Large Language Models can't squeeze much more performance out of their architecture -- ie, they have hit a wall.
(B) The other is that these models are too improving, and if you don't see it, that's your fault. You aren't using them well.
That both claims are being made, that both might be true (they aren't mutually exclusive) is a reflection of how new and weird LLMs are.
Hitting a Wall?
Gary Marcus, a cognitive scientist turned AI entrepreneur turned fierce critic of AI as it is, makes the case here that LLMs have reached the point of diminishing returns. One bit of evidence: on a common metric for LLMs, OpenAI's latest, GPT-4 Turbo, released last November, does little better than GPT-4, which was released six months before.
There are nuances here, and other points, and you'd do well to read Marcus directly. But, bottom line, he (and others, he is far from alone) think the problem is inherent in the way LLMs work. The models predict likely next words in the context they've been given, based on their training on many millions of texts. Since this design makes no effort to account for the reality beyond the words, goes the argument, the systems will never get past their fundamental unreliability. LLMs may be stuck at this untrustworthy level of performance, like humanity blocked by the Trisolaran sophons in Three Body Problem. In which case, to paraphrase another Benioff/Weiss series, maybe LLM winter is coming.
This is a controversial view. Many in the field think that LLMs, by dint of their complexity and power, actually have come to have some kind of understanding of real stuff (though it's not clear how, and what exactly). These people are optimistic that this architecture has a lot more progress to make -- in other words, that it will become more and more useful.
Ethan Mollick, a professor of management at Penn's Wharton School, is one such optimist. He's convinced AI is already better at many things than the average MBA student or high-end consultant. People who don't see that aren't doing AI correctly, he argues here. And, interestingly, he adds, this lack of skill is itself becoming a problem. If people are mistakenly shunning AI that can do good work, that's bad for the AI, and for them.
Again, you'd do well to read Mollick's post. He also has a book just out on this subject, if you want to dive deeper. But what struck me, as I read the post, was the gap between his take and my own experience.
For example, Mollick is a co-author in this paper, which I wrote about last year The New York Times -- a study of more than 700 consultants at Boston Consulting Group. When it came to blue-sky new ideas for a product, that group found, consultants who just pasted in ChatGPT's product into their work did better than those who tweaked it according to their human taste. I, on the other hand, would not be caught dead cutting-and-pasting ChatGPT product into something with my name on it. Nor Claude, neither.
What accounts for this difference?
I think the problem may be that often I'm asking LLMs to help find facts about the real world. 1 This is different from checking facts that I already know (which a LLM can be trained to do, Mollick writes).
Even top-line LLMs (I'm using the best versions available to civilians) have an uneasy relationship to reality. As in: I ask if a book was written that argues X within the last ten years, and the model confidently tells me that such a book, with the title Y, was published in 2016 by Z, the noted expert. But a check finds that no such book exists and that Z, instead of being a noted expert, is a self-published crank.
Or I ask which song by such and such a band has the line "a tear in your eye." The model identifies a song that doesn't contain this line, and then adds more lines that it says are in the song -- but that are not.
All this is OK, because it's checkable. And the results (right or wrong, and they're often right) will always be much more targeted to my query than a long list of links. A simple 2020-style Web search feels like getting out a map, while this LLM+search approach is more like using GPS to find that exact hardware store with the broken "d" in its sign. The fact that the GPS sometimes suggests that I drive into a wall doesn't make me stop using it. I just wouldn’t let it drive.
I don't think there's anything unique about my experience. In the BCG study I mentioned, when ChatGPT was put to a task that required judgment and analysis based on evidence (numbers and transcripts of interviews created for the study), it did worse than humans. Where it scored better than people was on a blue-sky brainstorming task (come up with a new idea for a product) -- that is, a chore where making stuff up is a feature, not a bug.
Still, as always with a digital tool, the thought nags: Maybe it's me. Maybe if I were more sophisticated and attuned, I could make this thing work better.
Maybe LLMs Are Fine. Maybe I’m the Problem
Which leads me to the other possibility I want to explore here: That I'm not using the LLMs with the boldness and imagination that I should. If that's true, what am I doing wrong?
Maybe I'm giving up too easily. As Mollick writes, "it is often very hard to know what these models can't do, because most people stop experimenting when an approach doesn't work."
Guilty. I seldom try to improve my prompts, usually because I am in a hurry.
Just now, I tried a different, detailed set of prompts for the song lyric question I mentioned before. They didn't produce any better results. After that, I didn't persist. Mollick writes, "trying to show that AI can do something [is] dependent on a mix of art, skill, and motivation." I may be lacking in one or more of those departments.
How Much Attention Does A Pencil Deserve?
How much art, skill and motivation should one be expected to supply, though, before the workday feels like it's slipping away? You know that experience of planning to work but finding yourself dicking around with updates, settings, or workarounds? Searching around in the user forums until you find the post that explains that because a system update broke a feature, you need to go into settings, authenticate, then click on some setting you never noticed (but don't forget to unclick it later!) and then restart the software, but not if this other app is running, which is invisible but here's a way to find it?
Nobody likes this part of digital life, which is as if you'd sat down to write, picked up a pencil, and spent 20 minutes sharpening and dusting off the eraser end and checking the lead and messaging Ticonderoga to make sure they haven't updated the firmware.
Sure, sometimes these dives into the gearbox yield better understanding of software and long-term improvement, but often they're a frustrating waste of time, that yields a just-for-now solution. If the quest for the right LLM prompts turns into another instance of fussing over the software pencil, it may not be worth it.
You might reply that it's a mistake to treat LLMs as software. That it's better to play along, and treat them like human assistants. That doesn't change the cost/benefit question here. If you prefer to imagine the LLM as a new employee, you still face the same problem: How much time and effort should you sink into helping Claude improve, before you think maybe it'd be easier to fire him?
Accepting That It’s All in Beta
Maybe, then, my problem is my definition of "worth it." Maybe, to use these things effectively, you have to be willing to give them time and attention way beyond what we think is normal for something that is supposedly helpful. If you're asking about reliability, you're premature. And by doing so you're cutting yourself off from experiences that might teach you something.
François Candelon, a senior partner at BCG who leads its think tank, the Henderson Institute, compares the state of AI today to aviation in 1905. Flying then involved a few hundred feet at best. It'd have been foolish to arrive at Kitty Hawk with your bags, magazines and the expectation that you'd be flying to Paris.
"We are just at the beginning of it, finding what works, what doesn't work," Candelon, also a co-author of the BCG ChatGPT study, told me when we met earlier this month. "What you need to do is to try to get your hands dirty." Of course LLMs won't fit into work as you know it, he said. "Very often, people think first 'how can I use these technologies to do, in a more productive way, what I'm currently doing?' Instead of trying to imagine what it should be."
That sounds kind of risky (what if you really need to get to Paris for that meeting? An ocean liner would be the sure thing, here in 1905). But people who decide to avoid AI's mess and confusion will end up left behind by those who were open to experimenting, Candelon says.
The Revolution Will Be Permanent
He spoke of a "permanent revolution" mentality in enterprises, where LLMs and other AI change so fast that only people in a state of constant adaptation will succeed. Candelon, with a smile, attributed the idea of "permanent revolution" to Leon Trotsky, the Russian revolutionary. Perplexity.ai got Claude to tell me that in fact Karl Marx had used the phrase in 1850, though Trotsky developed the concept and was the first to use it in Russian. It seems to have gotten that mostly from this quite interesting article, which it cites. If I had not used the Perplexity+LLM approach, I might never have clicked on this particular link, because the same question put to DuckDuckGo and Google simply yields a list of equally valid looking sources.
The LLM approach's benefit is that it saved me looking through a lot of links. Its cost is ... that it saved me looking through a lot of links. I've saved a half hour, but I know less about what is in those articles. And I've trusted algorithms that told me to focus on just five, one of which was this. This will do for now (after checking to make sure it says what I'm told it says, and that its names and dates align with those in other sources), but it won't always be the better solution.2
So: What Should You Be Doing?
So, where does that leave me, away from the swirling banners of revolution, on an ordinary Tuesday morning, with work to do?
First, I'll keep fumbling around with LLM tools, trying new tricks when I have time, now and then. I'll keep an eye on people who are creating ingenious new prompts -- Mollick has a whole site of them here, and Matt Beane, a brilliant researcher on human-AI interaction at UC San Diego, has several specific suggestions for ChatGPT users on how to "program ChatGPT to program you." They are here.
Second, I will try to avoid the pinched just-get-it-done mindset that wants to settle on the one sure way to get the work done with a LLM. I will, in short, try to keep playing around -- within reason. Being passionate about AI has to have its limits. (As Lord Byron wrote to a friend, "there is no such thing as a life of passion any more than a continuous earthquake, or an eternal fever. Besides, who would ever shave themselves in such a state?")
Third, keeping in mind that this is really new technology, I try not to depend on analogies with other digital tools to understand it. Ie, I don't expect it to improve relentlessly, just because chips have. I don't expect it to be seamless and easy to use, just because most consumer software is. I don't necessarily expect it to be good at math, just because computers and calculators are. This is different.
And I remain agnostic as to whether LLMs have hit a wall.
Another Formidable Humanoid
Boston Dynamics, makers of the iconic Spot "robot dog" and creators of amazing demos of robots somersaulting, today announced a new version of its Atlas robot. As you can see from the video, it is human-shaped and can move in human ways -- but it can also move in surreal non-human ways.
"Atlas may resemble a human form factor, but we are equipping the robot to move in the most efficient way possible to complete a task, rather than being constrained by a human range of motion," Boston Dynamics' announcement says. Like Figure's Figure 1 (being tested by BMW) and Agility's Digit (being trialled by Amazon), BD's Atlas is also going to be tested in real workplaces soon -- including at least one factory owned by Hyundai (one of Boston Dynamics' investors).
I keep track of these announcements not only for the cool videos but because they're a clear indicator that humanoid robots are coming. (Of course, saying they're coming does not equal a claim that they are staying. Will these types of robots be useful, affordable, desirable and better than alternative approaches? All open questions. But they sure are on their way.
Details matter, so here’s my LLM deal: I mostly use Perplexity.ai, which integrates its search engine with either GPT-4 Turbo or Claude Opus (in my usual case, though there are other choices I haven't experimented with much). Perplexity's answers via Claude come in the voice of a research assistant ("According to the search results....") Sometimes I interact with GPT-4 or Claude Opus using another paid channel, Bearly.ai (if I want the LLM to play a role, as in "You're a high school sophomore who is interested in robots. How does this sound to you?"). If your eyes are glazing over, the key takeaway is this: I've been using GPT-4 Turbo and Claude Opus, which are the best versions of those models available. I've also tried a few gambits on Meta's Llama 3, released yesterday (it doesn't do search and writes merely serviceable prose).
To be fair to Perplexity.ai, there is a "Co-pilot" mode in which I can ask it to provide way more detail and a lot more sources. When I use it, I am still depending on the algorithm's selection, but it's a much broader sample.
You are the expert on my shoulder. It is important to recall that Trotsky ended up killed by an axe. Human beings are not made to tolerate a state of continuous revolution, and to make that a condition of success is an exhausting way to approach what is valuable in living.