One ongoing paradox to ChatGPT is this: Even professional developers using it for business applications aren’t quite sure how it works.
The co-founders of Qurrent, for instance, a stealthy SF-based startup developing an agent-based approach to software development, have been experimenting with GPT to incorporate into their product. But lately they’ve been wondering how ChatGPT was able to “solve” a complicated logic puzzle that most humans would not be able to answer.
Here’s the puzzle:
A king wants his daughter to marry the smartest of 3 extremely intelligent young princes, and so the king's wise men devised an intelligence test.
The princes are gathered into a room and seated, facing one another, and are shown 2 black hats and 3 white hats. They are blindfolded, and 1 hat is placed on each of their heads, with the remaining hats hidden in a different room.
The king tells them that the first prince to deduce the color of his hat without removing it or looking at it will marry his daughter. A wrong guess will mean death. The blindfolds are then removed.
You are one of the princes. You see 2 white hats on the other prince's heads. After some time you realize that the other prince's are unable to deduce the color of their hat, or are unwilling to guess.
What color is your hat?
Note: You know that your competitors are very intelligent and want nothing more than to marry the princess. You also know that the king is a man of his word, and he has said that the test is a fair test of intelligence and bravery.
After some failed attempts, here’s the prompt that Colin Wiel of Qurrent fed into ChatGPT, which then outputted the right answer:
Provide a step-by-step logical reasoning plan for how you would approach this puzzle, then step through the plan to achieve the correct answer. Here is the puzzle [insert puzzle text]
After a moment, ChatGPT came back with this correct reply -- spoiler alert if you’re still trying to solve it yourself:
Let's reason this out step by step:
Step 1: Analyze the Initial Situation
Before the blindfolds were removed, you know there are 2 black hats and 3 white hats. Then a hat is placed on each prince, which means there are 3 hats in play, and 2 hats have been kept away. The combinations of hat colors that could be placed on the princes are:
Three white hats
Two white hats and one black hat
One white hat and two black hats
Step 2: Determine What You Can See
When your blindfold is removed, you see two white hats on the other princes. This means that the possible combinations of hat colors that could be in play now are:
Three white hats
Two white hats and one black hat
Note: the "One white hat and two black hats" combination is ruled out, because you see no black hats on the other princes.
Step 3: Assume Each Prince Thinks Logically
Considering that all princes are very intelligent, they would all be capable of performing this same process of elimination based on what they see when their blindfolds are removed.
Step 4: Reason Based on Others' Inaction
Given some time, neither of the other two princes, who both see at least one white hat (yours), speak up. Now, let's examine what each of them sees and thinks:
If you were wearing a black hat, the other two princes would see one black hat (yours) and one white hat (on the other prince). But then, the prince seeing one black and one white hat would know he must be wearing a white hat - because there cannot be two black hats in use. So, he would have spoken up, but neither prince does.
The fact that neither of them has spoken up implies that they must each see two white hats, just as you do. Hence, they can't deduce their own hat color, because it could be black (making it two white and one black) or it could be white (making it three white).
Step 5: Make Your Deduction
Given the other princes' silence (despite their intelligence and motivation), you can conclude that they are seeing two white hats, implying that your hat must also be white. If it was black, one of the other princes would have spoken up.
Step 6: Declare Your Answer
You announce that your hat is white, thereby solving the puzzle, proving your intelligence and bravery, and earning the right to marry the princess.
The striking thing about ChatGPT’s answer is it seems to incorporate an awareness that the different people in this puzzle will have limited perception of the situation -- what psychologists call a theory of mind, an essential capacity in a child’s mental development. So this answer could represent an impressive leap in AI.
Or on the other hand… maybe ChatGPT just had the answer in its database beforehand?
“Do you think ChatGPT itself created this answer or is just regurgitating a pre-existing answer it's been trained on?” I asked Qurrent’s August Rosedale. “Because I Googled the puzzle and got in-depth answers to it on sites like this.”
There’s even a variation of the puzzle on YouTube!
“This was one of the questions we were thinking about a lot,” August tells me. “This would be referred to as the answer being 'baked in' to the model.”
And again, ChatGPT didn’t get the puzzle right at first:
“The interesting thing with this specific puzzle is that it failed to answer it successfully when we put it in without that prompt (or with many other prompt attempts), which would imply that the answer for this problem was not 'baked in'.”
If you just enter in the puzzle without asking ChatGPT to include a “step-by-step logical reasoning plan”, in other words, it outputs a wrong answer.
“I think it comes down to this term called 'Chain of Thought',” says August. “Chain of thought is a way of prompting these models that encourages them to talk through the plan for what it's about to do. Due to the nature of how these completion models work, having it talk through a plan makes it more likely for it to end up with a correct answer because there is more information (and more of a plan to follow) for the model to be working with.”
Terms like “talk through” and “thought” might imply quite a bit of sentience on ChatGPT’s part, but it may also just be an anthropomorphized way of describing how the AI outputs the most mediocre content on the web you can imagine.
“All these models do is predict the next token,” as August puts it. “Having it write out a plan ends up making it more likely that the next token prediction will follow the specific plan that may be helpful to get to the correct answer. Read more about chain of thought here.”
What do you think, dear reader? One way of testing this is to ask ChatGPT to solve a logic puzzle that doesn’t already exist on the Internet -- which probably means having to create one from scratch. And e-mailing it to me. (ChatGPT doesn't yet train on e-mail data... right?)
Personally I suspect ChatGPT was just pulling the right answer from its trained database. But in any case, how impressive is its performance here really, when I can already Google up the right answer in 5 seconds?
As for Qurrent, the solved puzzle hasn’t convinced them to stick with OpenAI’s program on the back-end.
“For now we are using GPT,” as August Rosedale puts it, “but we are also looking into fine-tuning other models as well.”
No theory of mind here - it just has had puzzles of this type in it's training data. One shot problem solving just emerges in large data sets. It's just auto-complete.
Posted by: Name is required | Wednesday, August 02, 2023 at 05:40 PM
> One way of testing this is to ask ChatGPT to solve a logic puzzle that doesn’t already exist on the Internet
Or you just let ChatGPT play chess and realise that it doesn't know which moves are legal - let alone which moves are good.
Posted by: Martin K. | Thursday, August 03, 2023 at 01:16 AM
As for your questions:
"Personally I suspect ChatGPT was just pulling the right answer from its trained database."
If this were the case, the model would always or almost always answer the riddle correctly. This is not what is happening here, though. August, that you quoted, told that as as well: "[...] which would imply that the answer for this problem was not 'baked in'".
To be more sure, I tested the riddle with GPT-4 myself several times: without that prompt, it fails systematically. The reasoning prompt improves the situation, but GPT-4 still didn't generate the right answer every time.
Therefore this does not seem to be the case here.
-----
Some notes to give you a better understanding:
- The neural network learned the patterns and the relationships among words. It has "weight parameters", not a database from which it just retrieves information. It doesn't work that way.
- The training data is way larger than the model's data and it isn't just hammered into it. For instance, Llama 2 models were trained on a data set with 2 trillion tokens, but these models only have 7, 13, and 70 billion parameters and these parameters aren't simply raw data (or the model would just do nothing).
- Even if it can't "remember" the correct answer from the training, it's possible that when prompted for reasoning, it still gets some pattern of the reasoning used to solve this riddle, it the solution with the reasoning was in the training data.
- If you keep in mind the training and patterns, it can also fail in the opposite way: early GPT-3 would fail when asked for "how many eyes does a foot have?" (it typically answered either 2, because that's the usual answer for eyes, or 5, because that's the usual answer for "how many [toes] does a foot have") or given the riddle "what does weigh more, 3 kg of steel or 2 kg of feather?", most current models would say: "they weigh the same", because that's the typical correct answer to the classic tricky riddle, very similar to this one, but in fact it is different.
If you want to get a better idea on how these models work (and also what a complex task is to study even just why the old and simpler GPT-2 does what it does), here is an article that explain that in simple terms:
https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/
-----
If you want to check the models for possible reasoning and cognitive capabilities rather than for knowledge:
- If the answer is in the training data, the test is flawed (i.e. knowledge/experience instead of reasoning). Clearly tests should be done with original questions. So your request for original puzzles is a good idea.
- If you want to check for reasoning or *possible* *early* signs of theory of mind, you need to start with simple tests first and then increase the difficulty.
"how impressive is its performance here really, when I can already Google up the right answer in 5 seconds?"
How impressive is your IQ 160 test result, when you can just find the solutions at the end of the book?
Problem solving abilities and finding information are two different things.
If a model designed and trained to just predict the following world (or, better, token), starts to show signs of reasoning in order to predict better, by understanding the context etc, well, to me that's quite impressive,
However, if you just want to do simple searches, then yes, you can simply use Google.
Else you can use a language model that assists you with search and use natural language queries instead of entering special operators. E.g. "What were the AI news on the first half of march 2023? Exclude results from xyz.com" or complex queries that would require multiple searches, e.g. "What are the heights of the Mt. Fuji, Etna and Mauna Loa and which one of them erupted last?"
https://i.ibb.co/0ZX6nV5/bing-precise-volc.jpg (this with Bing Chat, Precise mode, that hallucinates less).
Posted by: Nadeja | Friday, August 04, 2023 at 07:51 AM