Following up on a reader's comment to my rant about ChatGPT totally botching my writing, I went with their suggestion to try Microsoft's Copilot instead. I asked the same question, "What is Wagner James Au's definition of the Metaverse?" and got this output:
Wagner James Au defines the Metaverse as a vast, immersive virtual world simultaneously accessible by millions of people through highly customizable avatars and powerful experience creation tools integrated with the offline world through its virtual economy and external technology. His book, “Making a Metaverse That Matters: From Snow Crash & Second Life to A Virtual World Worth Fighting For,” explores nascent metaverse platforms, interviews metaverse creators like Neal Stephenson, and aims to dispel myths and misconceptions around this interconnected universe of virtual worlds. In essence, the Metaverse represents a convergence of digital and physical experiences, where users can interact, create, and explore within a shared digital space.
Much better than ChatGPT! I also like how Copilot's output includes source web links for further reading, enforcing its role as your assistant, and not the ultimate purveyor of truth.
Still, Copilot is decidedly not perfect, adding in bizarre inaccuracies:
I rarely write about the Metaverse being "a convergence of digital and physical experiences" (though others emphasize that angle much more), and I've devoted thousands of words explaining why the Metaverse should not be defined as an "interconnected universe of virtual worlds". Worst of all, putting Neal Stephenson in the "metaverse creators like" category is so profoundly, face-palmingly wrong, if I were teaching a class on the topic and a student wrote that in a paper, I'd deduct a whole grade or two.
So overall I still question the usefulness of LLMs beyond being a highly imperfect, unreliable assistant. Anyway, here's the comment from reader "N" who makes some good points and even shows how Copilot is pretty impressive with discussing Second Life-only content:
It never ceases to amaze me how some people persist with this bullshit, using a LLM as a substitute for Google or as a database for factual information. Especially when the LLM in question is an outdated model and running an old application that doesn't even have web search capabilities, nor RAG for grounding. By the way, Copilot Precise, as a search engine assistant, found your definition and quoted it verbatim, even providing a link to the source.
Mitch Wagner, instead, provided also a few examples of what (even) ChatGPT is actually useful for (and there are many more) and correctly said it's not reliable as information source. That "ANN/LLMs hallucinate" is know since forever. Also a LLM with few billion parameters can't physically hold a 15 trillion token dataset. It's incorrect to label hallucinations as lies, though: "they lie" implies an intentional act of deception, whereas LLMs fill in the gaps with "hallucinated" information. Do you lie when your brain reconstructs what your blind spot can't see, based on the average surroundings?
Also I find it interesting that Mitch Wagner said 'ChatGPT was faster' than getting information from Google. Sometimes, when searching for technical or obscure information, you have to rummage through numerous search results and websites. GPT-4, which powers Copilot Precise, might already have a decent idea of what the solution is, maybe a lacking one, but it knows what to search for, then it uses its search tool and finds it or at least puts you on the right track. Also you can ask for something more complex than trivial plain searches. Copilot and Perplexity saved me quite a bit of time, multiple times.
You might be interested in an example of multi-search for Second Life: "What are the most trendy Second Life female mesh bodies as of 2024? Please list them and for each of them search for their respective inworld store in secondlife.com Destinations." In few seconds, Copilot Precise correctly listed Maitreya Lara and LaraX, eBody Reborn, and Mesbody Legacy, found their stores in Destinations and linked them. You can also ask it to make a comparative table. It's not always perfect, but it's pretty handy.
I won't take ChatGPT and its old GPT-3.5, as the state-of-the-art, let alone use it as a benchmark to evaluate the current advancements in AI research. Today, even compact models like PHI-3 and LLama 3 8b, which can run locally on a laptop or a high-end smartphone, are nearly on par with GPT-3.5 in many tasks. In the meantime, while GPT-4 continues to be helpful for coding, models like Gemini 1.5 Pro and Claude 3 Opus seem better than GPT-4 in generating (and writing) fiction ideas or brainstorming. Gemini has a huge context window and can write a critique of a novel. Llama 3 400b is about to be released and it looks like OpenAI is releasing a new model this year.
It's improving and improving. I don't think they would be on par with a senior software engineer or a professional writer or poet until we develop an actual strong AI or AGI (or at least a good introspection, the labs are working on planning at least), but vice-versa I won't say they are useless either or only focus on the negative.
Of course GPT-3.5 powering ChatGPT hasn't been updated: OpenAI, obviously, wouldn't waste millions to retrain that older model, even less for just minor facts that aren't repeated enough times in the training dataset and can't be fit inside its limited amount of parameters anyway, let alone for the sake of the wrong use in the wrong application... when their newer models (or even GPT-3.5 itself) are already doing that on better suited applications such as Perplexity or Copilot.
I'd quibble that "LLMs hallucinate" is more accurate than saying "LLMs lie", since a hallucination implies there's already a base stable awareness where none actually exists. Then again, pretty much any verb we use related to AI (thinks, decides, chooses, etc. etc.) implies some level of human-type sentience that's simply not there at all.
Microsoft Copilot is just ChatGPT 4-Turbo, but with the added ability to access data you have given Microsoft. I suspect when you tried your original query, you posted it at ChatGPT 3.5, which is the version OpenAI makes available to free users on its website.
Posted by: Aleena | Monday, April 29, 2024 at 06:40 PM
AI text generation should not be expected to give factual information. It's understandable why people continue to expect this (as that is how it is marketed) but that is never going to change reality. Please try to do your best to combat this pervasive misunderstanding in expectations.
Posted by: Adeon Writer | Tuesday, April 30, 2024 at 04:43 AM
>> human-type sentience that's simply not there at all.
exactly.
n explained really well that to use a LLM as a substitute for Google or as a database for factual information is misunderstanding what the thing is doing. As Adeon re-iterates; AI text generation should not be expected to give factual information.
I really recommend this article: https://medium.com/@colin.fraser/hallucinations-errors-and-dreams-c281a66f3c35. Its long but does explain how the whole thing "is a dream". Also read the first response.
Posted by: Viki | Tuesday, April 30, 2024 at 07:10 AM
Thank you for listening and giving it a try, and I'm glad that you see that it's better! However, I recommended using Copilot in Precise mode specifically because it's designed to be more accurate. In the screenshot you shared, it appears you used the default mode (Balanced), instead.
You can click this (see the linked image) to switch to Precise mode and give it another go:
https://i.postimg.cc/7hnLM3kK/precise.png
Here is what it returned for me when I used Precise mode:
https://i.postimg.cc/9Xdzhh7y/precise-response.png
As you can see, the results in Precise mode are even better and it didn't make any mistake. In the few cases it happens, you can still check the sources. This is indeed a much better approach than ChatGPT. Precise mode significantly mitigates "hallucinations", not only by using web search to stay grounded (and it tries to determine the most reliable among the available sources), but also by keeping the response more concise, moreover it likely has the hyperparameter “temperature” set to 0, which helps control the randomness of the output.
Hallucinations can be useful for creativity, so Copilot Creative’s temperature is set high, which can be fun for brainstorming or suggesting ideas. Balanced mode, like most general-purpose applications, has a moderate temperature setting (also it is powered by a less capable model). Precise mode, on the other hand, has its temperature likely set to low or zero for maximum accuracy. With local models, you have have more flexibility to set the temperature as you please.
Posted by: n | Thursday, May 02, 2024 at 09:29 PM
As for the "hallucinations", that's how they are called. It's "lie" that may rather imply (or make think the listener of) awareness.
You wrote: «I'd quibble that "LLMs hallucinate" is more accurate than saying "LLMs lie", since a hallucination implies there's already a base stable awareness where none actually exists.»
But I think it's the other way around.
Lies are typically used with the purpose of deceiving or misleading someone, deliberately. This is not the case with LLMs, that generate an output based on patterns they have learned during training. The output may contain something that is inaccurate or nonexistent, in other words, they "hallucinate".
Hallucination, in the context of machine learning and AI research, has been used for decades and it does not imply any form of awareness:
https://www.ibm.com/topics/ai-hallucinations
(also, out of the metaphor, often people aren't aware they are hallucinating, e.g. with schizophrenia or dementia).
At most, among researchers, there is also who prefer to call them "confabulations".
Posted by: n | Thursday, May 02, 2024 at 09:35 PM
As for the expectations, yeah, as I've repeatedly said, LLMs are not databases to retrieve factual information from. To expect that is a misunderstanding of how they work. Vice-versa, they aren't useless just because they don't meet such expectations or aren't a sci-fi AI (I don't mean you told that).
You said: «So overall I still question the usefulness of LLMs beyond being a highly imperfect, unreliable assistant», when you were looking for factual information and accurate responses (and you used Balanced). The Copilot's Creative mode would have been even worse than Balanced for that task, as it tends to generate fictional facts. This not because it's flawed, but because it's not designed for such tasks: the Creative mode is intended for generating imaginative content, hence the name "creative". That task was better suited for Copilot Precise: the language model (GPT-4) processes your natural language input, calls the search engine, quotes the results as precisely as possible. Then it's still a good thing (and I appreciate it a lot) that it provides links to the sources, so you can verify.
However, it's clear that there's a demand for assistants that provide factual information. LLMs, trained on enormous datasets, end up with a wast, but imperfect, knowledge base. Would you consider an expert useless simply because they can't recall every detail flawlessly? Obviously, you wouldn't rely entirely on memory, but rather consult data and sources. Similarly, there are various ways in which LLMs can do a better job at this task. Even so, again, not always perfect, but not so terribly imperfect and so unreliable. And they can do many other things too.
Posted by: n | Thursday, May 02, 2024 at 09:52 PM