Michael,

The examples you cite illustrate the strengths and weaknesses of LLMs.  They show why multiple methods of evaluation are necessary. 

1. The failures mentioned in paragraph 1 show that writing a program requires somebody or something that can understand a problem statement and generate a sequence of commands (in some detailed notation) to specify a method for solving that problem.  LLMs can't do that

2. The second paragraph shows that ChatGPT had a better selection of answers available in May or perhaps an improvement in its ability to find answers.  It's possible that very few dental clinicians had ever used ChatGPT for that purpose.  You experiment and the work by the dental clinicians in India may have added enough new patterns that dental clinicians worldwide would have benefitted.

3. The third paragraph shows how ChatGPT learns how to do what it does best:  translate from one notation to another.  Since you did all the problem analysis to generate Python with miscellaneous errors, it learned how to translate your personal dialect of Python to the official Python syntax.  That is an excellent example of LLMs at their best.  It was learning how to translate, not learning how to understand.

4.  I would say that there is a major difference.  Wikipedia is not improved by any method of learning (by humans or machines).  Instead, some articles are excellent products of collaboration by experts on the subject matter.  But other articles were written hastily by people who don't have the expertise or the patience to do a thorough research of the topic.  The Wikipedia editors usually mark those articles that require further attention.  But there are many articles that fall between the cracks -- nobody knows whether they are accurate or not.

John
 


From: "Michael DeBellis" <mdebellissf@gmail.com>

[Paragraph 1]  I agree. I've asked ChatGPT and Copilot for SPARQL queries, not extremely complicated, either things I thought I would attempt rather than going back to the documentation or in some cases to get DBpedia or Wikidata info because I find the way they structure data to be not very intuitive and it takes me forever to figure out how to find things like all the major cities in India (if anyone knows some good documentation on the DBpedia or Wikidata models please drop a note). I think part of the problem is that people see what looks like well formatted code and assume it actually works. None of the SPARQL queries I've ever gotten worked.  

[2]  We did an experiment in February this year with dental clinicians in India where we gave them a bunch of questions and had them use ChatGPT to get answers and they rated the answers very highly even though almost all of them were incomplete, out of date, had minor or major errors.  On the other hand, when I ran the same questions through ChatGPT (and in both cases I used 3.5) in May it was radically different. Almost all the answers were spot on. 

[3]  And for coding, I have to say I find the AI support in PyCharm (my Python IDE) to be a great time saver. Most of the time now I never finish typing. The AI can figure out what I'm doing by figuring out patterns and puts the suggested completion in grey and all I do is hit tab. It's also interesting how it learned. My code is fairly atypical Python, because it involves manipulating knowledge graphs and at first I was getting mostly worthless suggestions. But after a few days it figured out the patterns to read and write to the graph and it has been an incredible benefit. I like it for the same reason I always copy and paste names whenever I can rather than typing, it drastically cuts down on typing errors. 

[4] All this reminds me of the debates people had about Wikipedia. Some people thought it was worthless because you can always find some example of vandalism where there is garbage in an article. And other people think it is the greatest thing on the Internet. The answer is somewhere in the middle. Wikipedia is incredibly useful and also an amazing example of how people can collaborate just to contribute their knowledge, the way people collaborate on that site is so different than most of the Internet, but you should never use it as a primary source. Always check the references. That's the way I feel about Generative AI. Like Wikipedia I think it is a great resource in spite of the fact that some people claim it can do much more than it really can and that it can still be wrong. It's just another tool and if used properly, an incredibly useful one.

Michael
https://www.michaeldebellis.com/blog

On Saturday, July 27, 2024 at 4:41:24 PM UTC-7 John F Sowa wrote:
Another of the many reasons why Generative AI requires other methods -- such as the 70 yeas of AI and computer science -- to test, evaluate, and correct anything and everything that it "generates",

As the explanation below says, it does not "UNDERSTAND" what it is doing  It just finds and reproduces patterns that it finds in its huge volume of data.  Giving it more data gives it more patterns to choose from.  But it does nothing to help it understand any of them.

This method enables it to surpass human abilities on IQ tests, law exams, medical exams, etc. -- for the simple reason that the answers to those exams can be found somewhere on the WWW.  In other words, Generative AI does a superb job of CHEATING on exams.  But it is hopelessly clueless in solving problems whose solution depends on understanding the structure and the goal of the problem.

For similar reasons, the article mentions that self-driving cars fail in complex environments, such as busy streets in city traffic.  The number and kinds of situations are far more varied and complex than anything they have been trained on. Carnegie Mellon University is involved in more testing of self-diving cars because Pittsburgh has the most complex and varied patterns.  It has more bridges than any other city in the world.  It also has three major rivers, many hills and valleys, steep winding roads, complex intersections, tunnels, foot traffic, and combinations of any or all of the above.

Drivers who test self-driving cars in Pittsburgh say that they can't go for twenty minutes without having to grab the steering wheel to prevent an accident.   (By rhe way, I learned to drive in P:irravurgh.  Then I went to MIT and Harvard,, where the Boston patterns are based on 300-year-old cow paths.)

John