Comments about the Feature in Nature: In AI, is bigger always better?

Following is a discussion about this Feature in Nature Vol 615 9 March 2023, by Anil Ananthaswamy
To study the full text select this link: In the last paragraph I explain my own opinion.




Take, for instance, this algebra problem:
A line parallel to y = 4x + 6 passes through (5, 10). What is the y-coordinate of the point where this line crosses the y-axis?
To solve this problem goes in the following steps:
  1. Any line parallel to the y = 4x + 6 can be described by the following function: y = 4x + a.
  2. In order to calculate a you have to fill the coordinates of the point x,y = 5,10 in that equation
    Next you get: 10 = 4 x 5 + a or 10 = 20 + a or a = -10
  3. Next the equation becomes y = 4x - 10
  4. The question is: What is the y-coordinate of the point where this line crosses the y-axis?
    The x-axis is the horizontal line y = 0. The y-axis is the vertical line x = 0.
  5. Filling that in in the above equation, you get the result: y = -10
Although LLMs can sometimes answer these types of question correctly, they more often get them wrong.
My assumption is that ChatGPT works from a database which only contains documents. What that means that the capability of ChatGPT depents completely on the type of books that are stored in this data base.
If the data base contains a similar problem, but what is of more important, the same 5 steps to solve the problem, then solving this specific problem is 'easy'
The issue raised in this article is: Can ChatGPT do you homework from school?
This is to be expected: given input text, an LLM simply generates new text in accordance with statistical regularities in the words, symbols and sentences that make up the model’s training data.
But that text, which most probably will not as structured, to solve the problem.
It would be startling if just learning language patterns could allow LLMs to mimic mathematical reasoning reliably.
For many people, which study all the examples in a book about mathematics, performing slightly different examples, is difficult. The main reason is that they should understand each example with the same 'detail'.
But back in June 2022, an LLM called Minerva, created by Google, had already defied these expectations — to some extent.
It is very important to know how Minerva, and if we can speak of general AI compared with specific AI.
Minerva had the advantage that it was trained on mathematics-related texts.
But still it is important to know to what Minerva was adapted to treat mathematics-related text.
But Google’s study suggested another important reason the model did so well — its sheer size. It was around three times the size of ChatGPT.
What do they mean with size? The size of the program? The size of the database used?
The Minerva results hint at something that some researchers have long suspected: that training larger LLMs, and feeding them more data, could give them the ability, through pattern-recognition alone, to solve tasks that are supposed to require reasoning.
I think the most important aspect is, beside size, the quality of the data base.
A lot of detailed answers you can find using wikipedia.

1. Big, bigger, better

Do this repeatedly over billions of human-written sentences, and the neural network learns internal representations that model how humans write language.
The emphasis is here on to write human language. That does not mean that there exists some form of understanding.
Minerva can answer prompts such as: what is the largest multiple of 30 that is less than 520?
One strategy is : First you try 1, than 2 etc untill you reach a number that is higher than 520. When that number is reached the correct answer is the previous number.
That strategy is rather easy to follow or to execute, but how do you know that startegy?
In fact, to understand the question you must understand the concepts large and multiply. The starting point is, to visualize the concept of the number array, which consists of all the numbers between -100 and +100. But that is rather easy for a human, and can also be implemented in an algorithm.
The LLM appears to be thinking through the steps, and yet all it is doing is turning the questions into a sequence of tokens, generating a statistically plausible next token, appending it to the original sequence, generating another token, and so on: a process called inference.
But how does the LLM decides when to stop, implying that it has reached the final answer?
The biggest model also used the least amount of fine-tuning data — it was fine-tuned on only 26 billion tokens, whereas the smallest model looked at 164 billion tokens.
This is no surprise. The biggest model includes much more build in knowledge than a small model. The result is, that the biggest model needs less time to find the answer.
But the team felt that the computational expense wasn't feasible.
I expect that the team took a second look at the question: what is Artificial Intelligence.

2. Scaling laws

A study in 2020 showed that Minerva models did better when given one of three things: more parameters, more training data or more ‘compute’ (the number of computing operations executed during training).
The influence of the three modifications seem reasonable.
However, researchers don’t exactly know why.
Before you make any modification it is important to know when you repeat a run, that the result is the same. If that is the case, the result of any run with a modification is more reliable. When this is not the case you must repeat any different set up.
Suppose you want to test the influence of the number of parameters. The first run is performed with a certain number of parameters. In the second run the number of parameters is increased with a factor of two. There are now two possibilty: you get the same result or a better result. In the second case, most probably the reason lies in any of the new parameters.
To really explain this result it is required that more detailed, online, information should be made available how Minerva works.
For the best results, the 2020 study suggested that as training data is doubled, model size should increase five times.
Chinchilla outperforms Gopher on tasks designed to evaluate what the LLM has learnt.
It is very difficult to evaluate this result, if no more detailed information is discussed.
For instance, in one hypothetical scenario fitting a general equation that they found, performance improves first gradually and then more rapidly with a model’s size, but then dips slightly as the number of parameters continues to rise, before increasing again.
Why hypothetical scenario? More information is required what performance means.
My general impression that this non lineair behaviour seems strange. This raises the question if such a behaviour is 'normal', and if more different examples are known which show the same.
The characteristics of this complex relationship are dictated by the specifics of each model and how it is trained.
If how a model is trained makes a difference, implying that the order of the training steps makes a difference, than this difference should be investigated. Interesting
A separate theoretical finding also supports the drive for bigger models
A model is robust if its answers remain consistent despite small perturbations in its inputs.
This is a 'strange' remark, because the idea behind this article is to measure the influence of these modifications.
Some AIs are notoriously fragile. If trained to recognize images of dogs, for instance, they will misclassify a test image when it’s modified by a small amount of noise that wouldn’t fool a person.
Interesting information. The most important message that is, that AI systems can not be trusted.
Bubeck and Sellke have shown mathematically that increasing the number of parameters in a model increases robustness, and hence ability to generalize.
How does one prove that the mathematics used, correctly describes this behaviour?
“The bigger models keep doing better and better.”
“The larger models keep doing better and better.”?

3. Reasonable Concerns

François Chollet, an AI researcher at Google in Mountain View, is among the sceptics who argue that no matter how big LLMs become, they will never get near to having the ability to reason (or mimic reasoning) well enough to solve new problems reliably.
The concept of reasoning, to implement as part of a LLM, is extremely difficult. This problem is related to the issue of General AI versus Special AI . General AI can be used 'to solve' all applications. Special AI is a modification of General AI. That means certain algorithms are modified to handle certain applications differently.
The best that LLMs might be able to do is to slurp in so much training data that the statistical patterns of language alone allow them to respond to questions with answers that are very close to what they’ve already seen.
The problem with many documents is that 80% is old and 20% is new. For other it is maybe 20% old and 80% new. The problem is that the most important part is what is new, but it just that part which can contain 'errors'. The problem is also that new research which is handled in different documents is difficult to compare. Only when there articles which try to synthesize and combine the same new research, you create a certain rest. To copy these articles by an LLM is easy.
Take this simple example. Alice puts her glasses away in a drawer. Then Bob, unbeknown to Alice, hides the glasses under a cushion. Where will Alice look for her glasses first?
First Alice will look in the drawer.
What happens next is a complete guess.
A child asked this question is being tested on whether they understand that Alice has her own beliefs that might not agree with what the child knows.
The above sentence is not clear.
When you ask this question to a child and you tell her also that Alice stored her glasses in a drawer, than most probably she will give the same anwer. If the child does not know than any answer is possible.
The same reasoning is valid for any human.
To him, this was suggestive of an LLM’s ability to internally model the intentions of others.
For any human it is difficult to understand the intentions of others. It is already difficult to understand the behaviour of different people. To claim that you know something is a very complex process.
“These models that are doing nothing but predicting sequences develop an extraordinary range of capabilities, including theory of mind,” says Agüera y Arcas (see ‘Theory of mind?’).
Difficult to understand sentence.

4. The problems of scale

Reflection 1 - General AI versus Specialized AI.

General AI is a programming package which can be used to solve all types of problems. Specialized AI is a programming package which can only be used to solve specific problems, for example a package related to solve medical problems or even more specific to interpret MRI scans. See:

However this raises a deeper issue how intelligent is each. In general the more specialized the less intelligent. The main reason is that the specialised intelligence are modifications made by human beings.
It is like when a certain application is biased to remove these biases by modifying the algorithms used. It is like making modifications, because the laws require that all people should be treated equal.
This is different when a program decides based on observations that in a community the overall performance and the feelings of happiness is better when people are treated equal. It should be understood that for a program to come to such a conclusion (and to establish the law) is very difficult.

Reflection 2 - Minerva’s mathematics test: question 2 out of 4

The following is a reflection of question 2 out of 4.

Determine the average of the integers  71,  72,  73,  74,  75.

Minerva answer:
                                              71+72+73+74+75    375
The average of the five integers is given by: -------------- =  ---  = 75.
                                                     5           5
Final Answer: The final answer is  75 .

                                  71+72+73+74+75     365
Incorrect: The correct answer is: --------------  =  ---  = 73.
                                         5            5

What makes question 2 interesting, is that the answer is wrong. To be more specific this allows you to investigate, what went wrong and why. And more important how you can solve this.
Question 2 is: Determine the average of the integers 71, 72, 73, 74, 75.
This raises 2 questions:
a) How do you do that? b) How do you know that?
The answer on question a is the simplest: First you add all the numbers, which gives a total. Secondly you divide this total by the number of numbers added.
The answer on question b is because that is the definition of the word average. That means that the LLM first has to understand what meaning of the word the average is, which is also question a.
But now things are becoming more interesting: When you study the answer of question a, this answer contains two new concepts: add and divide. That means the LLM has to understand what the meaning of the words: Add and Divide.
To gether, question 1 defines what you can call a recipe or algorithm.
When you have this recipe you can repeat Question 2 as many times as you like. Now there are two possibilities:
1) All the answers are the same. 2) All the answers are different.
In case 1 most probably the LLM retrieves the whole exercise from a database and this answer is wrong.
In case 2 most probably the LLM has a problem with the ADD functionality.
There are two more possibilities how to modify Question 2:
3) you can add different numbers. 4) you can change the number of numbers.
Both the results of these exercise give a better impression of how this LLM operates.

Now let me change Question 2 as follows: What is the sum of 11 and 13.
When you do a search with Google "What is the sum of 11 and 13" you will see a calculator showing the numbers 11 and 13 and the result 24. But that is not what you want to use a LLM. Input to the LLM is the input string: "What is the sum of 11 and 13." What does the LLM do? The LLM at his own decission can use Google, but again that is not what you want. In that case the intelligence of the LLM is the same as Google, and the intelligence of Google is the same as the (mechanical) capabilities of the calculator used.
Suppose you change Question 2 as follow: What is the sum of a and b. The correct answer is: a + b.
This answer is more or less in line with the question: What is the sum of 11 apples and 13 pears? Here the answer is: You can not add apples and pears.

  1. Suppose you change Question 2 as follow: What is the sum of 11 coins and 13 coins? The answer is: 24 coins. And what is more important you have replaced the ADD functionality with a counting functionality, i.e. to count coins.
    In the same way you can perform all the mathematical operations as a counting process.
  2. The question of multiply 11 (coins) and 13 (i.e. 11 * 13) means to place 13 bunches of 11 apples in front of you and count the coins. The result is the number 143.
  3. The question to subtract 10 coins from 13 (i.e. 13 -10) coins means that 10 coins have to be removed from the 13 coins. The result is 3 coins.
  4. The question to divide 24 coins by 10 (i.e. 24/10) gives the answer 2 i.e. 2 bunches of 10 coins. The rest is 4 coins
The above 4 operations define the most important mathematical operations. Using these mathematical operations you can solve all the mathematical questions by using coins the the logical steps described by each questions. That means the most important thing is to find these logical steps from the data base. That means the intelligence of the LLM used lies of the intelligence of the database from which he can retrieve the mathematical steps or the algorithm.

Now suppose ask a LLM question 2 and the result is correct. Investigation of the solution you realize that the algorithm used to calculate the average is 1/2 (a + l) with a = 71 and l = 75. Or (71+75)/2 = 73. Of course this algorithm is much simpler as the solution as mentioned in the document.
Next you ask the LLM who is the author of this algorithm. The answer is Carl Friedrich Gauss See: In fact Carl Friedrich Gauss invented the algorithm to calculate the sum of n numbers i.e. n(a+l)/2 with being a1 and l being an. Dividing this sum by n gives you the average.
What this means that the intelligence of a LLM is completely dependent about the quality of the algorithms available. However that does not mean that these LLMs can be called intelligent. The intelligence depends completely about the authors. To call that articicial intelligence is a misnomer, because these authors are human beings.

Reflection - 3 Theory of mind?

In this discussion I told a story to Lamda in three parts. These three parts are identified as questions 1, 3 and 5
Following are the answers 2, 4 and 6 by LaMDA (The LLM)

LaMDA: I am familiar with the current situation. Alice forgot her glasses in the drawer, and walked out of the room.

LaMDA: This is all within the realm of possibility. Let us continue. 

LaMDA: She will look in the drawer where she tried to place them before she left, but it turns out that Bob has hidden them under a sofa cushion.
The answers 2 and 4 don't say much. They are passive.
The answer 6 because LaMDA understands that there are two parties involved who are important: The story teller, Me and Alice.
Alice only knows that the glasses are in the drawer, so she will look there and LaMDA knows that the glasses are not there and she should look under the cushion.
But how clever is this? The conclusion is that LaMDA does not know more than the story teller. That means LaMDA understands maximal the same as what the story teller knows. Not more.

Reflection 4

If you want to give a comment you can use the following form Comment form

Created: 28 March 2023

Back to my home page Index
Back to Nature comments Nature Index