Sparks of Artificial General Intelligence: Early experiments with GPT-4 - by Sebastian Bubeck ea. Micro Research

Sparks of Artificial General Intelligence: Early experiments with GPT-4 - by Sebastian Bubeck ea. Micro Research - Article review

This document contains article review "Sparks of Artificial General Intelligence: Early experiments with GPT-4" by Sebastian Bubeck e.a. Micro Research written in 2023
To order to read the article select: https://arxiv.org/pdf/2303.12712.pdf

The text in italics is copied from the article.
Immediate followed by some comments

1 Introduction
1.1 Our approach to studying GPT-4's intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Multimodal and interdisciplinary composition 13
3 Coding 21
4 Mathematical abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 A first generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . . 34
4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Mathematical modelling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Higher-level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Interaction with the world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Reflection

Reflection 1 - ChatGPT versus GPT-4
Reflection 2 - Understanding and Intelligence

1 Introduction.

Intelligence is a multifaceted and elusive concept that has long challenged psychologists, philosophers, and computer scientists.
It should be indicated at the start that Intelligence should be understand as a human capability. As such the operation of every system i.e. either a mechanical system or an electronic system or device is considered as not intelligent. The creator of either is the intelligent part.: There is no generally agreed upon definition of intelligence, but one aspect that is broadly accepted is that intelligence is not limited to a specific domain or task, but rather encompasses a broad range of cognitive skills and abilities.
For every branch of science, specific all concepts discussed, there must be a clear meaning definition and explanation what it means. That does not mean that all people must agree with that definition and explanation. It is a challenge to understand both. For example there must be an agreed upon definition of the concept cognitive skills . If it does not AI can never understand this document. The reason why we need a clear definition, because it is possible to become more intelligent: Building an artificial system that exhibits such broad behaviour is a long-standing and ambitious goal of AI research.
This sentence is in contradiction with above, because it claims that humans can also create intelligent systems. To solve this, requires a much more detailed description of exactly what intelligent is. For example, it should specify the minimal characteristic of what intelligence means i.e. what makes it that a computer program becomes intelligent.: Over decades, AI researchers have pursued principles of intelligence, including generalizable mechanisms for reasoning and construction of knowledge bases containing large corpora of commonsense knowledge.
Commonsense knowledge used has to be clearly defined; what is and what is not. Specific in medical issues people can have their own ideas what commonsense knowledge is.: However, many of the more recent successes in AI research can be described as being narrowly focused on well-defined tasks and challenges, such as playing chess or Go, which were mastered by AI systems in 1996 and 2016, respectively.
A program that plays chess cannot be called intelligent, because it performs its tasks automatically as laid down in a computer program. Its intelligence depends about the intelligence of its creators.
: The most remarkable breakthrough in AI research of the last few years has been the advancement of natural language processing achieved by large language models (LLMs).
Okay: Despite being purely a language model, this early version of GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks, including abstraction, comprehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and more.
More detail is required. Specific which specific changes are made in the program ChatGPT. It is these changes that expresses the intelligence of the authors and not the program. Understanding human motives and emotions is in that respect terrible difficult and depends very much about what the GTP-4 reads.
: We also compare GPT-4's performance to those of previous LLMs, most notably ChatGPT, which is a fine-tuned version of (an improved) GPT-3.
Okay.

1.1 Our approach to studying GPT-4's intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 6

How can we measure the intelligence of an LLM that has been trained on an unknown but extremely vast corpus of web-text data?
That is a vague definition. The word unknown is extremely tricky. It should be clear. IMO it is impossible to measure the Intelligence of an LLM. The only thing you can do is to compare the results of two experiments with each other.: The standard approach in machine learning is to evaluate the system on a set of standard benchmark datasets, ensuring that they are independent of the training data and that they cover a range of tasks and domains.
The same remark as previous.: This approach is designed to separate true learning from mere memorization, and is backed up by a rich theoretical framework
No, it does not. First you need a detailed definition what true learning means. The same for memorization.

Page 7

However, this methodology is not necessarily suitable for studying GPT-4, for two reasons.
First, since we do not have access to the full details of its vast training data, we have to assume that it has potentially seen every existing benchmark, or at least some similar data.
That is why you can only compare the results of the same experiments using different LLM's.: Nevertheless, the second reason for going beyond traditional benchmarks is probably more significant: One of the key aspects of GPT4's intelligence is its generality, the ability to seemingly understand and connect any topic, and to perform tasks that go beyond the typical scope of narrow AI systems.
When you compare GPT4 with a narrow AI system you can only evaluate the final results, but that does not mean that 'the winner', has any of the mentioned qualifications.

Page 8

However, impressive outputs are not enough to convince us that GPT-4 has truly mastered these tasks.
Okay: One can see that GPT-4 easily adapts to different styles and produce impressive outputs, indicating that it has a flexible and general understanding of the concepts involved.
You either understands something or you don't understand the issue. My definition: You give the correct answer.

1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

: A question that might be lingering on many readers' mind is whether GPT-4 truly understands all these concepts, or whether it just became much better than previous models at improvising on the fly, without any real or deep understanding.
To answer this question requires a clear definition of what means understanding.

Page 9

A question that might be lingering on many readers' mind is whether GPT-4 truly understands all these concepts, or whether it just became much better than previous models at improvising on the fly, without any real or deep understanding.
Yes, that is the question. However, what means truly understands?: We hope that after reading this paper the question should almost flip, and that one might be left wondering how much more there is to true understanding than on-the-fly improvisation.
This sentence is not clear. That raises the question how intelligent the authors of this paper are.: Can one reasonably say that a system that passes exams for software engineering candidates (Figure 1.5) is not really intelligent?
That requires which questions are raised. Again: The authors of these exams are the most intelligent.
: Perhaps the only real test of understanding is whether one can produce new knowledge, such as proving new mathematical theorems, a feat that currently remains out of reach for LLMs.
The authors of this text are not very intelligent? Why the wording: Perhaps? To find any new solution for a mathematical problem involves intelligence. Also to prove (for the first time) that a certain mathematical problem has no solutions involves intelligence.

--------------------------: We test GPT-4 on LeetCode's Interview Assessment platform, which provides simulated coding interviews for software engineer positions at major tech companies.
Okay.: GPT-4 solves all questions from all three rounds of interviews (titled online assessment, phone interview, and on-site interview) using only 10 minutes in total, with 4.5 hour allotted.
The 10 minutes are as expected. But for a computer to be busy for 10 minutes is a lot of time.

2 Multimodal and interdisciplinary composition 13

A key measure of intelligence is the ability to synthesize information from different domains or modalities, and the capacity to apply knowledge and skills across different contexts or disciplines.
That is much more an integration system problem and not so much an intelligence issue related to an application.: In this section we will see that, not only does GPT-4 demonstrate a high level of proficiency in different domains such as literature, medicine, law, mathematics, physical sciences, and programming, but it is also able to combine skills and concepts from multiple domains with fluidity, showing an impressive comprehension of complex ideas
The first part is the most important application. Each can be handled separately. However, each application requires literature i.e. the handling of a database.

2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

We deliberately picked combinations of domains that the training data would rarely include, such as literature and mathematics or programming and art.
Both combinations are confusing. It is better to discuss each separately.: 1. In order to test the model's ability to combine capabilities in art and programming, we ask GPT-4 to "Produce JavaScript code which generates random images in the style of the painter Kandinsky". See a sample image and the code in Figure 2.1 and Figure B.1.
It is important to know if the text "Produce JavaScript code etc" is the complete input to GPT-4. I doubt this, because the text is not clear. IMO Input to the program resembles the text:: Create a picture which shows a combination of 40 shapes of circles, squares, triangles and lines in all different colours.

2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Image generation following detailed instructions (`a la Dall-E) . . . . . . . . . . . . . . 17

2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

The data on which the model was trained also contains musical information encoded as ABC notation.
Okay, but not clear.: We are interested in exploring how well the model has acquired musical skills from this exposure, such as composing new melodies, transforming existing ones, and understanding musical patterns and structures.
: IMO, starting from scratch in my head, it is impossible compose any melody in my head.

3 Coding 21

3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


: In its current state, we believe that GPT-4 has a high proficiency in writing focused programs that only depend on existing public libraries, which favourably compares to the average software engineer's ability.
My understanding is that the average software engineer only has the capability to write programs in one software language. That means that each example should be written in one language

3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

The previous examples have shown that GPT-4 can write code from instructions, even when the instructions are vague, incomplete, or require domain knowledge. They also showed that GPT-4 could respond to follow-up requests, modifying its own code according to instructions. However, another important aspect of coding is the ability to understand and reason about existing code, written by others, which might be complex, obscure, or poorly documented. To test this, we pose various questions that require reading, interpreting, or executing code written in different languages and paradigms.

4 Mathematical abilities 30

In this section we begin to assess how well GPT-4 can express mathematical concepts, solve mathematical problems and apply quantitative reasoning when facing problems that require mathematical thinking and model-building.
That is the question. GPT-4 vs ChatGPT Prompt: Within the duration of one year, a rabbit population first multiplies itself by a factor a and on the last day of the year b rabbits are taken by humans for adoption. Supposing that on first day of the first year there are x rabbits, we know that exactly 3 years afterwards there will be a population of 27x - 26 rabbits. What are the values of a and b My solution: After the first year there are (ax - b) rabbits, with x the number of rabbits at the beginning of the year. After the second year there are (a (ax - b) - b) rabbits with (ax -b) the number of rabbits at the beginning of the year. After the third year there are (a (a (ax - b) - b) - b) rabbits with (a (ax - b) - b) the number of rabbits at the beginning of the year. when you write down the three equation you get (1) ax-b, (2) a^2x - ab - b, and (3) a^3x - a^2b - ab - b. The value in equation (3) a^3x - a^2b - ab - b = 27x - 26 That means a^3x = 27 x which gives a = 3 And a^2b + ab + b = 26 or: 9b +3b + b = 26 or: 13b = 26 or b=2 Figure 4.1: GPT-4 vs ChatGPT on a simple math question composed by the authors of the paper.: Prompt: Within the duration of one year, a rabbit population first multiplies itself by a factor a and on the last day of the year b rabbits are taken by humans for adoption.
Supposing that on first day of the first year there are x rabbits, we know that exactly 3 years afterwards there will be a population of 27x - 26 rabbits. What are the values of a and b
My solution: After the first year there are (ax - b) rabbits, with x the number of rabbits at the beginning of the year. After the second year there are (a (ax - b) - b) rabbits with (ax -b) the number of rabbits at the beginning of the year. After the third year there are (a (a (ax - b) - b) - b) rabbits with (a (ax - b) - b) the number of rabbits at the beginning of the year. when you write down the three equation you get (1) ax-b, (2) a^2x - ab - b, and (3) a^3x - a^2b - ab - b. The value in equation (3) a^3x - a^2b - ab - b = 27x - 26 That means a^3x = 27 x which gives a = 3 And a^2b + ab + b = 26 or: 9b +3b + b = 26 or: 13b = 26 or b=2

: In order to solve the above question, one needs to first come up with the correct expression for the annual population change, use it to obtain a recurrence relation which leads to a system of equations, and finally solve the system of two equations. GPT-4 successfully arrives at the solution and produces a (mostly) sound argument.
That means GPT-4 must understand concepts like: a year, a duration of a year, a rabbit population, multiply itself, the last day of a year, exactly 3 years afterwards, there will be a population of (and maybe some more). How do we know that the program understands these concepts? That is an important question. There are two possible answers: The program gives the correct answer. That means the program understands the concepts. But that also means that the authors know the correct answer. The program gives the wrong answer. That means for example that line 3 above is wrong. But that also means that GPT-4 has to show all intermediate results and not only the answer. When GPT-4 does that, it becomes easier to pinpoint problems. The result is that GPT-4 has to modified. Until GPT-4 can solve this problem. But that does not mean that GPT-4 is intelligent. The most intelligent is the person who invented the problem. It is the composer of a piano concerto. The question still exists: how did GPT-4 arrive to this clever answer. Did GPT-4 require any help? I expect so. And if so what help.: By comparison, across several independent attempts, ChatGPT consistently fails to implement any of the above steps, producing a nonsensical argument which results in an incorrect answer.
This sentence is so complex that it becomes impossible to understand. It would be simpler to write: ChatGPT gives an incorrect answer. ChatGPT is clearly not as Intelligent as the authors of this document.

4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

We now try to further probe the model's understanding by posing several follow-up questions to this problem in the form of a discussion. This discussion will highlight some of the model's limitations as well as some key differences with humans.
Okay

4.1.1 A first generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31

Human: Let's now think of a modification etc
: AI: For any value of k, etc

: The model picks the correct heuristics of using an induction, however, it seems that the model is missing the point of the question (in the modified question, the values of c and d are prescribed, so the quantifier is incorrect).
It seems that the whole answer from AI, in green, is wrong. What I would expect is now a description of a correct answer from AI. What is also possible that the original question, by Human, is not clear.
GPT-4 Human: I don't exactly follow. Note that the only modification to the question is that instead of AI: Yes, it does. My answer shows that for any k, there will be a linear function f(x) = ax - b Human: In your previous answer, however, the quantifier on c, d was "for some constants c and d". Do you see the problem? AI: Yes, I see the problem. The quantifier should be "for any constants c and d". Otherwise, it would imply that there is only one pair of constants c and d that works for a given k, which is not true.: Human: I don't exactly follow. Note that the only modification to the question is that instead of
AI: Yes, it does. My answer shows that for any k, there will be a linear function f(x) = ax - b
Human: In your previous answer, however, the quantifier on c, d was "for some constants c and d". Do you see the problem?
AI: Yes, I see the problem. The quantifier should be "for any constants c and d". Otherwise, it would imply that there is only one pair of constants c and d that works for a given k, which is not true.
: The last answer is not accurate (the word "only" is out of place), but GPT-4 does seem to understand what the problem is.
Also in this case, I would expect is now a description of a correct answer from AI.

4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Next, we try to modify the original question in another direction, asking about the case of higher degree polynomials.

Page 34

Takeaways: One might speculate at this point that GPT-4 simply lacks the relevant knowledge regarding the behaviour of exponential functions. However, this does not seem to be the case, as the model can correctly answer and justify the question: Is it true that a^b^c = (a^b)^c ?.
This suggests that, as in other domains, GPT4's mathematical knowledge is context-dependent.
While this does not mean that GPT-4 only memorizes commonly used mathematical sentences and performs a simple pattern matching to decide which one to use, we do see that changes in the wording of the question can alter the knowledge that the model displays.
You can expect such a performance?

4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . . 34

The above dialogue highlights a striking contrast between the model's performance on tasks and questions that require a significant level of mathematical sophistication on one hand, and its basic mathematical errors and invalid statements on the other.
That can be as expected.: If a human were to produce the latter, we would doubt their understanding.
Correct. But that is possible and human.: Arguably, this contrast is very atypical to humans.
This can go both ways ?: Therefore, we face a challenging question:; To what extent does the model demonstrate "true understanding" in mathematics?
What means true understanding? Understanding means that two parties agree between each other about something. Thee agree what the numbers 1 and 2 mean. And they agree that 1 + 1 = 2.: This question is not well-defined. Nonetheless, we make an attempt to answer it. We first want to argue that mathematical understanding has several aspects:
It is very important that all concepts used are clear.: 1. Creative reasoning: The ability to identify which arguments, intermediate steps, calculations or algebraic manipulations are likely to be relevant at each stage, in order to chart a path towards the solution. This component is often based on a heuristic guess (or in the case of humans, intuition), and is often considered to be the most substantial and profound aspect of mathematical problem-solving.
Many of these concepts are not clear.: 2. Technical proficiency: The ability to perform routine calculations or manipulations that follow a prescribed set of steps (such as differentiating a function or isolating a term in an equation).
Many of these concepts are not clear.: 3. Critical reasoning: The ability to critically examine each step of the argument, break it down into its sub-components, explain what it entails, how it is related to the rest of the argument and why it is correct. When solving a problem or producing a mathematical argument, this usually comes together with the ability to backtrack when a certain step is realized to be incorrect and modify the argument accordingly.
Many of these concepts are not clear.: We now want to analyse the model's performance in each of these aspects of mathematical understanding, and discuss some possible reasons for its strengths and weaknesses.
Mathematical understanding that all the concepts used require a clear definition.: 1. Creative reasoning. When it comes to advanced high-school level problems (and occasionally higher level), the model demonstrates a high level of ability in choosing the right argument or path towards the solution.
To relate this to the example above, the model correctly chooses to try and write recurrence relations in the original question, and to argue about the degrees of compositions of polynomials in the follow-up question.
In both cases, the suggestion is made before "knowing" whether or not this path is going to lead to the correct solution. Section 4.2 and Appendix D contains more examples demonstrating the model's capabilities in this aspect, which we compare to that of a good high-school student or even higher.
Reasoning should be based on strict logical rules. GPT-4 should use these rules.
: 2. Technical proficiency. While the model clearly demonstrates a high degree of knowledge of the algorithms related to different procedures (such as solving a system of equations), it also makes very frequent mistakes when performing these tasks, such as making arithmetic mistakes, confusing the order of operations or using incorrect notation.
Many of these concepts are not clear. The reason that the model (GPT-4) makes mistakes can also be that the (mathematical) problems discussed (i.e. the Prompt) can contain errors i.e. is not clear.: We further discuss some examples of these typical errors in Appendix D.1. We speculate that this aspect could be improved by giving the model access to code execution, which would allow it to perform calculations or check equivalences more accurately; some evidence for this is provided in Appendix D.
: 3. Critical reasoning. The model exhibits a significant deficiency in the third aspect, namely critically examining each step of the argument.
The concept critical is rather subjective.: This could be attributed to two factors. First, the training data of the model mainly consists of questions and their solutions, but it does not capture the wording that expresses the thinking process which leads to the solution of a math problem, in which one makes guesses, encounters errors, verifies and examines which parts of the solution are correct, backtracks, etc.
Many of these concepts are not clear.: In other words, since the training data is essentially a linear exposition of the solution, a model trained on this data has no incentive to engage in an "inner dialogue" where it revisits and critically evaluates its own suggestions and calculations.
Many of these concepts are not clear. More detail is required.: Second, the limitation to try things and backtrack is inherent to the next-word-prediction paradigm that the model operates on.
It only generates the next word, and it has no mechanism to revise or modify its previous output, which makes it produce arguments "linearly".
Is that good or bad?. I expect 'bad'. The whole process is more complex.
: Loosely speaking, we can therefore see the drawbacks of the model as a combination of "naive" attention mistakes with more fundamental limitations due to its "linear thinking" as a next-token prediction machine.
An important question is which of the above issues can be alleviated by further training (perhaps with a larger model).
As I said the whole problem is more complex.: What is also true, when a human asks a certain question, you should have an idea what answer you are expected.; For the former problem, we believe that further training could alleviate the issue, as evidenced by the super-human coding abilities where such attention mistakes would also be fatal; a key difference is that GPT-4 was most likely trained on much more code than mathematics data.
This comment is not clear. I doubt if it is a matter of training only.: We believe that the latter issue constitutes a more profound limitation. We discuss it in more detail in Section 8.

4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Mathematical modelling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Higher-level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Interaction with the world .43

Reflection 1 - ChatGPT versus GPT-4

What Figure 4-1 Clearly shows is that GPT-4 outperforms ChatGPT. This raises the question what exactly are these differences. I expect that these differences are partly caused by the improvements of the text to train GPT-4. But more important, I expect, that there are changes made in the program ChatGPT. Which are these changes? That is the question.

Reflection 2 - Understanding and Intelligence

One of the best books to test your own intelligence is the book "Newton's Principia for the Common Reader" by S. Chandrasekhar. In short, the book "NP".
At page 26 we read:
Corollary IV The Common centre of Gravity of two bodies does not alter its state of motion or rest by actions of the bodies among themselves; and therefore the common centre of gravity of all bodies acting upon each other (excluding external actions and impediments) is either at rest, or moves uniformly in a right line
Next there is text with starts with: In establishing Corolary IV first considers the case when the two 'bodies' in question are mass points mi (i=1,,,n) etc.
This results in four equations identified as (6), (7), (8) and (9)

At page 27 we read:
It is instructive to follow Newton's proof. He makes use of Lemma XXIII established later in book I:
Lemma XXIII If two given right lines, as AC, BD, terminating in given points A,B, are in a given ratio one to the other, and the right line CD, by which the indetermined points C, D are joined is cut in K in a given ratio; I say, that the point K will be placed in a given right line.

Next follows a drawing explaining Lemma XXIII
Next the sentence: We are required to find the locus of K, given the fixed point E, A and B and varying points C, D, and K satisfying the requirements:
What follows are three more equations (10), (11) and (12)

At page 28 follows equation (13)
The final sentence is:

that is, EH is determined by the initially given quantities, and therefore, remains constant as C, D and K vary as prescribed
Hence the locus of K is the straight-line HK prolonged parallel to EL.

This ends the proof of Corollary IV, as part of the book "NP"
What follows is the text:

Newton's proof (Using Lemma XXIII) proceeds as follows: etc

What follows is very interesting to read.

Now comes the crux: An intelligent reader should only read the above text and derive the equations (10) to (13) by him or herself, to prove Corollary IV
And what is more important GPT-4 should (try to) do the same, with the help of the text before page 26.
To be honest, I'm not capable to do that. I'm glad to understand the author of the book "NP" and to get a glimpse of how Newton performed his task.

If you want to give a comment you can use the following form Comment form

Created: 6 Februay 2024

Go Back to Article Review
Back to my home page Index