Measuring Intelligence?—?The Role of Benchmarks in Evaluating AGI

author-img
By SingularityNET May 29, 2024

Measuring Intelligence — The Role of Benchmarks in Evaluating AGI

Dear Singularitarians,

The development of Artificial General Intelligence (AGI) represents one of the ultimate goals of AI research. While the precise definition or characterization of AGI is not broadly agreed upon, the term “Artificial General Intelligence” has multiple closely related meanings, referring to the capacity of an engineered system to:

  • display the same rough sort of general intelligence as human beings;
  • display intelligence that is not tied to a highly specific set of tasks;
  • generalize what it has learned, including generalization to contexts qualitatively
  • very different than those it has seen before;
  • take a broad view, and flexibly interpret its tasks at hand in the context of the
  • world at large and its relation thereto.

Achieving this milestone requires not only robust methods for developing AGI but also means with which we can measure and evaluate AGI’s progress. As researchers worldwide are constantly making strides in this field, the role of benchmarks becomes increasingly important the closer we get to the advent of general intelligence.

In this article, we’ll explore the importance of benchmarks in AGI evaluation, studying how some standardized tests may provide us with a clear and objective measure of a machine’s journey toward true, human-like intelligence.

It all started with the Turing test

The Turing Test, proposed by Alan Turing in 1950, is the most well-known benchmark for AI. It involves three terminals: one controlled by the computer and two by humans.

One human acts as a questioner, and the other human and the computer respond. The questioner must determine which respondent is the machine.

The computer passes the test if the questioner cannot reliably distinguish it from the human. Initially, this test was only passable for computers with simple yes/no questions. However, it becomes significantly more challenging with conversational or explanatory queries.

The Robot College Student Test

In 2012, the “Robot College Student” test was proposed by Dr. Ben Goertzel. It has simple reasoning: if an AI is capable of obtaining a degree in the same way a human is, then it should be considered conscious. This test evaluates an AI’s ability to learn, adapt, and apply knowledge in an academic setting.

Dr. Ben Goertzel’s idea, standing as a reasonable alternative to the famous “Turing test” might have remained a thought experiment were it not for the successes of several Ais. Most notably, GPT-3, the language model created by the OpenAI research laboratory. However, Bina48, a humanoid robot AI, was the first to complete a college class at the University of Notre Dame de Namur University in 2017. Another example is the robot AI-MATHS, which completed two versions of a math exam in China. Although capable of completing college classes and exams, these AIs still have a long way to go until sentience and true general intelligence.

The Coffee Test

The Coffee Test, also proposed by Dr. Ben Goertzel and endorsed by Steve Wozniak, co-founder of Apple, involves an AI application making coffee in a household setting. The AI must find the ingredients and equipment in any kitchen and perform the simple task of making a coffee. This test assesses the AI’s ability to understand and navigate a new environment, recognize objects, and execute a complex sequence of actions, reflecting its practical intelligence.

Other standardized tests that are used to evaluate different AI benchmarks

Evaluating whether an AI is on the path to becoming AGI involves assessing its capabilities across the widest possible range of cognitive tasks, as it has to demonstrate versatility, generalization, and adaptability akin to human intelligence.

Here are some key benchmarks and criteria that are often considered:

· Learning and Adaptation

· Common Sense Reasoning

· Creativity and Innovation

· Versatility in Problem-Solving

· Natural Language Understanding (and Generation)

· Perception and Interaction

· Generalization

· Ethical and Moral Reasoning

To assess these benchmarks, a combination of standardized tests, real-world challenges, and continuous evaluation across multiple domains is essential.

Here are some of the current proposed evaluation frameworks:

· The AI2 Reasoning Challenge (ARC) is a benchmark dataset created by the Allen Institute for AI (AI2) designed to assess an AI’s commonsense reasoning abilities. There are two sets of questions an AI must go through, one with easy, surface-level questions and one with a set of questions that require complex reasoning and the integration of multiple sources of knowledge to find the right answer. Its main goal is to push the boundaries of what a machine can comprehend and reason.

· The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding (NLU) tasks. It is interesting in that it comprises of different sets of tasks, such as sentiment analysis (for example, is a certain sentiment expressed in a piece of text?), textual entailment (determining whether one sentence logically follows from another), and even semantic similarity (as in, how similar are two different sentences in meaning?) GLUE was designed to evaluate and foster progress in the development of AI systems that can understand and generate human language.

· The Winograd Schema Challenge is a test designed to evaluate an AI’s ability to understand context and resolve ambiguities in natural language, specifically focusing on pronoun disambiguation. It aims to test AI systems’ deeper understanding of language and context, something that goes beyond mere statistical pattern recognition to include real-world knowledge and reasoning. If an AI is “successful” in the Winograd Schema Challenge, this means it’s able to make contextually appropriate judgments, and therefore, it demonstrates a more human understanding of language.

How do we create an effective AGI benchmark?

Creating effective benchmarks for AGI is a complex, challenging, and multifaceted problem.

And it starts with first defining what intelligence is — it involves taking into account a wide range of cognitive abilities such as reasoning, problem-solving, learning, perception, and emotional understanding, making the creation of comprehensive benchmarks very difficult.

AGI is expected to excel across diverse tasks, from simple arithmetic to complex decision-making and creative thinking, and naturally, this further complicates designing benchmarks to evaluate such a broad spectrum of capabilities.

Since human intelligence evolves with experience and learning, AGI benchmarks must account for this dynamic nature, assessing both static performance and the ability to adapt over time.

With all that said, it’s safe to say benchmarks play a massive role in evaluating the development and progress towards AGI, as they will provide us with a standardized, objective means to measure that progress.

However, we still have a long way to go until an effective benchmark is created due to the sheer magnitude and complexity involved. As research in AGI advances, so too will the sophistication and comprehensiveness of our benchmarks, bringing us closer to the goal of achieving true artificial general intelligence.

About SingularityNET

SingularityNET is a decentralized Platform and Marketplace for Artificial Intelligence (AI) services founded by Dr. Ben Goertzel with the mission of creating a decentralized, democratic, inclusive, and beneficial Artificial General Intelligence (AGI).

  • Our Platform, where anyone can develop, share, and monetize AI algorithms, models, and data.
  • OpenCog Hyperon, our premier neural-symbolic AGI Framework, will be a core service for the next wave of AI innovation.
  • Our Ecosystem, developing advanced AI solutions across market verticals to revolutionize industries.

Stay Up to Date With the Latest News, Follow Us on:

Stay Updated!

Get the latest insights, news, and updates.