Meaning and the world: the symbol grounding problem

By Casey Kennington

What is the meaning of the word 'red'? How do small children learn it? What about AI systems? Prof. Casey Kennington explains how natural languages are intrinsically linked to the real world, and how humans naturally ground symbols in their physical environment... in stark contrast with LLMs.

By speaking or writing a few words in the right order, you can communicate with other people. Have you ever considered how miraculous that is? We all think differently, yet despite our differences, we can share thoughts and ideas with each other.

Yet, misunderstandings do arise. You have probably had to define words in order to be clear about what you mean in a given situation. But have you ever taken a step back and considered how all words come to mean what they do? And in this world of AI that we live in, how do large language models (LLMs) learn language, and does it matter that it’s vastly different from how humans learn language?

One thing I like to ask people (especially students in my Natural Language Processing course) is: “What is the meaning of the word red?” It should be an easy question to answer, but that question usually stumps them, and is often the beginning of a long and interesting discussion. Without fail, one person says, “red is a color.” Yes, but so is green. And yellow. Does that mean they have the same meaning? Of course not; color is more of a category than a meaning or even a definition. Someone will then say something about the light spectrum and wavelengths, giving a more technical definition of what red means. You might nod in agreement, but then we wonder together how it is that small children can learn the meaning of red without needing to have light wavelengths explained to them. All children need are a few examples of red things and they learn what red means. Correction: they learn what red can refer to.

“The meaningfulness of language lies in the fact that it is about the world,” writes Kathleen Dahlgren (Dahlgren 1976),¹ explaining that the meaning of many words comes from the things they refer to; the connotation (the sense, or idea of a term) is derived from repeated exposures to denotations (entities in the world that the term refers to). In his seminal paper The Symbol Grounding Problem (Harnad 1990),² Stevan Harnad argues that there are many words that arrive at their meaning because of our physical world experience, words like red, chair, dog, or anything else that physically exists. These are concrete words that have real-world, physical examples, whereas abstract words are non-physical ideas. In the world of computational devices and language, we call this area of natural language processing grounded semantics.

Of course, Harnad doesn’t just claim that the meaning of all words is referential. Rather, he makes a claim that the human experience is that we learn concepts often without knowing a word for the concept. For example, a small child knows about hunger, or that a drink can quench thirst before she knows that there are words that refer to those concepts. Only later are the symbolic words hunger, drink, and thirst “grounded” into those concepts when the relevant words are learned through interactive, person-to-person spoken dialogue with other speakers of a language.

The direction of learning concept-then-symbol is more common than we might recognize. Small children who cannot yet speak do it all the time, as the above examples suggest. Throughout life new concepts emerge through experience, for example, someone experiences prejudice without knowing about racism or lack of control of attention before they learn about ADHD.

Throughout our lives we might have any number of ungrounded concepts floating around in our heads. Anyone who speaks multiple languages has realized this. The German word doch, for example, means something like “in contrast to what you were thinking, yes.” If someone asks “are you not going to lunch?” in English, if we answer “yes” we have to then disambiguate if “yes” agrees with the question of “not going” (negative polarity) or if “yes” means “going” (positive polarity). Doch is a polarity switcher: if used in answer to the question about not going to lunch, the answer is clear that it switched from the negative polarity of the question to the positive polarity of the answer. The concept of polarity switch is very clear in many languages like German, French, and Spanish, but it’s clunky in English. It’s a concept we have in our heads, but we don’t make use of it very well.

So concept-then-symbol happens throughout life as we learn new concepts. The other direction, symbol-then-concept happens a lot as well. In fact, learning symbols then concepts is probably more common especially during formal education after learning to read. Teachers often introduce a new word (concept) then explain its definition (description of the meaning).

It should be noted that it doesn’t matter whether concepts are concrete or abstract, the direction of concept-then-symbol or symbol-then-concept can apply. For example, the concrete term red might be learned concept-then-symbol by seeing red things then later hearing the word red used in conjunction with them. The word zebra—also concrete—on the other hand, might be learned by a child in the U.S. through symbol-then-concept where the word is first learned, then explained as a “horse with stripes” that is a concept someone has never seen, but can imagine. Only later they might experience what zebras look like to ground the symbol zebra.

I could talk about grounded semantics all day, but the point I am driving at is, I think, very important: LLMs like ChatGPT do not learn language in any way like I have described above with concepts and symbols. ChatGPT is only learning symbols and how they are used in text. It’s not terribly unlike the symbol-then-concept learning, but the concept is always abstract—never grounded. One of the most powerful things about language is that it allows us to talk about anything abstractly (even concrete concepts) both about things generally (e.g., dogs) and things specifically (e.g., Fido) even when they aren’t physically present. Is that really a problem for LLMs?

I think it might be. It’s important to understand the progression of language learning for humans: children generally learn concrete words first, then move towards more abstract words as they become more educated. Only 10% of a four year-old’s vocabulary is made up of abstract words (Borghi et al. 2019).³ That increases a lot over the years, as a five year-old is already at 25%, but at twelve year-old’s is roughly at 40% (Ponari, Norbury, and Vigliocco 2018).⁴ That means that most words learned in the first decade of our lives refer to something in the world, and there needs to be a concrete conceptual substrate that those words ground into. In the last few years, researchers have taken this to heart and added visual knowledge to LLMs (see Fields and Kennington, 2023⁵ for a review), but concrete words go beyond vision. Words like garlic are grounded in olfactory and gustatory senses, and words like kick are grounded in proprioperceptive muscle memory. If you were to close your eyes, and make a “thumbs up” gesture, you could do it because the muscle memory of thumbs up is grounded in a particular muscle configuration in your hand. Some LLMs like Palm-E are trying to make it possible to incorporate multiple modalities (Driess et al. 2023),⁶ but they still rely heavily on learning from large amounts of text to do anything useful. They moreover don’t actually solve the symbol grounding problem because they only model concepts that are learned in the symbol-then-concept direction. As Schlangen (2023)⁷ has observed, it seems as though the research progression for LLMs is precisely opposite of the progression that children experience as they learn language.

It’s crucial to note that one cannot learn an abstract concept from its definition unless one has a grasp of the vocabulary used in the definition. For example, the Oxford Dictionary definition of democracy, an abstract concept, is “control of an organization or group by the majority of its members.” In order, therefore, to understand what democracy means, one has to know the meaning of all of the words in the definition, and the meaning of the composed description. Harnad breaks this important idea down in a recent article. The main point is that, clearly, there are many abstract words that are defined by other abstract words, but if we take a random word in a dictionary, look up each word in its definition and find their respective definitions, we eventually find our way to a “minimal grounding set” of words that cannot be defined by other words (Vincent-Lamarre et al., 2016).⁸ They found about 1000 words in each minimal grounding set for each dictionary they analyzed, which suggests that though the minimal grounding set might be different for everyone, they form the basis of language upon which all other linguistic concepts are built. LLMs don’t do that; rather, they form a distributional representation of words based on other words.

He was responding to the claim that LLMs ground “indirectly” through the experience of others. I, in agreement with Harnad, argue that while LLMs learn what they can from the text that they are given, meaning is assigned by humans when they read the words that LLMs output; words aren’t meaningful to LLMs, particularly concrete words. Put another way, like dictionaries, LLMs learn circular approximations of the meaning of words (following the distributional hypothesis of semantics). Powerful, given enough text, but still only approximate.

The Chinese Room Thought Experiment and the Intelligent Octopus Thought Experiment both apply here, but the Symbol Grounding Problem is a bit different: in my estimation, computers could learn deeper, grounded meanings given the right setting, learning progression, experience, and data. As of now, easy-to-find text isn’t quite getting current language models all the way. Vision helps, but is still only part of what it will take to solve the Symbol Grounding Problem.

For a longer discussion on these and related topics (such as How does emotion play a role in language learning?), see another article I recently wrote: Kennington (2023).⁹

Featured photograph by Jason Leung on Unsplash.

Dahlgren, Kathleen. 1976. “Referential Semantics.” University of California, Los Angeles. ↩
Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D. Nonlinear Phenomena 42 (1-3): 335–46. ↩
Borghi, Anna M., Laura Barca, Ferdinand Binkofski, Cristiano Castelfranchi, Giovanni Pezzulo, and Luca Tummolini. 2019. “Words as Social Tools: Language, Sociality and Inner Grounding in Abstract Concepts.” Physics of Life Reviews 29 (July): 120–53. ↩
Ponari, Marta, Courtenay Frazier Norbury, and Gabriella Vigliocco. 2018. “Acquisition of Abstract Concepts Is Influenced by Emotional Valence.” Developmental Science 21 (2). https://doi.org/10.1111/desc.12549. ↩
Fields, Clayton, and Casey Kennington. 2023. “Vision Language Transformers: A Survey.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2307.03254. ↩
Driess, Danny, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, et al. 2023. “PaLM-E: An Embodied Multimodal Language Model.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2303.03378. ↩
Schlangen, David. 2023. “What A Situated Language-Using Agent Must Be Able to Do: A Top-Down Analysis.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2302.08590. ↩
Vincent-Lamarre, Philippe, Alexandre Blondin Massé, Marcos Lopes, Mélanie Lord, Odile Marcotte, and Stevan Harnad. 2016. “The Latent Structure of Dictionaries.” Topics in Cognitive Science 8 (3): 625–59. ↩
Kennington, Casey. 2023. “On the Computational Modeling of Meaning: Embodied Cognition Intertwined with Emotion.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2307.04518. ↩