I really enjoyed this piece! It made me think about an extension of the thought experiment:
What if the man in the room isn’t fed the original Chinese, but a translated version of it -- translated by another man in another room? That second man could represent a vision encoder, converting the world into embeddings before the language model ever sees it. The first man (the LLM) then reasons about those embeddings without ever accessing the original world.
To me, this feels close to how visual language models operate -- layers of translation mediating between perception and reasoning. It raises a question about “understanding in degrees”: if each layer compresses and abstracts meaning, how much of what we call social inference can survive that mediation?
Do you think these layered translations limit the possibility of genuine understanding in multimodal systems, or could emergent reasoning eventually bridge that gap?
I'm a little stuck at the challenge to 'understanding' being "he doesn't know what a hamburger tastes like." I've never eaten headcheese, but I understand the concept of "headcheese," so much so that I'm disgusted by the idea of it and do not plan to ever eat it. The poor man cannot know what a hamburger tastes like because he never had access to any stimulus that would have allowed him to do that (i.e., he never saw or smelled or ate a hamburger). But I agree with you that if he was operating from a compressed translation manual, with associations, say "hamburger" is "food," he would eventually have an understanding that hamburger is a food in the same way "hot dog" is food. Is the proposition just that a machine that leans purely on text cannot know the "non-textual" qualities of an entity that he only knows through texts?
I think there are two points here. First, like you said, there’s an embodied kind of understanding the guy in the room (or a text-only model) just can’t have. He’ll never know what a hamburger tastes like without ever eating one, same as we’ll never know what it’s like to be a bat.
But... I don’t think you need that kind of experience to have **some** understanding. Most of what we call “understanding” in everyday life is about patterns and relations, not direct experience. I’ve never eaten headcheese or seen a quark, but I still get what they are through descriptions. So if the man learns compressed rules and starts spotting patterns that let him use words flexibly, I’d say that’s a real (though different) kind of understanding.
I think we're on the same page here. Your text got me thinking about two separate tracks (which are tangential to what you wrote, so no need to directly answer): one comment on compression and an alternative thought experiment.
On compression, the original thought experiment was designed such that the person was following rules mechanically and so every output is a correct statement. Now once you have compression and the person have some discretion in how he's using the rules (because they are compressed), he must also be capable of producing an incorrect statement. The way you laid out the hypo, compression (which I'm assuming is lossy, otherwise we're just changing the manual) is a necessary step for understanding. Does it also mean that any real understanding must come along with the capacity to misunderstand and produced incorrect statements?
And here's an increment to your experiment. In the original version, the man knows English well and has an ordinary understanding of English language and human experience. So he knows what is a hamburger and knows what a hamburger tastes like. What he doesn't know is what's the Chinese word for hamburger at least initially. Now let's take your compression idea. Is there a level of compression at which point the person would recognize some Chinese word as corresponding to a hamburger (even if he is misrecognizing)?
I really enjoyed this piece! It made me think about an extension of the thought experiment:
What if the man in the room isn’t fed the original Chinese, but a translated version of it -- translated by another man in another room? That second man could represent a vision encoder, converting the world into embeddings before the language model ever sees it. The first man (the LLM) then reasons about those embeddings without ever accessing the original world.
To me, this feels close to how visual language models operate -- layers of translation mediating between perception and reasoning. It raises a question about “understanding in degrees”: if each layer compresses and abstracts meaning, how much of what we call social inference can survive that mediation?
Do you think these layered translations limit the possibility of genuine understanding in multimodal systems, or could emergent reasoning eventually bridge that gap?
I'm a little stuck at the challenge to 'understanding' being "he doesn't know what a hamburger tastes like." I've never eaten headcheese, but I understand the concept of "headcheese," so much so that I'm disgusted by the idea of it and do not plan to ever eat it. The poor man cannot know what a hamburger tastes like because he never had access to any stimulus that would have allowed him to do that (i.e., he never saw or smelled or ate a hamburger). But I agree with you that if he was operating from a compressed translation manual, with associations, say "hamburger" is "food," he would eventually have an understanding that hamburger is a food in the same way "hot dog" is food. Is the proposition just that a machine that leans purely on text cannot know the "non-textual" qualities of an entity that he only knows through texts?
I think there are two points here. First, like you said, there’s an embodied kind of understanding the guy in the room (or a text-only model) just can’t have. He’ll never know what a hamburger tastes like without ever eating one, same as we’ll never know what it’s like to be a bat.
But... I don’t think you need that kind of experience to have **some** understanding. Most of what we call “understanding” in everyday life is about patterns and relations, not direct experience. I’ve never eaten headcheese or seen a quark, but I still get what they are through descriptions. So if the man learns compressed rules and starts spotting patterns that let him use words flexibly, I’d say that’s a real (though different) kind of understanding.
I think we're on the same page here. Your text got me thinking about two separate tracks (which are tangential to what you wrote, so no need to directly answer): one comment on compression and an alternative thought experiment.
On compression, the original thought experiment was designed such that the person was following rules mechanically and so every output is a correct statement. Now once you have compression and the person have some discretion in how he's using the rules (because they are compressed), he must also be capable of producing an incorrect statement. The way you laid out the hypo, compression (which I'm assuming is lossy, otherwise we're just changing the manual) is a necessary step for understanding. Does it also mean that any real understanding must come along with the capacity to misunderstand and produced incorrect statements?
And here's an increment to your experiment. In the original version, the man knows English well and has an ordinary understanding of English language and human experience. So he knows what is a hamburger and knows what a hamburger tastes like. What he doesn't know is what's the Chinese word for hamburger at least initially. Now let's take your compression idea. Is there a level of compression at which point the person would recognize some Chinese word as corresponding to a hamburger (even if he is misrecognizing)?
Yes, any understanding must come with the possibility of being incorrect in exactly the same way that all models are wrong.