Discussion about this post

User's avatar
Neha Balamurugan's avatar

I really enjoyed this piece! It made me think about an extension of the thought experiment:

What if the man in the room isn’t fed the original Chinese, but a translated version of it -- translated by another man in another room? That second man could represent a vision encoder, converting the world into embeddings before the language model ever sees it. The first man (the LLM) then reasons about those embeddings without ever accessing the original world.

To me, this feels close to how visual language models operate -- layers of translation mediating between perception and reasoning. It raises a question about “understanding in degrees”: if each layer compresses and abstracts meaning, how much of what we call social inference can survive that mediation?

Do you think these layered translations limit the possibility of genuine understanding in multimodal systems, or could emergent reasoning eventually bridge that gap?

Expand full comment
Shridhar Jayanthi's avatar

I'm a little stuck at the challenge to 'understanding' being "he doesn't know what a hamburger tastes like." I've never eaten headcheese, but I understand the concept of "headcheese," so much so that I'm disgusted by the idea of it and do not plan to ever eat it. The poor man cannot know what a hamburger tastes like because he never had access to any stimulus that would have allowed him to do that (i.e., he never saw or smelled or ate a hamburger). But I agree with you that if he was operating from a compressed translation manual, with associations, say "hamburger" is "food," he would eventually have an understanding that hamburger is a food in the same way "hot dog" is food. Is the proposition just that a machine that leans purely on text cannot know the "non-textual" qualities of an entity that he only knows through texts?

Expand full comment
3 more comments...

No posts