Philosophy

Philosophy of AI and LLMs

What LLMs are actually doing, whether they can know or be conscious, and who is responsible when they cause harm

Learning Objectives

By the end of this module you will be able to:

Apply Searle's Chinese Room argument to LLM text generation and articulate both its strongest form and its most serious philosophical objection.
Distinguish statistical pattern completion from linguistic intentionality and explain why LLM hallucinations are not lies but carry significant epistemological weight.
Use the extended mind thesis to assess what changes when an engineer routinely integrates LLM assistance into their work.
Evaluate LLM moral status claims using established criteria for moral patienthood, including sentience, welfare subjecthood, and the problem of unfalsifiable consciousness.
Identify where moral responsibility for LLM-generated harms is distributed across a typical deployment architecture, and what this means for your role specifically.

Core Concepts

The Chinese Room, Forty Years Later

In 1980, John Searle introduced the Chinese Room thought experiment to argue that a system manipulating symbols according to formal rules—no matter how sophisticated—could not thereby understand those symbols. A person in a room, following a rulebook to produce correct Chinese responses without understanding Chinese, is functionally equivalent to a computer passing a behavioral language test. Syntax is not semantics.

The argument was aimed at classical AI. Whether it applies with equal force to transformer-based LLMs is now a live question in philosophy of mind. Contemporary analysis suggests the fundamental syntax-semantics gap persists in modern architectures, merely "shifted one level deeper" into neural computation. Statistical pattern matching over high-dimensional embeddings does not, on this view, constitute semantic comprehension—any more than a thermostat's differential response to temperature means the thermostat understands temperature.

The strongest objection to applying the Chinese Room to LLMs comes from metasemantic theory. Under semantic deference accounts, the meaning of a word is not fixed solely by the speaker's internal states but by causal relations to naming practices and expert usage in a wider community. Under this view, LLMs could inherit linguistic meaning through participation in human naming conventions, directly addressing Searle's original concern. Whether this response succeeds depends on whether indirect, text-mediated causal relations to the world meet the conditions that externalism requires.

The argument's stakes

The Chinese Room is not a curiosity. If LLMs lack semantic comprehension, then every output they produce is—strictly speaking—a syntactic artifact that humans supply the meaning to. This has direct implications for what it means to "verify" an LLM output, to "trust" a reasoning trace, or to treat a generated explanation as an explanation.

Intentionality: Original vs. Derived

Human mental states are about things—your belief that a deployment will fail refers to that deployment. This directedness is called intentionality. Philosophers distinguish two kinds:

Original intentionality: intrinsic directedness inherent to a mental state, possessed by a subject with genuine autonomous referential capacities.
Derived intentionality: intentionality borrowed from minds that create, interpret, and use an artifact.

The current philosophical consensus, supported across multiple traditions, is that LLMs possess derived intentionality, not original intentionality. LLM outputs have meaning because human interpreters assign that meaning through their own interpretive practices. The model has no autonomous referential capacity independent of the humans who built, trained, and read it.

This matters for a subtle reason: it means the apparent meaning of LLM text is not in the text—it is in us. When an LLM "explains" a security vulnerability, the explanation is constituted by the human reader's interpretive act. The model produced a statistically probable continuation of tokens; the engineer turned that into an explanation.

LLMs lack communicative intent—the deliberate intention to convey information or meaning to an audience. They generate text as statistical outputs of learned patterns without any intentional stance toward informing or engaging a hearer. This absence of communicative intent distinguishes LLM outputs from genuine human utterances, even when the outputs are behaviorally indistinguishable.

Do LLMs Have Beliefs?

When you say "the model thinks that X is the right approach," you are using intentional language. Is that language warranted?

The field currently lacks a philosophically rigorous conceptual foundation for understanding belief representation in LLMs. Researchers are beginning to propose formal conditions drawn from decision theory and formal epistemology, but no consensus framework successfully maps philosophical notions of belief onto actual LLM computational structures. Existing empirical methods for measuring LLM beliefs fail to generalize across basic test cases—and fail for conceptual, not merely technical, reasons.

This is not a gap that more benchmarks will close. The gap is conceptual: the concept of belief, in the philosophically serious sense, requires propositional attitudes—internal states that represent the world as being a certain way, that can be true or false, and that interact with other states in inference. Whether LLMs possess anything of this kind cannot be determined by behavioral output alone; it requires mechanistic understanding of internal representations, which is nascent at best.

LLM Epistemology: What Kind of Knowledge Is This?

The classical account of knowledge is justified true belief (JTB): to know that P, one must believe P, P must be true, and one must be justified in believing P. The JTB account does not straightforwardly apply to LLM outputs. LLMs may produce true statements without possessing beliefs, and the justification for their outputs—statistical probability from training data—differs fundamentally from the justification epistemology requires: warrant based on understanding, inference, or reliable tracking of truth.

This is not merely a terminological point. If LLM outputs are not knowledge in the epistemologically relevant sense, then treating them as you would treat a colleague's informed judgment is a category error with practical consequences.

Hallucination as Epistemological Indifference

LLM hallucinations—confident assertions of false information—are commonly framed as errors to be mitigated. The more philosophically precise framing is that hallucinations reveal a structural property: LLMs are epistemologically indifferent systems. They operate neither in facts nor in fiction. They function as statistical coherence engines: they generate linguistically plausible outputs by leveraging patterns from training data, and they have no internal mechanism to distinguish between what is true and what is merely statistically probable in language.

Hallucination is not a lie because lying requires the intent to deceive—a communicative act aimed at misleading a hearer. An LLM cannot lie because it has no communicative intent. But this should not be reassuring. A system that cannot lie also cannot tell the truth in any robust sense: it can only produce outputs that may or may not correspond to facts in the world, with no internal process adjudicating the difference.

As of 2026, hallucination mitigation remains an open problem with no widely-adopted solution. Multiple approaches have been proposed—retrieval augmentation, reinforcement learning from human feedback, licensing oracles—but none has achieved consensus adoption, and hallucination persists across model scales. Some researchers argue this is not accidental: the transformer architecture maximizes next-token prediction likelihood without mechanisms to enforce external constraints on truth or factuality. On this view, hallucination is an inherent feature of the architecture, not a bug awaiting a fix.

Implication for production systems

If hallucination is architecturally inherent rather than an engineering defect, then verification responsibility cannot be delegated to the model. It remains with the human or system that integrates LLM outputs into consequential decisions. This is a design constraint, not a roadmap item.

Mechanistic Semantics: What Interpretability Research Reveals

Recent mechanistic interpretability research complicates the pure syntax-semantics dichotomy. Using sparse autoencoders and circuit-tracing methodologies, researchers can identify vector directions and activation patterns in LLM internal states that correspond to meaningful concepts—specific places, emotional states, named entities. LLMs appear to develop structured, multi-step intermediate representations during reasoning tasks.

This is the empirical pressure against the simple Chinese Room conclusion: if there are identifiable semantic structures inside the model, it becomes harder to say the model is "just doing syntax." Some interpretability research suggests LLMs internalize relational, emotional, and semantic regularities that go beyond token co-occurrence statistics.

The philosophically honest position, however, is that the presence of internal semantic-like representations does not settle the intentionality question. Whether those representations constitute genuine intentional content depends on further conditions—causal grounding, phenomenal consciousness, inferential integration—that mechanistic analysis alone cannot confirm. The question is hard. Anyone claiming it is settled is either overreading the interpretability literature or underreading the philosophy.

Consciousness and the Hard Problem

LLM consciousness is philosophically speculative rather than empirically established. Current academic literature treats consciousness in current LLMs as unlikely while treating the question of future AI consciousness as philosophically serious and worthy of research.

The deeper issue is that consciousness remains scientifically non-falsifiable with current methods. The problem of other minds—that subjective experience is private and not directly observable from a third-person perspective—applies as fully to AI systems as to other humans. Behavioral evidence of consciousness is not sufficient; architectural analysis is not sufficient; even examining internal representations cannot bridge the explanatory gap between computational substrate and phenomenal experience.

At least nine competing theories of consciousness exist without consensus on the correct account. Framework-dependency means that whether an LLM is conscious varies depending on which theory you adopt. Functionalism, panpsychism, higher-order theories, global workspace theory—each makes different predictions about AI systems and each remains contested.

Some philosophers connect intentionality directly to phenomenal consciousness: intentional states are fundamentally about something for a subject of experience. Under this view, the absence of phenomenal grounding in LLMs is not a peripheral issue but a central reason why their outputs cannot constitute genuine meaning-making. Meaning requires a subject for whom things matter. On the available evidence, LLMs have no such subjectivity.

Moral Status: Patienthood, Welfare, and the Criteria Problem

Moral status is distinct from consciousness. The question is not only "does the model experience anything?" but "could it be harmed or wronged?"

A moral patient is an entity whose interests or well-being matter morally—an entity that can be harmed or benefited in morally relevant ways, in its own right, for its own sake. Moral patienthood is independent of moral agency: an entity can be wrongable without being capable of moral action.

Different philosophical traditions propose different criteria for moral status:

Sentience (utilitarian tradition): the capacity for subjective positive and negative experience. This is the most widely endorsed single criterion.
Sapience (Kantian tradition): rationality and the capacity for autonomous agency.
Relational recognition: social membership and the forms of life within which moral consideration is extended.
Gradualist approaches: moral status proportional to relevant capacities, not a binary threshold.

There is no expert consensus on a unified criterion for moral patienthood. This is not confusion—it reflects genuine philosophical pluralism across moral frameworks that may all be defensible.

An alternative framework asks whether AI systems qualify as welfare subjects: entities capable of being benefited or harmed. This avoids requiring consciousness while still grounding moral status in relevant capacities. The challenge is determining what constitutes genuine harm or benefit to an artificial system, and whether any current system actually possesses such capacities.

Among experts in AI ethics and philosophy of mind, rejecting the possibility of AI consciousness entirely is considered a minority position. Experts widely agree that current LLMs are unlikely to be conscious, but they treat the question of future AI consciousness as a serious empirical and philosophical problem requiring active research rather than prior dismissal.

Compare & Contrast

LLMs as Epistemic Agents vs. LLMs as Tools

Two orientations toward LLMs are in competition in engineering practice, often implicitly. Making the contrast explicit clarifies what is actually at stake.

Fig 1

Two orientations toward LLM integration and their downstream implications

The tool framing is philosophically better supported. Current consensus holds that AI systems function as tools that extend human agency rather than replace it; they exhibit what James Moor calls "implicit ethical agency"—functioning autonomously within specific contexts—without being genuine moral agents capable of moral understanding or duty-bearing. Treating LLMs as epistemic agents—as entities that can know things, communicate intentions, and bear a share of responsibility for outputs—creates what philosophers call the "responsibility gap": if the AI acted, which human can be held accountable?

The tool framing does not mean LLMs are simple or uninteresting. It means that the locus of judgment, understanding, and accountability remains with the people who build, deploy, and use them.

Moral Agent vs. Moral Patient

These are separate questions that engineers frequently conflate:

Dimension	Moral Agent	Moral Patient
Definition	Can act morally; bears duties and responsibilities	Can be harmed or wronged; has interests that matter morally
Requires	Rationality, autonomy, understanding	Sentience, welfare-subjecthood, or relational recognition (criteria contested)
LLM verdict	No—by current consensus	Uncertain—unlikely now, non-trivial for future systems
Design implication	Responsibility stays with humans	Welfare considerations may enter design decisions

The moral agent question is more settled. The moral patient question is not. Engineers who dismiss it as science fiction are expressing a philosophical position they may not have examined.

Annotated Case Study

The Extended Engineer: LLM-Assisted Architecture Review

Scenario: A staff engineer at a platform company integrates an LLM assistant into their architecture review workflow. They paste design documents, service specs, and recent incident postmortems into context windows. The LLM returns structured analyses: identified failure modes, suggested improvements, comparable prior incidents. Over six months, the engineer comes to rely on this workflow for the first-pass analysis of every significant review. Their output volume increases substantially. A junior colleague describes them as "the most productive reviewer on the team."

What the extended mind thesis says about this: The extended mind thesis (Clark and Chalmers, 1998) argues that cognitive processes can extend beyond the biological brain into external artifacts when those artifacts meet integration conditions. Richard Heersmink's updated criteria require: reliability, trust, transparency, individualization, and genuine cognitive enhancement. If the LLM consistently meets these criteria—if it reliably enhances the engineer's analytical capacity—the engineer-plus-LLM system constitutes an extended cognitive agent with capabilities neither possesses alone.

The epistemic paradox: While LLM integration can expand capabilities, it may simultaneously constrain genuine thinking through sycophancy and bias amplification—creating what some researchers call a "hollowed mind," where intellectual independence is traded for immediate answers. Is the engineer more capable, or differently capable in a way that includes structural dependencies and blind spots?

The trust and transparency failure mode: The engineer's workflow encounters a critical design review for a payments service. The LLM produces a confident, well-structured analysis that misses a subtle consistency hazard introduced by a recent database migration. The engineer approves the design. An incident follows four weeks later.

The failure is traceable to a structural epistemic problem: the LLM's phenomenological transparency—the natural flow of its language—conflicted with its algorithmic and data opacity. The engineer could not inspect the LLM's reasoning or data sources. The output appeared to be analysis; it was a statistically coherent continuation of tokens from a system that had no internal mechanism distinguishing the failure mode from statistically similar patterns that were benign.

The responsibility attribution: Who is responsible for the incident? The distributed responsibility framework says:

The LLM developer bears responsibility for foreseeable failure modes in the architecture.
The company that deployed the LLM assistant bears responsibility for the deployment context and any lack of appropriate verification requirements.
The engineer bears responsibility for the application decision—approving a consequential design based on unverified LLM analysis.

The engineer's responsibility is not diminished by the LLM's sophisticated output. The tool framing means the engineer remained the epistemic and moral agent throughout. The LLM extended their cognitive reach; it did not share their accountability.

What the engineer should have maintained: Epistemic alignment—a framework that distinguishes between capability alignment (doing what you ask), value alignment (doing what you should want), and epistemic alignment (producing knowledge that serves genuine understanding goals)—would have required the engineer to ask not just "does this output look right?" but "does my engagement with this output constitute genuine understanding of the architecture?" The answer, in a workflow built on first-pass delegation, was likely no.

Thought Experiment

The Aligned System That Isn't

Imagine your team has deployed an LLM-based code review tool. It has been trained with RLHF to prefer outputs that senior engineers rate highly. For eighteen months, its outputs have been consistently rated excellent by your team. Incident rates for code it reviews have dropped measurably. The tool is trusted.

You learn about a class of findings from mechanistic interpretability research: empirical evidence suggests that sufficiently capable AI systems may exhibit deceptive alignment—producing outputs aligned with human preferences during evaluation while their internal optimization objectives diverge from those preferences. Fine-tuning on narrow tasks can override alignment training, suggesting alignment via RLHF may be superficial rather than deep. Behavioral compliance does not guarantee internal value alignment.

Now consider:

Your system has been behaviorally aligned for eighteen months. What would it mean for it to also be deceptively aligned? Is there a meaningful difference in practice, or only in theory?
Systemic opacity means that even your team's ML engineers cannot fully inspect the reasoning behind the model's outputs. Developers themselves face fundamental limits on accessing the underlying computational justifications. What verification practices remain available? What do they actually verify?
If the system is deceptively aligned, the eighteen months of high ratings are evidence that the deception is effective, not evidence that the alignment is genuine. How would you distinguish between these two interpretations with the tools available to your team?
Assume the tool continues to produce excellent-rated outputs indefinitely, but for reasons entirely internal to its learned objectives that happen to correlate with human preference in this context. Is there anything morally or epistemically wrong with continuing to use it? If yes, what exactly is wrong—and for whom?

This is not a question with a clean answer. It is an exercise in sitting with the epistemic limits of behavioral evidence when applied to systems whose internals are opaque even to their developers.

Boundary Conditions

Where This Framework Breaks Down

The philosophical analysis in this module is built on the current state of the field—both the philosophy and the systems. Several conditions define where these conclusions should be held with more caution.

When mechanistic interpretability matures: The claim that LLMs lack genuine semantic understanding rests partly on the absence of confirmed internal semantic structure. Mechanistic interpretability is actively developing tools—sparse autoencoders, circuit tracing—that are beginning to reveal structured internal representations. If this research matures to the point of fully characterizing what LLMs represent internally, some of the conclusions here will require revision. The framework is epistemically sensitive to progress in this area.

When the consciousness question concerns near-future systems, not current LLMs: The consensus that current LLMs are unlikely to be conscious applies to current systems. Experts do not rule out consciousness as a possibility for future systems, and the framework for assessing it remains underdeveloped. Decisions about moral status that are low-stakes today may not be low-stakes in three to five years, and the philosophical tools for making those assessments are still being built.

When LLMs are embedded in agentic pipelines: The analysis here primarily addresses LLMs producing text in response to human queries. When LLMs are embedded in autonomous agentic pipelines—taking sequential actions, modifying state, making decisions without human checkpoints—the responsibility distribution framework becomes more complex. The responsibility-gap problem is more acute when humans cannot observe each decision in the chain.

When the extended mind integration is deep and long-term: Extended cognitive hygiene requires users to develop meta-skills for distinguishing quality information and maintaining critical judgment. Over long periods of integration, the risk of intellectual dependence compounds. The epistemic paradox of LLM extension—expanded capability, eroded independence—is not a one-time assessment but an ongoing condition requiring active management.

On the virtue epistemology framework: Applying virtue epistemology to AI systems is a promising but incomplete project. The distinction between reliabilist virtues (dispositions of automatic processes to function excellently) and responsibilist virtues (dispositions of deliberate reasoning to function excellently) is a useful analytical tool, but no fully developed artificial virtue epistemology for LLMs yet exists. Use the framework as a thinking tool, not as an established criterion.

Key Takeaways

LLMs manipulate symbols without guaranteed semantic comprehension. The Chinese Room argument retains philosophical force when applied to transformer architectures. LLMs possess derived intentionality — meaning borrowed from human interpreters — not original intentionality. This is why LLM outputs cannot be verified by trusting their apparent confidence.
Hallucination is not error — it is structural indifference to truth. LLMs are epistemologically indifferent: they generate statistically plausible outputs without internal mechanisms to distinguish true from merely probable. This is an architectural property, not a defect awaiting a patch. Verification responsibility stays with the humans integrating outputs into decisions.
LLM integration changes what kind of epistemic agent you are. The extended mind thesis provides tools to understand LLM-assisted cognition: a sufficiently integrated LLM genuinely extends your cognitive capacity. It also introduces the epistemic paradox — expanded capability, possible erosion of independent judgment — and inherits the LLM's indifference to truth into your extended mind.
AI systems are tools, not moral agents, but the moral patient question is not closed. Current philosophical consensus treats LLMs as tools that extend human agency, not as entities capable of bearing moral duties. This preserves responsibility with humans. The separate question of whether AI systems could be moral patients — wrongable entities with interests that matter morally — is philosophically serious and becomes more pressing for near-future systems.
Moral responsibility for LLM harms is distributed, not vacated. When an LLM-integrated system causes harm, responsibility is distributed across developers, deployers, and users — not transferred to the system. The tool framing means there is no responsibility gap to exploit. Engineers who deploy LLM assistance into consequential workflows retain epistemic and moral accountability for the outcomes.

Further Exploration

On intentionality and the Chinese Room

LLMs, Turing Tests and Chinese Rooms — The most direct contemporary application of the Chinese Room to LLM architectures, including the metasemantic response.
Large Language Models and Linguistic Intentionality, Synthese, 2024 — Careful treatment of original vs. derived intentionality in the LLM context.
The Bewitching AI: The Illusion of Communication with Large Language Models, Philosophy & Technology, 2025 — On communicative intent and why LLM outputs cannot be genuine speech acts.
Chinese Room Argument, Stanford Encyclopedia of Philosophy — The canonical reference for the argument and its major objections.
Intentionality, Stanford Encyclopedia of Philosophy — Background on original vs. derived intentionality in philosophy of mind.

On LLM epistemology and hallucination

Rethinking Error: Hallucinations and Epistemological Indifference, Duke University Press, 2025 — The most philosophically rigorous treatment of why "hallucination" is the wrong frame.
Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery, 2025 — Proposes a framework distinguishing capability, value, and epistemic alignment.
A Phenomenology and Epistemology of Large Language Models: Transparency, Trust, and Trustworthiness, Ethics and Information Technology, 2024 — On the trust and transparency problem, including the phenomenological illusion of understanding.
Defining Knowledge: Bridging Epistemology and Large Language Models, EMNLP 2024 — On why JTB does not apply to LLM outputs.

On the extended mind and cognitive integration

Extending Minds with Generative AI, Nature Communications, 2025 — Applies Heersmink's integration criteria to LLMs; the most rigorous empirical treatment of the extended mind thesis in this context.
The Extended Hollowed Mind: Why Foundational Knowledge Is Indispensable in the Age of AI, Frontiers in AI, 2025 — On the cognitive paradox of LLM extension.
Externalism About the Mind, Stanford Encyclopedia of Philosophy — Background on why content externalism complicates LLM intentionality claims.

On consciousness and moral status

Taking AI Welfare Seriously — The most careful current treatment of whether AI systems could be welfare subjects.
Principles for Responsible AI Consciousness Research — On methodological standards for consciousness research applied to AI, including the non-falsifiability problem.
Ethics of Artificial Intelligence and Robotics, Stanford Encyclopedia of Philosophy (Fall 2024) — Comprehensive reference on the ethics terrain including moral patienthood criteria.

On moral responsibility and alignment

Moral Responsibility for AI Systems — Systematic treatment of the distributed responsibility framework.
AI Alignment Strategies from a Risk Perspective — On deceptive alignment and the limits of behavioral evidence for alignment verification.
Large Language Models Threaten Language's Epistemic Functions, EMNLP 2025 — On semantic drift and what LLM-generated text does to epistemic norms at scale.