Portability, Limits, and Engineering Judgment
How to make durable decisions in an unstable discipline
Learning Objectives
By the end of this module you will be able to:
- Assess the portability of a given prompt strategy across model families and API versions.
- Apply the concept of technique half-life to prioritize where to invest prompt engineering effort.
- Recognize the conditions under which iterability and context accumulation cause long conversations to degrade.
- Articulate an engineering judgment framework that integrates empirical evidence, pragmatic theory, and production constraints.
Core Concepts
Technique Portability
Not all prompt engineering techniques transfer equally across models. Chain-of-thought prompting, for instance, is often presented as universally beneficial. The empirical picture is more complicated. Research across open-weight models ranging from 7B to 120B parameters shows that CoT brittleness—reasoning degradation under perturbation—is not a small-model artifact. It is present at all scales, suggesting it is a structural property of how transformers process reasoning steps, not something scale alone resolves. Intermediate reasoning steps can pollute the context window, and errors early in a chain amplify through subsequent steps regardless of model size.
This matters directly for portability. A chain-of-thought prompt that performs well on one model may degrade on another not because of capability differences alone, but because the fragility mechanism is the same across families.
A technique working reliably on one model is weak evidence it will work on another. The fragility pattern underlying CoT brittleness appears consistent across model families, which means you cannot assume cross-model portability without measuring it.
Technique half-life is a useful mental model: the period over which a specific prompting technique remains reliably effective before model updates, fine-tuning, or distribution shift erodes it. Some techniques get baked into fine-tuning and become unnecessary. Others get broken by safety tuning or instruction-following changes. Your prompt engineering investment should be weighted toward techniques that are either empirically demonstrated to be durable, or that you are actively measuring in production.
Iterability and Context Accumulation
Derrida's concept of iterability—that utterances are repeatable and decontextualizable by their very nature—has a direct engineering counterpart in multi-turn LLM conversations. For a prompt or instruction to be meaningful at all, it must be recognizable as a type (a command, a constraint, a persona directive). But that same property—being iterable, repeatable—means it can be cited, transformed, or used parasitically as context accumulates.
In long conversations, earlier instructions do not simply persist: they become part of a growing context that can be reinterpreted, contradicted, or diluted by later turns. Derrida's argument is that infelicity—the failure of a speech act to achieve its intended effect—is not exceptional but endemic to language itself. Applied to prompting: the failure of a system prompt to hold its intended meaning across a long conversation is not a bug you can fully eliminate. It is a structural feature of how iterability works.
This produces a concrete engineering implication: long conversation threads are reliability liabilities. System prompt instructions that work perfectly in turn 1 may be effectively overridden or diluted by turn 30, not through explicit contradiction but through accumulated context shifting what "relevant" means.
Context drift does not require the user to explicitly override your system prompt. The accumulation of turns changes the pragmatic weight of earlier instructions. This is a structural property, not a failure of implementation.
Gricean Maxims: Useful but Not Laws
Grice's Cooperative Principle and its four maxims (Quantity, Quality, Relation, Manner) are a common mental model for why prompts should be clear and concise. But this framing has known weaknesses. Scholars have criticized the maxims as vague and sometimes conflated with politeness norms, which means they function more as loose heuristics than as precise engineering constraints.
More pointedly, Sperber and Wilson proposed replacing all four maxims with a single principle of relevance: the speaker attempts to be as relevant as possible given the circumstances. This is a more parsimonious and cognitively grounded account. For prompting, the implication is practical: rather than checking your prompt against four separate maxims, ask whether each element contributes to making the model's task more tractable in context. Relevance is the single load-bearing criterion.
Pragmatic Context as a Systematic Weakness
Current language models struggle fundamentally with context-aware pragmatic behavior. This is not an incidental limitation to be solved with more tokens or a better system prompt—it is a structural feature of how these models process language. Tasks requiring reasoning about speaker intentions, implicit information, and context-dependent meanings represent a persistent weak spot across diverse datasets and phenomena.
This has a direct consequence for prompt design: do not assume the model will infer what you mean from context in the way a human conversational partner would. Empirical surveys of LLM pragmatic capabilities confirm this is a structural, not incidental, limitation. The practical adjustment is to make implicit assumptions explicit in the prompt, even when they feel obvious.
Key Principles
1. Measure portability; do not assume it. A technique that works on one model is a hypothesis about another. CoT fragility being consistent across model scales is evidence that cross-family portability cannot be assumed without measurement. Build portability testing into your evaluation pipeline.
2. Treat technique half-life as a first-class concern. Invest heavily in techniques you can continuously measure. Avoid deep coupling to prompting behaviors that are undocumented or that you suspect are emergent from current training runs—these are the highest half-life risk.
3. Make implicit assumptions explicit. LLMs have a structural weakness in pragmatic context-handling. Developers in software practice routinely skip explicit clarification of ambiguous requirements, resolving ambiguity through judgment and project context. LLMs do not have access to your project context. What feels obvious to you must be stated.
4. Long contexts are reliability liabilities. Iterability means context accumulates and derails earlier instructions. Design for this: use explicit re-anchoring, keep system prompts as close to the input as possible, or break long conversations into bounded sessions with re-initialized context.
5. Reject first-principles reasoning about prompts. Pragmatism explicitly argues against abstract, pre-determined frameworks as the primary mode of inquiry. The same discipline applies here. A prompt that "should" work based on your mental model of how the model reasons is a prior. Test it. Accumulated empirical evidence beats confident deduction.
6. Engineering judgment is a cultivated capacity, not a formula. Engineering codes cannot cover every circumstance; good engineering judgment is the capacity to perceive what a situation requires. In prompt engineering, this means knowing when to trust a technique, when to replace it, and when the system has drifted outside the distribution where your evidence applies. This capacity is built through practice, measurement, and honest failure analysis—not through reading more about prompting.
Thought Experiment
You are maintaining a production system that uses a system prompt written 18 months ago, before the current model version was deployed. The prompt was validated against the previous model with 50 test cases. The current model has been fine-tuned since deployment and your error rate has crept upward, but you cannot pinpoint a single failure type.
Consider the following:
- What evidence would convince you the degradation is portability-related versus data-distribution-related versus context-accumulation-related?
- If you could only run 10 additional test cases, what would they target and why?
- Your team proposes rewriting the system prompt from scratch using current best practices. What risks does this introduce, and how would you mitigate them?
- The new model version is being released in 6 weeks. What is the minimum viable portability assessment you could run in that window?
There is no single correct answer. The purpose is to practice integrating technique half-life, portability assessment, and iterability concerns into an actual operational decision under time and resource constraints.
Annotated Case Study
The Distributed Monolith as Prompting Analogy
The distributed monolith anti-pattern in microservices is instructive by analogy. When microservices become tightly coupled through synchronous dependencies, a single service failure triggers cascading failures that bring down seemingly unrelated functionality. The emergent behavior arises not from flaws in individual service design but from cumulative dependency structures at the system level. Effects are difficult to predict without detailed coupling analysis.
Now map this to a complex prompting system:
- Each prompt component—system prompt, few-shot examples, CoT scaffold, persona directive, output schema—is analogous to a service.
- Tight coupling between components means a change to one (say, an updated output schema) can cascade through the others in ways that are difficult to predict without comprehensive testing.
- The emergent behavior of the whole prompt is not predictable from analyzing components in isolation.
Annotation: This is not just a metaphor. Emergent system behavior in tightly coupled architectures is both discoverable in retrospect and practically unpredictable in advance. The same applies to multi-component prompts: post-hoc analysis can usually explain why a prompt broke, but the breaking point was not predictable from reading the components. The engineering response is the same in both domains—reduce coupling, test at the system level, and maintain observability.
What the case illustrates: Prompt engineering debt accumulates the same way architectural debt does. A prompt that was once modular and testable can become a distributed monolith through incremental changes, each of which seemed safe in isolation. The remedy is the same: refactor toward loose coupling, and re-test at the integration level after each significant change.
Common Misconceptions
"If the technique worked on GPT-4, it will work on Claude." This conflates model capability with technique portability. CoT fragility is consistent across model families and scales, which means a technique's empirical success on one model family is weak evidence of success on another. Portability must be measured.
"My system prompt establishes a durable contract with the model." System prompts are not contracts. They are instructions subject to the same iterability dynamics as any other text in the context window. As conversation length grows, earlier instructions lose pragmatic weight relative to recent turns. The "contract" interpretation overestimates your control.
"Following Gricean maxims will make prompts effective." The maxims are vague—they can be misinterpreted as etiquette guidelines rather than descriptive principles of cooperation. More critically, they give no traction on the actual failure modes in production prompts. Relevance theory's single criterion (does this reduce inference burden in context?) is more actionable than checking four loosely defined maxims.
"The model will infer what I meant from context." Context-aware pragmatic reasoning is a structural weakness of current LLMs, not a gap that clever prompting can fully close. Implicit assumptions do not reliably surface as intended behavior. Explicit beats implicit.
"I can reason from first principles about why a prompt should work." Pragmatism argues that abstract, pre-determined frameworks are insufficient for inquiry. Your mental model of how the model reasons is a prior, not a fact. A prompt that "should" work and a prompt that does work are empirically different things. Measure.
"Speech acts are universal—directness norms do not affect LLM behavior." Speech act realization exhibits systematic cross-cultural variation, and LLMs trained on multilingual corpora inherit this variation. A prompt phrased as a blunt imperative may perform differently from one that uses conventionalized indirectness—and this effect may vary across model families trained on different corpus mixes.
Stretch Challenge
Design a portability assessment protocol for a production prompt you currently own or can specify. The protocol must:
- Define what "portability failure" means for this specific prompt (degraded accuracy, changed output structure, higher refusal rate, or something else).
- Identify the minimum number of test cases required to detect a portability failure with reasonable confidence, and justify that number.
- Include at least one test targeting context accumulation: a case where the prompt is evaluated after a long preceding conversation rather than at turn 1.
- Specify how you would update the protocol when the underlying model is version-bumped.
- Identify one component of your prompt that is most likely to have a short half-life and explain why.
The goal is not a perfect protocol but a defensible one—one you could explain to a colleague who will use it after you have moved on.
Key Takeaways
- Portability is a measurement problem, not a property you can infer. CoT fragility is consistent across model families and scales, which means success on one model is weak evidence for another. Build portability testing into your workflow.
- Iterability means context derailment is structural. Long conversations accumulate context that reweights earlier instructions. Infelicity—the failure of an instruction to hold its meaning—is endemic to language, not an edge case to be patched.
- LLMs have a documented structural weakness in pragmatic context-handling. Make implicit assumptions explicit. The model does not have your project context, and context-dependent inference is where current systems reliably fail.
- Technique half-life is a first-class engineering concern. Invest in techniques you can continuously measure. Avoid deep coupling to prompting behaviors that are undocumented or likely to change with model updates.
- Engineering judgment is the terminal skill. No prompt engineering curriculum can cover every circumstance. The capacity to perceive what a situation requires—integrating empirical evidence, pragmatic theory, and production constraints—is built through practice and honest failure analysis, not through reading alone.
Further Exploration
On CoT fragility and cross-model robustness
- Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations — The primary evidence base for cross-model CoT brittleness.
- Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting — Adversarial framing of the same fragility in code generation contexts.
On pragmatic theory and LLM limitations
- Pragmatics in the Era of Large Language Models: A Survey — The empirical survey establishing pragmatic context-handling as a structural weakness.
- Sperber & Wilson: Relevance Theory Revisited — The single-principle replacement for Gricean maxims; useful for prompt clarity reasoning.
On iterability and speech act failure
- Iterability and Meaning: The Searle-Derrida Debate — The philosophical grounding for why context derailment is structural, not incidental.
On engineering judgment
- Virtue in Engineering Ethics Education — The case that engineering judgment is a cultivated character capacity, not a checklist.
- Pragmatism (Stanford Encyclopedia of Philosophy) — The philosophical argument against first-principles reasoning as primary inquiry mode.
On computational pragmatics and future directions
- Towards Neuro-Symbolic Approaches for Referring — Hybrid systems that preserve interpretable pragmatic principles alongside neural representations.
- Pragmatic language interpretation as probabilistic inference — RSA models and Bayesian pragmatic reasoning; background for understanding computational pragmatic accuracy.