Formal Proposition Embedding

From natural language to logic: how sentences become formal structures, and why it's harder than it sounds

Lead Summary

Formal proposition embedding is the project of taking a natural-language sentence — an utterance made in some context, by some speaker, about some state of affairs — and mapping it into a machine-interpretable formal structure: a logical formula, a graph, a description-logic expression, or some other symbolic object that downstream systems can reason over, verify, or query.

The project spans philosophy (what do sentences mean?), linguistics (how is meaning built from parts?), computer science (what representation is tractable to reason over?), and NLP engineering (how do you actually build a parser that does this at scale?). Each layer adds its own set of difficulties, and the story of formal proposition embedding is largely the story of researchers discovering that each apparent solution opens onto the next layer of complexity.

This article traces the intellectual arc from Frege's foundational principle of compositionality through Montague's formal grammar, through Discourse Representation Theory's dynamic turn, through the emergence of AMR and semantic parsing, through description logics and ontologies, and through today's neuro-symbolic pipelines that try to combine neural flexibility with symbolic soundness.

Etymology & Terminology

The phrase "formal proposition embedding" combines several concepts. A proposition (from the philosophical tradition) is the abstract content of a claim — what a sentence expresses, independent of who said it or when. Formal means the representation uses a mathematical or logical notation with explicit syntax and semantics. Embedding comes from computational usage: placing something into a representational space — here, placing a natural-language claim into a formal system.

Other terms are used in overlapping ways: semantic parsing refers to the computational task of producing such representations; autoformalization refers specifically to converting informal mathematical or natural-language statements into formally verifiable code; logical form is the output structure in the formal semantics literature; and meaning representation is the broader NLP term.

Core Concepts

The Principle of Compositionality

The dominant theoretical framework for formal proposition embedding is the principle of compositionality: the meaning of a complex expression is a systematic function of the meanings of its component parts and the way they are syntactically combined. As stated by Montague, for whom compositionality was not optional but the only possible way to proceed formally, this principle licenses building the meaning of sentences up from the meanings of words through grammatical rules.

The principle traces back to Frege's context principle (Foundations of Arithmetic, 1884): "only within a complete sentence do words have meaning." Though Frege's formulation suggests a top-down direction (sentence to parts) that actually inverts modern compositional analysis (parts to sentence), his work is universally cited as the foundation. The inversion reflects a tension that persists today: does meaning flow bottom-up from atoms, or top-down from wholes?

In practice, compositionality functions as a methodological norm or working hypothesis in formal semantics rather than a settled empirical fact. Schools of thought differ on whether it is: (a) a basic methodological principle proposals should obey, (b) an empirical claim about how language works, (c) an analytic truth, or (d) merely a useful heuristic. The Stanford Encyclopedia of Philosophy entries on compositionality document this ongoing dispute. Notably, construction grammar and Discourse Representation Theory operate without strict compositionality.

The empirical argument for compositionality is its track record: semantic theories built on compositional principles explain inference patterns and grammaticality contrasts across a wide range of languages and constructions. Yet the same sources note that this success is partial — compositionality works well in constrained formal fragments but struggles with real-world linguistic phenomena.

Lambda Calculus and Intensional Logic

Montague's key technical move was to use lambda calculus as the composition mechanism. Lambda abstraction (λ-operator) allows higher-order functions to serve as semantic contributions of syntactic constituents. Before Montague, this was not standard in logic. The lambda operator enables phrases to contribute their meanings to sentence truth conditions through function application: a noun phrase contributes a function, a verb phrase contributes a function, and application computes the combined meaning.

Montague's framework defines the semantic denotation of a sentence as an intension: a function from possible worlds and moments of time to truth values. This allows sentences to be evaluated across different possible worlds and temporal points, enabling formal treatment of modality, tense, and counterfactual reasoning while maintaining compositionality. The Stanford Encyclopedia of Philosophy on Montague Semantics gives the full technical treatment.

Partee's type-lifting (1973, 1987) extended this framework by allowing semantic rules to change the type of an expression corresponding to a syntactic category shift — enabling flexible composition where, for instance, the meaning of 'and' functioning as a verb-phrase connective is obtained by lifting the meaning of the sentence connective 'and' to a higher-order function. This kind of mechanism shows the practical work required to maintain compositionality across the full range of natural-language phenomena.

Mechanism & Process

The Montague Rule-to-Rule Architecture

In Montague semantics, there is a systematic correspondence between syntactic rules and semantic rules. If there is a syntactic rule combining expressions to form a compound expression, there must be a corresponding semantic rule determining the meaning of that compound. This rule-to-rule correspondence establishes a homomorphism between syntactic structures and their semantic interpretations.

Combinatory Categorial Grammar (CCG), developed by Mark Steedman and others, is the dominant practical descendant of Montague semantics in computational linguistics. CCG inherits the rule-to-rule semantic architecture and extends classical categorial grammar with combinatory rules inspired by combinatory logic. It provides a lexicalized framework that couples syntactic and semantic operations, allowing the semantic construction of a sentence to proceed in lockstep with parsing. CCG is described in Jurafsky & Martin's speech and language processing textbook appendix as the canonical framework for transparent compositional semantics in NLP.

Lambda Dependency-based Compositional Semantics (λ-DCS), developed by Liang (2013), is a more compact variant of lambda calculus that makes existential quantification implicit, eliminating the need for explicit variables. It has been widely adopted as a meaning representation language in semantic parsing systems, particularly for question answering over knowledge bases.

Discourse Representation Theory: Meaning as Context Update

Classical Montague semantics assigns truth conditions to each sentence relative to a fixed model, treating sentences independently. This breaks down for natural language discourse, where pronouns from sentence N bind to referents introduced in sentence N-1, and where presuppositions require knowing prior context.

Discourse Representation Theory (DRT), introduced by Hans Kamp (1981), addresses this by treating sentences not as static truth-conditional units but as context transformers — functions that update a globally maintained discourse context. The basic formal object is the Discourse Representation Structure (DRS): a two-component structure consisting of (1) a universe of discourse referents (variables representing entities under discussion) and (2) a set of conditions specifying properties and relations among those referents, expressible as first-order logic formulas.

As Stanford Encyclopedia of Philosophy's entry on DRT explains, the meaning of a sentence under this view is its context-update potential — a function that transforms an input discourse context into an output context. This unifies anaphora resolution, presupposition projection, and quantifier scoping under a single computational model.

Irene Heim's File-Change Semantics (1982) was developed independently and contemporaneously, using file-card metaphors rather than DRS boxes, but addressing the same problem. The parallel development reflects convergent recognition of a fundamental inadequacy in classical Montague semantics for discourse phenomena.

A DRS is built in two stages: first, sentence-level meaning is composed bottom-up following standard recursive rules; second, the resulting DRS is merged with the existing discourse context, triggering anaphoric resolution and presupposition accommodation.

DRT employs this two-stage compositional process: compositionality in stage 1 is followed by pragmatic context-update in stage 2. The boundary between these stages does not align with the traditional semantics-pragmatics divide — presupposition accommodation and anaphora resolution belong to what looks like pragmatics, yet in DRT they are formally encoded in the DRS update machinery, not relegated to informal post-hoc interpretation. This dissolves the hard semantics-pragmatics boundary.

Key mechanisms in DRT include:

Indefinite noun phrases introduce discourse referents into the DRS universe that remain accessible to subsequent anaphoric elements. "A cat is on the mat. It is asleep" does not require existential quantification with narrow scope; the indefinite "a cat" introduces referent x, and "it" simply retrieves x.
The DRS merging operation integrates a sentence DRS into the existing discourse DRS: referents and conditions from both are preserved, conditions from the new DRS are added with resolved anaphoric references, and presuppositions are checked against available discourse referents. The operation is sequential (not commutative), respecting temporal unfolding.
The Right Frontier Constraint governs which discourse referents are accessible for anaphoric binding — roughly, referents on the "right frontier" (immediately preceding clauses or enclosing higher-level clauses), not referents from completed subordinate clauses.
Donkey sentences ("Every farmer who owns a donkey beats it") illustrate DRT's advantage over classical logic: rather than requiring the indefinite "a donkey" to take wide or narrow scope under the universal quantifier (both producing wrong predictions), DRT introduces the donkey as a discourse referent in the antecedent DRS, making it accessible to the anaphoric "it" without scope paradox.

DRS structures translate systematically into First-Order Logic (FOL) formulas, enabling integration with classical logical provers and knowledge-graph systems. A DRS with discourse referents and conditions translates to FOL sentences where referents become bound variables and conditions become predicates under appropriate quantifier scope. This translation is non-unique (lossy with respect to presupposition triggers and modal scope) but well-studied. The Kamp, van Genabith & Reyle DRT handbook chapter gives the definitive technical treatment.

Segmented DRT (SDRT), developed by Lascarides and Asher, extends classical DRT by adding explicit representation of discourse structure via rhetorical relations (Elaboration, Explanation, Contrast, etc.) that hold between DRS units. Rather than treating discourse as a flat merge, SDRT introduces a hierarchical tree structure where rhetorical relations constrain accessibility and compositional scope — necessary for claims that relate via explicit discourse relations.

Abstract Meaning Representation (AMR)

Abstract Meaning Representation, introduced by Banarescu et al. (2013), takes a different approach: rather than tracking discourse context or targeting FOL, it represents the meaning of a single sentence as a rooted, directed, acyclic graph (DAG) where nodes represent concepts and edges represent semantic relations.

AMR concepts are either English words or PropBank framesets — labeled with semantic arguments (:ARG0, :ARG1, etc.) grounded in PropBank's inventory of over 11,000 frame files. Beyond predicate-argument structure, AMR captures higher-order semantic phenomena including modality, negation, aspect, quantification, and semantic relations through special AMR concepts and role types (:mod, :neg, etc.), extending its expressiveness beyond standard semantic role labeling.

AMR is engineered to abstract away from surface syntactic variations — word order, voice, morphological inflections, function words. Sentences with identical semantic content but different syntactic structures should ideally produce the same AMR graph. This makes it useful as a representation that cuts across the noise of surface form.

Contemporary AMR parsing is formulated as a sequence-to-sequence generation problem: text is encoded and the AMR graph is linearized as a sequence for decoding. This has made neural approaches competitive. AMR has become a canonical semantic parsing benchmark; evaluation relies on graph-matching metrics including SMatch and its variants, which measure structural similarity between predicted and gold-standard representations. However, granular evaluation via GrAPES shows that aggregate SMatch scores mask performance variation across linguistic phenomena — parsers may excel at predicate-argument structure while failing at polarity, coreference, or generalization to unseen labels.

Description Logics and Ontologies

Description Logics (DLs) are a family of decidable fragments of first-order predicate logic, designed for knowledge representation and reasoning over structured domains. Unlike full FOL (undecidable for many tasks), DLs deliberately restrict expressivity to guarantee decidability of key inference operations.

In DLs, concepts are interpreted as unary predicates (classes of individuals, e.g., "Person") and roles as binary predicates (relationships, e.g., "owns"). A knowledge base has two components: the TBox (terminological box) containing general axioms about concepts and roles, and the ABox (assertion box) containing facts about individual objects. Standard reasoning tasks include subsumption, satisfiability, instance-checking, and consistency checking — the basis for evaluating computational complexity across DL families.

A major organizing principle in DL research is the expressivity-tractability trade-off: researchers systematically explore which combinations of logical constructors maintain decidable or polynomial-time reasoning. This has produced a landscape of DL families (ALC, EL, DL-Lite, SHOIN, etc.) with formally characterized complexity profiles. Concept satisfiability in ALC is PSPACE-complete; with general concept inclusions it becomes ExpTime-complete. This motivates tractable fragments (EL, DL-Lite) where satisfiability can be decided in polynomial time.

OWL Full, used in semantic web applications, is undecidable for entailment — no effective automated reasoning tools exist or are expected. OWL DL (based on SHOIN(D)) and restricted profiles (EL, QL, RL) restore decidability by trading expressivity. Domain ontologies provide semantic grounding for semantic parsing: they constrain and validate automatically-generated semantic structures, ensuring consistency and accuracy in knowledge graph generation.

Controversies & Debates

Is Compositionality a Principle, a Hypothesis, or an Empirical Fact?

The philosophical status of compositionality is contested. The question is not merely academic: if compositionality is a methodological principle, then violations can be acceptable when they are necessary; if it is an empirical claim, then systematic violations refute it; if it is an analytic truth, violations are impossible by definition.

Systematic non-compositional phenomena challenge any strong version of the thesis:

Idioms: "kick the bucket" (meaning to die) cannot have its meaning derived from the meanings of "kick," "the," and "bucket." Fully non-compositional idioms are stored and retrieved as units rather than processed incrementally. More compositional idioms allow partial compositional processing, but the degree of compositionality varies and is not predictable from surface form.
Conversational implicature: Grice's implicatures (what is communicated beyond what is literally said) are not determined compositionally. They arise from pragmatic principles operating over sentence meaning, not from composition.
Occasion meanings: Expressions in use carry contextual meanings discerned from situation and intent, not from lexical meaning plus syntax.
Sarcasm and irony: The propositional content expressed is systematically opposite to the literal compositional content.

These phenomena are not exotic edge cases but pervasive features of ordinary language use.

Pragmatic Intrusion

Pragmatic intrusion is the phenomenon whereby contextual factors, speaker intentions, indexicals, and conversational conventions determine the propositional content of an utterance. Stalnaker's framework, documented in the Stanford Encyclopedia of Philosophy on Pragmatics, demonstrates that formal semantics cannot assign propositions to sentences directly; rules must match sentences to propositions relative to features of the context.

Indexical expressions (I, you, here, now, today) and context-sensitive predicates (vague terms whose extension shifts with context) create a fundamental formalization challenge: their truth conditions cannot be determined from the sentence alone. Context is not a finite, closed parameter space — it includes speaker intentions, common ground, discourse history, and social conventions — making it impossible to enumerate all context-determining features. Any formalization therefore requires making arbitrary closure assumptions about which contextual features matter.

Quine's Indeterminacy of Translation

Quine's thesis holds that there is no unique, correct formalization of natural language into formal propositions. Multiple conflicting translations or formalizations can be compatible with all observable behavioral data, and there is no fact of the matter determining which is "correct." This reflects two related doctrines: meaning holism (sentences do not have independent meaning but derive meaning from their position in a whole language and belief system) and reference inscrutability (there is no fact about what a term refers to — "rabbit" could be formalized as referring to rabbits, undetached rabbit parts, or rabbit-stages, with no behavioral evidence distinguishing them).

The practical implication of Quinean indeterminacy

Formalization is not a process of extracting the logical form from a sentence — there is no unique logical form to discover. It is a process of constructing a formal representation by making choices that are underdetermined by the evidence. Any formalization is therefore theory-laden and must be evaluated as such.

Vagueness and the Sorites Paradox

The sorites paradox demonstrates that formalization of vague predicates in classical logic leads to contradiction. If one grain is not a heap, and adding one grain to a non-heap leaves it a non-heap, then no finite number of grains makes a heap — contradicting ordinary judgment. This is not a flaw in any particular logical system but a structural tension between natural-language vagueness and bivalent logic.

Resolving the sorites requires abandoning classical bivalence: supervaluationist semantics (truth-value gaps), fuzzy logic (infinite-valued truth), degree-theoretic approaches. Each alternative represents a distinct formalization choice with distinct consequences. No single classical formalization captures vague predicates adequately.

Variants & Subtypes

Semantic Parsing as the Computational Task

Semantic parsing is the computational task of converting natural-language utterances into machine-interpretable logical forms that can be executed against a knowledge base or theorem prover. Unlike shallow semantic processing (POS tagging, named entity recognition), semantic parsing aims at complete formal semantics suitable for downstream reasoning. It is formal proposition embedding operationalized as an NLP task.

Semantic parsing systems employ diverse formal meaning representation languages across application domains:

Lambda calculus and λ-DCS for compositional linguistic analysis
SQL and FunQL for database query understanding
SPARQL for RDF knowledge bases
First-order logic for theorem proving
Domain-specific languages including Python and Java for specialized tasks

Neural semantic parsing emerged in the mid-2010s following advances in sequence-to-sequence encoder-decoder architectures. Neural approaches frame semantic parsing as sequence-to-sequence translation, achieving competitive results on existing benchmarks with robustness to noise. But they demonstrate a critical failure mode: compositional generalization — the ability to systematically generalize to unseen compositions of seen components. On the COGS benchmark, models achieve 96–99% in-distribution accuracy but only 16–35% compositional generalization accuracy, revealing that neural models memorize rather than truly compose. This gap persists with model scaling, suggesting a structural rather than a capacity limitation.

Autoformalization

Autoformalization — transforming informal mathematical or natural-language statements into formally verifiable representations — has emerged as a distinct research area in the LLM era (2023–2026). It acts as a bridge between natural-language ambiguity and the strict requirements of formal verification, enabling symbolic reasoning over LLM outputs.

The autoformalization survey (arXiv 2505.23486) documents that NL-FOL (natural language to first-order logic) translation remains a persistent and substantially unresolved challenge even for large language models. LLMs struggle with NL-FOL conversion despite strong performance on other reasoning tasks, indicating that formal logic translation requires capabilities distinct from general language understanding.

Large-scale datasets have been generated to support autoformalization research: Goedel-Pset-v1 contains 1.64M formal statements; FormL4 targets large-scale LLM evaluation. These represent a shift toward data-driven scaling. FormaRL achieved 9.6% pass@1 accuracy on autoformalization tasks through reinforcement learning on compiler feedback without labeled data — a 54.8% relative improvement over baseline — illustrating both the practical progress and the distance still to travel.

Components & Structure

The Atomic Decomposition Pipeline

A dominant pattern in applied formal proposition embedding is the decompose-then-verify paradigm: a complex claim is first decomposed into a set of minimal, self-contained atomic facts, then each fact is independently verified against evidence sources, and results are aggregated. This strategy underlies influential systems including FActScore (Min et al., EMNLP 2023), AFEV's iterative framework, and graph-based decomposers like GraphFC.

The rationale is twofold: decomposition distributes verification burden across simpler units, and fine-grained error localization becomes possible — if an atomic fact fails verification, its provenance can be traced back to the original claim.

FActScore operationalizes this as an automated factuality metric: decompose a generation into atomic facts using language models, retrieve evidence for each from Wikipedia, compute the percentage of facts verified as supported. It achieves high correlation with human judgments and has become influential as an automated evaluation metric.

AFEV (Atomic Fact Extraction and Verification) introduces iterative decomposition: extract atoms, retrieve evidence, verify; then use verified atoms to guide subsequent decomposition of remaining content. This reduces accumulated error propagation compared to one-shot decomposition, though it does not eliminate the underlying challenge of independence violations.

Graph-based approaches (GraphFC family) treat atomic decomposition as a graph construction and traversal problem. Claims and evidence are structured as entity-relationship graphs; decomposition is performed via graph construction and graph-guided planning; multi-path exploration enables systematic search of alternative decompositions. This explicitly captures structural interdependencies that linear decomposition loses.

Faithfulness and Granularity

Atomic decomposition introduces a faithfulness challenge: decomposed subclaims must preserve the semantic content of the original claim while becoming independently verifiable. DecMetrics (arXiv 2509.04483) formalizes three quality dimensions:

Completeness — whether decomposed atoms collectively cover all aspects of the original claim, measured via entailment between merged atoms and the original
Correctness — whether each atom is factually accurate and substantiated by the original text
Semantic Entropy — whether atoms are distinct and non-repetitive rather than paraphrasing one another

The granularity challenge has no consensus standard. Substantial disagreement exists between human and LLM annotators regarding appropriate decomposition granularity, particularly for conjunctive and conditional structures — should "A and B" become one atom or two? Finer granularity increases verification precision but risks over-decomposition and loss of context; coarser granularity preserves context but reduces error localization.

Neuro-Symbolic Hybrids

The frontier of formal proposition embedding today is neuro-symbolic pipelines that combine the semantic flexibility of neural language models with the soundness guarantees of symbolic reasoners.

VeriCoT extracts formal logical arguments from chain-of-thought reasoning by formalizing each CoT step into first-order logic and identifying premises that ground the argument. This bridges natural-language step reasoning with formal logic verification, enabling consistency validation of LLM-generated reasoning chains.

VERGE (Formal Refinement and Guidance Engine) introduces semantic routing: rather than forcing all claims into first-order logic, the system routes vague or commonsense claims to consensus-based soft verifiers while reserving strict formalization for claims that support it. This acknowledges that not all natural-language statements are formalizable and proposes a pragmatic architecture that degrades gracefully on open-domain content.

Grammar-constrained LLM decoding applied to formal specification generation (e.g., Req2LTL transforming natural language requirements to LTL) achieves 88–98% accuracy with 100% syntactic validity. Deterministic grammar-based constraints ensure all generated formulas pass structural validation — a major step toward combining neural semantic extraction with symbolic verification.

Neuro-symbolic frameworks combining LLM-driven semantic extraction with symbolic reasoning engines achieve over 99% soundness (near-zero false positive rates) on formal verification tasks. Systems like ARc exceed soundness assurance thresholds unattainable by pure neural approaches. The LogicGraph benchmark enables systematic evaluation of multi-path logical reasoning through neuro-symbolic generation and verification pipelines.

FregeLogic, a hybrid neuro-symbolic architecture for content-robust syllogistic validity prediction, positions modern hybrid methods as direct descendants of classical formal semantics — invoking Frege in both name and theoretical framing.

The Liang and Potts (2015) synthesis demonstrated that the historical divide between logical and statistical approaches to computational semantics is eroding, proposing a unified discriminative learning framework where statistical models learn compositional semantic theories from corpora and databases, relating distributed representations and logical forms within the same theoretical structure.

However, the formal-distributional gap remains deep: formal semantics models truth conditions and reference, while distributional semantics models usage patterns. Neural language models do not consistently behave according to formal semantic predictions — they fail to respect intersective and subsective modifier properties and are inconsistent on compositional phenomena. The gap reflects an irreducible tension in any attempt to reduce formal propositions to distributional embeddings or vice versa.

Limits of Formalization

Gödel-like incompleteness in formal specification

Every formal specification makes unstated assumptions about what counts as a "case," what silence means (default to false or true?), and how to handle edge cases the specification did not anticipate. This is not a flaw in poor specification-writing but a fundamental consequence: natural language contains implicit background knowledge that formal systems must make explicit, but there is always more implicit knowledge to extract.

Specification incompleteness affects all formal proposition embedding systems in high-stakes applications. Extensions to foundational formal semantics (DRT, Dynamic Semantics) were needed to handle phenomena the original systems missed — and each extension exposes new phenomena that require further extension.

The formal-distributional gap extends beyond representation to architecture: LLMs exhibit systematic failures in compositional generalization. These failures persist even with model scaling, suggesting the limitation is structural — the conflict between early-layer compression of inputs and late-layer information retention required for compositional processing creates a fundamental architectural barrier.

Proposition knowledge graphs attempt to unify atomic decomposition with formal semantics by treating atoms as instantiations of graph nodes and edges. The Semantic Ladder framework (arXiv 2603.22136) proposes progressive formalization of natural language content for knowledge graphs, treating formalization as a spectrum rather than a binary threshold. This reframes the question: rather than asking whether a sentence has been formalized, it asks how far along the formalization spectrum it has been moved.

Further Exploration

Foundational Theory

Compositionality — Stanford Encyclopedia of Philosophy — Definitive reference on the principle, its status, and its debates
Montague Semantics — Stanford Encyclopedia of Philosophy — Covers the formal framework, rule-to-rule correspondence, intensional logic, and Partee type-lifting
Discourse Representation Theory — Stanford Encyclopedia of Philosophy — DRT from first principles through extensions; covers DRS structure, merging, anaphora, presupposition
Willard Van Orman Quine — Stanford Encyclopedia of Philosophy — Indeterminacy of translation and reference inscrutability
Vagueness — Stanford Encyclopedia of Philosophy — Sorites paradox, non-classical logics for vague predicates
Pragmatics — Stanford Encyclopedia of Philosophy — Pragmatic intrusion, indexicals, Stalnaker's framework

Formal Technical Sources

Kamp, van Genabith & Reyle — DRT handbook chapter — The authoritative technical treatment of DRS construction, merging, and FOL translation
van Eijck & Kamp — Discourse Representation in Context — DRS merging operation and dynamic semantics
Description Logic: A Formal Foundation — Oxford (Horrocks) — TBox/ABox, expressivity-tractability tradeoffs
A Description Logic Primer — Accessible introduction to DL concepts, roles, inference tasks

Computational / NLP

Combinatory Categorial Grammar — Jurafsky & Martin — CCG as practical descendant of Montague semantics
Survey of Abstract Meaning Representation: Then, Now, Future — Comprehensive review of AMR formalism, parsing, evaluation
Autoformalization in the Era of LLMs: A Survey — Covers the full landscape of LLM-based formal translation
A Survey on Deep Learning for Theorem Proving (DL4TP, COLM 2024) — Five-category taxonomy: autoformalization, premise selection, proof-step generation, proof search, auxiliary tasks
Bringing Machine Learning and Compositional Semantics Together — Liang & Potts — The synthesis of logical and statistical approaches
FActScore: Fine-grained Atomic Evaluation of Factual Precision — Decompose-then-verify operationalized as a factuality metric
Do Large Language Models Have Compositional Ability? — Empirical documentation of compositional generalization failure in LLMs
VERGE: Formal Refinement and Guidance Engine — Semantic routing as a pragmatic approach to partial formalization
A Neurosymbolic Approach to Natural Language Formalization and Verification — CSP-based hybrid framework achieving >99% soundness

Quick reference

Field Formal semantics, computational linguistics, NLP

Core problem Translating natural language into machine-interpretable logical forms

Key frameworks Montague semantics, DRT, AMR, Description Logics

Founding work Frege (1884), Montague (1970s), Kamp/Heim (1981–82)

Practitioner tools CCG parsers, Boxer, neural seq2seq, autoformalization LLMs

Core tension Expressivity vs. tractability; compositionality vs. pragmatic intrusion

Active frontier Neuro-symbolic hybrid pipelines, autoformalization datasets