🎭 Extra-Ordinary Language
COLLINS WESTNEDGE
JULY 2, 2025
Introduction
I’ve been thinking about a conversation I had with a former colleague about language that defies ordinary usage. More specifically expressions that violate the distributional hypothesis of language and yet carry high degrees of intentionality and depth.
When Kafka wrote “A cage went in search of a bird,” he created something that on the surface seems impossible and yet expresses a profound and insidious truth. Current AI systems, for all their linguistic sophistication, rarely produce such language. They excel at coherent, informative prose but struggle with the kind of intentional violations that define great literature.
In this post, I’m dog-earing these thoughts to revisit later. The aim here is to understand what makes these expressions work and more critically, the implications this has on how we measure and evaluate model intelligence.
Literary Examples
So they lov’d, as love in twain
Had the essence but in one;
Two distincts, division none:
Number there in love was slain.Hearts remote, yet not asunder;
Distance and no space was seen
Twixt this Turtle and his queen— Shakespeare, The Phoenix and the Turtle
A cage went in search of a bird.
— Kafka, Aphorisms Merry and tragical! Tedious and brief!
That is, hot ice and wondrous strange snow.
How shall we find the concord of this discord?— Shakespeare, A Midsummer Night’s Dream
Pseudo Formalizations
Because of their intentionality and depth I’m going to call these violations of “ordinary use” extra-ordinary use. They don’t obscure meaning but instead elucidate by way of contradiction or a violation of expectations. To explore how these linguistic violations work, I’ll borrow notation from formal logic. This isn’t meant as rigorous formalization, but rather as tool to highlight the syntactic and semantic patterns that make these expressions interesting.
Two Models of Interpretation
First, it’s helpful to think about how these statements behave across two interpretive models:
- Mphys: Physical/literal interpretation (common sense, ordinary meaning)
- Minterp: Creative interpretation (metaphor, allegory, other figurative readings)
Examples and Analysis
Passage | Formal Notation | Plain English |
---|---|---|
“Two distincts, division none” | Mphys ⊨ Distinct(a,b)
and Minterp ⊨ ¬Distinct(a,b) |
Two bodies, one soul |
“Hearts remote, yet not asunder” | Mphys ⊨ Remote(a,b)
and Minterp ⊨ ¬Remote(a,b) |
Spatially apart, spiritually united |
“Hot ice / wondrous strange snow”1 | Mphys ⊨ ¬(Ice ∧ Hot)
and Minterp ⊨ (Ice ∧ Hot) |
Physically impossible, imaginatively possible |
“A cage went in search of a bird” | Mphys ⊨ ¬Animate(cage) ∧ Search(e, cage, bird) ⇒
⊥ Minterp ⊨ Search(e, oppression, freedom) |
Cages dont search (type clash) metaphorically, oppression pursues freedom. |
Core Characteristics
These literary examples share a unifying feature: they present a literal semantic failure in one domain that creates insightful or profound resonance in another domain (metaphorical, allegorical, or abstract).
From the examples above we can loosely group these failures into three heuristic patterns:
\[ \text{ExO}(\varphi)\; \approx\; \text{MC}(\varphi)\; \lor\; \text{MP}(\varphi)\; \lor\; \text{TC}(\varphi) \]
Where
Pattern | Formal Sketch | Intuition |
---|---|---|
Modal Clash (MC) | Mphys ⊨ ¬φ and Minterp ⊨ φ | Statement fails in physical model but holds in interpretive model |
Modal Projection (MP) | Mphys ⊭ φ but Minterp renders φ conceivable | Physical world disallows φ; imagination projects consistent scenario |
Type Clash (TC) | φ forces type contradiction in Mphys | Category mismatch (e.g., animate actions for inanimate objects) resolved through reinterpretation |
Scope and Purpose
The logic sketch is strictly exploratory. It shows how an expression can fracture under a literal reading yet resolve in an imaginative one, and it supplies shorthand—MC, MP, TC—for that move. Making the break explicit clarifies how objectives rooted in the distributional hypothesis e.g. next-token prediction and RLHF tuned for coherence could steer models away from ExO language. If we want systems that embrace deliberate, meaningful rule-bending, we’ll need benchmarks that reward it.
Why Might Current AI Struggle Here?
Current language models face several systematic barriers to producing ExO language; at this point many of these are my own speculation or fan theory than concrete fact, but nevertheless here they are:
Data Scarcity in Pretraining: Though profound literature exists in pretraining corpora (Google Books, etc.), it’s statistically underrepresented. By definition, novel writing is rare, and easily-licensed conventional text dominates the training mix. Even within Pulitzer Prize winning articles/books etc the instances of truly profound prose/ExO language (as impactful as they may be) are few and far between.
Objective Mismatch: From a causal language modeling perspective, next token prediction is less about encoding the abstract concepts or deep intentionality these examples are made up of and more so about emulation of style and prose. At this phase models learn to reproduce surface features without encoding the abstract concepts that necessarily drive literary innovation. Even though large causal models like GPT-3 begin to exhibit some few-shot behavior with sufficient examples, it seems unlikely that the causal training paradigm alone gets us the reasoning necessary for truly novel language.
Task Absence During Fine-tuning: When models are optimized for instruction following, there’s likely an absence of tasks that push them to not just learn ExO behavior, but more importantly exhibit it. The training emphasizes practical capabilities over creative linguistic reasoning. Though literary analysis and reading comprehension are a big part of this phase they are somewhat distinct from the task of exhibiting and producing novel prose. In short, there are more analyses of great works than great works, and the reading comprehension/literary analysis task itself aligns more intuitively with how we quantify intelligence in school.
RLHF Optimization Pressure: This one is fun to think about. From a preference learning perspective, I doubt anyone wants to do full-blown Harold Bloom literary analysis to rate model outputs. Most annotators would favor accessible, Wikipedia-style entries over Joycean explorations of any topic. This optimization pressure likely eliminates whatever literary capabilities emerge during pretraining.
The Deeper Issue: Fluid Literary Intelligence: The more I examine instances of impactful prose packed with intentionality and metaphysical depth, the more convinced I am that modeling such language requires what I’d call fluid literary intelligence. This goes beyond pattern matching toward adaptive generalization on out-of-distribution linguistic tasks.
Missing Benchmarks: This leads to deeper questions: What constitutes literary novelty computationally? Why are there no benchmarks on par with ARC that touch this axis of intelligence? Current reasoning evaluation has heavily favored verifiable tasks (coding, math) over creative reasoning.
There is Hope: Even if this does require some deeper literary understanding, if we had 20 Harold Blooms doing RLHF or composing benchmarks of curated ExO instances, I believe we could optimize models toward this intelligence. Reinforcement learning scales. If we went from GPT-3 to ChatGPT with thousands of samples, maybe we need just a handful of literary experts. Additionaly, this approach could even generalize to other creative domains.
Key Readings
These papers and essays collectively help explore extra‑ordinary (ExO) language: controllable‑generation techniques expose rare linguistic traces, RLHF studies detail how reward shaping affects diversity, Chollet’s work addresses fluid intelligence and out‑of‑distribution generalization, and the SEP entries offer foundational perspectives on metaphor and contradiction—concepts central to creativity and nuanced language use.
Controllable Generation
RLHF on Mode Collapse & Output Diversity
Fluid Intelligence
- On the Measure of Intelligence — PDF
Philosophy of Language & Logic
- Metaphor (Stanford Encyclopedia of Philosophy) — https://plato.stanford.edu/entries/metaphor/
- Contradiction (Stanford Encyclopedia of Philosophy) — https://plato.stanford.edu/entries/contradiction/
Open Problems
Empirical Approach: Can we just crank up the temperature and RLHF against literary critics? Is the solution as simple as generating more varied outputs and having 20 Harold Blooms rank them? Or does this require deeper architectural changes?
Rules vs. Intuition: Do models need to deeply formalize the rules of language to meaningfully break them? Or can expert preference data teach the patterns directly without explicit rule-learning? This gets at whether ExO generation is about systematic reasoning or learned aesthetic judgment.
Sample Efficiency: How many expert annotations would we actually need? If RLHF worked with 10K of samples for ChatGPT, maybe we only need hundreds of expert literary judgments to see significant improvement in creative output.
A broken clock is right twice a day: Is there a meaningful difference between “turning up randomness and filtering with experts” versus “genuine creative intelligence”? Could high-temperature generation coupled with expert filtering approximate the cognitive processes behind ExO language?
Expert Agreement: This one is common in business problems. Would literary critics even agree on what constitutes good ExO language? The subjectivity of literary judgment might make this approach messier than mathematical reasoning tasks.
Socialogical Influence: How do we account for the way social and historical contexts shape judgments of novelty and creativity and is it a moving target?
Novelty and creativity are often historically and socially situated. A good deal of what constitutes creatvity and novelty is dependent on the historical context in which artistic expressions are judged. Citizen Kane, for example, is often cited as one of the greatest films of all time due to its innovative cinematography and storytelling. However, the cinematic innovations that define this film, such as Toland’s use of depth of field, is now a staple in most introductory film courses. Fashion often follows a similar arc, innovative and fresh designs that mark the runway one season saturate the shelves of fast-fashion retailers the next.
Though judgements about creativity and artistic merit are heavily influenced by the social and historical factors there is still a sense in which great works are able to stand the test of time. When evaluating creative intelligence, we must consider how social and historical contexts shape our aesthetic judgments and distinguish between those that are fleeting and those that endure.
Generalization Limits: If we optimize models toward specific critic preferences, do we get genuine creative capability or just better mimicry of those critics’ tastes? I think of Wittgensteins example of mimicry and rule following. A pupil is learning a geometric series: The pupil has been tested on examples of counting by +2 up to 1000. The pupil is then asked to continue the task on numbers above 1000 and then writes 1000, 1004, 1008, 1012. The pupil claims they have been following the rule all along: “Up to 1000 I add 2; from 1000 onward I add 4.” Every step the pupil took in training was perfectly compatible with this alternative rule. Most intelligence tasks are about fluid and adaptive generalization.
Closing Thoughts
Just because models can exhibit surprisal or violate semantic expectations doesn’t always mean they possess the ability to do so meaningfully. Ultimately, the goal is to understand whether machines can develop the kind of flexible, creative intelligence that ExO language represents—and to build evaluation frameworks that recognize this intelligence when it emerges. In short, we need benchmarks that reward “wondrous strange snow.”
Shout-out to my good friend Joshua for the stimulating convo and amazing Midsummer example. And shout-out to Henry too for the great convos on AGI/ARC and thoughts on diffusion-based and RL approaches. And last but not least shoutout to Noel for her core contributions on aesthetics and philosophical insights on creativity and intelligence
Theseus sets up an open contradiction hot + ice only to “resolve” it with the sneering flourish “wondrous strange snow.”
The oxymoron stays physically impossible, but the new name lets the audience picture it for a moment as an imaginable marvel.
I think this imaginative license would still make it a Modal-Projection (MP).↩︎