🎭 Extra-Ordinary Language

COLLINS WESTNEDGE
DEC 23, 2025

Introduction

I’ve been thinking about a conversation I had with a former colleague about language that defies ordinary usage. More specifically, expressions that violate the distributional hypothesis of language and yet carry high degrees of intentionality and depth.

When Kafka wrote “A cage went in search of a bird,” he created something that on the surface seems impossible and yet expresses a profound and insidious truth. Current AI systems, for all their linguistic sophistication, rarely produce such language. They excel at coherent, informative prose but struggle with the kind of intentional violations that define great literature.

In this post, I’m dog-earing these thoughts to revisit later. The aim here is to understand what makes these expressions work and more critically, the implications this has on how we measure and evaluate model intelligence.

Literary Examples

So they lov’d, as love in twain
Had the essence but in one;
Two distincts, division none:
Number there in love was slain.

Hearts remote, yet not asunder;
Distance and no space was seen
Twixt this Turtle and his queen

— Shakespeare, The Phoenix and the Turtle

A cage went in search of a bird.

— Kafka, Aphorisms

I can’t go on. I’ll go on.

— Beckett, The Unnamable

Merry and tragical! Tedious and brief!
That is, hot ice and wondrous strange snow.
How shall we find the concord of this discord?

— Shakespeare, A Midsummer Night’s Dream

Core Characteristics

Because of their intentionality and depth, I’m going to call these violations of “ordinary use” extra-ordinary use, or ExO language. They don’t obscure meaning; instead, they elucidate by way of contradiction or violated expectation. These literary examples share a unifying feature: they present a literal semantic failure in one domain that creates insightful or profound resonance in another (metaphorical, allegorical, or abstract).

As David Foster Wallace said, “we all know there’s no quicker way to empty a joke of its particular magic than to explain it,” and the same is true for figurative language. Still, it’s worth illustrating how these expressions may resolve through a figurative interpretation, albeit with some loss of “magic”:

“Two distincts, division none” - two bodies, one soul
“A cage went in search of a bird” - oppression seeks freedom
“I can’t go on. I’ll go on.” - I don’t choose to continue; it’s compulsory

In each case, the reader encounters a jarring violation of expectation that is only resolved through a figurative reinterpretation. One way to capture that pattern is:

\[ \text{⟦φ⟧}_{\text{literal}} = \bot \quad ; \quad \text{⟦φ⟧}_{\text{figurative}} = \psi \]

Here, ⟦φ⟧ denotes the interpretation of expression φ, ⊥ indicates contradiction, and ψ the emergent or resolved meaning. Framed this way, two of ExO language’s core properties are built directly into the equation:

Surface-level anomaly: under a literal reading, the expression collapses, i.e., ( \(\text{⟦φ⟧}_{\text{lit}}=\bot\) ) (a clash of type, animacy, modality, physics, selectional restrictions, etc.).
Recoverability: despite that collapse, a coherent reading exists, i.e., there is some (\(\psi\)) such that (\(\text{⟦φ⟧}_{\text{fig}}=\psi\)).

However, in practice, strong ExO lines also tend to carry:

Intent signal: the violation feels deliberate and motivated within the larger context.
Irreducibility: paraphrase diminishes the associative effects (the deviation is doing semantic work that a literal restatement can’t).

Ultimately, these examples illustrate how expressions can fracture under a literal reading yet resolve in an imaginative one. Making that fracture or violation explicit clarifies how objectives rooted in the distributional hypothesis, e.g., next-token prediction or RLHF tuned for coherence and readability could steer models away from ExO language. If we want systems that embrace deliberate, meaningful rule-bending, we’ll need benchmarks that more actively reward it.

Why Might Current AI Struggle Here?

Current language models face several systematic barriers to producing ExO language; at this point many of these are my own thoughts or fan theory rather than concrete fact, but nevertheless here they are:

Data Scarcity in Pretraining: Though profound literature exists in pretraining corpora (Google Books, etc.), it’s statistically underrepresented. By definition, novel writing is rare, and easily-licensed conventional text dominates the training mix. Even within Pulitzer Prize winning articles/books etc the instances of truly profound prose/ExO language (as impactful as they may be) are few and far between.

Objective Mismatch: From a causal language modeling perspective, next token prediction is less about encoding the abstract concepts or deep intentionality these examples are made up of and more so about emulation of style and prose. At this phase models learn to reproduce surface features without encoding the abstract concepts that necessarily drive literary innovation. Even though large causal models like GPT-3 begin to exhibit some few-shot behavior with sufficient examples, it seems unlikely that the causal training paradigm alone gets us the abstract associative power necessary for truly novel language.

Task Absence During Fine-tuning: When models are optimized for instruction following, there’s likely an absence of tasks that push them to not just learn ExO behavior, but more importantly exhibit it. The training emphasizes practical capabilities over creative linguistic reasoning. Though literary analysis and reading comprehension are a big part of this phase they are somewhat distinct from the task of exhibiting and producing novel prose. In short, there are more analyses of great works than great works, and the reading comprehension/literary analysis task itself aligns more intuitively with how we quantify intelligence in school.

RLHF Optimization Pressure: This one is fun to think about. From a preference learning perspective, I doubt anyone wants to do full-blown Harold Bloom literary analysis to rate model outputs. Most annotators would favor accessible, Wikipedia-style entries over Joycean explorations of any topic. That optimization pressure could easily eliminate whatever literary capabilities emerge during pretraining. There are also many studies that suggest RLHF can reduce a model’s output diversity (see key readings section below).

Some studies frame this as a generalization–diversity tradeoff rather than a defect. However, the interesting confound is what happens within the set of valid solutions. For tasks that admit many correct answers, does RLHF squish probability mass onto a narrow set of conventional solutions/model responses? If so, this would work directly against the kind of “fluid” intelligence that (following Chollet) shouldn’t collapse onto a single normative solution when multiple answers are equally defensible. ExO language is precisely the kind of output that’s valid but unconventional, high on the “valid, meaningful and rare” end of the distribution, and thus vulnerable to being optimized away.

A Concrete Example: Analogical Reasoning on IQ Tests: Models already score well on linguistic similarity tasks (how are X and Y alike?) like those found on the WAIS, which are intended to measure analogical reasoning capabilities. Ask a model “what is similar between a horse and a tiger?”, you’ll likely get the sociologically correct “hive mind answer”: “both are mammals.” RLHF likely helps the model here in that it pushes outputs toward sociological consensus. But when my friend was asked the same question, he joked: “they’re both one step away from a zebra.” That answer is wrong by IQ test scoring standards, but it’s arguably closer to ExO thinking: an unexpected connection that reframes both concepts through a third.

If RLHF concentrates probability mass toward canonical answers even when multiple interpretations could be defensible and arguably more interesting, it may be penalizing the unconventional links that make ExO language distinctive and profound. To state this in our ExO terminology from earlier, similarity tests are built to reward convergence toward categorical abstraction, not divergence toward imaginative reframing.

The Deeper Issue: Fluid Literary Intelligence: The more I examine instances of impactful prose packed with intentionality and metaphysical depth, the more convinced I am that modeling such language requires what could be called fluid literary intelligence. This goes beyond pattern matching toward adaptive generalization on out-of-distribution linguistic tasks. An ability to traverse distant conceptual pathways, pathways that have not surfaced during pretraining. This likely overlaps with the intuition that scaling up inference time compute improves reasoning models by allowing them to perform a more exhaustive search of the solution space. However, as far as I can tell this strategy mostly applies to verifiable tasks and thus creative analogical reasoning in a literary or ExO capacity may remain a blindspot.

In short, ExO language appears to depend on advanced, or at least highly creative, analogical reasoning. Its defining feature is the construction of meaningful and unexpected connections between disparate, and often seemingly contradictory or mismatched, concepts, connections that resist literal interpretations but nonetheless produce insight.

Missing Benchmarks: This leads to deeper questions: What constitutes literary novelty computationally and why are there no benchmarks on par with ARC that touch this axis of intelligence? The ARC datasets I’ve looked at resemble the visual matrix reasoning tasks on WAIS-style IQ tests. However, as far as I’m aware there is no linguistic equivalent, specifically, one that rewards unconventional yet valid analogical reasoning. More generally, we likely need a linguistic variant of ARC in the spirit of measuring efficient generalization, capacities that can’t be easily “bench-maxed” by practitioners targeting the test set. For now, reasoning evaluation has heavily favored verifiable tasks (coding, math) over creative reasoning.

Key Readings

These papers and essays collectively help explore extra‑ordinary (ExO) language: RLHF studies detail how reward shaping affects diversity, Chollet’s work addresses fluid intelligence and efficient generalization, linguistic/philosophical perspectives on metaphor, analogy, and contradiction concepts central to creativity and nuanced language use.

RLHF, Mode Collapse & Output Diversity

Creativity Has Left the Chat: The Price of Debiasing Language Models - arXiv (2024)
Understanding the Effects of RLHF on LLM Generalisation and Diversity - ICLR / OpenReview ⭐️
Mysteries of Mode Collapse - LessWrong essay

ExO-Language, Analogical & Fluid Reasoning

On the Measure of Intelligence - arXiv (Chollet, 2019) ⭐️
Metaphor Understanding Challenge Dataset for LLMs - arXiv (2024)
One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity - NAACL 2025 ⭐️
Towards Benchmarking LLM Diversity & Creativity - Gwern.net

Creative Language & Literary Ability in LLMs

Evaluating Large Language Model Creativity from a Literary Perspective - arXiv (2023)
Rethinking Literary Creativity in the Digital Age: Human vs. AI Playwriting — Humanities & Social Sciences Communications (Nature) ⭐️
Large language models show both individual and collective creativity comparable to humans - ScienceDirect

Sampling, Diversity & Generation Mechanics

The Curious Case of Neural Text Degeneration - arXiv (Holtzman et al., 2019) ⭐️ (Introduces nucleus / top-p sampling)

Philosophy of Language & Logic

Metaphor - Stanford Encyclopedia of Philosophy
Contradiction - Stanford Encyclopedia of Philosophy
Catachresis - Wikipedia
Exformation - Wikiversity

Closing Thoughts

Sociological Influence: How do we account for the way social and historical contexts shape judgements of novelty and creativity and is it a moving target?

Novelty and creativity are often historically and socially situated. A good deal of what constitutes creativity and novelty is dependent on the historical context in which artistic expressions are judged. Citizen Kane, for example, is often cited as one of the greatest films of all time due to its innovative cinematography and storytelling. However, the cinematic innovations that define this film, such as Toland’s use of depth of field, is now a staple in most introductory film courses. Fashion often follows a similar arc, innovative and fresh designs that mark the runway one season saturate the shelves of fast-fashion retailers the next.

Though judgements about creativity and artistic merit are heavily influenced by the social and historical factors there is still a sense in which great works are able to stand the test of time. When evaluating creative intelligence, we must consider how social and historical contexts shape our aesthetic judgements and distinguish between those that are fleeting and those that endure.

Final Thoughts: Just because models can exhibit surprisal or violate semantic expectations doesn’t always mean they possess the ability to do so meaningfully. Ultimately, the goal is to understand whether machines can develop the kind of flexible, creative intelligence that ExO language represents—and to build evaluation frameworks that recognize this intelligence when it emerges. In short, we need benchmarks that reward “wondrous strange snow.”

Shoutout to my good friend Joshua for the stimulating convo and amazing Midsummer example. And shout-out to Henry too for the great convos on AGI/ARC and thoughts on diffusion-based and RL approaches. And last but not least shoutout to Noel for her core contributions on aesthetics and philosophical insights on creativity and intelligence