Cognitive Fatigue in LLMs

Attention decay (A_t)

Mean last-layer attention to prompt slice; declining trend = instruction loss

Embedding drift (D_t)

‖h_t − h₀‖₂ from prompt anchor; growth = representation wandering

Entropy deviation (E_t)

Departure from healthy band; low = repetition, high = erratic indecision

→

Monotonicity

Scale invariance

Boundedness

Temporal stability

Compositionality

→

FATIGUE INDEX

0.82

FI = w_Aφ_A + w_Eφ_E + w_Dφ_D
FI ∈ [0,1] · computed at inference time

Three lightweight inference-time signals — normalized under explicit axioms and aggregated into the Fatigue Index — enable real-time monitoring of long-horizon generation reliability without retraining.

What is cognitive fatigue?

Language models do not fail all at once. They drift. Over long generations, a model that starts focused and coherent gradually loses its grip on the original instruction, its internal representations shift away from the task, and its predictions grow erratic or repetitive. By the time the output looks wrong, the deterioration has already been underway for some time.

We formalize this as cognitive fatigue: a progressive, within-run deterioration in instruction adherence, representation stability, and predictive calibration. It is not random noise or an edge case. It is a systematic consequence of how transformer decoders operate over extended sequences, and it is measurable at inference time, token by token, without retraining.

The critical insight is that fatigue has internal signatures. Attention patterns, hidden states, and output distributions all show signs of degradation before the generated text visibly breaks down. This makes it possible to detect unreliable generation as it happens, not after the fact.

The three signals

Signal A

Prompt attention decay

As decoding progresses, the model increasingly conditions on its own recent outputs rather than the original instruction. Transformer decoders expose this through attention weights: the mean last-layer attention mass allocated to the initial prompt slice declines even when the instruction remains relevant. A declining A_t signals growing instruction neglect.

A_t = (1/H) Σ_h Σ_j≤Lp Attn_h(x_t, x_j)
Weight: w_A = 0.40 (highest priority)

Signal D

Embedding drift

Long-horizon decoding repeatedly updates a shared residual stream, allowing small perturbations to accumulate. Hidden states gradually drift away from the representational subspace induced by the prompt — a latent degradation process that precedes surface-level incoherence. This internal drift is not directly observable from output text alone.

D_t = ‖h_t − h₀‖₂
Weight: w_D = 0.25 (noisiest signal)

Signal E

Entropy deviation

During extended generation, models become overconfident (low entropy → repetition, degeneracy) or erratically uncertain (high entropy → unrelated to task difficulty). Shannon entropy of the next-token softmax provides a direct view of calibration state. Deviation from a healthy band in either direction signals fatigue.

E_t = −Σ p log p | healthy: [H_ℓ, H_u]
Weight: w_E = 0.35

The Fatigue Index

FI_t = w_A φ_A(A_t) + w_E φ_E(E_t) + w_D φ_D(D_t)

Each φ is a fixed monotone normalization map to [0,1]. Weights w_A = 0.40, w_E = 0.35, w_D = 0.25 encode domain priors — prompt attention most directly governs instruction following; entropy governs degeneracy and repetition; drift signals longer-horizon instability but is noisier. Weights are frozen across all experiments. Higher FI = greater fatigue. FI ∈ [0,1].

Five axioms any valid fatigue measure must satisfy

Monotonicity

Reduced prompt attention, increased entropy deviation, or increased embedding drift must each monotonically increase FI. Worsening signals must produce worsening scores — never the reverse.

Scale invariance

Monotone reparameterizations of raw signals must preserve the fatigue ordering, ensuring comparability across decoding regimes and numerical scales.

Boundedness

FI lies on a fixed, interpretable scale: FI ∈ [0, 1]. This supports stable thresholds and consistent online monitoring across runs and models without renormalization.

Temporal stability

Small, transient perturbations in signals should not induce large, instantaneous FI changes. Prevents spurious oscillations under inference-time noise. Enforced via hysteresis with distinct activation and deactivation thresholds.

Compositionality

FI must decompose into interpretable per-signal contributions: FI = Σ w_k g_k(s_k,t). This enables attribution — identifying which failure mode is driving a high score — and supports simple stabilizing mechanisms. Together, A1–A3 and A5 characterize the family of valid additive fatigue measures (Theorem E.1).

Mechanisms that drive fatigue

Mechanism	Description	Observed effect
Attention dilution	As sequence length grows, softmax attention mass over prompt tokens is diluted; RoPE and ALiBi positional biases further favor recent tokens.	Declining A_t; worse performance when evidence is distant from decode head.
Residual drift	Errors compound in the shared residual stream; deviations propagate rather than cancel due to autoregressive updates and LayerNorm dynamics.	Monotonic increase in D_t; higher drift aligns with repetition and lower F1.
Entropy collapse	Autoregressive training favors sharp predictions; greedy or low-temperature decoding amplifies overconfidence, especially under quantization.	Entropy leaves healthy band; repetitive and degenerate outputs late in generation.
Context length stress	Long contexts strain numerical precision in KV caches; RoPE phase saturation without scaling degrades long-range recall.	Earlier FI onset; biased attention scores; unstable output distributions.
Reduced precision	4-bit quantization destabilizes predictive calibration without disrupting prompt focus or representation stability.	Deeper, more variable entropy collapse under NF4 vs. FP16; A_t and D_t remain similar.

Abstract

Autoregressive language models frequently degrade during long-horizon generation, producing repetitive text, losing instruction adherence, and exhibiting unstable entropy. Despite the prevalence of these failures, practitioners lack online diagnostics to detect them in real time as they occur. We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axiomsenabling reliable runtime monitoring. Across nine models (1B–13B parameters), FI trajectories exhibit structured temporal dynamics, predict task degradation (AUROC = 0.95) and repetition (ρ = 0.94), and reveal non-monotonic scaling behavior. Stress analyses further show that FI onset accelerates under longer contexts, middle-positioned evidence, and reduced numerical precision. These results establish cognitive fatigue as a coherent and measurable phenomenon, and position FI as a principled tool for runtime reliability monitoring in production LLM systems.

Key results

0.95

AUROC predicting task degradation

0.94

Spearman ρ with repetition ratio

>91%

Jitter reduction via hysteresis alerting

Models evaluated · 1B–13B params

Finding 1

Fatigue is domain-agnostic. FI accumulates consistently across HotpotQA (reasoning), TriviaQA (knowledge), and SQuAD (comprehension) — 27,405 generated sequences total — with mean values clustering tightly around 0.82 across all three. This rules out benchmark-specific artifacts and supports fatigue as a degradation process intrinsic to long-horizon decoding itself.

Finding 2

Aggregation is necessary. The full FI achieves AUROC = 0.977 on HotpotQA, significantly outperforming every individual signal in isolation: Entropy only (0.954), Drift only (0.930), Attention only (0.308). Attention alone performs particularly poorly — prompt focus is insufficient as a standalone indicator. The multi-component construction is empirically justified.

Finding 3

Degradation is cumulative, not Markovian. Spearman correlation between FI and repetition is ρ > 0.84 over full sequences but only ρ ≈ 0.40 over the first 20 tokens. Early-warning heuristics based on initial tokens are insufficient — an effective monitoring policy must track the full trajectory.

Finding 4

Context length accelerates FI onset. Longer contexts induce earlier and more sustained collapse of prompt-directed attention. Embedding drift increases across all conditions but becomes more variable with length, while entropy exhibits larger fluctuations — consistent with attention dilution and accumulation of residual deviations under extended sequences.

Finding 5

Middle-positioned evidence is systematically underused. Identical evidence placed at the start of a context receives substantially higher attention than the same evidence placed in the middle or end — a positional bias that accelerates prompt-forgetting and explains the "lost-in-the-middle" performance failures documented in prior work.

Finding 6

Quantization primarily destabilizes entropy. Comparing FP16 and 4-bit NF4 decoding under matched prompts and seeds: attention and drift trajectories remain similar, but entropy exhibits deeper and more variable collapse under quantization. Reduced precision disrupts predictive calibration far more than prompt focus or representation stability.

Non-monotonic scaling with instruction tuning. Instruction-tuned models below 3B parameters exhibit faster entropy collapse than base models under matched decoding — this trend reverses at 7B, where instruction tuning improves entropy calibration. At 13B, aggressive alignment produces a distinct failure mode: Llama-2-13B-Chat collapses into low-entropy refusal templates despite grammatical output ("safety fatigue"). Drift slopes remain approximately constant across all model sizes — larger models do not drift less, they drift more coherently.

Demo video

Overview

Today's chatbot interfaces offer little to no friction: seamless conversations conceal when the model is drifting, hallucinating, or failing. This lack of transparency fosters blind trust, even as models produce unstable or repetitive outputs. Chatsparent makes cognitive fatigue visible, measurable, and actionable. The system instruments all three token-level fatigue signals in real time, fuses them into the Fatigue Index, and streams FI live alongside model outputs, giving users a continuous view of generation reliability. When thresholds are crossed, the interface enables retrain-free interventions that restore generation stability without modifying model weights. By turning passive chatbot interaction into an interactive diagnostic experience, Chatsparent reframes auto-regressive generation as an active control problem.

Retrain-free interventions

SCA · Soft context anchor

Attention-triggered prompt reinsertion

When A_t falls below a threshold, the original prompt is re-prepended and only a short recent tail of tokens is retained within the context limit. This "break-glass" action refocuses the model without editing the key–value cache. Triggered adaptively when attention crosses threshold τ_A = 0.010.

PAR · Periodic attention reset

Scheduled context rebuild

At a fixed cadence k, the context is rebuilt as [prompt + recent tail]. PAR produces bumps in attention around reset boundaries, acting as a preventive nudge against gradual decay before thresholds are breached. Reset every 50 tokens with tail_keep = 128 in reported experiments.

ERD · Entropy-regularized decoding

Dynamic temperature adjustment

At each step, temperature T ∈ [T_min, T_max] is adjusted to track a target entropy H*: if entropy is too low, increase T; if too high, decrease it. ERD curbs entropy collapse and indirectly flattens attention decay while leaving representation dynamics largely unchanged.

PAUSE · Self-reflection

Chain-of-thought checkpoints

On a fixed cadence or when entropy or drift breach thresholds, the model briefly pauses generation to perform a targeted self-check. Grounding the model's next generation in a re-evaluation of its task context counteracts both attention decay and drift accumulation.

Results — Falcon-7B-Instruct · 4-bit NF4 · HotpotQA

Method	Mean Fatigue Index ↓	Change vs. baseline	Latency (s)
Baseline	0.36	—	213.5
ERD	0.31	−0.05	212.5
PAUSE	0.31	−0.05	228.0
SCA	0.32	−0.04	225.1
PAR	0.34	−0.02	222.4

All interventions reduce mean FI with modest latency overhead. ERD achieves the best FI reduction with negligible added latency. Decoding defaults: top-p = 0.95, T = 1.0, max new tokens = 120.

Papers

International Conference on Machine Learning (ICML 2026)

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

Riju Marwah · Ritvik Garimella · Vishal Pallagani · Atishay Jain · Michael Stewart · Amit Sheth

Formalizes cognitive fatigue as a runtime state variable grounded in three token-level signals. Introduces the Fatigue Index with five explicit axioms and validates it across nine models (1B–13B) on HotpotQA, TriviaQA, and SQuAD under long-context, positional, and precision stress conditions.

PDF ↗ arXiv ↗ Code ↗ BibTeX

Association for the Advancement of Artificial Intelligence (AAAI 2026) · Demo

Chatsparent: An Interactive System for Detecting and Mitigating Cognitive Fatigue in LLMs

Riju Marwah · Vishal Pallagani · Ritvik Garimella · Amit Sheth

An interactive system that streams fatigue signals live alongside model outputs and enables four retrain-free interventions (SCA, PAR, ERD, PAUSE). Turns passive chatbot interaction into a diagnostic experience that exposes model dynamics and improves long-horizon reliability without retraining.

PDF ↗ arXiv ↗ Demo video ↗ BibTeX

Citation

ICML 2026 — Full paper

@inproceedings{marwah2026cognitivefatigue, title = {Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement}, author = {Marwah, Riju and Garimella, Ritvik and Pallagani, Vishal and Jain, Atishay and Stewart, Michael and Sheth, Amit}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)}, year = {2026} }

AAAI 2026 — Demo paper

@inproceedings{marwah2026chatsparent, title = {Chatsparent: An Interactive System for Detecting and Mitigating Cognitive Fatigue in {LLMs}}, author = {Marwah, Riju and Pallagani, Vishal and Garimella, Ritvik and Sheth, Amit}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, year = {2026} }