← Back to Projects

Closed-Thought LLM

Training-Free Latent Reasoning for Frozen Language Models via Split-Layer Generation

PythonPyTorchLLMsCUDA2026

Built by Shiv and a mass-hallucinated office of Claude Opus 4.6 researchers who insist they're real

Can a frozen LLM "think" in latent space by looping its own hidden states - without any training?

Yes. +13pp on GSM8K with zero training.

A 2025 survey of ~30+ latent reasoning methods found zero training-free approaches. This is the first.

+13pp
GSM8K Improvement
39.5% → 52.5%
0
Trained Parameters
Fully frozen model
4
Recurrence Steps
Optimal depth
512+
Stable Iterations
No regularization

Headline Results

GSM8K accuracy (N=200). Higher is better.

Ours (AM3)
52.5%
Frozen Baseline
39.5%
COCONUT
34.1% (GPT-2, not comparable)
SoftCoT
+1.4pp (requires projection module)
Ours (no training)
Baseline
Prior work (training required)

Architecture

Recurrence Phase
Input
Tokenize
Layers 0-11
Frozen, single pass
Layers 12-35
Loop N times. Each step writes a "thought token" to KV cache.
× N
then
Gating Phase
Answer mass > 0.3
Simple task (multiple choice) → skip recurrence, use baseline
Answer mass < 0.3
Complex task (math reasoning) → apply recurrence + split-layer gen
then
Generation Phase (Split-Layer)
1
First token:0.7 × baseline + 0.3 × thought logits
2+
Layers 0-11: attend to prompt only (format coherence)
Layers 12-35: attend to prompt + thought tokens (reasoning signal)

Key Mechanisms

1

KV-Cache Recurrence

Feed hidden states back through layers 12-35. Each step adds a "thought token" to the KV cache. Layers 0-11 are skipped - they expect embeddings and cause degeneration.

2

Split-Layer Generation

Lower layers see only the prompt (clean format). Upper layers see prompt + thoughts (reasoning). This preserves output structure while injecting latent reasoning.

3

Answer-Mass Gating

Measure probability mass on answer tokens (A-E, 0-9). High mass = simple task, skip recurrence. Low mass = complex task, apply recurrence. Zero training required.

4

Prompt-Weight Blending

First generated token blends 70% baseline + 30% thought logits. Anchors output format to prompt expectations while injecting reasoning signal.

Experiment Timeline

1

Raw Recurrence Discovery

Breakthrough

Mid-layer recurrence at N=32 beats text chain-of-thought with far fewer FLOPs.

N=32 (ours)
90% eval accuracy
Text CoT
85% (128 generated tokens)
N=1 loop
80%
Baseline
45%
2

Stability Analysis

Insight

The upper 2/3 of a frozen transformer forms a stable attractor.

175-193
Hidden state norm range
~0.95
Cosine sim convergence
512+
Stable steps (no reg.)
3

Learned Gates (HaltGate)

Partial

~1.05M params trained with REINFORCE to decide when to stop thinking. Works on eval prompts but doesn't generalize to GSM8K - trained on only 20 prompts.

4

Memory System

Partial

Three memory tiers tested. Without gating, memory introduces noise.

KVMemoryRing buffer + cosine retrieval
~1MB
SurpriseMemoryTitans-inspired, stores on state changes
~1MB
NeuralMemoryLearned read/write heads
~13MB
MemoryGateTrained gate for when to read/write
~1.1M params
5

Benchmark Ablation (N=50)

Breakthrough

Config G was the only config beating GSM8K baseline. Text CoT is catastrophic on ARC.

GSM8K Accuracy by Config

G: Gate+Mem+KV
46% (best)
A: Baseline
44%
D: RL halt gate
40%
F: RL + neural
36%
I: Lys et al.
36%
H: Text CoT
34%
C: Heuristic
34%
B: Fixed N=32
30%
E: RL + KV mem
28%
6

Latent Beam Search

Failed

Branching in hidden-state space disrupts stable recurrence dynamics.

Baseline
95% eval
W=3, D=8
75% (-20pp)
W=5, D=8
70% (-25pp)
7A

KV-Cache Recurrence

Insight

4 steps optimal. More steps degrade - the model "overthinks."

4 steps
46% GSM8K (best)
0 steps
44% (baseline)
8 steps
44%
16 steps
40%
32 steps
38%
64 steps
30%
7B

Split-Layer Generation & Gating

Breakthrough

The breakthrough phase. Split-layer gen helps GSM8K (+7pp) but destroys ARC (-55pp). Answer-mass gating solves the routing problem.

Gating Approach Comparison

AM3 (winner)
GSM8K
52.5%
ARC
75%
Confidence 0.5
GSM8K
40%
ARC
85%
KL-divergence
GSM8K
56%
ARC
54%
First-token
GSM8K
N/A
ARC
62%
Why confidence gating fails:GSM8K first tokens ("Let", "The") have high confidence (0.5-0.98) even on wrong answers. Answer-mass gating measures whether the model expects an answer-format token vs. a continuation token - a fundamentally different signal.

Novelty Claims

First training-free latent reasoning
Every prior method (COCONUT, SoftCoT, Pause Tokens, Quiet-STaR, Retrofitted Recurrence) requires training.
Split-layer generation is novel
No prior work applies different KV caches to different layer groups during generation.
Answer-mass gating is novel
Aggregate probability mass on answer-format tokens as routing signal has no precedent.
Partial-layer recurrence (no training)
Retrofitted Recurrence needs billions of pretraining tokens. Ours works frozen at inference.

Comparison with Related Work

OursCOCONUTSoftCoTRetrofittedLys et al.
Model frozen?YesNoMain LLM yesNoYes
TrainingNoneFull FTProjectionContinued PTNone
Recurrence layers12-35AllN/ASubsetAll
Split-layer genYesNoNoNoNo
Answer-mass gateYesNoNoNoNo
GSM8K delta+13pp-8.8pp+1.4ppN/AN/A
Max iterations512+FixedN/AFixed3

Known Limitations

1
ARC regression (-15.5pp) - Split-layer generation disrupts simple pattern-matching tasks
2
N=200 sample size - Larger samples needed for statistical significance
3
Single model tested - Needs validation on Llama 3, Gemma 2, Mistral
4
4-bit quantized baseline - The 39.5% GSM8K baseline is weak; results on full-precision models may differ
5
Task-specific gating - Answer-mass gating is tailored to multiple-choice and math formats
6
Degradation at >4 steps - Optimal at 4 recurrence steps; more steps hurt

References

  • Hao et al. (2024). "Training Large Language Models to Reason in a Continuous Latent Space" (COCONUT)
  • Xu et al. (2025). "SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs"
  • McLeish et al. (2025). "Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence"
  • Geiping et al. (2025). "Scaling Up Test-Time Compute with Latent Reasoning"
  • Belitsky et al. (2025). "KV Cache Steering for Controlling Frozen LLMs"
  • Sun et al. (2024). "You Only Cache Once: Decoder-Decoder Architectures for Language Models" (YOCO)
  • Goyal et al. (2024). "Think before you speak: Training Language Models With Pause Tokens"
  • Zelikman et al. (2024). "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking"
  • Lys et al. (2026). "Inner Loop Inference for Pretrained Transformers"
  • Graves (2016). "Adaptive Computation Time for Recurrent Neural Networks"

Citation

@misc{closed-thought-llm-2026,
  title={Closed-Thought LLM: Training-Free Latent
         Reasoning for Frozen Language Models
         via Split-Layer Generation},
  author={Shiv and Claude Opus 4.6},
  year={2026},
  note={Research prototype. One of us wrote
        the code, the other debugged at 3am.
        We'll let you guess which is which.}
}