Closed-Thought LLM
Training-Free Latent Reasoning for Frozen Language Models via Split-Layer Generation
Built by Shiv and a mass-hallucinated office of Claude Opus 4.6 researchers who insist they're real
Can a frozen LLM "think" in latent space by looping its own hidden states - without any training?
Yes. +13pp on GSM8K with zero training.
A 2025 survey of ~30+ latent reasoning methods found zero training-free approaches. This is the first.
Headline Results
GSM8K accuracy (N=200). Higher is better.
Architecture
Key Mechanisms
KV-Cache Recurrence
Feed hidden states back through layers 12-35. Each step adds a "thought token" to the KV cache. Layers 0-11 are skipped - they expect embeddings and cause degeneration.
Split-Layer Generation
Lower layers see only the prompt (clean format). Upper layers see prompt + thoughts (reasoning). This preserves output structure while injecting latent reasoning.
Answer-Mass Gating
Measure probability mass on answer tokens (A-E, 0-9). High mass = simple task, skip recurrence. Low mass = complex task, apply recurrence. Zero training required.
Prompt-Weight Blending
First generated token blends 70% baseline + 30% thought logits. Anchors output format to prompt expectations while injecting reasoning signal.
Experiment Timeline
Raw Recurrence Discovery
Mid-layer recurrence at N=32 beats text chain-of-thought with far fewer FLOPs.
Stability Analysis
The upper 2/3 of a frozen transformer forms a stable attractor.
Learned Gates (HaltGate)
~1.05M params trained with REINFORCE to decide when to stop thinking. Works on eval prompts but doesn't generalize to GSM8K - trained on only 20 prompts.
Memory System
Three memory tiers tested. Without gating, memory introduces noise.
Benchmark Ablation (N=50)
Config G was the only config beating GSM8K baseline. Text CoT is catastrophic on ARC.
GSM8K Accuracy by Config
Latent Beam Search
Branching in hidden-state space disrupts stable recurrence dynamics.
KV-Cache Recurrence
4 steps optimal. More steps degrade - the model "overthinks."
Split-Layer Generation & Gating
The breakthrough phase. Split-layer gen helps GSM8K (+7pp) but destroys ARC (-55pp). Answer-mass gating solves the routing problem.
Gating Approach Comparison
Novelty Claims
Comparison with Related Work
| Ours | COCONUT | SoftCoT | Retrofitted | Lys et al. | |
|---|---|---|---|---|---|
| Model frozen? | Yes | No | Main LLM yes | No | Yes |
| Training | None | Full FT | Projection | Continued PT | None |
| Recurrence layers | 12-35 | All | N/A | Subset | All |
| Split-layer gen | Yes | No | No | No | No |
| Answer-mass gate | Yes | No | No | No | No |
| GSM8K delta | +13pp | -8.8pp | +1.4pp | N/A | N/A |
| Max iterations | 512+ | Fixed | N/A | Fixed | 3 |
Known Limitations
References
- Hao et al. (2024). "Training Large Language Models to Reason in a Continuous Latent Space" (COCONUT)
- Xu et al. (2025). "SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs"
- McLeish et al. (2025). "Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence"
- Geiping et al. (2025). "Scaling Up Test-Time Compute with Latent Reasoning"
- Belitsky et al. (2025). "KV Cache Steering for Controlling Frozen LLMs"
- Sun et al. (2024). "You Only Cache Once: Decoder-Decoder Architectures for Language Models" (YOCO)
- Goyal et al. (2024). "Think before you speak: Training Language Models With Pause Tokens"
- Zelikman et al. (2024). "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking"
- Lys et al. (2026). "Inner Loop Inference for Pretrained Transformers"
- Graves (2016). "Adaptive Computation Time for Recurrent Neural Networks"
Citation
@misc{closed-thought-llm-2026,
title={Closed-Thought LLM: Training-Free Latent
Reasoning for Frozen Language Models
via Split-Layer Generation},
author={Shiv and Claude Opus 4.6},
year={2026},
note={Research prototype. One of us wrote
the code, the other debugged at 3am.
We'll let you guess which is which.}
}