By Gaurav in ML/Statistics — 10 Jan 2026

Out-of-Context: Constrained Tool Based Exploration of Context

Longer context windows have not come at the back of a solution for "context rot." Recursive Language Models (RLMs) take inspiration from out‑of‑core data processing, where systems handle datasets larger than memory by deciding what to load. The paper proposes solving the context rot problem by externalizing the long prompt into an environment (implemented as a Python REPL) and letting a “root” model interact with that environment: peek at slices, filter with regex, partition into chunks, and selectively invoke sub‑calls over relevant snippets.

The strategy is highly effective. GPT‑5 scores 44 on OOLONG (131K tokens) while RLM (GPT‑5) reaches 56.50. On BrowseComp‑Plus at 1,000 documents (6–11M tokens), the base model fails due to context limits, while RLM (GPT‑5) reaches 91.33, outperforming both a summarization agent (70.47) and a retrieval‑augmented CodeAct (+BM25) baseline (51.00) in their setup. The ablations sharpen the mechanism: the “REPL, no sub‑calls” variant can still scale beyond the context window (suggesting that externalized access is foundational), while recursive sub‑calling produces the big gains on information‑dense tasks like OOLONG and OOLONG‑Pairs.

The solution, while effective, has small but important chinks. The authors note that their calls are sequential/blocking and observe wide runtime variance, especially for RLMs. They also explicitly highlight high variance in cost due to variable trajectory lengths.

So how to keep that advantage of RLMs while eliminating the failure modes: unsafe execution, cost blow‑ups, and unpredictable latency. The likely root cause for these failures is that the solution is “freeform Python written by an LLM.” And the first step toward fixing it is to replace “arbitrary Python” with a constrained tool surface that captures the small set of operations RLMs repeatedly rediscover anyway:

tools = {
    "chunk_and_map": lambda fn, chunk_size: ...,
    "grep": lambda pattern: ...,
    "semantic_search": lambda query, top_k: ...,
    "summarize": lambda text: ...,
    "count": lambda filter_fn: ...,
    "small_llm_call": lambda prompt, text: ...
}

This does not guarantee correctness, but it does bound the space of failure: tools can be tested, cached, parallelized, and priced, and the agent can be prevented from inventing brittle code paths. The real tradeoff is expressivity: a tight DSL can undershoot the weird edge cases where freeform code is genuinely useful. The practical compromise is to start constrained, expand the toolset based on observed needs, and reserve an explicit escape hatch, e.g., a controlled “inspect raw text slice” operation, rather than granting the full Python standard library.

Second, route before you recurse. Many user queries are closer to “extract a span” than “conduct a research program.” A router (heuristic or learned) can send trivial extraction to deterministic string operations, send “needle” queries to cheap lexical search plus verification, and reserve full tool‑based exploration for queries whose structure implies distributed evidence or aggregation. This directly addresses the paper’s own observation that RLMs can underperform in the small‑input regime: if you can predict that regime cheaply, you can avoid paying for it. The risk is misrouting; the standard mitigation is to treat routing as a ranked shortlist with a verify step (“pipeline A says X with evidence; pipeline B says Y with evidence; reconcile”), rather than a single irreversible decision.

Third, separate “find candidates” from “prove the answer.” A hybrid RAG→tool‑exploration pipeline is often the best of both worlds: retrieval narrows a 10M‑token corpus to a manageable candidate set, then constrained tools perform deep synthesis and evidence‑grounded verification. The failure mode is obvious—retrieval can miss the key document—so the hybrid must be tuned for recall (multiple retrievers, generous top‑k, query expansion) and backed by verification that can ask for a second retrieval pass when evidence is weak or contradictory. This framing also makes comparisons to structured long‑sequence baselines more principled: approaches like LLM×MapReduce formalize chunking and aggregation protocols to reduce information loss across splits. RLMs suggest adaptive exploration can beat fixed protocols on diverse tasks, but fixed protocols still matter as strong, predictable baselines and as components inside a constrained toolset.

Finally, once the action space is finite, learning becomes realistic. Tool use can be trained as a policy: maximize accuracy while penalizing cost and latency. In that regime, “variance” stops being an annoyance and becomes an explicit optimization target: you can train against p95 blow‑ups rather than admiring a pretty median. This is where RLM’s most interesting claim points: the trajectory is learnable, but only if we make “trajectory” a well-defined, auditable object—tool calls with bounded semantics, not an unbounded program.

Put together, RLMs show that long‑context capability is less about ever‑larger windows and more about controllable access patterns to externalized context. The mature version of that idea is constrained tool‑based exploration, routed intelligently, executed in parallel where possible, and optimized against tail risk.

Subscribe to Gojiberries