Upload files to "drafts"
This commit is contained in:
@@ -2,10 +2,9 @@
|
|||||||
title: Building Babel - a fuzzy LLM vs the OS
|
title: Building Babel - a fuzzy LLM vs the OS
|
||||||
categories: [Thoughts]
|
categories: [Thoughts]
|
||||||
tags: [os]
|
tags: [os]
|
||||||
|
math: true
|
||||||
---
|
---
|
||||||
|
|
||||||
# Building Babel: Turning a Fuzzy LLM into an OS
|
|
||||||
|
|
||||||
*A post about reliability, memory, and the compiler we didn’t mean to write.*
|
*A post about reliability, memory, and the compiler we didn’t mean to write.*
|
||||||
|
|
||||||
When people talk about “prompt engineering,” it often sounds like a bag of tricks: write clearer instructions, add examples, constrain the format, keep history short, and pray. But if you zoom out, the pattern looks less like copywriting and more like systems engineering. We’re trying to run workloads on a machine whose “CPU” is probabilistic, whose “RAM” is fixed-size, and whose caching behavior depends on keeping the same prefix intact.
|
When people talk about “prompt engineering,” it often sounds like a bag of tricks: write clearer instructions, add examples, constrain the format, keep history short, and pray. But if you zoom out, the pattern looks less like copywriting and more like systems engineering. We’re trying to run workloads on a machine whose “CPU” is probabilistic, whose “RAM” is fixed-size, and whose caching behavior depends on keeping the same prefix intact.
|
||||||
@@ -17,7 +16,7 @@ That framing is useful, because it pushes us toward familiar tools: define an IS
|
|||||||
|
|
||||||
## 0 — The foundation: why the OS analogy is structurally correct
|
## 0 — The foundation: why the OS analogy is structurally correct
|
||||||
|
|
||||||
### 0.1 A minimal machine model (OS dev lens)
|
### 0.1 A minimal machine model
|
||||||
|
|
||||||
If you strip away the marketing, an LLM session is a constrained compute device with:
|
If you strip away the marketing, an LLM session is a constrained compute device with:
|
||||||
|
|
||||||
@@ -39,15 +38,15 @@ We start with “LLM session = state machine with bounded memory + caching,” a
|
|||||||
|
|
||||||
A practical theoretical model looks like this:
|
A practical theoretical model looks like this:
|
||||||
|
|
||||||
* Let (X) be the set of possible contexts (token sequences) with max length (N).
|
* Let $X$ be the set of possible contexts (token sequences) with max length $N$.
|
||||||
* Let (Y) be token outputs.
|
* Let $Y$ be token outputs.
|
||||||
* The model implements a stochastic policy:
|
* The model implements a stochastic policy:
|
||||||
[
|
$$
|
||||||
\pi(y \mid x)
|
\pi(y \mid x)
|
||||||
]
|
$$
|
||||||
where (x \in X).
|
where $x \in X$.
|
||||||
|
|
||||||
In each interaction, you append some new tokens to (x), then the model emits tokens (y), producing a new context (x' = \text{append}(x, y)), then truncation/packing happens due to the context limit (N).
|
In each interaction, you append some new tokens to $x$, then the model emits tokens $y$, producing a new context $x' = \text{append}(x, y)$, then truncation/packing happens due to the context limit $N$.
|
||||||
|
|
||||||
From an OS perspective, the key point is not stochasticity. The key point is **boundedness**:
|
From an OS perspective, the key point is not stochasticity. The key point is **boundedness**:
|
||||||
|
|
||||||
@@ -56,7 +55,7 @@ From an OS perspective, the key point is not stochasticity. The key point is **b
|
|||||||
|
|
||||||
That’s why memory management dominates in practice.
|
That’s why memory management dominates in practice.
|
||||||
|
|
||||||
### 0.3 “Main context ⇒ RAM” is not poetry: it’s a working-set equivalence
|
### 0.3 “Main context ⇒ RAM” : the working-set equivalence
|
||||||
|
|
||||||
In OS terms, RAM is defined by three properties:
|
In OS terms, RAM is defined by three properties:
|
||||||
|
|
||||||
@@ -66,9 +65,9 @@ In OS terms, RAM is defined by three properties:
|
|||||||
|
|
||||||
An LLM context window has exactly those properties:
|
An LLM context window has exactly those properties:
|
||||||
|
|
||||||
* bounded capacity: fixed token limit (N)
|
* bounded capacity: fixed token limit $N$
|
||||||
* fast access: everything in-context is “directly addressable” by attention
|
* fast access: everything in-context is “directly addressable” by attention
|
||||||
* content determines behavior: the probability distribution (\pi(\cdot\mid x)) changes when (x) changes
|
* content determines behavior: the probability distribution $\pi(\cdot\mid x)$ changes when $x$ changes
|
||||||
|
|
||||||
That’s enough to justify the equivalence “context behaves like RAM,” even though the representation isn’t bytes.
|
That’s enough to justify the equivalence “context behaves like RAM,” even though the representation isn’t bytes.
|
||||||
|
|
||||||
@@ -101,7 +100,7 @@ That is the same structural property as a TLB/cache: stable mappings/prefixes pr
|
|||||||
|
|
||||||
From an OS dev perspective, this creates an optimization target: keep the “kernel prefix” stable to maximize cache locality across turns.
|
From an OS dev perspective, this creates an optimization target: keep the “kernel prefix” stable to maximize cache locality across turns.
|
||||||
|
|
||||||
### 0.6 “Skills compiler” is justified by separation of concerns (fuzzy planning vs deterministic execution)
|
### 0.6 “Skills compiler” is justified by separation of concerns
|
||||||
|
|
||||||
OS devs separate *policy* from *mechanism*:
|
OS devs separate *policy* from *mechanism*:
|
||||||
|
|
||||||
@@ -134,11 +133,6 @@ This is the real foundation of the “ECC/control loop” later: error correctio
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
1. **Reliability issue & solution**: the LLM behaves like a fuzzy ALU; we wrap it with ECC-like mechanisms by compiling to a minimal instruction set and closing the loop with verification.
|
|
||||||
2. **Memory issue & solution**: the context window is physical RAM; we manage it like a Rust's arena allocator with stage checkpoints, lazy loading, and offloading.
|
|
||||||
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Part 1 — Reliability: the fuzzy ALU problem, and an ECC-shaped solution
|
## Part 1 — Reliability: the fuzzy ALU problem, and an ECC-shaped solution
|
||||||
|
|
||||||
@@ -146,7 +140,7 @@ This is the real foundation of the “ECC/control loop” later: error correctio
|
|||||||
|
|
||||||
In a classic machine, the critical property is that execution is deterministic: given the same instruction stream and machine state, you get the same result. That’s what makes debugging possible, and it’s why “bit flips” are an exceptional event handled by ECC, parity checks, and redundancy.
|
In a classic machine, the critical property is that execution is deterministic: given the same instruction stream and machine state, you get the same result. That’s what makes debugging possible, and it’s why “bit flips” are an exceptional event handled by ECC, parity checks, and redundancy.
|
||||||
|
|
||||||
LLMs invert that. The core model is best understood as a conditional distribution (P(y \mid x)): the next token depends on the prompt/context. Even if you force deterministic decoding, the *system-level behavior* remains fragile because the mapping from a messy human request to an internal strategy is not explicit and not stable. Small context changes, minor phrasing differences, or irrelevant baggage in the prompt can flip the “mode” the model enters. In practice, this looks like the ALU occasionally returning the wrong result, except the “wrongness” is semantic, not bit-level.
|
LLMs invert that. The core model is best understood as a conditional distribution $P(y \mid x)$: the next token depends on the prompt/context. Even if you force deterministic decoding, the *system-level behavior* remains fragile because the mapping from a messy human request to an internal strategy is not explicit and not stable. Small context changes, minor phrasing differences, or irrelevant baggage in the prompt can flip the “mode” the model enters. In practice, this looks like the ALU occasionally returning the wrong result, except the “wrongness” is semantic, not bit-level.
|
||||||
|
|
||||||
A direct way to improve reliability is to reduce the amount of “semantic work” the model must do while it is producing final outputs. Instead of asking the LLM to execute tasks in free-form language, we ask it to **compile** the request into a small set of **deterministic primitives**. Then we run those primitives in a runtime we control.
|
A direct way to improve reliability is to reduce the amount of “semantic work” the model must do while it is producing final outputs. Instead of asking the LLM to execute tasks in free-form language, we ask it to **compile** the request into a small set of **deterministic primitives**. Then we run those primitives in a runtime we control.
|
||||||
|
|
||||||
@@ -185,9 +179,9 @@ Structured result + trace (success/failure per instruction)
|
|||||||
|
|
||||||
This architecture deliberately moves uncertainty into one place: compilation. Execution becomes observable and mostly deterministic.
|
This architecture deliberately moves uncertainty into one place: compilation. Execution becomes observable and mostly deterministic.
|
||||||
|
|
||||||
### Why this increases reliability (a small “proof sketch”)
|
### Why this increases reliability
|
||||||
|
|
||||||
Let (U) be the user request (the “spec”), (C) be the compiler (LLM), (P) the produced plan (ISA program), (R) the runtime, and (O) the observed output. Let (V(U,O)\in{0,1}) be a checker that says whether the output satisfies the request (even a weak checker helps).
|
Let $U$ be the user request (the “spec”), $C$ be the compiler (LLM), $P$ the produced plan (ISA program), $R$ the runtime, and $O$ the observed output. Let $V(U,O)\in{0,1}$ be a checker that says whether the output satisfies the request (even a weak checker helps).
|
||||||
|
|
||||||
Because the runtime is deterministic and instrumented, the overall success probability decomposes conceptually into:
|
Because the runtime is deterministic and instrumented, the overall success probability decomposes conceptually into:
|
||||||
|
|
||||||
@@ -196,17 +190,17 @@ Because the runtime is deterministic and instrumented, the overall success proba
|
|||||||
|
|
||||||
Informally:
|
Informally:
|
||||||
|
|
||||||
[
|
$$
|
||||||
\Pr[V=1] \approx \Pr[P\ \text{correct}] \cdot \Pr[V=1\mid P\ \text{correct}]
|
\Pr[V=1] \approx \Pr[P\ \text{correct}] \cdot \Pr[V=1\mid P\ \text{correct}]
|
||||||
]
|
$$
|
||||||
|
|
||||||
If your runtime is strict and your primitives are deterministic, (\Pr[V=1\mid P\ \text{correct}]) is high. That’s the central win: you turn “LLM unpredictability everywhere” into “LLM uncertainty mainly at compilation time.” Once the failure surface is concentrated, you can apply ECC-like techniques there.
|
If your runtime is strict and your primitives are deterministic, $\Pr[V=1\mid P\ \text{correct}]$ is high. That’s the central win: you turn “LLM unpredictability everywhere” into “LLM uncertainty mainly at compilation time.” Once the failure surface is concentrated, you can apply ECC-like techniques there.
|
||||||
|
|
||||||
### ECC for compilation: redundancy + decoding (verification)
|
### ECC for compilation: redundancy + decoding
|
||||||
|
|
||||||
ECC works by adding redundancy and then decoding based on constraints that detect and correct errors. You can do the same for plans:
|
ECC works by adding redundancy and then decoding based on constraints that detect and correct errors. You can do the same for plans:
|
||||||
|
|
||||||
1. generate multiple candidate plans (P_1, …, P_k)
|
1. generate multiple candidate plans $P_1, …, P_k$
|
||||||
2. statically validate them (types, allowed effects, resource access)
|
2. statically validate them (types, allowed effects, resource access)
|
||||||
3. partially execute cheap prefixes if needed
|
3. partially execute cheap prefixes if needed
|
||||||
4. select the plan that passes checks / yields valid outputs
|
4. select the plan that passes checks / yields valid outputs
|
||||||
@@ -308,7 +302,7 @@ Stage S commits:
|
|||||||
bump_ptr = checkpoint[S] <-- bulk free (arena reset)
|
bump_ptr = checkpoint[S] <-- bulk free (arena reset)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Why this works (evidence / reasoning, OS flavor)
|
### Why this works
|
||||||
|
|
||||||
This approach gives two strong properties that are “theoretical” in the systems sense.
|
This approach gives two strong properties that are “theoretical” in the systems sense.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user