The Axes
Organizing information is a little challenging. Like, consider your personal notes. How do you keep them neat and tidy? When there’s that thing you remember writing down, but not exactly where or how you phrased it, do you just full text search? You search filenames too right? And you use embeddings in case you can only remember something semantically similar to what you need of course. We’re just talking about searching for a simple fact here, so it’s obviously pretty easy; it’s not like we’re worried about relationships or things that change over time or actually tracking why something happened. You know what let’s just not keep notes, I’m sure we’ll remember it if it’s really that important.
One major annoyance is that even after we’ve freed ourselves from the onus of writing things down, we’re still plagued by these folks marketing their techniques and software and tools to make tracking all these details easier. The problem is that rather suddenly, like right now, there’s real potential behind these ideas being pitched at you. It’s tricky to see at first because it’s a combination of new and old. LLMs are of course the big New Thing. Next in line is embeddings. Hiding in both their shadows is a couple decades of graph theory. They’re all standing on a well-aged foundation of full text search that’s so positively boring at this point that it’s often tragically overlooked.
Sadly nobody marketing their product, whether they’re chasing mind share or money, is keen to teach you this taxonomy. That’s where the clarity is though. You need to know the parts before you can know whether any given picture being painted for you is incomplete. Building a useful context lake is bounded by your understanding of the parts it’s made of.
The first axis is structure. How your information is digested. Our doomed system of informal notes usually lives and dies on the first tier, basic chunks of text kept in something like files. Hidden in plain sight in those chunks is a hierarchy. Chunks contain headers, documents contain sections, files are organized into trees. Chunks can be explicitly annotated with relations, think hyperlinks. More interesting though is that chunks can yield much deeper relationships if you extract entities with LLMs. Those entities can be modeled into domain ontologies. Chunk metadata allows temporal tracking of change over time. The ultimate achievement is true causal modeling. There’s an entire journey here from dumb text to modeling reality.
The second axis is about where the intelligence lives. This is where we learn the most from the data lake. Every compute we spend during ingestion makes every future query faster. It also saddles you with assumptions and couples you to how good your tools were at that initial moment of addition. Every compute spent at query time is an expense with a very short ROI horizon; in return you’re free to scale that cost up and down on the fly and always use the latest and most sophisticated tools available. There’s no right answer here and there never will be. You mix these two the best you can and stay flexible.
The third axis is retrieval. This is the other side of the first axis, structure. There’s a loose symmetry here that’s important. Retrieval is where the tools come together to make something bigger than the sum of the parts, and you need the higher tiers of structure to enable the higher tiers of retrieval. Vector similarity handles semantic search. Lexical search is an old reliable workhorse that casts a broad net. Graph traversal builds on extracted entities and relationships. An agent sits at the top and drives the process. The agent is technically optional. You could have run these retrieval queries yourself, targeting different angles and approaches to the search problem, then combed and combined the results to find the next iteration and the eventual conclusion. It only costs literally all your time. That’s why the agent is the capstone, and why this combination suddenly matters.
The fourth axis is maintenance. You don’t have to build a context lake that digests your organization’s data into sophisticated multi-tiered structures and extracts deeper meaning and causality. You don’t have to maintain a multi-pronged retrieval strategy coordinated by agentic LLMs. That shit is all expensive. That’s also why it’s the new moat. Some kids in a garage may be able to rebuild your software now with a few million tokens. Nobody else can ever apply this perfect storm of tools to your organization’s unique data.