Hackin’ and Tinkerin’ | Llm-personalization

Evaluations

2026-04-26T11:00:00-07:00

Coming soon but looking to leverage PrefEval…

Making memories

2026-04-26T08:45:00-07:00

After getting the setup, the next set was representing what it learns about the user. For now, I have focused on a very specific kind of memory: preferences. The system is not yet trying to build a complete user model with biographical facts or long-lived profile information (more below). Instead, it is looking for preference signals in the user’s prompts and using those signals as the first layer of personalization.

This was a deliberate choice to mirror the real-time implicit personalization we worked on at Rank Dynamics. The goal is to observe signals that emerge during interaction and use them quickly to change future behavior.

Extracting preferences

The first step is extracting preferences from the prompts. This is accomplished with a separate backend LLM call whose job is intentionally narrow: return a JSON object with two arrays, likes and dislikes, each containing short text snippets. If the user says something like “I like strawberries but I don’t like very sweet desserts,” the extractor attempts to isolate those preferences and store them as separate memory items.

The current system is not trying to infer personality traits, stable facts or complex emotional states. It is only trying to catch reasonably clear preference evidence which makes the behavior easier to inspect and debug. Eventually, I will expand to include factual memories about the user as well. Facts, such as where a user lives, what they do for work, or what project they are working on, is not the same as a taste or preference and, as such, should probably decay much more slowly.

What’s the vector, Victor?

Once a preference is extracted, it is converted into a vector representation using OpenAI embeddings. An important design choice here is that the embedded text is normalized to the semantic content of the preference rather than the full surface form. In other words, the goal is to embed something like “strawberries,” not “I like strawberries” or “I dislike strawberries.” The signed weight is stored separately.

This matters because I want positive and negative evidence about the same concept to land in roughly the same part of vector space. If the user says “I like strawberries” in one moment and “I don’t like strawberries” in another, the prototype should treat those as two pieces of evidence about the same underlying topic, not as unrelated memories. The vector is meant to capture semantic similarity, while the weight captures direction and strength.

At the moment, those weights are simple. Preferences are stored with signed values and the decay behavior is configurable, but I have not yet built the more intelligent weighting scheme that I ultimately want. For now, the important thing is that the prototype already separates semantic representation from preference direction, which gives it a useful structure for future refinement.

Once memories are embedded, the next task is grouping related ones together. For this, I chose DBSCAN because it does not require pre-specifying the number of clusters, like k-means. That is attractive because the number of memory themes depends on the user’s behavior rather than on an arbitrary design decision. If the user has expressed many related preferences, clusters should emerge naturally. If not, memories remain unclustered.

Conceptually, the clustering step is meant to discover topics of preference rather than isolated statements. A user may express the same taste repeatedly in slightly different ways, or may express related attitudes toward a broader concept. Clustering makes it possible to start treating those as part of a shared pattern rather than as independent rows in a database.

This is also where the prototype begins to move beyond simple prompt injection. Instead of just remembering isolated likes and dislikes, it starts building grouped preference areas. That is important because a personalized system should ideally react not just to exact repetition, but to related evidence that accumulates over time.

A cluster by any other name…

A cluster of vectors is useful computationally, but for the system to use clustered memories in prompt injection, and for the interface to make sense to users, each cluster needs a readable description.

The current approach sends the aggregated cluster memories, along with their signed scores, to the LLM and asks it to generate a short directional description. In contrast to the memoires, the prompt explicitly asks the model to infer the underlying preference direction that best explains the cluster as a whole. So the goal is not merely to name a concept like “mechanical keyboards,” but to produce something more like a preference description, such as a preference for quieter keyboard sounds or a dislike of loud, clicky ones.

The prompt also asks the model to judge whether each memory in the cluster supports or opposes the final cluster description. That matters because positive and negative memories about related concepts may still point toward the same higher-level preference. This gives the prototype a first pass at interpreting preference structure rather than merely storing raw observations.

Real-time signals, not full identity

One thing worth emphasizing is that this is still a very lightweight memory system. It is not trying to become a deep, persistent identity model of the user. That may come later in some form, especially once factual memories are added, but the current behavior is closer to short-horizon adaptation. In that sense, it remains aligned with the older Rank Dynamics intuition of using immediate interaction signals to improve relevance in real time.

That is also why decay matters. Some memories should fade if they are not reinforced. Others should become stronger as evidence accumulates. At the moment the decay mechanism is still relatively blunt, and one of the next steps will be to make those decay weights more intelligent. I expect factual memories, when introduced, to behave very differently from preference memories in this respect.

Prompt injection today and tomorrow

Right now, the prompt injection layer is still simple. The system stores and clusters the memories, but the live chat prompt is still primarily injected with raw remembered likes and dislikes when they seem relevant. In other words, the more sophisticated cluster descriptions already exist, but they are not yet the main representation used to guide the assistant.

That will eventually change. The direction I want to move in is to inject cluster-level natural language descriptions, augmented with some sense of weight or confidence. Instead of giving the model a flat list of remembered items, the prompt could communicate higher-level preference summaries such as strong tendencies, mild tendencies or mixed evidence around a topic. That would be much closer to the kind of structured, interpretable personalization I have in mind.

For now, the prototype is in an intermediate and useful state. It can extract preferences, map them into vector space, cluster related memories and generate directional descriptions for those clusters. That is already enough to make the memory system feel less like a bag of saved strings and more like the beginnings of a personalized representation of user intent.

Setup

2026-04-26T06:00:00-07:00

After laying out the motivation, the next question was how to build something simple enough to experiment with but real enough to share.

I did not want a heavy framework or a complex cloud architecture. The goal was not to build a polished product but a working prototype that would let me explore personalization in a concrete way. That meant choosing tools that were familiar, lightweight and easy to modify.

The basic stack ended up being:

a single-page frontend in HTML and JavaScript,
a small Python backend using Flask,
OpenAI’s API for the model calls and
SQLite for local storage. The idea was to keep the application legible. I wanted to be able to understand the whole system, change individual pieces quickly, and avoid spending the early days of the project fighting the scaffolding.

To stream or not to stream

I also chose not to implement streaming at the outset. That was partly a product decision and partly an architectural one. Since I wanted to work in a stack I already knew well, I started with a very simple HTML/JavaScript frontend and a Python Flask backend. Streaming would have been possible in that setup but it would have added complexity and pulled attention toward interface polish rather than the core question of the project. The memory logic is the story, so I kept the interaction loop simple and focused development effort there.

My coding partner, Codex

Codex was my development partner. Rather than writing everything from scratch in the traditional way, I used Codex conversationally to sketch the initial structure, generate and revise code, troubleshoot issues, and iteratively add features. That ended up shaping not just the speed of development, but the style of it. Instead of trying to design the entire system up front, I moved in small steps, testing each change and then deciding what to do next. My experience with coding agents is that, like when developing something from scratch, it helps to break the project up into little pieces and stage them out. For a prototype like this, that worked extremely well.

Bare bones

The first version was intentionally minimal. Before worrying about memory, embeddings, clustering or prompt injection, I just wanted a chatbot that worked. The frontend sends a message to the Flask backend, the backend forwards the request to OpenAI, and the response comes back to the browser. That simple loop established the core application shape and made it possible to layer in personalization later without rethinking everything.

To make that work, I needed an OpenAI API key. A ChatGPT subscription is not the same thing as API access, so I created an API key through the OpenAI platform, configured billing and stored the key as an environment variable rather than hardcoding it into the project in order to cleanly separate the application code from credentials and make it easier to run the same project locally and in production.

Local development was done the old-fashioned way, on localhost. I used a familiar Conda-based Python environment, installed the dependencies, ran the Flask app and iterated from there. This made it easy to test changes quickly and keep the feedback loop short.

Once the 127.0.0.1:5000 local version was stable enough to be interesting, I wanted to put it online so that other people could try it. For hosting, I chose Railway for its simplicity. Railway connects straight to a GitHub repository so changes are deployed automatically with a simple git push. There was no need to build a deployment pipeline from scratch or think deeply about servers. I just needed a reliable way to turn a local experiment into a public URL.

The application also needed a little production-minded handling because it uses SQLite. On a local machine, SQLite is almost effortless. In the cloud, it raises the practical questions of where does the database live, and does it survive restarts and redeploys? Railway’s persistent volume support provided a clean answer that let me keep the lightweight local-database approach while still preserving the prototype’s memory store across deployments.

Once the app was live on Railway, the next step was to make it feel less temporary with a custom domain name. (I bought mymemochat.com because it was available and I didn’t feel like spending days thinking about this.) This, unfortunately, turned out to be one of the more finicky parts of the setup. The DNS provided by the company hosting my domain did not play nicely with Railway’s custom-domain and SSL flow, particularly at the root. Railway expects a CNAME-style setup and some providers’ ALIAS-style behavior is unreliable for this because Railway services sit behind dynamic shared IPs. The practical fix was to move DNS handling to Cloudflare. That solved two problems at once:

Cloudflare supports CNAME flattening at the root domain and
it gave me a more predictable path for SSL.

Getting to work

One thing I appreciated about this setup is that it preserved the same basic development rhythm. I could still work locally on 127.0.0.1, test changes quickly, and only push when something was ready to be seen publicly. Once the code was pushed to GitHub, Railway would redeploy the updated app. That created a smooth bridge between experimentation and publication, which is what I wanted.

This all may sound ordinary, but when building a prototype whose main novelty lies in behavior rather than infrastructure, there is real value in keeping the surrounding system boring. A simple frontend, a small Flask backend, an API key, a local database, a straightforward host and a custom domain were enough to get the project to the point where the more interesting questions could begin.

Those more interesting questions are really what the project is about. Once the basic shell was in place, I could start focusing on the personalization itself: how memories should be extracted, represented, weighted, clustered, decayed and eventually injected back into future prompts.

Motivation

2026-04-19T00:00:00-07:00

At Rank Dynamics, our team worked on the problem of real-time personalization in search. The idea was simple: a short query rarely captures a person’s true intent and static ranking does a poor job of adapting as that intent becomes clearer. We believed relevance should not be frozen at the moment the query is issued but rather improved as the system learns, implicitly, from the user’s actions.

The core problem was too many search results. Teevan et al., in “Beyond the Commons: Investigating the Value of Personalizing Web Search,” put it plainly:

“Web queries are very short, and it is unlikely that a two- or three-word query can unambiguously describe a user’s informational goal.”

As such, we laid out our philosophy for addressing this problem with four quadrants of personalization. The quadrants distinguish two axes: signals learned explicitly (through keyword entry) vs. implicitly (through actions and inactions) and those that are captured in real time vs. built up over the long term. In particular contexts, like deep research and shopping, the implicit signals turned out to be the most powerful. Critically, however, user intent signals have a half-life. Some aspects of personalization don’t change frequently but most signals should decay quickly.

As a segue from search to LLMs, Andrej Karpathy recently wrote:

“One common issue with personalization in all LLMs is how distracting memory seems to be for the models. A single question from 2 months ago about some topic can keep coming up as some kind of a deep interest of mine with undue mentions in perpetuity. Some kind of trying too hard.”

Large language models (LLMs) make this old philosophy newly interesting.

LLMs are often discussed as if they have replaced search, which feels true to a large extent. In practice, however, they inherit many of the same problems, which are, frankly, fundamental to communication. The interface is more fluid and the output is more polished but the system still has to infer what the user actually wants. In fact, the problem may be even harder now. Instead of choosing which ten links to rank at the top, the system is choosing how to frame an answer, what assumptions to make, which facts to emphasize, how much detail to provide and what tone or style best fits the moment.

A beginner and an expert may need very different explanations of the same idea. A user who prefers concise answers may experience the same response differently from someone who wants more context. A recommendation that is technically sound may still feel wrong if it conflicts with a person’s tastes, habits or current goals. To Andrej’s point, bringing up old signals after the person has moved on can be frustrating. What matters is not only whether the model can generate a plausible response, but whether it can generate one that feels appropriately aligned to a particular user at a particular moment.

I wanted to test whether a lightweight system, built in a weekend, with no custom model training, could meaningfully personalize LLM responses by maintaining a small, decaying memory of what the user cares about and then selectively injecting that context at inference time.

So, over a weekend I built mymemochat.com.

It’s a prototype that looks reminiscent of an early Rank Dynamics prototype. (I wish I had saved a screenshot!) We eventually dropped the ‘insights’ in the left panel and then quickly moved to delivering our technology through a browser extension (which ended up being installed tens of millions of times) but it makes the process transparent which helps with debugging and communicating the value proposition.

The whole thing was built with Codex using OpenAI GPT-4o mini on the backend. The system progressively layers personalization on top of the LLM through prompt injection, using a local embedding vector memory store and explicit logic for deciding what should be saved, reinforced, decayed or discarded. I used Codex as a development partner throughout, sketching the architecture myself and breaking it into small chunks, without writing a line of code directly.

In future posts, I’ll walk through the architecture in detail: how memories are extracted and embedded, the decay logic, how RAG augmentation works in practice and what I’d do differently. In the meantime, give it a try and let me know what you think.