Tiny Predictive Text

## Predictive Text Using 2MB of JavaScript, no LLM. ![[v4 3.gif]] [Try it here](https://adamjgrant.github.io/Tiny-Predictive-Text/) | [Code](https://github.com/adamjgrant/Tiny-Predictive-Text) | [Permy](https://permy.link) My first attempt at this got some traction on [Hacker News](https://news.ycombinator.com/item?id=39556956). It was enough to satisfy it I've just seen what would happen if I did this. ## Attempt 1: 13kb of code This started as a simple POC of using [Permy](https://permy.link) with a simple JSON dictionary for predictive text that is surprisingly expressive despite the tiny size (**<8KB**) of the dictionary and even the full JS implementation itself (**13KB**). ^dc5d69 ### Background This code leverages how [Permy](https://permy.link) combines words through nested arrays of word choices and can loop back on itself through branch references. ![[permy.gif]] The dictionary starts by defining the 25 most common first words used in sentences as individual branches: ```json {"main":[{"branch":"the"},{"branch":"be"},{"branch":"to"},{"branch":"of"},{"branch":"and"},{"branch":"a"},{"branch":"in"},{"branch":"that"},{"branch":"have"},{"branch":"i"},{"branch":"it"},{"branch":"for"},{"branch":"not"},{"branch":"on"},{"branch":"with"},{"branch":"he"},{"branch":"as"},{"branch":"you"},{"branch":"do"},{"branch":"at"},{"branch":"this"},{"branch":"but"},{"branch":"his"},{"branch":"by"},{"branch":"from"}]} ``` Then, each branch starts with that word and defines the top five most likely words to follow. Then, for each of those five, it provides three more words that are most likely to follow it. ```json "the": [ [ "the ", [ [ "world ", [ "is ", "has ", "seems " ] ], ... ``` Sometimes one of these words is already a branch, so I replaced those with branch references, allowing the sentence creation to loop back on itself and keep writing one or two more words. In this case, one random selection starting with "of" would auto complete with a random construction of the "the" tree above. ```json "of": [ [ "of ", [ [ { "branch": "the" }, [ [ "world ", [ "is ", "has ", "knows " ] ], ``` ### Limitations Attempt 1 was just a POC. If it can do this in 13kb, it makes me wonder what it could do with more bytes. I also did use an LLM to generate the dictionary and as [others on HN have commented](https://news.ycombinator.com/item?id=39556956), the selection for the top 25 words that start sentences probably isn't accurate. A tree of 25x5x3 isn't really that big. It's only 375 possible combinations without considering rebranching. I can see several ways someone could take this even further with a larger dictionary and some other magic here and there. Permy has lots of flexibility in defining links between string nodes and can further be compressed with the permyscript syntax where applicable. I mainly just wanted to see what would happened if I tried to use this library this way. Now I'm eager to try different things with it. ## Attempt 2: Three word context window, 1.3MB ![[v2.gif]] Next, I decided to keep going with the idea of having a very tiny dictionary, but this time having many tiny dictionaries. I also wanted to see what happens when the dictionary is actually trained on text to develop a tree that is truly based on real data, but still keeping it small. Small to me means if I visit your website and this is the dependency I don't want to wait. So I thought maybe I reach out to a megabyte or two. In this attempt, I use training data from Reddit comments and made a dictionary of the 10,000 most common combinations of three words. Then I used a similar approach as above by creating a tree underneath each one of words that tend to follow it. The result took me from one extreme to the other in attempt one. In attempt 1, I got suggestions commonly but not good ones. Now I really get suggestions but when I do they are creepily human. ## Attempt 3 Including smaller words as backup. 1.4MB ![[v3 1.gif|500]] Much better! Again, trying to keep what works and adjust what doesn't: In attempt 3, I'm still making a three word dictionary but now, if a match is not found, it looks for the last two words in a separate two-word, dictionary based on the last two words typed. Finding nothing there, it will go to the equivalent of attempt number one's single word dictionary this time trained on actual data. The hopeful outcome here is, it will always or almost always give you a suggestion provided you're using common words and preferring more meaningful suggestions when they are available. ## Attempt 4, pruning word branches. ~2MB The last piece I didn't like about the implementation is the lower branches. (words that come after other words). There was no weighting, so if a dumb completion exists, it's as likely to appear as a smart completion. For example if anyone in the training data ever said "s9d0fwjef" after "I don't know" that is as likely a completion as "what I should" However, I'm not trying to reverse engineer a large language model here and I wanted to keep this light. So instead I bumped re-found words down in index in the array they lived in (essentially increasing their rank), and frequently "pruned the branches" by always bumping off everything but the first five branches allowing only the popular to survive. ![[v4 2.gif]] [Try it here](https://adamjgrant.github.io/Tiny-Predictive-Text/) | [Code](https://github.com/adamjgrant/Tiny-Predictive-Text) | [Permy](https://permy.link) In conclusion, I'm really happy with the way this turned out. It's incredible to me that it even works this well. At the end of the day all it's doing is taking a tree of words and randomly picking a path down until it hits a leaf node. There are no weights or biases because the branches are pruned in the training process instead. As I write this, the training is still running even though it has run all night. When it finally reaches the end, I'll update the GIFs above and the public codebase so everyone can see the latest. If I throw more time at this, I might want to satisfy my curiosity of just seeing what happens if we use even larger dictionaries or to see if there are ways we can optimize the existing one even further such as reducing file size with tokenization. ## Attempt 8 Weighing in at about 2.9MB: ![[v8.gif]] I wanted to try something very different this time. Instead of using n-grams, what if there was a way I could look back into the previous words without literally looking at every single word? What if I had a kind of "fingerprint" of the preceding words that would be enough to surface smart suggestions? I started with measuring the vowel-to-consonant ratio, word length distributions, and unique word ratio and turning them on and off in various combinations to see if any combination gave me a better signal. To make it stick a little better, I kept track of the anchor word, meaning the last word before the predictive text, as a qualifier for which predictive text is surfaced. So it has a context window of 1 word with intelligence of the statistical properties of that plus the preceding 9 words. Overall, it was an interesting direction but what actually gave me a better feeling about how this worked was a totally separate feature that you can see in this video (with some bugs at the beginning). When given a suggestion, the script listens for what I actually start typing and it does another lookup in its list of suggestions to see if there's a better fitting one. For example, if after "I have a " the suggestions are both "dog" and "cat", it gives me "dog" as a suggestion. But if I type the letter "c," it automatically changes the suggestion to "cat" allowing me to autocomplete what I've already written instead of tacking it on at the end. ## Attempt 9 > [!info] To be continued! For the next attempt, I wanted to take what I had learned again from the previous attempts and bid adieu to neat ideas that yielded no results. The big idea here was that maybe I don't need to know the full words preceding the completion. I think it will be enough to track the first letter of each word except the "anchor" word. In training text, writers will often misspell words, but they rarely get the first letter wrong. This also should help with file size and serve as a neat trick for tokenization. So if the text is: "The next president of the United States", the predictive text is "of the United States" and the anchor word is "of". What is stored is `["tnp", "of"]` as the context window and `["the", "United", "States"]` as the predictive text. `tnp` is treated as if it's just one word. The tokenization "trick" here is that `tnp` could be the acronym for any other set of three words elsewhere in the training and can therefore reuse that token. ```mermaid graph LR; A[👁️ tnp] --> B[⚓️ of] --> C[the] --> D[United] --> E[States] A --> F[⚓️ who] --> G[...] ``` So the suggestion only matches us on the "of" path under two conditions both being true: 1. Between 2 and 4 words back I typed words that started with T, then N, then P 2. 1 word back (the last word) in full was "of". This also gives it a stronger 4 word context window (kind of). It would also be neat to try out in a later attempt having a full word followed by an acronym followed by an anchor word to strengthen the matching quality while still compressing the data. My fear then however is that we'll have such a wide context window that the coverage we'll need for all permutations will drive the file size up way too high. And since this is currently just a place for me to drop notes to myself, until I actually do this i'm going to ramble on and come back and clean this up: I think what we're trying to achieve here is a predictive gradient, meeting that the further back in words, you go the less detail you have to work off of while still trying to keep a good signal all the way through. I had an idea of doing this in three layers, going back a certain number of words to be determined. The layers are in order from the furthest back to the closest to the predictive text point. 1. A set of words with common words removed then made into an acronym, so like maybe we remove prepositions, articles, conjugations of the verb to be, and to have, maybe not pronouns, because that will affect conjugation. This has the downside of making this whole effort language specific. So "I am really proud of you for being able to do this" we take "I am really proud of you" and turn it into "really proud you" which turns into `rpy`. 2. The next three words as an acronym, nothing removed. So "for being able" becomes `fba` 3. The next word as an anchor, so `to`. 4. Prediction Then we store the next three words as predictive text possibilities. We might not even need to mix it into an array like in the first two attempts because the combination of all the above is so rare So the context window of `the next qualified president of the United States of America` looks like: `nqp->uso->america` ```mermaid graph TD; H[nqp] --> A[uso] --> B[⚓️ america] --> C[will] --> D[be] --> E[decided] A --> F[⚓️ cares] --> G[about] --> I[...] ``` I think this is important to use similarity scoring because this could surface a match if we dropped "qualified" depending on how the scoring works it would also work if the first "the" were dropped since it's an article anyway. Another drawback to this approach is that it requires a lot of words already written in a sentence to form predictive text, so if no good matches are found we fallback to removing layer 1. As more words are typed it will comes back unless the user types letters into the suggestion. So I think it would make more sense to make the tree: 2:3:4:1 where the last layer is sorted by match to layer 1 in user text and the first is always picked up. Actually that could cause a lot of duplication for layer 4 items via layer 1. Maybe that's okay if we tokenize? I think we also need to prune primarily on anchor words, then in smaller proportions the layers down the tree. So there is really the assembly tree that is pruned and the lookup tree made after assembly that could have a different structure. Assembly tree: 3:2:4:1 Okay so turns out I don't need two trees. Since we're doing it this way, we should require a locale passed in as a flag. I think the precision gradient is a form of "decaying attention" which is more of an industry standard term. Let's have every layer do runtime similarity scoring and just go from anchor backwards with the predictive text at the very end. So on the assembly tree we only need scoring for pruning and we'd need it on all layers. # A Deeper Look at How This Works ## Creating the dictionary Tiny Predictive Text creates many small dictionaries in three groups: 1. Three word phrasings 2. Two word phrasings 3. Single word phrasing A dictionary is nothing more than a tree of words where the first node is either a three, two, or one word phrasing. ```mermaid graph LR; A[don't need to] --> B[talk]; A --> C[see]; A --> D[wait]; B --> E[about]; B --> F[through]; B --> G[in]; C --> H[him]; C --> I[them]; C --> J[every]; D --> K[another]; D --> L[for]; D --> M[through]; ``` You can also see that in the real dictionary.js here by looking at its JSON structure. In the right pane, you see the merger of three-, two-, and one-word dictionaries with the three word dictionary "I don't know" opened up on the left. ![[Screenshot 2024-03-02 at 11.44.01 AM.png]] Notice there are no weights on any of these connections. Instead, as any branch gets items added to it, if the item is already there, it gets moved up in rank. This means it's simply put at a lower index. Then when the main dictionary is compiled at a slow frequency during the training, branches are "pruned." This means only the first five or so branches are retained, preferring the ones at a lower index. This is important to keep next-words that are more likely to be seen in the training text. For example ```mermaid graph LR; A[don't need to] --> B[talk]; A --> C[go] A --> D[tell] A --> E[find] A --> F[think] A --> G[#YOLO$WAGGG] A --> F[bwhahaharofl] ``` The last two words never get promoted because they aren't seen as often in the training data so they're more likely to fall into branch pruning provided enough branches grow and enough data is seen. Neither of which is a problem in the training stage. We're also pruning the dictionaries themselves. Only a certain number of found three word combinations are allowed and the same applies to two- and one-word combinations having their own specified limit. This is determined by a number that describes the dictionary's overall "width" that each dictionary set has a % stake in. The more words, the more coverage we need, so the three word dictionaries get to take up more of the width. ```python # Across all dictionaries, how many entry word sets total should we regularly prune the # dictionary back to contain? TARGET_DICTIONARY_COUNT = 16000 # Of the total TARGET_DICTIONARY_COUNT, what stake in that count should each dictionary get? THREE_WORD_STAKE_PERCENT = 0.625 TWO_WORD_STAKE_PERCENT = 0.3125 ONE_WORD_STAKE_PERCENT = 0.0625 ``` ## Using the dictionary When the user enters text, the suggestion engine is always looking for the last three words and will look for a dictionary matching those three words. To normalize these realtionships, we're actually comparing "slugified" word sets which replace non alphanumeric characters with an underscore and keep everything lowercase. That's why we have slugs like `i_don_t_know`. Admittedly, that's a drawback that `I dont know` would only match `i_dont_know` and could be improved in the slugification logic. ```mermaid graph LR; A[✅ don't need to] --> B[✅ talk]; A --> C[see]; A --> D[...]; B --> E[about]; B --> F[through]; B --> G[✅ in]; C --> H[him]; C --> I[...]; ``` ### Is this a Markov Chain? If a match is found, it will traverse the tree by selecting a random branch each time. No weights are used here because that was all done in the pruning, so this is kind of a Markov chain with equal probabilities. I say kind of because we actually did weight the branches before compiling the dictionary by pruning ones that weren't in the top 5. This heuristic allows us to keep the processing logic and dictionary size lightweight. ### Looking up If no three word dictionary is found, it looks for just the last two words in the two-word dictionary. Finding nothing there, it finally looks in the one word dictionary before giving up. This was a key insight to nailing the sweetspot of both providing results that were high quality roften enough around attempt #3. ```mermaid graph TD; A[ߵgrilling without theߴ] --> B[👎 No match] --> C[ߵwithout theߴ] --> D[👎 No match] --> E[ߵtheߴ] --> F[👍 Match] --> G[ߵ...use ofߴ] ```