My Own Personal Wikipedia, Part 1

October 11, 2024

So I’ve been on a kick to sort out how I might come close to self-reliant when it comes to:

Purifying water for drinking
Maintain communications
Keep useful information around
Exist without MDs or DMDs

and the first three of those are hard enough without the fourth, which I have no solution for; at least nothing to speak of.

I know ways to handle water, for biological threats, iodine. For chemical or mineral threats things get trickier, like maybe gathering materials and activating charcoal for a filter, using detergent or stockpiling filters, etc.

Communications is basically a joke to me, I’m an Amateur Radio license holder, but I’ve still got my eyes on that extra ticket. I have radios in metal boxes, near and far communication abilities, but more is always better and since I was forced to sell my HF radio and then eventually getting a job, I haven’t stopped thinking about replacing it but haven’t for financial reasons.

Information is interesting: there is no single source, and while I can wander around survival/prepper sites looking for informational PDFs, it’s not super productive because I want something relatively comprehensive. On the web, I would have to guess that Wikipedia is standing champion for place with the most densely collected human knowledge of anywhere. But it’s PHP. I’m a crazy person, so instead of making an exception and just working out how to run wikipedia is a no-go except I can come up with more reasons than just the language it’s written in, like using MySQL where my world revolves around Postgres and requiring elasticache if I want search capability. But the folks at wikipedia are really nice and understand my plight, so they do provide dumps.

Specifically, my interest is in the text of English Wikipedia. I’m sure it’s missing more than eastern wisdom encoded in Mandarin, but I only speak English and hosting the images would just require infinitely too many bytes of storage. I have plenty storage for the uncompressed XML, which is around 90 gb but more than that I have room to store a stemmed and unstemmed set of terms in indexes for every single page, and room to separate those pages out so they can be quickly enough traversed for search refinement.

The biggest problem has turned out to be the markdown-style format that is mediawiki markdown. Nearly everything about it seems bespoke and the parser in PHP is written like the chrome HTML renderer, forgiving of mistakes to a degree that makes a mockery of any specification that could be had. Since I’m unwilling to leave the scope of go dependencies since that’s the most debuggable code for me, I had to really search but I found this fork of an ibm programmer’s library though I have to admit my first searches yielded a likely better parser written in python named the mwparserfromhell and I was invoking that through the command line concurrently, but the computational costs of starting the library every time invoked and trying to do it concurrently were very high, leading me to continue to search by language on github.

I eventually worked out that instead of using bunzip2 you can stream the lines from a bzip2-compressed format the way that you would any normal file, which is great because it saved me 70gb not decompressing to disk. I had to buffer and parse the xml page by page, though, because I didn’t want to wait for it all to parse, even if that was somehow possible. This streamed page-by-page input is the backbone of saving pages in a trie-like format for quick lookup. Initially I was trying to store parsed data and do it all in one directory inside the repo with a .gitignore entry. That slowed oh-my-zsh to a crawl because it would invoke github for the shell status info and I eventually moved it out of the repo but it was still very slow to try to count/move/delete because there were too many files for the filesystem, so a trie-like folder structure where two layers of directories just correspond to the beginning of the file name like /a/p/apple.xml, sped that up and I was finally able to get some work done.

Once I had worked out how to make the data clutter workable search was almost trivial for the first version because I had only built an index of the stemmed terms, but that also meant it wasn’t incredibly effective. On the reworked version I managed to help that problem by also storing the precise terms in lower case so even if the search is case-insensitive it still works for proper names, places, events etc. especially when they contain infrequent terms/names.

After making search work the first time in a CLI I decided it wouldn’t be that much harder to use go’s html templates (there are safer versions of this package, this is meant to be run locally) and it wasn’t. With a little help from LLMs (llama 3.1/3.2 and 4o and 4o-mini) the template was a real breeze and I had a nice dark mode interface, but I still had to debug and re-prompt. Billions of parameters, years of refinement and these LLMs still struggle with CSS. Very relatable.

And that’s the meandering, first draft story of how I came to reinvent the internet’s library with my free time instead of being a person & having a life. I really need a replacement car LMAO.

wiki4dummies