4.9 KiB
4.9 KiB
Overview and Backstory
- I bought a copy of a website,
newadvent.org, which is full of great stuff.- I'm not allowed to share the website. Go buy a copy or use https://newadvent.org/
- One thing I wanted to have with the website was my own search function.
- New Advent uses google as a search engine on their website. I don't like that.
- I discovered this project: https://pagefind.app/
Pagefind is a fully static search library that aims to perform well on large sites, while using as little of your users’ bandwidth as possible, and without hosting any infrastructure.
Pagefind runs after Hugo, Eleventy, Jekyll, Next, Astro, SvelteKit, or any other website framework. The installation process is always the same: Pagefind only requires a folder containing the built static files of your website, so in most cases no configuration is needed to get started.
After indexing, Pagefind adds a static search bundle to your built files, which exposes a JavaScript search API that can be used anywhere on your site. Pagefind also provides a prebuilt UI that can be used with no configuration. (You can see the prebuilt UI at the top of this page.)
The goal of Pagefind is that websites with tens of thousands of pages should be searchable by someone in their browser, while consuming as little bandwidth as possible. Pagefind’s search index is split into chunks, so that searching in the browser only ever needs to load a small subset of the search index. Pagefind can run a full-text search on a 10,000 page site with a total network payload under 300kB, including the Pagefind library itself. For most sites, this will be closer to 100kB.
- One problem I had was that New Advent used
.htmpages and not.htmlpages. - So when I tried indexing the directory, it wouldn't work because
pagefindonly recognized.html. - I had to use several scripts to change stuff around so it works.
- I had to also figure out why I was getting
Permission Deniedproblems while usingpagefind.
Permission Problems
- The problem happened when I was in
/mnt/drive/scholastia/and trying to command pagefind to index. - When I would run:
./pagefind --site "newadvent"I would get aPermission Deniederror. - I couldn't figure out what the problem was.
- I ended up copying the binary to
/etc/pagefind/and then I did a test run -- it worked. - Then I realized the 18,000 pages weren't indexing.
- Then I realized I had to change all of the
.htmextensions to.html
HTM to HTML
- These words needed to be changed in the
hrefsection, and also the HTML tags. - I realized I had to use
sedand some other shit. Here are the scripts I used.
Script 1
- The script below finds
.htmEXACTLY and replaces it with.html.
#! /bin/bash
find /etc/workspace/newadvent/ -type f -exec sed -i -e 's/\bhtm\b/html/g' {} \;
Script 2
- The script below finds all file extensions named
.htmand changes them to.html. - This has to be done in the directory of choice where all of the
*.htmfiles are named. - You have to copy this script in each directory. There are more elegant ways of doing this but I dont care.
#! /bin/bash
for file in *.htm
do
mv "$file" "${file%.htm}.html"
done
Script 3
- I realized some stuff would turn from
.htmlto.htmll, so I had to fix that with this script.
#! /bin/bash
find /etc/workspace/newadvent/ -type f -exec sed -i -e 's/\bhtmll\b/html/g' {} \;
Script 4
- The script below changes a weird occurance:
htmturned intohtmll4. - So I needed to change
htmll4intohtml
#! /bin/bash
find /etc/workspace/newadvent/ -type f -exec sed -i -e 's/\bhtmll4\b/html/g' {} \;
Pagefind Finally worked
- Everything finally worked out, and I got my search bar. It's Java Script which I don't like, but oh well.
- Solr was a pain to deal with.
Search Bar
<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>
window.addEventListener('DOMContentLoaded', (event) => {
new PagefindUI({ element: "#search", showSubResults: true });
});
</script>
Output of Indexing
$ ./pagefind --site "/path/to/directory"
Running Pagefind v1.1.0
Running from: "/etc/pagefind"
Source: "/path/to/directory"
Output: "/path/to/directory"
[Walking source directory]
Found 19045 files matching **/*.{html}
[Parsing files]
Did not find a data-pagefind-body element on the site.
↳ Indexing all <body> elements on the site.
2 pages found without an <html> element.
Pages without an outer <html> element will not be processed by default.
If adding this element is not possible, use the root selector config to target a different root element.
[Reading languages]
Discovered 1 language: unknown
[Building search indexes]
Total:
Indexed 1 language
Indexed 18974 pages
Indexed 438285 words
Indexed 0 filters
Indexed 0 sorts
Finished in 231.680 seconds