Renaming-Files-and-Changing.../README.md

4.9 KiB
Raw Permalink Blame History

Overview and Backstory

  • I bought a copy of a website, newadvent.org, which is full of great stuff.
  • One thing I wanted to have with the website was my own search function.
  • New Advent uses google as a search engine on their website. I don't like that.
  • I discovered this project: https://pagefind.app/
Pagefind is a fully static search library that aims to perform well on large sites, while using as little of your users bandwidth as possible, and without hosting any infrastructure.

Pagefind runs after Hugo, Eleventy, Jekyll, Next, Astro, SvelteKit, or any other website framework. The installation process is always the same: Pagefind only requires a folder containing the built static files of your website, so in most cases no configuration is needed to get started.

After indexing, Pagefind adds a static search bundle to your built files, which exposes a JavaScript search API that can be used anywhere on your site. Pagefind also provides a prebuilt UI that can be used with no configuration. (You can see the prebuilt UI at the top of this page.)

The goal of Pagefind is that websites with tens of thousands of pages should be searchable by someone in their browser, while consuming as little bandwidth as possible. Pagefinds search index is split into chunks, so that searching in the browser only ever needs to load a small subset of the search index. Pagefind can run a full-text search on a 10,000 page site with a total network payload under 300kB, including the Pagefind library itself. For most sites, this will be closer to 100kB.
  • One problem I had was that New Advent used .htm pages and not .html pages.
  • So when I tried indexing the directory, it wouldn't work because pagefind only recognized .html.
  • I had to use several scripts to change stuff around so it works.
  • I had to also figure out why I was getting Permission Denied problems while using pagefind.

Permission Problems

  • The problem happened when I was in /mnt/drive/scholastia/ and trying to command pagefind to index.
  • When I would run: ./pagefind --site "newadvent" I would get a Permission Denied error.
  • I couldn't figure out what the problem was.
  • I ended up copying the binary to /etc/pagefind/ and then I did a test run -- it worked.
  • Then I realized the 18,000 pages weren't indexing.
  • Then I realized I had to change all of the .htm extensions to .html

HTM to HTML

  • These words needed to be changed in the href section, and also the HTML tags.
  • I realized I had to use sed and some other shit. Here are the scripts I used.

Script 1

  • The script below finds .htm EXACTLY and replaces it with .html.
#! /bin/bash
find /etc/workspace/newadvent/ -type f -exec sed -i -e 's/\bhtm\b/html/g' {} \;

Script 2

  • The script below finds all file extensions named .htm and changes them to .html.
  • This has to be done in the directory of choice where all of the *.htm files are named.
  • You have to copy this script in each directory. There are more elegant ways of doing this but I dont care.
#! /bin/bash

for file in *.htm
do
  mv "$file" "${file%.htm}.html"
done

Script 3

  • I realized some stuff would turn from .html to .htmll, so I had to fix that with this script.
#! /bin/bash
find /etc/workspace/newadvent/ -type f -exec sed -i -e 's/\bhtmll\b/html/g' {} \;

Script 4

  • The script below changes a weird occurance: htm turned into htmll4.
  • So I needed to change htmll4 into html
#! /bin/bash
find /etc/workspace/newadvent/ -type f -exec sed -i -e 's/\bhtmll4\b/html/g' {} \;

Pagefind Finally worked

  • Everything finally worked out, and I got my search bar. It's Java Script which I don't like, but oh well.
  • Solr was a pain to deal with.

Search Bar

<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>
    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({ element: "#search", showSubResults: true });
    });
</script>

Output of Indexing

$ ./pagefind --site "/path/to/directory"

Running Pagefind v1.1.0
Running from: "/etc/pagefind"
Source:       "/path/to/directory"
Output:       "/path/to/directory"

[Walking source directory]
Found 19045 files matching **/*.{html}

[Parsing files]
Did not find a data-pagefind-body element on the site.
↳ Indexing all <body> elements on the site.
2 pages found without an <html> element. 
Pages without an outer <html> element will not be processed by default. 
If adding this element is not possible, use the root selector config to target a different root element.

[Reading languages]
Discovered 1 language: unknown

[Building search indexes]
Total: 
  Indexed 1 language
  Indexed 18974 pages
  Indexed 438285 words
  Indexed 0 filters
  Indexed 0 sorts

Finished in 231.680 seconds