Setup ArchiveBox on RHEL-compatible distros
2023-02-15
From their website: ArchiveBox “is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.” It offers a command-line tool, web service, and desktop apps for Linux, macOS, and Windows.
There are several ways to install ArchiveBox. The developers recommend to install with docker-compose, but this gets a little cumbersome when we’re running a RHEL-compatible Linux distro that has strong opinions on container management and prefers Podman over the “old way” (aka Docker). I’ve personally found it easier to install it with Python’s Pipx tool and have my web server reverse proxy the ArchiveBox server.
Prerequisites
- Preferably a filesystem with compression and deduplication capabilities, such as BTRFS or ZFS, but any journaling filesystem will work fine if we have another way to backup the archives.
- Minimum of 500MB of RAM, but 2GB or more is recommended for chrome-based archiving methods.
Installing dependencies
To get started, we’ll install pipx and the Python development package:
sudo dnf install python3-pip python3-devel pipx
Next, we’ll install required dependencies, some of which may already be available on our system:
sudo dnf install wget curl git libatomic zlib-ng-devel openssl-devel openldap-devel libgsasl-devel python3-ldap python3-msgpack python3-mutagen python3-regex python3-pycryptodomex procps-ng ldns-utils ffmpeg-free ripgrep
We’ll need a recent verison of NodeJS. On AlmaLinux, Rocky Linux, RHEL, or CentOS Stream, we can install version 20 by enabling its module with DNF.
sudo dnf module install nodejs:20
On Fedora, we can install the latest NodeJS version from the repositories.
sudo dnf install nodejs
Then, we’ll install optional dependencies. If we want to use chrome-based archiving methods, such as fetching PDFs, screenshots, and the DOM of web pages, we’ll need to install the Chromium package. If we want to archive YouTube videos, we’ll need the yt-dlp package.
sudo dnf install yt-dlp chromium
Now we’ll install ArchiveBox with pipx:
pipx install "archivebox[ldap,sonic]"
Initializing the ArchiveBox database
Create a directory in the archivebox user’s home directory to store the archive data:
mkdir data
cd data
Run the initialization:
archivebox init --setup
The setup wizard will prompt us to enter a username, email address, and password. This will allow us to login to our ArchiveBox web dashboard.
Now we need to create a systemd service for the ArchiveBox server. Create the file at ~/.config/systemd/user/archivebox.service
.
[Unit]
Description=Archivebox server
After=network.target network-online.target
Requires=network-online.target
[Service]
Type=simple
Restart=always
ExecStart=bash -c '$HOME/.local/bin/archivebox server 0.0.0.0:8000'
WorkingDirectory=$HOME/data
[Install]
WantedBy=default.target
Reload the daemons:
systemctl --user daemon-reload
Enable and start the archivebox.service:
systemctl --user enable --now archivebox.service
If we’re running a web server already, we can reverse proxy the archivebox server on port 8000. I use Caddy, so this is what I have in my Caddyfile:
archive.hyperreal.coffee {
reverse_proxy 0.0.0.0:8000
}
If we’re not already running a web server, then we might need to open port 8000 in our firewalld’s default zone:
sudo firewall-cmd --zone=public --permanent --add-port=8000/tcp
sudo firewall-cmd --reload
We should now be able to access our ArchiveBox instance from our web server domain or from our localhost by pointing our web browser at http://localhost:8000.
Here are a few examples of what we can do with the ArchiveBox command-line tool:
archivebox add "https://techne.hyperreal.coffee"
archivebox add < ~/Downloads/bookmarks.html
curl https://example.com/some/rss/feed.xml | archivebox add
We can specify the depth if we want to archive all URLs within the web page of the given URL:
archivebox add --depth=1 https://example.com/some/feed.RSS
We can run archivebox on a cron schedule:
archivebox schedule --every=day --depth=1 http://techrights.org/feed/
‘Dassit! Enjoy ArchiveBox :-)