Experiences with ArchiveBox

I am wondering if anyone has experience selfhosting ArchiveBox?

I’ve been thinking of doing it and have been wondering about things such as how much RAM and disk space it takes.

4 Likes

I have it running for around two years.

Sonic search engine uses around 60MB RAM for me, the ArchiveBox itself around 250MB. Sometimes there are bugs with Chromium not terminating and the RAM use going out of control. I simply use Docker Compose v2 file with RAM/CPU limits (previously, I’d restart the Docker container once a day; now OOM kills happen once or twice a month).

As for space, ca. 6000 pages archived, it takes around 80G compressed (raw data folder is 128G right now). The space needs depend on whether you disable video archival and how many of the archival tools do you run on one URL in parallel (I run all except for PDF and DOM).

You need to be ready to URL archivals failing if you host in a datacenter due to non-residential IP blocks on many sites.

Otherwise, I think it’s an essential tool to have.

Has anyone else attempted to use this after hearing about it?

I would love to hear about best practices with ArchiveBox. I had a huge set of problems trying to get it to work with logged-in websites, primarily because I run Docker on Mac.

ArchiveBox appears to primarily use two mechanisms to download:

  • Chrome - accessing a profile directory
  • Wget - accessing the standardized cookies.txt (using a Netscape standard)

Chrome blows my mind. ArchiveBox comes with a version of Chrome it can rely on, which is important because I can’t point it to my MacOS binary (since the container is running some variant of Linux). The issue then compounds, because Chrome profiles aren’t backwards or forwards compatible, and I’m not even sure if they’re portable between the exact version/build of Chrome but between the two OSs.

So if I want to host ArchiveBox in Docker but on a MacOS host, I think I’ll also need Chrome in a docker, and make its binaries and profile available to access from the ArchiveBox container. It seems like a giant mess, so I’m just curious if anyone has better ideas.

I run it on a Linux VM in a datacenter (https://contabo.com/en/), so I gave up on using it with websites that require login long time ago.

I just opened Feature Request: a web clipper · Issue #1203 · ArchiveBox/ArchiveBox · GitHub to track a request for a web clipper, where you use your browser to capture a WARC, for example, and then store it in the AB. For the time being, I use Joplin to clip webpages behind login pages.

Yeah, I don’t think it’s a supported used (though a creative one, :scientist: )

1 Like