In mid-2017 I set up an automated scrape of a frequently-updated website. Every day my script would crawl the website, download its contents, and commit these to GitHub. This allowed me to back up not just the site contents but the complete history of changes.
This scrape did its thing for nearly 4 years until I came to decommission the server where it runs. I was a little surprised to find that the repository had grown to well over 1 GB despite the site’s only containing around 80 MB of data. The root volume of the server was only 8 GB so this scrape was using a pretty big proportion of disk!
I shouldn’t have been surprised: storing the entire history of a website will quickly add up, especially over such a long period. I wanted to keep the entire history of the site, but I realised I didn’t need to store it on the server itself (GitHub does a fine job of hosting repositories, after all). It was time to go digging for a better solution.
git shallow clone
I’d heard about the idea of a “shallow clone,” where one clones only recent commits from a repository rather than the whole thing. git clone
supports the --depth
option which allowed me to clone only the most recent commit from the repository.
--depth
Create a shallow clone with a history truncated to the specified number of commits.
Git – git-clone Documentation (git-scm.com)
Let’s take a look how this works. First I’m going to create a source repository that has a couple of commits. You can skip this step if you’d prefer to experiment with a real repository.
# Make a bare "remote" repository that does an impression of GitHub
leigh:~$ mkdir remote
leigh:~$ cd remote
leigh:~/remote$ git init --bare
Initialized empty Git repository in /home/leigh/remote/
leigh:~/remote$ cd ..
# Clone the "remote" and add some commits
leigh:~$ git clone file:///home/leigh/remote local
Cloning into 'local'...
warning: You appear to have cloned an empty repository.
leigh:~$ cd local
leigh:~/local$ git commit --allow-empty -m 'First commit'
[master (root-commit) 3c315ce] First commit
leigh:~/local$ git commit --allow-empty -m 'Most recent commit'
[master dc999d5] Most recent commit
leigh:~/local$ git push
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 12 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 251 bytes | 251.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To file:///home/leigh/remote
* [new branch] master -> master
# Take a look at the results
leigh:~/local$ git log
commit dc999d56edcb14345da39ea25799879dadc406c7 (HEAD -> master)
Author: Leigh Simpson <code@simpleigh.com>
Date: Sat Feb 13 17:26:26 2021 +0000
Most recent commit
commit 3c315ceef3d3b5da0e02b0ea0249dfd2052175b3
Author: Leigh Simpson <code@simpleigh.com>
Date: Sat Feb 13 17:26:14 2021 +0000
First commit
Now let’s clone this repository again, but only capture the most recent commit:
# Clear out the original copy
leigh:~/local$ cd ..
leigh:~$ rm -rf local
# Clone again, passing --depth
leigh:~$ git clone --depth 1 file:///home/leigh/remote local
Cloning into 'local'...
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (2/2), done.
# Take a look at the results
leigh:~$ cd local
leigh:~/local$ git log
commit dc999d56edcb14345da39ea25799879dadc406c7 (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Leigh Simpson <code@simpleigh.com>
Date: Sat Feb 13 17:26:26 2021 +0000
Most recent commit
This is useful: we can clone all the files in a repository but ignore all its history.
What next?
I now know how to clone only a single commit, making it much easier to migrate this script to a new server (I have to download only 80 MB rather than > 1 GB).
Unfortunately this doesn’t quite solve the entire problem. In another four years I’ll have accumulated another 1 GB of new commits.
# Add another commit
leigh:~/local$ git commit --allow-empty -m 'New commit'
[master 2555676] New commit
leigh:~/local$ git push
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), 187 bytes | 187.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0)
To file:///home/leigh/remote
dc999d5..2555676 master -> master
# See what we have
leigh:~/local$ git log
commit 2555676490bee5e32b109dc8653596b4bd0de206 (HEAD -> master, origin/master, origin/HEAD)
Author: Leigh Simpson <code@simpleigh.com>
Date: Sat Feb 13 17:37:57 2021 +0000
New commit
commit dc999d56edcb14345da39ea25799879dadc406c7 (grafted)
Author: Leigh Simpson <code@simpleigh.com>
Date: Sat Feb 13 17:26:26 2021 +0000
Most recent commit
New commits are added to the history and stored locally as usual. Usefully I discovered that git fetch
also supports the --depth
option:
--depth
Limit fetching to the specified number of commits from the tip of each remote branch history. If fetching to a shallow repository created by
Git – git-fetch Documentation (git-scm.com)git clone
with--depth=<depth>
option (see git-clone[1]), deepen or shorten the history to the specified number of commits. Tags for the deepened commits are not fetched.
Let’s try it!
leigh:~/local$ git fetch --depth 1
remote: Total 0 (delta 0), reused 0 (delta 0)
leigh:~local$ git log
commit 2555676490bee5e32b109dc8653596b4bd0de206 (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Leigh Simpson <code@simpleigh.com>
Date: Sat Feb 13 17:37:57 2021 +0000
New commit
This is perfect: we can create and push a new commit and then throw away previous revisions. GitHub retains the full history of the crawl and I use a lot less disk space on my server.
Putting it together
I ended up with a script that looks a little like this:
#!/usr/bin/env /bin/bash
cd $(dirname "$0")
# Download content
wget --config wgetrc http://example.com/
# Craft a new commit and push
git add .
git commit -m "Update for $(date +%Y-%m-%d)"
git push
# Trim history to the specified number of commits and garbage-collect
git fetch --depth 1
git gc
I use wget
to do the scrape, and configure it using a local file (called wgetrc
). The git gc
call at the end shouldn’t really be necessary but doesn’t hurt.
Other thoughts
Running out of disk is a pretty disastrous situation for a server, and I’m always keen to minimise this risk. The scrape job described above opens up an interesting attack vector: if the site owner were to upload a large file then my script would happily try to download it. In the process it would consume all available disk space and bring my server to a halt!
An easy way to resolve this is to move the scrape onto its own dedicated disk. The server will then carry on running even if that disk fills up. The server is an EC2 instance running on Amazon Web Services so this was trivially easy: I created a new volume, attached it to the instance, and mounted it within the operating system. This is a good pattern for any directory that may grow without bound: even logs can explode in volume during an incident.
If the disk does fill up then I still want to know about it so I can fix the scrape. This is also pretty simple using AWS. I use CloudWatch Logs and the agent process can monitor metrics such as disk space. I monitor all disk volumes within CloudWatch and trigger alerts when disks start to fill up.
I’ll follow if this doesn’t work but hope this won’t be for a few more years.