This blog post is going to demo migrating a Git repository that contains large files. An overview of the steps are:

  • Create empty repos at gitlab.com and github.com
  • Create demo file structure and push to gitlab.com
  • Create a mirror clone of the repo
  • Pull down with all branches
  • Create large files
  • Install git lfs
  • Try to use git lfs when there are super large files
  • Delete super big files using BFG Repo Cleaner
  • Working with files in LFS

Create empty repos at gitlab.com and github.com

I have already created personal accounts at gitlab.com and github.com. We will use GitLab as the source and GitHub as the target, but these are just examples.

First, let’s create an empty repository in GitLab. To do that I browse to https://gitlab.com/projects/new#blank_project, login, and fill out the following:

  • Project name: demo-large-files
  • Project slug: demo-large-files
  • Uncheck: Initialize repository with a README
  • Leave all the other defaults

gitlab

Next, let’s do the same at GitHub. I then browse to https://github.com/new, login, and fill in the following:

  • Owner: elijahgagne
  • Repository name: demo-large-files
  • Leave all the other defaults

github

Following these steps, I now have two empty repositories at:

Create demo file structure and push to gitlab.com

I like to work locally as much as possible. I use my Mac’s Downloads folder as a “scratch” space of sorts. In the steps below, I do the following:

  • Create a demo-large-files folder
  • Initialize a Git repo in this folder
  • Create some demo folders and files
  • Make a commit
  • Create two branches
  • Create two tags
  • Add a remote
  • Push everything to GitLab
  • Delete everything locally
cd ~/Downloads
mkdir demo-large-files
cd demo-large-files

## Uncomment and run the following command to
## configure git to use a default of "main" instead of "master"
# git config --global init.defaultBranch main
## Alternatively, the following command can be used
## after initialization to rename the branch
# git branch -M main

git init

mkdir folder1 folder2
touch README.md data.csv folder1/file1 folder2/file2

git add -A
git commit -m 'Initial commit'

git branch develop
git branch patch-fix

git tag v1.0.0
git tag production

git remote add origin https://gitlab.com/egagne/demo-large-files.git

git push --set-upstream origin main

# Note that only the main branch was pushed
# You can confirm this at https://gitlab.com/egagne/demo-large-files/-/branches

git push --all

# Note that tags were not pushed
# You can confirm this at https://gitlab.com/egagne/demo-large-files/-/tags

git push --tags

cd ~/Downloads
rm -rf demo-large-files

Create a mirror clone of the repo

In this section we do a mirror clone of the GitLab repo. Mirror clones are useful to pull down all branches, tags, and other repository objects so that you have the full repository.

The code below does the following

  • git clone –mirror
  • Changes the repo to non-bare. This is useful if we plan to push the repo to a new remote.
  • We look at the file structure, branches, and tags
  • Then we do a regular clone and see how it is different. Note that the tags will come over, but not the branches.
  • Finally we delete everything locally
cd ~/Downloads
git clone --mirror https://gitlab.com/egagne/demo-large-files.git

ls -l demo-large-files.git

git -C demo-large-files.git config --get core.bare
git -C demo-large-files.git config --bool core.bare false

cd demo-large-files.git
git branch
git tag

cd ~/Downloads
git clone demo-large-files.git demo-large-files

cd demo-large-files
ls -l

git branch

git tag

cd ~/Downloads
rm -rf demo-large-files demo-large-files.git

Pull down with all branches

Let’s look at this again, but this time we will end up with a regular clone that includes all branches.

cd ~/Downloads
rm -rf demo-large-files
git clone https://gitlab.com/egagne/demo-large-files.git

git tag

git branch

for branch in $(git branch -r | grep -v '\->'); do
  git branch --track "${branch#origin/}" "$branch"
done

git branch

cd ~/Downloads
rm -rf demo-large-files

For the same result, here are steps that do a mirror clone, convert it to a non-bare repo, and then pull in all the branches.

git clone --mirror https://gitlab.com/egagne/demo-large-files.git

git clone demo-large-files.git demo-large-files

cd demo-large-files

git tag
git branch

git remote -v

for branch in $(git branch -r | grep -v '\->'); do
  git branch --track "${branch#origin/}" "$branch"
done

git branch

Create large files

OK, that’s been fun, but let’s move on to working with large files now. The code block does the following.

  • Deletes anything still on the local file system
  • Does a fresh regular clone from GitLab
  • Pulls down all the branches
  • Create two large (128mb) and two super large (2.1gb) files
  • Commits the files
  • Adds GitHub as the remote origin
  • Tries a push

Note that this will fail because GitHub (same is true for GitLab) has a max file size limit of 100mb.

rm -rf demo-large-files.git demo-large-files

cd ~/Downloads
git clone https://gitlab.com/egagne/demo-large-files.git
cd demo-large-files

for branch in $(git branch -r | grep -v '\->'); do
  git branch --track "${branch#origin/}" "$branch"
done

# add some large files and super large files
dd if=/dev/urandom of=file_a_128mb.data bs=1024 count=131072
dd if=/dev/urandom of=file_b_128mb.data bs=1024 count=131072
dd if=/dev/urandom of=file_a_2.1gb.data bs=1024 count=2202010
dd if=/dev/urandom of=file_b_2.1gb.data bs=1024 count=2202010

git add file_*
git commit -m 'large files added'

git remote -v

git remote set-url origin https://github.com/elijahgagne/demo-large-files.git

git push

The push is expected to fail because of the large files.

Install git lfs

Let’s use the Git extension Large File Storage. Since I’m on a Mac, I can use the following commands to install it.

brew install git-lfs
git lfs install

More information is avialable at https://git-lfs.com/

Try to use git lfs when there are super large files

This code block demos the following

  • Use the find command to locate all files >100mb
  • Remove one large and one super large file
  • Run a git garbage collection
  • Import everything >100mb into LFS
  • Try to push

Note that this is going to fail because my personal GitHub account cannot take super large files, even though they are managed by git lfs.

# Find files larger than 100mb
find . -type f -size +100M

git rm file_b_*
git commit -m 'removed one large and one super large file'

git lfs migrate import --everything --above=100Mb
git push --all origin

Delete super big files using BFG Repo Cleaner

Alright, so LFS works for large files, but there are still limits imposed by the Git remote that might preclude super large files. Git isn’t a place to store large data; it’s really meant for storing code.

For this use case we need a way to remove the super large files from the Git repo. This is going to mean rewriting Git history. There is a git filter-branch command built into the command line binary to do this. But even GitHub recommends that instead you consider using the open source tool called BFG Repo Cleaner.

Note that this tool/workflow might also be useful if you’ve accidently committed sensitive data to a repository. I’m of the opinion that if you put a password or encryption key in a repo, your best bet is to change that credential immediately because you might not be able to account for anyone else you pulled down your secrets. But that’s a topic for a different post. Additionally, I’ll add that if you want to overwrite the remotes history, that is possible with a git push --force

This code block demos the following

  • Download the BFG jar file
  • Then we remove everything from LFS
  • Next, we delete the super large file from the working directory and then commit the change. This is needed because BFG, by design, will not delete files that exist in HEAD. It’s documention explains that “your current files are sacred”.
  • We run git log so that we can have a listing of the commit ids
  • Git garbage collection is run to pack the objects
  • Now it’s time to run BFG to delete the large files and then rewrite the git history
  • Logs are written to a ..bfg-report folder
  • We then run git log again and note that some of the commit hashes have changed
  • We try a push again, but it failes because of large files
  • Next we import with LFS and push again, which works
cd ~/Downloads
curl -o https://repo1.maven.org/maven2/com/madgag/bfg/1.14.0/bfg-1.14.0.jar

git lfs migrate export --everything --include="*"

git rm -f file_a_2.1gb.data
git add -A
git commit -m 'removed last super large file'

git log --pretty=format:"%h"

git gc

java -jar ~/Downloads/bfg-1.14.0.jar --strip-blobs-bigger-than 2G .
git reflog expire --expire=now --all
git gc --prune=now --aggressive

rm -rf ..bfg-report

git log --pretty=format:"%h"

git push --all origin

git lfs migrate import --everything --above=100Mb

git push --all origin

git push --tags

Working with files in LFS

As a last piece, I wanted to explore how LFS works. The following code block shows:

  • We clone a repo with files managed by LFS
  • We check the size and structure of the cloned repo
  • Then we run a git lfs pull and check again
  • This shows that we’ve “hydrated” the files locally
cd ~/Downloads
rm -rf demo-large-files*

git clone https://github.com/elijahgagne/demo-large-files.git demo-large-files

cd demo-large-files

du -sh .
ls -lh
ls -lh .git/lfs

git lfs pull

du -sh .
ls -lh
ls -lh .git/lfs