Introduction to git-annex

git-annex allows to efficiently synchronize directories, called repositories, stored at different locations on different machines. It does so by handling separately the file contents and the files list. The file list is fully synchronized to all repositories, the actual file contents are copied only to a subset of repositories. If the contents of a file is not available on one of the repositories git-annex puts instead of the file a placeholder.

To create a repository in a local directory:

git init .
git annex init "name of the repository"

Bash

Repositories can be in either indirect or direct mode, by default they are in indirect mode. Indirect mode requires to lock/unlock files when editing them, this allows to save every version. In direct mode the files do no need to be locked/unlocked before being edited but their local history may be lost. If you changed a file and you didn’t synchronize it to another repository the original version will be lost. The same would happen if you did synchronize a file to another repository but you also modified it there.

To edit a file when the repository is in indirect mode:

git annex unlock name/of/file

Bash

To record the changes to a file after editing when the repository is in indirect mode:

git annex add name/of/file

Bash

To discard the changes done to a file when the repository is in indirect mode:

git annex lock name/of/file

Bash

To switch a repository to direct mode:

git annex direct

Bash

To try to recover the previous version of a file in direct mode (it may fail):

git annex undo name/of/file

Bash

In order to synchronize repositories they need to be linked. Each repository has a list of remotes: a list of other repositories that it can be synched to. The remote configuration is composed by the unique identifier of the repository and an address that explains how it can be accessed. For instance assume there are three repositories A, B, C, the first two are on the same machine FOO, at location /srv/a and /srv/b, while the third is at location /srv/c on machine BAR accessible via ssh from FOO. The repository A can be configured with two remotes: the first for B pointing to /srv/b, while the second for C pointing to ssh://BAR/srv/c.

To add a remote to the a repository:

git remote add name_of_remote url_to_remote

Bash

Synchronization of two repositories can be triggered at will. To help synchronizing the repositories git-annex keeps track of their history1. Thanks to this it can easily determine what changed between any two of them. If the history of two repositories is the same, no changes need to be transferred. If the history of one of the two repositories is behind, it just need to apply the missing changes. If the history diverged, that is, some changes where done only on one repository and some other only on the other, then the program merges those changes. To merge the changes the program follows two simple rules. For all files that were modified only on one side it applies the modification to the other repository. For any file that was modified in different ways it creates two copies with different names, one coming from the first and one coming from the second repository.

To synchronize the local repository with all its remotes:

git sync --content

Bash

In git-annex there is no master copy of the data: any two repositories can be synchronized. Even if some repositories cannot be directly synchronized between themselves, by repeating the pair-wise synchronization operation enough times all repositories end up being the same.

During synchronization all repositories receive the full list of files and depending on how their preferred content has been configured they receive the content of none, some or all the files2. It is possible to configure git-annex so that for instance a given repository receives all the version of files or just the current version. It is also possible to force the copy of the contents of a file to a given repository.

To force current version of all files to be copied to the local repository during synchronization:

git annex wanted . standard
git annex group . client

Bash

To force all the versions of all files to be copied to the local repository during synchronization:

git annex wanted . standard
git annex group . backup

Bash

To copy locally the content of a file from any of the remotes that has it:

git annex get path/to/file

Bash

In addition to full fledged repositories git annex supports two other types of storage facilities. The first are repositories that can store only the history of the file list, the second, called special remotes, are repositories that can only store the contents of the files. The first type of repositories are normal git repositories (for instance stored on github). The second type are file storage services (such as Amazon or Dropbox). When synchornizing with these repositories only some type of information can be synchronized: on the first type the file list, on the second type only the file contents. It is possible to configure git-annex to store the content of the files on special remotes in encrypted form.

  1. Git-annex keeps track of the history of repositories using git. Instead of storing the actual files it stores symbolic links that point to the hash of the file content. This is done in order to keep the size of the history small and to being forced to push the content of the files during synchronization.
  2. The program keeps track of which file content (identified by its hash) is stored on which repository on a git branch that is synchronized at the same time of the file list. This means that all repositories, assuming that they are all synchronized, know where all the copies of a given file are.