Bringing Git to data archival
I am increasingly excited about distributed version control and how it enables easy collaboration between software developers without technical and social barriers such as synchronisation of work and maintenance of control.
The obvious question is how the DVC systems can be applied to scientific collaboration and my particular specialism -- large scale data access and archiving for scientists. I'm not alone in asking the question either. There was a flurry of discussion in the blogasphere in 2011 around the idea of a GitHub for Science and Git has regularly come up when discussing solutions for managing the CMIP5 archive.
Therefore I was really excited when I discovered git-annex. This tool looks like an excellent fit for solving many of the challenges we have faced in developing the data infrastructure for CMIP5 and has the potential to bring a more radical git-like workflow to how scientists obtain data. To explain why I need to describe a little about one aspect of data management for CMIP5.
CMIP5 and drslib
My particular contribution to CMIP5 has been drslib, a library for maintaining the directory structure used to store CMIP5's 1-2Pb of data. To cut a long story short drslib maintains a tree of thousands of dataset directories each containing a collection of files ranging from MBs to 10s of GB. Each dataset can go through several versions and each version is visible on the filesystem as a separate subdirectory. The challange for drslib is:
- Manage changes from one version to another.
- De-duplicate the storage so that a file which exists in multiple versions does not need to be stored twice.
- Just use the filesystem so that standard data transfer tools like FTP would work.
This is achieved by storing all files in a separate storage subdirectory inspiringly called "files" and symbolically linking files from there to version subdirectories named v<YYYYMMDD>. For instance structure of a dataset with 2 variables and 2 versions looks something like this:
$ tree -Fd .
.
├── files
│ ├── sbl_20111109
│ ├── sbl_20120105
│ ├── snw_20111109
│ └── snw_20120105
├── latest -> v20120105
├── v20111109
│ ├── sbl
│ └── snw
└── v20120105
├── sbl
└── snw
Where each leaf directory in files contains the real data and the leaf directories v<YYYYMMDD> contains symbolic links.
git-annex in a nutshell
.git/annex and only checks in metadata about the file into git. Each clone of the repository keeps a complete history of where a annexed file can be found, either from a remote's annex, the web or something called a special remote, but the clone only downloads the file itself if requested. It then symbolically links the file into the working tree. The result is remarkably similar to what drslib, only much better engineered of course!An example
The resulting annex structure is a little more complex than drslib's files directory but manageably so. The annex has a similar structure to git's object database only with configurable object naming. See git-annex internals for details.
$ find .git/annex/objects -type d | head -n 20 .git/annex/objects .git/annex/objects/w2 .git/annex/objects/w2/Jp .git/annex/objects/w2/Jp/WORM-s33431484-m1320856060--snw_LImon_HadGEM2-ES_rcp85_r3i1p1_205512-208011.nc .git/annex/objects/z3 .git/annex/objects/z3/M0 .git/annex/objects/z3/M0/WORM-s26748560-m1357837445--sbl_LImon_HadGEM2-ES_rcp85_r3i1p1_208012-210011.nc .git/annex/objects/VM .git/annex/objects/VM/5W .git/annex/objects/VM/5W/WORM-s127784-m1325766407--sbl_LImon_HadGEM2-ES_rcp85_r3i1p1_210012-210012.nc .git/annex/objects/gG .git/annex/objects/gG/zV .git/annex/objects/gG/zV/WORM-s127668-m1325767423--snw_LImon_HadGEM2-ES_rcp85_r3i1p1_210012-210012.nc .git/annex/objects/X6 .git/annex/objects/X6/Mv .git/annex/objects/X6/Mv/WORM-s33431600-m1357837443--sbl_LImon_HadGEM2-ES_rcp85_r3i1p1_200512-203011.nc .git/annex/objects/VK .git/annex/objects/VK/Kj .git/annex/objects/VK/Kj/WORM-s33431600-m1325766325--sbl_LImon_HadGEM2-ES_rcp85_r3i1p1_200512-203011.nc .git/annex/objects/qM
Conclusions
So git-annex is remarkably similar to drslib in the way it de-duplicates large files on the file system. It could replace drslib's de-duplication and version transition logic without having any impact on what the end-user sees. These features come with the full advantage of git for robust version tracking and cloning. I will be investigating this further as we prepare to take on data from the CORDEX project.
After a few days of investigating git-annex the software seems remarkably robust and worth pursuing further. Development is active and there are RPM and DEB packages available. There are many possibilities beyond this narrow use case that could be opened up if we can make it work such as replication via cloned repositories or allowing advanced users to clone a repository to get a tracable version history.