Wednesday, December 23, 2015

A Look at Git Hashes

Git is a locally stored database with integrity guarantees. To get  a feel for how this works, this post takes a look at a very simple repository using the commands hash-object and cat-file. Although these are not commands you would normally use (they are among the "plumbing commands" that lie behind the "porcelain commands" like clone and commit), they are very helpful for inspecting the objects that a Git repository manages.


Let's create a Git repository. On a command line, type:

>git init test

You should see something like:

Initialized empty Git repository at C:/Some/Path/test/.git

It's interesting to note that the response refers to the .git directory as the repository.  The parent directory, "test", is called the "working directory" in Git documentation. It is tracked by Git, but it is not the repository; it's what the repository tracks and controls.  Of course, on a non distributed version control system, the repository would reside in an external location.

Here are the contents of an empty .git repository:

12/21/2015  06:06 AM               157 config
12/21/2015  06:06 AM                73 description
12/21/2015  06:06 AM                23 HEAD
12/21/2015  06:06 AM    <DIR>          hooks
12/21/2015  06:06 AM    <DIR>          info
12/21/2015  06:06 AM    <DIR>          objects
12/21/2015  06:06 AM    <DIR>          refs

Let's look at how these files and directories make Git work.  The file HEAD contains text that shows what the current branch is:

>type .git\HEAD
ref: refs/heads/master

These 23 bytes of ASCII text are how Git knows what  you are working on.  The "refs/heads/master" is also a text file. Or will be as soon as you make a commit.  It contains 40 alphanumeric characters, the hash value of the current commit.  When you switch to a new branch in Git, it does two very simple things: (1) it creates a new file in "refs\heads\" with the name of your new branch, and with the 40 character identifier of the current commit, (2) it updates HEAD to "ref: refs/heads/<your-new-branch>".  Contrast this to SVN or TFS, where creating branches is a much heavier operation, causing hierarchies to get created on other servers.  On Git, branching is just creating a 40 byte file.  Which is pretty fast.

The real meat of the repository is the "objects" directory, where Git stores the objects it tracks.  Let's put some things in there.

>copy con droid1.txt
R2-D2
^Z

>copy con droid2.txt
C-3PO
^Z

So now we have two files.  If Git is going to manage these files, it will need a way to track their contents, to refer to the contents and to track changes.  Git uses SHA1 hashes to do this.  They are fast to generate, and they are strongly unique.

You can tell the SHA1 hash value for any file (even one not tracked by Git) using git hash-object.

>git hash-object droid1.txt
668b9c33030c59db9c0f11f777029cc3fc0fdaf1

>git hash-object droid2.txt
27c825d7d5393f79c5b14cf0dd719e3dbb391c4e



This version of the command just outputs the 40 character value, but you can also use the "-w" option to store it in the "objects" directory.  So after

>git hash-object droid1.txt -w 

Git splits the hash into a two character directory name and a 38 character file name, so you would now find a directory named ".git\objects\66" with a file named "8b9c33030c59db9c0f11f777029cc3fc0fdaf1", which would contain a compressed version of the file (with a very short header stating it was a "blob", and how many bytes long it is).  Of course, you normally would not use "hash-object" to do this, you would use "git add" or a GUI tool.  But this is what Git uses behind the scenes.

Let's go back to normal Git commands, and add these files to Git.

>git add *

>git commit -m "These are not the droids you're looking for"
[master (root-commit) 914ebae] These are not the droids you're looking for
 2 files changed, 2 insertions(+)
 create mode 100644 droid1.txt

 create mode 100644 droid2.txt

Up to now, the hashes in this post should match yours, but this will not be true with this commit hash, because it contains my name, email address, and the current date.  So Git uses one hash to guarantee the contents of a file, and a different hash to guarantee that I was the person who added them to the repository.

Let's take a look at the artifacts created by this commit.

>type .git\HEAD
ref: refs/heads/master

No change there. I am still on the "master" branch.

>type .git\refs\heads\master
914ebae549d6f4070184c7db9e1ddbaaf80e1d3b

Ah, now there is a hash value.  A more user friendly ways of seeing this is using the command git log:

>git log
commit 914ebae549d6f4070184c7db9e1ddbaaf80e1d3b
Author: dsolovay <dsolovay@gmail.com>
Date:   Mon Dec 21 06:54:17 2015 -0500

    These are not the droids you're looking for

But what does the commit contain? We can use another git plumbing command, git cat-file, to inspect it:

>git cat-file -p 914e
tree 9b8c255bff7beb1440cc726ebe3346816dc04d67
author dsolovay <dsolovay@gmail.com> 1450698857 -0500
committer dsolovay <dsolovay@gmail.com> 1450698857 -0500

These are not the droids you're looking for

A few things to note.  I typed "914e" instead of the full hash; Git allows these short cuts.  The value of the hash is generated from just the text above: the "tree" (we will get to that next), the author and the committer (these will be different if someone is merging in someone else's work), the time stamp for each, and the commit message. Change any of these, and you change the hash value, so the hash guarantees the contents. If this were not an initial commit, the commit would also contain the hash of the parent, thus including its ancestry as part of the SHA1 hash guarantee.

You can see the power of this guarantee when you fire up, for example, a mongod instance.  The hash of the git commit of the version getting used appears in the output.  This shows (sorry) that this is the mongod you're looking for:

2015-12-21T07:12:33.466-0500 Hotfix KB2731284 or later update is not installed,
will zero-out data files
2015-12-21T07:12:33.472-0500 [initandlisten] MongoDB starting : pid=8748 port=27
017 dbpath=\data\db\ 64-bit host=DanSolovay-PC
2015-12-21T07:12:33.472-0500 [initandlisten] targetMinOS: Windows 7/Windows Serv
er 2008 R2
2015-12-21T07:12:33.473-0500 [initandlisten] db version v2.6.11
2015-12-21T07:12:33.473-0500 [initandlisten] git version: d00c1735675c457f75a12d
530bee85421f0c5548

And now let's inspect the tree that the commit referred to:

>git cat-file -p 9b8c
100644 blob 668b9c33030c59db9c0f11f777029cc3fc0fdaf1    droid1.txt
100644 blob 27c825d7d5393f79c5b14cf0dd719e3dbb391c4e    droid2.txt

Again, a hash of a little bit of text. The 100644 is a little bit of UNIX-ese saying this is a non-executable file, "blob" means file (the other objects Git stores are commits, trees, and tags), and then we have the hash of the contents, and the name of the file.  So if we renamed the file, the hash wouldn't change, but the name would.  It's important to note that the name or location of the directory is not stored in this contents, so if we were to move the directory to a new location (say to test\resources\droids) the hash of the tree object would still be 9b8c255bff7beb1440cc726ebe3346816dc04d67. In fact, if you do these steps locally, your tree should also be named 9b8c255bff7beb1440cc726ebe3346816dc04d67.  Any directory, on any computer, in any universe, at any time in history, that has two files named droid1.txt and droid2.txt with the contents "R2-D2" and "C-3PO" will have the hash 9b8c255bff7beb1440cc726ebe3346816dc04d67. Guaranteed.

If you found this exploration entertaining, I highly recommend Chapter 10 of Pro Git, which builds a commit by hand. Just a note that there is a Ruby script at the end of section 2 that does not format correctly on the website, so if  want to walk through that, please go to the PDF version.  And if you would like a deeper dive into the .git directory, this blog post is very useful. 

No comments:

Post a Comment