Alexander Brett

Tools for effective branching structures in git

28 August 2015

Creating a good git branching structure is a difficult process. There are many considerations to be juggled, including:

  • Is this easily understandable to developers and PMs, including those who may not have prior experience with git?
  • Is it easy to trace a single change to the branch, developer and ticket for which it was made?
  • Is it possible to roll back changes which introduce issues?
  • Will this scale out to several large teams of developers, and does it need to?

In addition, when working with specific systems, for instance Salesforce.com or CPAN, the sandboxing and release processes suitable for those systems introduce additional requirements around the branching structure.

In fact, the principal trade-off to be made is that a branch model which produces a very clear and traceable result will, in general, require a higher level of fluency with git for all participants.

This article is an exploration of different techniques that can be used to build the branching structure your organisation needs; it’s important to note that there is no one true branching structure, and that anybody who says that there is is wrong!

NB: If you feel that something in this article needs improvement, please feel free to open a pull request

#The trivial structure

The simplest possible branching model has one developer working on one feature at a time. When the feature is complete, you tag a release, and continue working on the same branch. This looks like this:

   v0          v1
    |           |
o---o---o---o---o---o

This works well for personal projects, but obviously falls down as soon as you need to switch the priorities on features, fix a bug in an existing feature before resuming work on the in-progress one, or collaborate with anybody else. Nonetheless, it’s important to realise that git branching structures don’t, in fact, have to include multiple branches.

Branching

In order to work on multiple features at once, or get a bugfix done quickly, you can start to use multiple branches, merging changes as appropriate. This works as follows:

  1. You’re working on a feature:

          v0    myFeature
           |       |
       o---o---o---o
    
  2. You need to fix a bug, so you create a new branch starting at the last release

           v0   myFeature
           |       | myBugfix
       o---o---o---o   |
            \----------o
    
  3. When you’ve finished the bug, you release that branch

           v0   myFeature
           |       | v0.1
       o---o---o---o   |
            \----------o
    
  4. You merge the new release into your own branch

           v0              myFeature
           |         v0.1  |
       o---o---o---o---|---*
            \----------o--/
                           ^merge commit
    
  5. When you finish your feature, you release your branch

           v0                 v1
           |          v0.1     |
       o---o---o---o---|---*---o
            \----------o--/
    

Master

Using tags for releases works really well from a release-management and version history point of view, but it can get a bit fiddly as a developer - you have to constantly check which tag is the most recent, and ensuring that you’re branching and merging the right commits can get a little tedious. If you’re the only one working on the project, it’s probably not going to get to complicated, because you may well have only a couple of branches at once, and you’ll create each release and therefore be in a better position to remember what’s going on. However, once you have more than one person able to make releases, or you get several branches, you’ll want to handle this potential complexity.

At this point, having a master branch is really useful. Whenever you release, you ensure that master points at that commit. In that way, each time you switch branch, making sure it’s up-to-date with the latest release is simply a matter of merging in master. When you’re dealing with a master branch, your branching diagrams look a little different:

  1. You’re working on a feature

           master
         v0 |  
          \ /     myFeature
       o---o        |
            \---o---o  
    
  2. You need to fix a bug, so you create a new branch starting at the last release

           master
         v0 |  
          \ /     myFeature
       o---o         |  myBugfix
            \        |   |
             \---o---o   |
              \----------o
    
  3. When you’ve finished the bug, you release that branch by merging into master

                          master
          v0                | v0.1
           |   myFeature    |/
       o---o---------|------*          
            \        |     /^merge commit
             \---o---o    /
              \----------o
    
  4. You merge the new release into your own branch

                        master
          v0              | v0.1
           |              |/
       o---o--------------*  myFeature
            \            / \ /
             \---o---o--/---*
              \--------o
    
  5. When you finish your feature, you release your branch.

          v0            v0.1 v1 master
           |              |   | /
       o---o--------------*---*
            \            / \ /
             \---o---o--/---*
              \--------o
    

Fast-Forward

The downside of introducing the master branch like this is that we’ve introduced two extra merge commits compared to the previous version - and in fact, half of the commits since v0 are merge commits! This does serious damage to our ability to see quickly and easily what changes have been introduced and when. Fortunately, we don’t always need to do a merge - git has an ability to fast-forward, which means that, when there is nothing to merge, the branch is moved to point to a different commit, without any new commit being added.

To be more specific, a fast-forward occurs when one of the commits to be merged is the ancestor of the other, which you can see happening at v0.1 and v1 above.

If we allow fast-forward commits, we end up with much more attractive diagrams for steps 3 onwards:

  1. When you’ve finished the bug, you release that branch (nb we fast-forwarded master!)

           v0      master v0.1
           |  myFeature | /
       o---o--------|---o 
            \       |
             \--o---o   
    
  2. You merge the new release into your own branch

           v0   master  v0.1
           |         \ /
       o---o----------o  myFeature
            \          \/
             \--o---o---*
    
  3. When you finish your feature, you release your branch (nb another fast-forward onto master!)

           v0       v0.1 v1  master
           |          |   | /
       o---o----------o---* 
            \            /
             \--o---o---/
    

It’s important to realise that this set of diagrams is identical to the original, with a new branch added and some lines in different places - master is simply ‘a branch which will always point to the last release’. In fact, if you are using master in this way, you could choose only ever to fast-forward commits onto it.

Rebase

I think that merge commits are noise. When you have a branch-based workflow, you’re working on a few features simultaneously, and you release regularly, you may end up with 1/3rd or more of your commits being merge commits, and this can mean that when you use git log you end up with an effectively unreadable mess. Fortunately, in rebase we have a tool that lets us re-arrange our commit history in an extremely readable and pleasant manner. It works exactly the same as above up to step 4, at which point instead of merging, we rebase, which takes all of the commits we made on our branch and then applies them on top of the target, which means it’s as though we just checkout out the latest release and instantaneously developed on top of it. This leaves the history looking like this:

             master
          v0  | v0.1
          |   \ /
      o---o----o     myFeature
                \       |
                 \--o---o

which in turn means that when we release myFeature, we get this:

          v0  v0.1   v1 master
          |    |      \ /
      o---o----o---o---o

…which is extremely easy-to-follow.

This is the workflow that I use on my perl modules. The habitual use of rebase during development is not without controversy, however; to be able to rebase accurately and effectively whilst avoid messing up your own and other people’s work requires discipline and experience. You have to ensure that you don’t rebase a branch which you’ve pushed to a shared git server, and that when you do rebase you are aware of potential conflicts and the ways to resolve them - because it’s less obvious after the fact that when you do a merge. It was this article which got me thinking about the ways that rebase is in fact a brilliant tool to have up your sleeve, and I do think that on projects with a high enough level of expertise, it should be used.

Another caveat to add at this stage is that rebase is, like all tools, not always appropriate. If your branch is more than a few commits divergent, or if the rate of change is so fast that you’re trying to rebase dozens if not hundreds of commits at a time, you may well find that it’s more trouble than it’s worth; git merges exist for an excellent reason. I think that choosing the best way to incorporate change is largely a matter of doing it several times and getting an intuition for it.

Rebasing onto a shared branch

Let’s say you and another developer both working on some feature, and you’ve got a branch called myFeature. You actually have at least 5 branches in at least 3 locations:

  • On the server, you have myFeature
  • On your computer, you have origin/myFeature and myFeature
  • On his computer, you have origin/myFeature and myFeature

To start with, all of the branches look the same. However, once you’ve each done a little work, it can easily look a bit like this:

server/myFeature   a---b
                    \   \
theirs/myFeature     \   d
                      \
mine/myFeature         c---e

Now, when I pull from and push to the server, then make another commit, this happens:

server/myFeature   a---b-------*
                    \   \     / \
theirs/myFeature     \   d   /   \
                      \     /     \
 mine/myFeature        c---e       f

And they do the same, which looks like this:

server/myFeature   a---b-------*-----*
                    \   \     / \   / \
theirs/myFeature     \   d---/---\-/   g
                      \     /     \
mine/myFeature         c---e       f

This rapidly becomes messy and has unnecessary merge commits, not to mention being hard to follow. However, what would have happened had we fetched and rebased instead of pulling is the following much neater result:

server/myFeature   a---b---c'--e'--d'
                                \   \
theirs/myFeature                 \   g
                                  \
 mine/myFeature                    f

Essentially, a competent developer using git should almost always rebase when the commits to be pushed are not yet on a server.

Pull requests

One good use for merges is that they allow peer-review and attribution of changes. This leads to the idea of a ‘pull request’ - some contributor sends a message saying

Please pull1 my branch into your repository

At this point, every git tool out there will show you exactly what has changed and why, enabling you to have confidence in the features they’ve developed, and it also makes it easy to appreciate their contributions. Pull requests are a crucial tool for collaboration on projects where there is anything other than a small and tightly-knit team.

When you have a pull request based workflow, your master branch will look something like this:

master     ---*---*---*---
feature A  o-/   /   /
feature B  ---o-/   /
feature C  -o----o-/

This means that every commit on master is a merge commit, and they will probably look something like ‘Merge pull request #4 from my-super-special-feature to master’. This does mean that’s it’s often harder to find the specific commit which introduced a change.

Develop

At some point you’ll be working on a system where you can’t simply release several times a week, and releases need to be gathered, tested, signed off, and deployed. Some might argue that this is a pathology, but it’s also a fact of life. In this situation, you may well add in a branch for work that’s done, but not yet released. Depending on your background, you may want to call this several names, including stable and trunk - in git, it’s called develop.

This looks like this:

  1. You start with master and develop

     master v0 develop
          \ | /
            o
    
  2. You do some work on the develop branch using one or more of the above principles:

    v0  master
      \ /      
       o      develop
        \        |
         o---o---o
    
  3. You’re ready to release, so you fast-forward master onto develop and tag a release

     v0   master v1 develop
     |         \ | /
     o---o---o---o
    
  4. Rinse & repeat

Git-Flow

Git-Flow is essentially: having a master branch, a develop branch, and additional feature branches, without using fast-forward or rebase. It ends up looking a bit like this:

master   --o----------------------*---
            \                    / \
develop      \----------*-------*---*
              \        / \     /  
feature1       \--o---o   \   /  
                \          \ /  
feature2         o----o--o--*

It has the advantage of being able to accommodate reasonably-sized teams of relatively-low expertise, but it also has a fair number of disadvantages - which have been discussed at length everywhere.

Beyond Git-Flow

Git-Flow starts breaking down once you hit a large number of simultaneous teams; once you hit about 50 feature branches, you spend so much time merging down from develop and there are so many merge commits, that you lose a lot of the benefits of using git to begin with. At this point, it’s much easier to set up a branch per epic2 and have the team working on that treat it as a master branch - once the epic is ready for release, that’s then released as normal. What this means is you have:

master   o---------------*---*
          \             /   /
epic 1     \-#Black Box#   /
            \             /
epic 2       \-#Black Box#

So, depending on those teams’ structures, they may be using anything from an extremely trivial workflow up to a full on mini-Git-Flow. It’s at this point that your branching structure starts looking a bit like a fractal.

Forks

If you’re going to treat each team’s work on your product to be a separate black box waiting to be pull requested back into the develop or master branch, you may as well get them to work in separate forks - this prevents you from getting a gradual buildup of 300 stale branches where nobody’s quite sure who’s working on what.

Using forks can also unlock some useful functionality in whatever git server you’re using; Atlassian’s Stash has a ‘fork syncing’ feature which allows you to automatically apply any commit which is applied to a branch in a parent repository to all the child forks. It allows each team to set fine-grained permissions and have administrative access, isolates critical infrastructure, and makes setting up continuous integration easier (you just clone the CI environment and point it at a different URL, rather than having to reconfigure all the branches).

Per-environment branches

Depending on the way you have your continuous integration environments set up, you may want to use a branch to represent test and staging environments. However, you probably won’t want to ever merge these branches into anywhere else - tickets that are in for testing are explicity untested, and tickets undergoing UAT are not UAT’d. One successful approach is:

  1. A ticket is moved to ‘development complete’
  2. A pull request is automatically opened to the relevant test environment
  3. A build plan detects the pull request and attempts to build and deploy the pull request
  4. If the build and deployment is successful, the pull request is automatically merged

Travis CI has a great feature where it automatically detects pull requests and builds them; Atlassian Bamboo has a feature where it can automatically merge branches if a build passes, and they are both good examples of how using even simple git features can save you a lot of work.

  1. When you remember that pull means fetch then merge, this is a very clear and specific request. 

  2. Or whatever you want to call a related group of features 

Tags: Git

Handling repository rewrites with git

16 January 2015

Let’s say you’ve decided that you need to make a change which changes almost every line in your git repository, for instance if you’ve realised you’ve got your line endings all messed up and want to make them uniform. If you’re the only developer, or you can close every branch so that you can make your change on one branch only, you’re ok. However, if you have dozens of developers working on dozens of branches, you’ll come across a problem which is that once you’ve applied your change, the next time you attempt to merge anything, every single line will come up as a merge conflict in git.

Let’s say you have branches A and B, and that each branch has some changes to a file called foo.txt, which has CRLF line endings, and you introduce a commit on each branch which changes them to LF. Git sees this as a change on every line in each branch, which means that there is nothing in common with the base commit of those branches, so it simply has nothing to go on when merging.

The good news is that by a little bit of git trickery, you can avoid this situation altogether. In my organisation, we have some branches organised as follows: master is our currently-released branch. develop is all features which are done, and is branched from master. qa is a testing branch and all features are merged into it. Each feature has a branch which is created from develop. I hope this diagram makes the situation clear!


master   --*--------------------------------------*---
            \                                    / \
develop      *-*-*-*-------------*--------------*---*-
                \ \ \           / \            /
QA               \ \ *-----*---/---\-------*--/-------
                  \ \     /   /     \     /  /
feature1           \ *-*-*---*       \   /  /
                    \                 \ /  /
feature2             *----*---*--*-----*--*

This means that unless we have a hotfix underway, every branch being worked on is branched from develop, and in general we keep develop merged into each branch as much as possible. The thing that will make this rollout of a huge number of changes possible without breaking everything is creating a point on develop which we ensure is merged into every branch, then, applying our mass change on every branch, without any other commit. This means that every branch gets a commit called something like apply mass change. Lastly, we will pretend that nothing happened.

Let’s go into a bit more detail. In this example, I was trying to compress some profiles and apply whitespace changes at the same time. I’m going to tell it as a story because it works better that way.

##Preparation

I created a branch called, addGitattributes. This contained only one change - the addition of a .gitattributes file detailed in my last post. Other than that, it was created from master, so I was guaranteed that it would merge into any branch just fine.

Then, I created a batch script called for instance, doMassChange.bat. It looked like this, although yours will vary depending on what you were trying to achieve.

git reset --hard && ^
git clean -f && ^
git merge origin/addGitattributes && ^
rm .git/index && ^
git reset && ^
git add -u && ^
git commit -m "Whitespace normalisation commit" && ^
compressProfiles.bat && ^
git add -u && ^
git commit -m "Profile compression commit"

As you can see, I’ve chained each command with && which ensures that if one thing breaks, we stop and the developer has a chance to call me over so I can work out what! Lastly, I created a file called applyMassChangeCleanly.bat (these names are actually fictional to make it clear what I mean, to be honest) which looked like this:

git reset --hard && ^
git clean -f && ^
git merge MASS_CHANGE_DEVELOP_BEFORE && ^
doMassChange.bat && ^
git merge -sours MASS_CHANGE_DEVELOP_AFTER

The crucial bit is the -sours strategy being chosen on the last line. What the ours strategy does is mark the branches as merged, without actually doing a merge. This has all sorts of potential to break things, but because we know we’re going from a known state (the tag MASS_CHANGE_DEVELOP_BEFORE) and applying identical changes (doMassChange.bat), it is in fact perfect for this situation.

I ensured that these two files are propagated onto every branch (you could alternatively distribute them to every developer in another way), and lastly I sent an email to my developers detailing what was going to happen on rollout day.

##Rollout

On rollout day, I got to the office early and made myself a strong coffee, then did the following:

  • Merged master into develop (just to make sure)
  • Tagged develop as MASS_CHANGE_DEVELOP_BEFORE
  • Ran doMassChange.bat on develop
  • Ran doMassChange.bat on master
  • Ran git merge -sours master on develop
  • Tagged develop as MASS_CHANGE_DEVELOP_AFTER
  • Pushed everything (including tags) to the server
  • Ran applyMassChangeCleanly.bat on QA
  • Pushed QA to the server

Then I had another strong coffee.

As the developers got to the office, they did their daily pull of develop and got huge merge conflicts. Then they remembered I’d sent them an email and read it, following which they ran applyMassChangeCleanly.bat on their branches.

And we all lived happily ever after!

Tags: Git

How to handle whitespace with Salesforce.com and git

15 January 2015

A common problem when working in a git repository in a cross-platform environment is end-of-line handling, as testified by the number of stackoverflow questions on the topic! I found that the most useful guide to getting whitespace right in a repository was github’s, but that there were some additional concerns when working with Salesforce.com.

Firstly, it’s important to bear in mind that SFDC provides all of its text files with unix-style (LF) line endings, and I think that the path of least resistance is to stick with what they provide! However, if you’re a windows shop, your developers are probably using tools which introduce windows-style line endings (CRLF) into the files which they touch. The problem with letting this go unchecked is that you are liable to end up with a huge number of merge conflicts which are extremely frustrating, and eventually you put --ignore-space-change or --ignore-whitespace on every git command.

The first recommendation of github’s guide is to set core.autocrlf=true and call it a day. However, you must not do this! The reason why not is that when you retrieve from SFDC, your static resources are saved as src/staticresources/foo.resource, and git does not by default recognise that these are binary files. This means if you just set up autocrlf, git will mangle these files by deleting bytes which it thinks are CR characters and are in fact useful information, and then SFDC will stop being able to read the files.

So the correct solution is to set up a .gitattributes file in the root of your git repository which looks a lot like this:

# ensure all salesforce code is normalised to LF upon commit      
*.cls text=auto eol=lf                                            
*.xml text=auto eol=lf                                            
*.profile text=auto eol=lf                                        
*.permissionset text=auto eol=lf                                  
*.layout text=auto eol=lf                                         
*.queue text=auto eol=lf                                          
*.app text=auto eol=lf                                            
*.component text=auto eol=lf                                      
*.email text=auto eol=lf                                          
*.page text=auto eol=lf                                           
*.object text=auto eol=lf                                         
*.report text=auto eol=lf                                         
*.site text=auto eol=lf                                           
*.tab text=auto eol=lf                                            
*.trigger text=auto eol=lf                                        
*.weblink text=auto eol=lf                                        
*.workflow text=auto eol=lf                                       
*.reportType text=auto eol=lf                                     
*.homePageLayout text=auto eol=lf                                 
*.homePageComponent text=auto eol=lf                              
*.labels text=auto eol=lf                                         
*.group text=auto eol=lf                                          
*.quickAction text=auto eol=lf                                    
*.js text=auto eol=lf                                             
*.py text=auto eol=lf                                             
*.pl text=auto eol=lf                                             
*.csv text=auto eol=lf                                            

… which is to say, every metadata type which you know is going to be text gets an entry, but those types which might be binary get no entry (or you can add *.staticresource binary etc). This probably isn’t quite comprehensive depending on your setup, because inside documents/*/ you can end up with arbitrary file endings - however, normally the files you have there have ‘normal’ filenames, such as downArrow.png or footer.html which git has a chance of being able to recognise as binary or not.

Once you’ve set up your .gitattributes properly, if you’re starting off a new repository you’re good to go, but if you’re having to apply these changes to a repository which you already have developers working on, you need to be quite careful about rolling them out. I’m going to write a post on that topic soon.

Tags: SFDC Git

How and why to compress your Salesforce.com profiles

15 January 2015

Why compressing your profiles is a good idea

When handling Salesforce.com metadata, especially attempting to store it in source control, it doesn’t take long to notice the following:

  • Profiles are big. In fact, they contain 3-10 lines for every Apex Class, Visualforce Page, Object, Field, App, and so on, and so forth. Before long you’ve got thousands of lines, which means…
  • It’s difficult to commit changes to a profile, because you’ve got to scroll down to line 10243 to check that that’s the change you meant.
  • It takes ages to diff your profiles because they take up many megabytes on disk.
  • Profiles are vulnerable to merge errors because git’s standard diff algorithm doesn’t respect xml structure, and good luck finding an algorithm which does which can handle such huge files.

I work with a large salesforce installation with about 110 profiles and 30 permissionsets, each of which is some 25,000 lines long, and they take up 120mb on disk. These are real problems for my organisation, and I had to come up with a solution. I realised that there’s no reason to have exactly what you retrieve from Salesforce.com stored on disk. You can apply retrieve-time transformations to your code so long as:

  • Whatever you store is still deployable.
  • The tools used to retrieve your metadata are uniform across your organisation.

I write developer tools for my colleagues, so I am in a position to guarantee the latter. As for the former, all you have to do is remove a lot of line breaks. The idea is to transform this:

    <applicationVisibilities>
        <application>Order_Management</application>
        <default>false</default>
        <visible>true</visible>
    </applicationVisibilities>
    <applicationVisibilities>
        <application>SendGrid</application>
        <default>false</default>
        <visible>true</visible>
    </applicationVisibilities>
    <applicationVisibilities>
        <application>Territory_Management</application>
        <default>false</default>
        <visible>true</visible>
    </applicationVisibilities>
    <applicationVisibilities>
        <application>standard__AppLauncher</application>
        <default>false</default>
        <visible>true</visible>
    </applicationVisibilities>
    <applicationVisibilities>
        <application>standard__Chatter</application>
        <default>false</default>
        <visible>true</visible>
    </applicationVisibilities>
    <applicationVisibilities>
        <application>standard__Community</application>
        <default>false</default>
        <visible>true</visible>
    </applicationVisibilities>
    <applicationVisibilities>
        <application>standard__Content</application>
        <default>false</default>
        <visible>true</visible>
    </applicationVisibilities>

into this:

<applicationVisibilities><application>Order_Management</application><default>false</default><visible>true</visible></applicationVisibilities>
<applicationVisibilities><application>SendGrid</application><default>false</default><visible>true</visible></applicationVisibilities>
<applicationVisibilities><application>Territory_Management</application><default>false</default><visible>true</visible></applicationVisibilities>
<applicationVisibilities><application>standard__AppLauncher</application><default>false</default><visible>true</visible></applicationVisibilities>
<applicationVisibilities><application>standard__Chatter</application><default>false</default><visible>true</visible></applicationVisibilities>
<applicationVisibilities><application>standard__Community</application><default>false</default><visible>true</visible></applicationVisibilities>
<applicationVisibilities><application>standard__Content</application><default>false</default><visible>true</visible></applicationVisibilities>

The key idea is that each metadata component, whether an application, a custom field, a visualforce page or anything else, gets precisely one line in the resulting document, which means:

  • Any addition, deletion or modification of a component changes exactly one line
  • The addition or removal of lines is guaranteed to result in well-formed XML which is deployable.
  • Merges are much, much easier to perform.
  • Since git diff works line-by-line and we’re reducing the file from 25,000 to 2,500 lines, we gain a huge increase in efficiency when working with git.
  • We get back about 500kb of disk space per file.

For really tiny Salesforce instances, this might be overkill, but you can see that once you get big enough, this makes a real impact.

##How to do this compression

I produced an extremely simple Perl script to carry out this compression. Why Perl?

  • Unmatched string processing ability
  • Perl 5.8.8 comes bundled with a git installation on windows

Save this file as profileCompress.pl:

BEGIN { $\ = undef; }
s/\r//g;                  # remove all CR characters
s/\t/    /g;              # replace all tabs with 4 spaces
if (/^\s/) {              # ignore the the xml root node
  s/\n//;                 # remove newlines
  s/^    (?=<(?!\/))/\n/; # insert newlines where appropriate
  s/^(    )+//;		      # trim remaining whitespace
}

Then every time you do a retrieve, invoke it with perl -i.bak -p profileCompress.pl src/profiles/*.profile src/permissionsets/*.permissionset. The obvious disclaimer about backing up your code first because it might get mangled and I can’t take any responsibility for that applies!

I handle this by adding

<exec executable = "perl">
	<arg value = "-pi.bak"/>
	<arg value = "${lib.dir}/script/profileCompress.pl"/>
	<arg value = "${src.dir}/profiles/*.profile"/>
	<arg value = "${src.dir}/permissionsets/*.permissionset"/>
</exec>

to my ant script right after I retrieve, once I’ve stored all of my stuff inside the folders stored in those variables.

Tags: SFDC Git