Git & GitHub

A motivating example


You’re working on an analysis in R and you’ve got it into a state you’re pretty happy with.

We’ll call this version 1.

Cartoon graphic of a script called 'analysis.r' containing a simple linear model

The next day, you have an email from your boss, “Hey, you know what this model needs?”

Cartoon graphic of a script called 'analysis.r' containing a simple linear model with a person with a speech bubble saying 'hey, you know what this model needs'?

You’re not sure what she means but you figure there’s only one thing she could be talking about: more cowbell. So you add it to the model.

But you’re worried about losing the old model. Instead of editing the code, you comment out the old code and put a serious warning in a comment above it.

Cartoon graphic of a script called 'analysis.r' containing a more complicated linear model with a person with a speech bubble saying 'hey, you know what this model needs'? and another person with their face in their hands

Commenting out code is common, but it’s hard to understand why you did this when you come back years later or you when you send your script to a colleague.

Luckily, there’s a better way: Version control.

Instead of commenting out the old code, we can change the code and tell Git to commit our change. So now we have two distinct versions of our analysis and we can always see what the previous version(s) look like.

We can also describe the change in the commit message. Git also tracks who, when, and where the change was made.

Cartoon graphic of two versions of a script called 'analysis.r' containing a more simple and more complicated linear model respectively with the Git commit message used when the second version was committed

After some time, you’ve committed a 3rd version of your analysis (v3), and a colleague has an idea…

Cartoon graphic of a person with a speech bubble saying 'what if we used machine learning?'

You’re not sure the idea will work - this is where Git shines. Without a tool like Git, we might copy analysis.R to another file called analysis-ml.R which might end up having mostly the same code except for a few lines. This isn’t particularly problematic until you want to make a change to a bit of shared code and now you have to make changes in two files, if you even remember to.

Instead, with Git, we can start a branch. Branches allow us to confidently experiment on our code, all while leaving the old code intact and recoverable.

Cartoon graphic of a person with a speech bubble saying 'what if we used machine learning?' over six numbered circles, the first three just with numbers, the last three with a preceding lowercase 'b' and the number with a text label saying 'branch 'maybethiswontwork'

You’ve been working in a branch, made a few commits, and your boss emails again asking you to update the model. Without Git, you might panic because you’ve rewritten much of your analysis to use a different method but your boss wants change to the old method.

Cartoon graphic of a person with a speech bubble saying 'we just got some new data. update the model.' over six numbered circles, the first three just with numbers, the last three with a preceding lowercase 'b' and the number with a text label saying 'branch 'maybethiswontwork'. There is a second person with a thought bubble with a screaming face sitting at their laptop in the bottom corner

With Git and branches, we can continue developing our main analysis at the same time as we are working on any experimental branches. Branches are great for experiments but also great for organizing your work generally.

Cartoon graphic of seven numbered circles, the first four just with numbers, the last three with a preceding lowercase 'b' and the number with a text label saying 'branch 'maybethiswontwork' and '4' and 'b1' are  both connected to '3' by arrows

After all that hard work on the machine learning experiment, you decide to scrap it. It’s perfectly fine to leave branches around and switch back to the main line of development but we can also delete them to tidy up.

Cartoon graphic of seven numbered circles, the first four just with numbers, the last three with a preceding lowercase 'b' and the number with a text label saying 'branch 'maybethiswontwork' and '4' and 'b1' are  both connected to '3' by arrows. A red 'x' covers all of the 'b' circles and a person with a speech bubble says 'nevermind, this was a bad idea...'

If, instead, you decided you liked the machine learning experiment, you could also merge the branch with your main development line. Merging branches is analogous to accepting a change in Word’s Track Changes feature but way more powerful and useful.

Cartoon graphic of eight numbered circles, the first four and last one just with numbers, the penultimate three with a preceding lowercase 'b' and the number with a text label saying 'branch 'maybethiswontwork' and '4' and 'b1' are  both connected to '3' by arrows. '4' and 'b3' both connect to '5' and a person with a speech bubble says 'that was a great idea!'

Years later, your colleague asks you to make sure the model you reported in a paper you published together was actually the one you used.

Another really powerful feature of Git is tags which allow us to record a particular state of our analysis with a meaningful name.

In this case, we tagged the version of our code we used to run the analysis for the paper. Even if we continued to develop the work after we submitted our manuscript, we can always go back and run the analysis as it was in the past.