LibGuides: Data Science: Version control & GitHub

What is version control anyways?

Have you ever had a folder of scripts that looked like this?

If so, you'll be glad to learn that there's a better way to handle sequential versions of the same thing!

What is version control?

The concept of version control is to keep a record of changes made to specified files. One common way to think about version control is to think of "track changes" in a Word document, which is a simple form of version control. While it's fairly intuitive to apply the concept to something like a Word or Google doc, in which you may want to add and delete paragraphs, version control is equally applicable to code. When developing a piece of code, you'll probably try out new approaches, test new functions, etc. You want to keep what works, but also want to keep track of what you tried and didn't work.

Version control software, such as Git, keeps track of changes made to files in an automated fashion. It also allows you to "roll back" any changes you had made. Think of it like an unlimited "undo" button, not limited to your current session.

Why should I use version control?

For starters, you can avoid the situation of "analysis_script_FINAL_REVISED_FOR_REAL-updated" naming problems. Using version control means you don't lose any of your earlier work, but you can simplify how you organize your files.

It's also great for collaboration. Again, think of a Google doc with many people making changes - this is an easier situation than emailing a track changes Word doc back and forth. Version control software helps do this, and more, with code under development. Version control systems not only keep track of changes and allow you to roll back changes, but can also merge changes that arise when multiple people are making changes to a script at the same time. This gets around the "multiple versions of the same file made by different people" issue that can come up in a platform like Dropbox.

Version control systems also have a functionality so developers can have a code "branch" in which to test out changes, rather than editing the "master" script directly. This reduces the chance of a new functionality breaking all the original code - the new feature is separate from the master, but any successful changes are able to easily be merged back in.

Working with Git and GitHub

Git is the name of a common version control software, which you can run locally on your computer. GitHub, in contrast, is a web-based service for Git repositories (i.e., groups of tracked files). Think of Git as the actual tool, and GitHub as the user-friendly web interface for working with Git. You do not need to use GitHub to use Git for version control.

You may have also heard of Git in terms of its (somewhat notorious) steep learning curve. Yes, Git can be challenging. Advanced Git can be very challenging. For the most part, however, users of Git will only need to know a few key commands. If you need additional functionality, there are many resources available for learning more, some of which are included in this guide.

Version control with Git (Software Carpentry)
Great for beginners! Yes, the example document under version control here is silly. Sub out the .txt file in the example for a .py or .R script and you'll get a sense for how you can use Git in your own workflows.
Atlassian tutorials
This series of tutorials is a good option for those who are already familiar with some level of programming and navigating the command line. If you're going to be working on code development using Git, this is a good place to start.
Git cheat sheet
A quick reference guide from GitHub for common Git commands.
Advanced Git
Covers rebase, stashing, patches, automating git pull, and more.
GitHub guides
Learn how to use GitHub from the people at GitHub. Topics include introduction to GitHub, GitHub Pages, how forks work, using issues, assigning a DOI to your code, and more.

There are some best practices to keep in mind when organize your GitHub repository. The GitHub help for creating a repository includes information about READMEs and licensing.

A basic set-up for a GitHub repository for a research project would be:

A README (includes description and goal of your project)
Raw Data folder
Derived Data folder
Scripts folder
Figures folder

Having a set-up like this beforehand will help keep your repository organized. Keep in mind that a repository should be self-contained - one project per repository, with all file paths as relative paths.

This article provides an example of how to organize a more advanced data science project in a GitHub repository. A useful exercise is to spend some time looking through data science projects on GitHub to see how people are setting up their repositories.

Additional reading: Ten Simple Rules for Taking Advantage of Git and GitHub.

Books about Git

Version Control with Git by Jon Loeliger; Matthew McCullough
ISBN: 9781449316389

Publication Date: 2012-08-27

Get up to speed on Git for tracking, branching, merging, and managing code revisions. Through a series of step-by-step tutorials, this practical guide takes you quickly from Git fundamentals to advanced techniques, and provides friendly yet rigorous advice for navigating the many functions of this open source version control system.
Git Pocket Guide by Richard E. Silverman
ISBN: 9781449325862

Publication Date: 2013-08-02

This pocket guide provides a compact, readable introduction to Git for new users, as well as a reference to common commands and procedures for those of you with Git experience. Written for Git version 1.8.2, this handy task-oriented guide is organized around the basic version control functions you need, such as making commits, fixing mistakes, merging, and searching history.
Git Recipes by Wlodzimierz Gajda
ISBN: 9781430261032

Publication Date: 2013-12-19

Whether you're relatively new to git or you need a refresher, or if you just need a quick, handy reference for common tasks in git, Git Recipes is just the reference book you need. With recipes to cover any task you can think of, including working with GitHub and git on BitBucket, Git Recipes shows you how to work with large repositories, new repositories, forks, clones, conflicts, differences, and it even gives you practical scenarios you may find yourself dealing with while using git.
Git for Teams by Emma Jane Hogbin Westby
ISBN: 9781491911181

Publication Date: 2015-09-12

This practical guide delivers a unique people-first approach to version control that also explains how using Git as a focal point can help your team work better together. You'll learn how to plan and pursue a Git workflow that not only ensures that you accomplish project goals, but also fits the immediate needs and future growth of your team. Examine popular collaboration platforms: GitHub, Bitbucket, and GitLab.
Git - Version Control for Everyone by Ravishankar Somasundaram
ISBN: 9781849517539

Publication Date: 2013-01-01
Pro Git by Scott Chacon; Ben Straub
ISBN: 9781484200773

Publication Date: 2014-11-09

Written by Git pros Scott Chacon and Ben Straub, Pro Git (Second Edition) builds on the hugely successful first edition, and is now fully updated for Git version 2.0, as well as including an indispensable chapter on GitHub. It's the best book for all your Git needs.