Skip to Main Content
The UC San Diego Library is closed December 23, 2023 - January 2, 2024. Access to online resources remains available.

Data Science: Version control & GitHub

What is version control anyways?

Have you ever had a folder of scripts that looked like this?

If so, you'll be glad to learn that there's a better way to handle sequential versions of the same thing! 

What is version control?

The concept of version control is to keep a record of changes made to specified files. One common way to think about version control is to think of "track changes" in a Word document, which is a simple form of version control. While it's fairly intuitive to apply the concept to something like a Word or Google doc, in which you may want to add and delete paragraphs, version control is equally applicable to code. When developing a piece of code, you'll probably try out new approaches, test new functions, etc. You want to keep what works, but also want to keep track of what you tried and didn't work.

Version control software, such as Git, keeps track of changes made to files in an automated fashion. It also allows you to "roll back" any changes you had made. Think of it like an unlimited "undo" button, not limited to your current session.

Why should I use version control?

For starters, you can avoid the situation of "analysis_script_FINAL_REVISED_FOR_REAL-updated" naming problems. Using version control means you don't lose any of your earlier work, but you can simplify how you organize your files. 

It's also great for collaboration. Again, think of a Google doc with many people making changes - this is an easier situation than emailing a track changes Word doc back and forth. Version control software helps do this, and more, with code under development. Version control systems not only keep track of changes and allow you to roll back changes, but can also merge changes that arise when multiple people are making changes to a script at the same time. This gets around the "multiple versions of the same file made by different people" issue that can come up in a platform like Dropbox. 

Version control systems also have a functionality so developers can have a code "branch" in which to test out changes, rather than editing the "master" script directly. This reduces the chance of a new functionality breaking all the original code - the new feature is separate from the master, but any successful changes are able to easily be merged back in.

Working with Git and GitHub

Git is the name of a common version control software, which you can run locally on your computer. GitHub, in contrast, is a web-based service for Git repositories (i.e., groups of tracked files). Think of Git as the actual tool, and GitHub as the user-friendly web interface for working with Git. You do not need to use GitHub to use Git for version control.

You may have also heard of Git in terms of its (somewhat notorious) steep learning curve. Yes, Git can be challenging. Advanced Git can be very challenging. For the most part, however, users of Git will only need to know a few key commands. If you need additional functionality, there are many resources available for learning more, some of which are included in this guide.

There are some best practices to keep in mind when organize your GitHub repository. The GitHub help for creating a repository includes information about READMEs and licensing.

A basic set-up for a GitHub repository for a research project would be:

Having a set-up like this beforehand will help keep your repository organized. Keep in mind that a repository should be self-contained - one project per repository, with all file paths as relative paths.

This article provides an example of  how to organize a more advanced data science project in a GitHub repository. A useful exercise is to spend some time looking through data science projects on GitHub to see how people are setting up their repositories. 

Additional reading: Ten Simple Rules for Taking Advantage of Git and GitHub.

Books about Git