Purging Data From Source Control
Start course

Source control repositories are an important part of software development. They ensure that data can always be recovered and they also enable continuous integration. Azure DevOps includes features common to modern software development such as source control repositories, continuous integration pipelines, agile planning tools, and more. This course covers several topics which fall under the umbrella of configuring repositories for Azure DevOps.

Learning Objectives

  • Integrating GitHub repositories with Azure DevOps Pipelines
  • Configuring permissions in source control repositories
  • Configuring tags to organize source control repositories 
  • Recovering data using Git commands
  • Purging data from source control

Intended Audience

  • Software engineers
  • DevOps engineers
  • Site reliability engineers


  • Be comfortable using Git
  • Be familiar with Azure DevOps

Hello and welcome. In this content, we'll explore how to remove data from Git repositories. Version control systems are designed to ensure tracked files can always be recovered. Deleted files remain a part of the commit history and can be checked out as needed. Occasionally files will need to be removed from the version control history. Some examples include removing large files, removing binary files, and removing files containing sensitive information such as access tokens, credentials, etc. Eventually, Git users will commit something that needs to be removed. Git provides multiple mechanisms for rewriting the commit history depending on the situation. 

I'm going to cover two scenarios. The first is removing a recent local single-change commit. The second is removing files that have existed across shared historical commits. I'm using this web IDE for this demonstration. On to the first scenario. The first scenario is removing a recent single-change local commit. Removing files from a local commit is perhaps the simplest option. Once commits are shared with other repositories, cleanup becomes more involved. This is covered in the second scenario. I'm going to start by generating a 10-megabyte file using the following command. Next, I'm going to add and commit this 10 megabyte file using the Git add and Git commit commands. This file is now part of the commit history for the local repository. Removing a local commit is accomplished the same way for both single and multiple commits.

I'm going to demonstrate how to remove a specific local commit out of multiple commits. For this, I'll need to make one more change. I'm going to change the readme file and commit this change. I now have two changes, running the Git log command displays the commits for the current branch.

I'm going to remove the commit containing the large file and keep the other commit. I'll use the Git rebase command to rewrite the commit history. Rebasing is the process of replaying one or more commits on top of another commit. The Git rebase command replaces existing commits with newly created commits. A couple of warnings regarding the implications of rebasing. First, rebasing can result in data loss if not used mindfully. Second, rebasing should only be performed on unshared commits. Running the Git log one line command displays the commits for the current branch. I'm going to rebase on top of the original branch head. I'll copy the commit hash for use in a moment. 

I'm going to use the Git rebase command with the -i flag to make this process interactive. The command for this is git rebase -i followed by the commit hash. The interactive mode opens up the text editor configured by Git. In this example, the nanotext editor is used. This displays the commits that have occurred since the specified commit. The Git rebase command has several subcommands used to control how the commit history is rewritten. The drop command removes a commit and the pick command keeps a commit. I'm going to drop the commit containing the large file and keep the change to the readme file. I'll close out of nano using Ctrl+X and save the file using the text editor defaults. The Git log command shows the newly created commit history. Notice the commit that added the large file has been removed.

The large file has now been removed from the working tree and the repository. Now the valid change can be pushed to the remote repository without the large file. I'm going to push this to demonstrate and then I'll switch over to the Azure DevOps repository web interface. Looking at the history, you can see only the valid commit has been pushed. Okay, to summarize the first scenario, the Git rebase command can be used to remove local only commits. The commit history can be rewritten to drop or pick one or more commits. This command should not be used for shared commits unless absolutely necessary. The second scenario is removing files that have existed across shared historical commits. 

I'm going to configure the scenario by committing another large file followed by some other commits. Next, I'm going to push this to the remote repository so that these are now shared commits. In this scenario, files have been committed and shared. This scenario makes rebasing impractical because rebasing shared changes requires all repository users to perform the rebase. The Git filter branch command can be used to remove files throughout the commit history. This command must be used with caution. I'm demonstrating this command because it's built into Git and it's often found in blog posts, Stack Overflow answers, etc.

However, the maintainers of Git recommend an alternative option. I'll mention the other option at the end of this content. The Git filter branch command includes multiple types of filters. I'm going to use the tree filter which checks out each commit and runs a command against those checked out files. Content changes made during this process are persisted. I'll use the following command to remove the large bin file from all commits made to the branch. Notice the warning attempts to guide us towards another solution. For this example. I'll continue. This command removed the largebin file from all commits in the current branch. Rewriting the commit history requires changes to be pushed to the remote repository using the force flag. To do that, I'll use git push with the force flag. Reviewing the commit history in the web interface demonstrates that the file has been removed from the commit and the repository. 

The git filter branch command is known to include multiple ways to unknowingly damage repository. The official Git documentation includes a write up regarding the safety issues related to this command. The Git maintainers recommend the use of a third party tool called git filter repo. The git filter repo tool is known to be faster, more capable, and safer than git filter branch. The documentation for git filter repo includes a CHEAT SHEET which demonstrates how to achieve similar result to git filter branch. This tool is outside of the scope of this content. However, you now have a starting point for future research. Rewriting the commit history is potentially highly disruptive.

There are many ways to irreparably damage a repository. Before performing any actions that rewrite the history, consult with your team members to holistically understand the scope of the problem and make sure you have an answer that meets your needs. Okay, we're going to wrap up here. I hope this demonstration has been.


About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.