Project 2: Gitlet

A note on this spec

This spec is fairly long. The first half is a verbose and detailed description of every command you’ll support, and the other half is the testing details and some words of advice. To help you digest this, we’ve prepared many high quality videos describing portions of the spec and giving advice on how and where to begin. All videos are linked throughout this spec in the relevant location, but we’ll also list them right here for your convenience. Note: some of these videos were created in Spring 2020 when Gitlet was Project 3 and Capers was Lab 12, and some videos briefly mention Professor Hilfinger’s CS 61B setup (including a remote called shared, a repository called repo, etc). Please ignore these as they do not provide any useful information for you this semester. The actual content of the assignment is unchanged.

As more resources are created, we’ll add them here, so refresh often!

Overview of Gitlet

Warning: Ensure you’ve completed Lab 6: Canine Capers before this project. Lab 6 is intended to be an introduction to this project and will be very helpful in getting you started and ensure you’re all set up. You should also have watched Lecture 12: Gitlet, which introduces many useful ideas for this project.

In this project you’ll be implementing a version-control system that mimics some of the basic features of the popular system Git. Ours is smaller and simpler, however, so we have named it Gitlet.

A version-control system is essentially a backup system for related collections of files. The main functionality that Gitlet supports is:

  1. Saving the contents of entire directories of files. In Gitlet, this is called committing, and the saved contents themselves are called commits.

  2. Restoring a version of one or more files or entire commits. In Gitlet, this is called checking out those files or that commit.

  3. Viewing the history of your backups. In Gitlet, you view this history in something called the log.

  4. Maintaining related sequences of commits, called branches.

  5. Merging changes made in one branch into another.

The point of a version-control system is to help you when creating complicated (or even not-so-complicated) projects, or when collaborating with others on a project. You save versions of the project periodically. If at some later point in time you accidentally mess up your code, then you can restore your source to a previously committed version (without losing any of the changes you made since then). If your collaborators make changes embodied in a commit, you can incorporate (merge) these changes into your own version.

In Gitlet, you don’t just commit individual files at a time. Instead, you can commit a coherent set of files at the same time. We like to think of each commit as a snapshot of your entire project at one point in time. However, for simplicity, many of the examples in the remainder of this document involve changes to just one file at a time. Just keep in mind you could change multiple files in each commit.

In this project, it will be helpful for us to visualize the commits we make over time. Suppose we have a project consisting just of the file wug.txt, we add some text to it, and commit it. Then we modify the file and commit these changes. Then we modify the file again, and commit the changes again. Now we have saved three total versions of this file, each one later in time than the previous. We can visualize these commits like so:

Three commits

Here we’ve drawn an arrow indicating that each commit contains some kind of reference to the commit that came before it. We call the commit that came before it the parent commit–this will be important later. But for now, does this drawing look familiar? That’s right; it’s a linked list!

The big idea behind Gitlet is that we can visualize the history of the different versions of our files in a list like this. Then it’s easy for us to restore old versions of files. You can imagine making a command like: “Gitlet, please revert to the state of the files at commit #2”, and it would go to the second node in the linked list and restore the copies of files found there, while removing any files that are in the first node, but not the second.

If we tell Gitlet to revert to an old commit, the front of the linked list will no longer reflect the current state of your files, which might be a little misleading. In order to fix this problem, we introduce something called the head pointer (also called the HEAD pointer). The head pointer keeps track of where in the linked list we currently are. Normally, as we make commits, the head pointer will stay at the front of the linked list, indicating that the latest commit reflects the current state of the files:

Simple head

However, let’s say we revert to the state of the files at commit #2 (technically, this is the reset command, which you’ll see later in the spec). We move the head pointer back to show this:

Reverted head

Here we say that we are in a detatched head state which you may have encountered yourself before. This is what it means!

EDITED 3/5: Note that in Gitlet, there is no way to be in a detached head state since there is no checkout command that will move the HEAD pointer to a specific commit. The reset command will do that, though it also moves the branch pointer. Thus, in Gitlet, you will never be in a detached HEAD state.

All right, now, if this were all Gitlet could do, it would be a pretty simple system. But Gitlet has one more trick up its sleeve: it doesn’t just maintain older and newer versions of files, it can maintain differing versions. Imagine you’re coding a project, and you have two ideas about how to proceed: let’s call one Plan A, and the other Plan B. Gitlet allows you to save both versions, and switch between them at will. Here’s what this might look like, in our pictures:

Two versions

It’s not really a linked list anymore. It’s more like a tree. We’ll call this thing the commit tree. Keeping with this metaphor, each of the separate versions is called a branch of the tree. You can develop each version separately:

Two developed versions

There are two pointers into the tree, representing the furthest point of each branch. At any given time, only one of these is the currently active pointer, and this is what’s called the head pointer. The head pointer is the pointer at the front of the current branch.

That’s it for our brief overview of the Gitlet system! Don’t worry if you don’t fully understand it yet; the section above was just to give you a high level picture of what its meant to do. A detailed spec of what you’re supposed to do for this project follows this section.

But a last word here: commit trees are immutable: once a commit node has been created, it can never be destroyed (or changed at all). We can only add new things to the commit tree, not modify existing things. This is an important feature of Gitlet! One of Gitlet’s goals is to allow us to save things so we don’t delete them accidentally.

Internal Structures

Real Git distinguishes several different kinds of objects. For our purposes, the important ones are

Gitlet simplifies from Git still further by

Every object–every blob and every commit in our case–has a unique integer id that serves as a reference to the object. An interesting feature of Git is that these ids are universal: unlike a typical Java implementation, two objects with exactly the same content will have the same id on all systems (i.e. my computer, your computer, and anyone else’s computer will compute this same exact id). In the case of blobs, “same content” means the same file contents. In the case of commits, it means the same metadata, the same mapping of names to references, and the same parent reference. The objects in a repository are thus said to be content addressable.

Both Git and Gitlet accomplish this the same way: by using a cryptographic hash function called SHA-1 (Secure Hash 1), which produces a 160-bit integer hash from any sequence of bytes. Cryptographic hash functions have the property that it is extremely difficult to find two different byte streams with the same hash value (or indeed to find any byte stream given just its hash value), so that essentially, we may assume that the probability that any two objects with different contents have the same SHA-1 hash value is 2-160 or about 10-48. Basically, we simply ignore the possibility of a hashing collision, so that the system has, in principle, a fundamental bug that in practice never occurs!

Fortunately, there are library classes for computing SHA-1 values, so you won’t have to deal with the actual algorithm. All you have to do is to make sure that you correctly label all your objects. In particular, this involves

By the way, the SHA-1 hash value, rendered as a 40-character hexadecimal string, makes a convenient file name for storing your data in your .gitlet directory (more on that below). It also gives you a convenient way to compare two files (blobs) to see if they have the same contents: if their SHA-1s are the same, we simply assume the files are the same.

For remotes (like skeleton which we’ve been using all semester), we’ll simply use other Gitlet repositories. Pushing simply means copying all commits and blobs that the remote repository does not yet have to the remote repository, and resetting a branch reference. Pulling is the same, but in the other direction. Remotes are extra credit in this project and not required for full credit.

Reading and writing your internal objects from and to files is actually pretty easy, thanks to Java’s serialization facilities. The interface java.io.Serializable has no methods, but if a class implements it, then the Java runtime will automatically provide a way to convert to and from a stream of bytes, which you can then write to a file using the I/O class java.io.ObjectOutputStream and read back (and deserialize) with java.io.ObjectInputStream. The term “serialization” refers to the conversion from some arbitrary structure (array, tree, graph, etc.) to a serial sequence of bytes. You should have seen and gotten practice with serialization in lab 6. You’ll be using a very similar approach here, so do use your lab6 as a resource when it comes to persistence and serialization.

Here is a summary example of the structures discussed in this section. As you can see, each commit (rectangle) points to some blobs (circles), which contain file contents. The commits contain the file names and references to these blobs, as well as a parent link. These references, depicted as arrows, are represented in the .gitlet directory using their SHA-1 hash values (the small hexadecimal numerals above the commits and below the blobs). The newer commit contains an updated version of wug1.txt, but shares the same version of wug2.txt as the older commit. Your commit class will somehow store all of the information that this diagram shows: a careful selection of internal data structures will make the implementation easier or harder, so it behooves you to spend time planning and thinking about the best way to store everything.

Two commits and their blobs

Detailed Spec of Behavior

Overall Spec

The only structure requirement we’re giving you is that you have a class named gitlet.Main and that it has a main method.

We are also giving you some utility methods for performing a number of mostly file-system-related tasks, so that you can concentrate on the logic of the project rather than the peculiarities of dealing with the OS.

We have also added two suggested classes: Commit, and Repository to get you started. You may, of course, write additional Java classes to support your project or remove our suggested classes if you’d like. But don’t use any external code (aside from JUnit), and don’t use any programming language other than Java. You can use all of the Java Standard Library that you wish, plus utilities we provide.

You should not do everything in the Main class. Your Main class should mostly be calling helper methods in the the Repository class. See the CapersRepository and Main classes from lab 6 for examples of the structure that we recommend.

The majority of this spec will describe how Gitlet.java’s main method must react when it receives various gitlet commands as command-line arguments. But before we break down command-by-command, here are some overall guidelines the whole project should satisfy:

The Commands

We now go through each command you must support in detail. Remember that good programmers always care about their data structures: as you read these commands, you should think first about how you should store your data to easily support these commands and second about if there is any opportunity to reuse commands that you’ve already implemented (hint: there is ample opportunity in this project to reuse code in later parts of project 2 that you’ve already written in earlier parts of project 2). We have listed lectures in some methods that we have found useful, but you are not required to use concepts from these lectures. There are conceptual quizzes on some of the more confusing commands that you should definately use to check your understanding. The quizzes are not for a grade, they are only there to help you check your understanding before trying to implement the command.

init

add

commit

Here’s a picture of before-and-after commit:

Before and after commit

rm

log

===
commit a0da1ea5a15ab613bf9961fd86f010cf74c7ee48
Date: Thu Nov 9 20:00:05 2017 -0800
A commit message.

===
commit 3e8bf1d794ca2e9ef8a4007275acf3751c7170ff
Date: Thu Nov 9 17:01:33 2017 -0800
Another commit message.

===
commit e881c9575d180a215d1a636545b8fd9abfb1d2bb
Date: Wed Dec 31 16:00:00 1969 -0800
initial commit

There is a === before each commit and an empty line after it. As in real Git, each entry displays the unique SHA-1 id of the commit object. The timestamps displayed in the commits reflect the current timezone, not UTC; as a result, the timestamp for the initial commit does not read Thursday, January 1st, 1970, 00:00:00, but rather the equivalent Pacific Standard Time. Your timezone might be different depending on where you live, and that’s fine.

Display commits with the most recent at the top. By the way, you’ll find that the Java classes java.util.Date and java.util.Formatter are useful for getting and formatting times. Look into them instead of trying to construct it manually yourself!

Of course, the SHA1 identifiers are going to be different, so don’t worry about those. Our tests will ensure that you have something that “looks like” a SHA1 identifier (more on that in the testing section below).

For merge commits (those that have two parent commits), add a line just below the first, as in

===
commit 3e8bf1d794ca2e9ef8a4007275acf3751c7170ff
Merge: 4975af1 2c1ead1
Date: Sat Nov 11 12:30:00 2017 -0800
Merged development into master.

where the two hexadecimal numerals following “Merge:” consist of the first seven digits of the first and second parents’ commit ids, in that order. The first parent is the branch you were on when you did the merge; the second is that of the merged-in branch. This is as in regular Git.

Here’s a picture of the history of a particular commit. If the current branch’s head pointer happened to be pointing to that commit, log would print out information about the circled commits:

History

The history ignores other branches and the future. Now that we have the concept of history, let’s refine what we said earlier about the commit tree being immutable. It is immutable precisely in the sense that the history of a commit with a particular id may never change, ever. If you think of the commit tree as nothing more than a collection of histories, then what we’re really saying is that each history is immutable.

global-log

find

status

checkout

Checkout is a kind of general command that can do a few different things depending on what its arguments are. There are 3 possible use cases. In each section below, you’ll see 3 numbered points. Each corresponds to the respective usage of checkout.

A [commit id] is, as described earlier, a hexadecimal numeral. A convenient feature of real Git is that one can abbreviate commits with a unique prefix. For example, one can abbreviate

a0da1ea5a15ab613bf9961fd86f010cf74c7ee48

as

a0da1e

in the (likely) event that no other object exists with a SHA-1 identifier that starts with the same six digits. You should arrange for the same thing to happen for commit ids that contain fewer than 40 characters. Unfortunately, using shortened ids might slow down the finding of objects if implemented naively (making the time to find a file linear in the number of objects), so we won’t worry about timing for commands that use shortened ids. We suggest, however, that you poke around in a .git directory (specifically, .git/objects) and see how it manages to speed up its search. You will perhaps recognize a familiar data structure implemented with the file system rather than pointers.

Only version 3 (checkout of a full branch) modifies the staging area: otherwise files scheduled for addition or removal remain so.

branch

All right, let’s see what branch does in detail. Suppose our state looks like this:

Simple history

Now we call java gitlet.Main branch cool-beans. Then we get this:

Just called branch

Hmm… nothing much happened. Let’s switch to the branch with java gitlet.Main checkout cool-beans:

Just switched branch

Nothing much happened again?! Okay, say we make a commit now. Modify some files, then java gitlet.Main add... then java gitlet.Main commit...

Commit on branch

I was told there would be branching. But all I see is a straight line. What’s going on? Maybe I should go back to my other branch with java gitlet.Main checkout master:

Checkout master

Now I make a commit…

Branched

Phew! So that’s the whole idea of branching. Did you catch what’s going on? All that creating a branch does is to give us a new pointer. At any given time, one of these pointers is considered the currently active pointer, also called the HEAD pointer (indicated by *). We can switch the currently active head pointer with checkout [branch name]. Whenever we commit, it means we add a child commit to the currently active HEAD commit even if there is already a child commit. This naturally creates branching behavior as a commit can now have multiple children.

A video example and overview of branching can be found here

Make sure that the behavior of your branch, checkout, and commit match what we’ve described above. This is pretty core functionality of Gitlet that many other commands will depend upon. If any of this core functionality is broken, very many of our autograder tests won’t work!

rm-branch

reset

merge

<<<<<<< HEAD
contents of file in current branch
=======
contents of file in given branch
>>>>>>>

(replacing “contents of…” with the indicated file’s contents) and stage the result. Treat a deleted file in a branch as an empty file. Use straight concatenation here. In the case of a file with no newline at the end, you might well end up with something like this:

<<<<<<< HEAD
contents of file in current branch=======
contents of file in given branch>>>>>>>

This is fine; people who produce non-standard, pathological files because they don’t know the difference between a line terminator and a line separator deserve what they get.

Once files have been updated according to the above, and the split point was not the current branch or the given branch, merge automatically commits with the log message Merged [given branch name] into [current branch name]. Then, if the merge encountered a conflict, print the message Encountered a merge conflict. on the terminal (not the log). Merge commits differ from other commits: they record as parents both the head of the current branch (called the first parent) and the head of the branch given on the command line to be merged in.

A video walkthrough of this command can be found here.

By the way, we hope you’ve noticed that the set of commits has progressed from a simple sequence to a tree and now, finally, to a full directed acyclic graph.

Skeleton

The skeleton is fairly bare bones with mostly empty classes. We’ve provided helpful javadoc comments hinting at what you might want to include in each file. You should follow a similar approach to Capers where your Main class doesn’t do a whole lot of work by itself, but rather simply calls other methods depending on the args. You’re absolutely welcome to delete the other classes or add your own, but the Main class should remain otherwise our tests won’t be able to find your code.

If you’re confused on where to start, we suggest looking over Lab 6: Canine Capers.

Design Document

Since you are not working from a substantial skeleton this time, we are asking that everybody submit a design document describing their implementation strategy. It is not graded, but you must have an up-to-date and completed design document before we help you in Office Hours or on a Gitbug. If you do not have one or it’s not up-to-date/not complete, we cannot help you. This is for both of our sakes: by having a design doc, you have written out a road map for how you will tackle the assignment. If you need help creating a design document, we can definately help with that :) Here are some guidelines, as well as an example from the Capers lab.

Grader Details

We have three graders for Gitlet: the checkpoint grader, the full grader, and the snaps grader.

Checkpoint Grader

Due 3/12 at 11:59 PM for 16 extra credit points.

Submit to the Project 2: Gitlet Checkpoint autograder on Gradescope.

It will test:

In addition, it will comment on (but not score):

We will score these in your final submission. EDITED 3/4: It’s ok to have compiler warnings.

You’ll have a maximum capacity of 1 token which will refresh every 20 minutes. You will not get full logs on these failures (i.e. you will be told what test you failed but not any additional message), though since you have the tests themselves you can simply debug it locally.

Full Grader

Due 4/2 at 11:59 PM for 1600 points.

The full grader is a more substantial and comprehensive test suite. You’ll have a maximum capacity of 1 token. Here is the schedule of token recharge rates:

You’ll see that, like Project 1, there is limited access to the grader. Please be kind to yourself and write tests along the way so you do not become too reliant on the autograder for checking your work.

Similar to the checkpoint, the full grader will have English hints on what each test does but not the actual .in file.

Snaps Grader

Due 4/9 at 11:59 PM. Your Gradescope score will not be transferred to Beacon until you’ve pushed your snaps repo and submitted to the Snaps Gradescope assignment. To push your snaps repo, run these commands:

cd $SNAPS_DIR
git push

After you’ve pushed your snaps repository, there is a Gradescope assignment that you will submit your snaps-sp21-s*** repository to (similar to Project 1). This is only for the full grader (not the checkpoint nor the extra credit assignment).

You can do this up to a week after the deadline as well in case you forget. If you forget to push after a week, then you’ll have to use slip days.

Extra credit

There are a total of 16 + 32 + 64 = 112 extra credit points possible:

  1. 16 for the checkpoint
  2. 32 for the status command printing the Modifications Not Staged For Commit and Untracked Files sections
  3. 64 for the remote commands

The rest of this spec is filled resources for you that you should read to get you started. The section on testing/debugging will be extremely helpful to you as testing and debugging in this project will be different than previous projects, but not so complicated.

Miscellaneous Things to Know about the Project

Phew! That was a lot of commands to go over just now. But don’t worry, not all commands are of the same difficulty. You can see for each command the approximate number of lines we took to do each part (this only counts code specific to that command – it doesn’t double-count code reused in multiple commands). You shouldn’t worry about matching our solution exactly, but hopefully it gives you an idea about the relative time consumed by each command. Merge is a lengthier command than the others, so don’t leave it for the last minute!

This is an ambitious project, and it would not be surprising for you to feel lost as to where to begin. Therefore, feel free to collaborate with others a little more closely than usual, with the following caveats:

The Ed megathreads typically get very long for Gitlet, but they are full of very good conversation and discussion on the approach for particular commits. In this project more than any you should take advantage of the size of the class and see if you can find someone with a similar question to you on the megathread. It’s very unlikely that your question is so unique to you that nobody else has had it (unless it is a bug that relates to your design, in which case you should submit a Gitbug).

By now this spec has given you enough information to get working on the project. But to help you out some more, there are a couple of things you should be aware of:

Dealing with Files

This project requires reading and writing of files. In order to do these operations, you might find the classes java.io.File and java.nio.file.Files helpful. Actually, you may find various things in the java.io and java.nio packages helpful. Be sure to read the gitlet.Utils package for other things we’ve written for you. If you do a little digging through all of these, you might find a couple of methods that will make the io portion of this project much easier! One warning: If you find yourself using readers, writers, scanners, or streams, you’re making things more complicated than need be.

Serialization Details

If you think about Gitlet, you’ll notice that you can only run one command every time you run the program. In order to successfully complete your version-control system, you’ll need to remember the commit tree across commands. This means you’ll have to design not just a set of classes to represent internal Gitlet structures during execution, but you’ll need an analogous representation as files within your .gitlet directories, which will carry across multiple runs of your program.

As indicated earlier, the convenient way to do this is to serialize the runtime objects that you will need to store permanently in files. The Java runtime does all the work of figuring out what fields need to be converted to bytes and how to do so.

You’ve already done serialization in lab6 and so we will not repeat the information here. If you are still confused on some aspect of serialization, re-read the relevant portion of the lab6 spec and also look over your code.

There is, however, one annoying subtlety to watch out for: Java serialization follows pointers. That is, not only is the object you pass into writeObject serialized and written, but any object it points to as well. If your internal representation of commits, for example, represents the parent commits as pointers to other commit objects, then writing the head of a branch will write all the commits (and blobs) in the entire subgraph of commits into one file, which is generally not what you want. To avoid this, don’t use Java pointers to refer to commits and blobs in your runtime objects, but instead use SHA-1 hash strings. Maintain a runtime map between these strings and the runtime objects they refer to. You create and fill in this map while Gitlet is running, but never read or write it to a file.

You might find it convenient to have (redundant) pointers commits as well as SHA-1 strings to avoid the bother and execution time required to look them up each time. You can store such pointers in your objects while still avoiding having them written out by declaring them “transient”, as in

    private transient MyCommitType parent1;

Such fields will not be serialized, and when back in and deserialized, will be set to their default values (null for reference types). You must be careful when reading the objects that contain transient fields back in to set the transient fields to appropriate values.

Unfortunately, looking at the serialized files your program has produced with a text editor (for debugging purposes) would be rather unrevealing; the contents are encoded in Java’s private serialization encoding. We have therefore provided a simple debugging utility program you might find useful: gitlet.DumpObj. See the Javadoc comment on gitlet/DumpObj.java for details.

Testing

You should read through this entire section, though a video is also avilable for your convenience.

As usual, testing is part of the project. Be sure to provide your own integration tests for each of the commands, covering all the specified functionality. Also, feel free add any unit tests you’d like. We don’t provide any unit tests since unit tests are highly dependent on your implementation.

We have provided a testing program that makes it relatively easy to write integration tests: testing/tester.py. This interprets testing files with an .in extension. You may run all of the tests with the command

make check

If you’d like additional information on the failed tests, such as what your program is outputting, run:

make check TESTER_FLAGS="--verbose"

If you’d like to run a single test, within the testing subdirectory, run the command

python3 tester.py --verbose FILE.in ...

where FILE.in ... is a list of specific .in files you want to check.

CAREFUL RUNNING THIS COMMAND as it does not recompile your code. Every time you run a python command, you must first compile your code (via make).

The command

python3 tester.py --verbose --keep FILE.in

will, in addition, keep around the directory that tester.py produces so that you can examine its files at the point the tester script detected an error. If your test did not error, then the directory will still remain there with the final contents of everything.

In effect, the tester implements a very simple domain-specific language (DSL) that contains commands to

python3 testing/tester.py

(with no operands, as shown) will provide a message documenting this language. We’ve provided some examples in the directory testing/samples. Don’t put your own tests in that subdirectory; place them somewhere distinct so you don’t get confused with our tests vs your tests (which may be buggy!). Put all your .in files in another folder called student_tests within the testing directory. In the skeleton, this folder is blank.

We’ve added a few things to the Makefile to adjust for differences in people’s setups. If your system’s command for invoking Python 3 is simply python, you can still use our makefile unchanged by using

make PYTHON=python check

You can pass additional flags to tester.py with, for example:

make TESTER_FLAGS="--keep --verbose"

Testing on the Staff Solution

As of Sunday February 28th, there is now a way for you to use the staff solution to verify your understanding of commands as well as verify your own tests! The guide is here.

Understanding Integration Tests

The first thing we’ll ask for in Gitbugs and when you come to receive help in Office Hours is a test that you’re failing, so it’s paramount that you learn to write tests in this project. We’ve done a lot of work to make this as painless as possible, so please take the time to read through this section so you can understand the provided tests and write good tests yourself.

The integration tests are of similar format to those from Capers. If you don’t know how the Capers integration tests (i.e. the .in files) work, then read that section from the capers spec first.

The provided tests are hardly comprehensive, and you’ll definitely need to write your own tests to get a full score on the project. To write a test, let’s first understand how this all works.

Here is the structure of the testing directory:

.
├── Makefile
├── student_tests                    <==== Your .in files will go here
├── samples                          <==== Sample .in files we provide
│   ├── test01-init.in               <==== An example test
│   ├── test02-basic-checkout.in
│   ├── test03-basic-log.in
│   ├── test04-prev-checkout.in
│   └── definitions.inc
├── src                              <==== Contains files used for testing
│   ├── notwug.txt
│   └── wug.txt
├── runner.py                        <==== Script to help debug your program
└── tester.py                        <==== Script that tests your program

Just like Capers, these tests work by creating a temporary directory within the testing directory and running the commands specified by a .in file. If you use the --keep flag, this temporary directory will remain after the test finishes so you can inspect it.

Unlike Capers, we’ll need to deal with the contents of files in our working directory. So in this testing folder, we have an additional folder called src. This directory stores many pre-filled .txt files that have particular contents we need. We’ll come back to this later, but for now just know that src stores actual file contents. samples has the .in files of the sample tests (which are the checkpoint tests). When you create your own tests, you should add them to the student_tests folder which is initially empty in the skeleton.

The .in files have more functions in Gitlet. Here is the explanation straight from the tester.py file:

# ...  A comment, producing no effect.
I FILE Include.  Replace this statement with the contents of FILE,
      interpreted relative to the directory containing the .in file.
C DIR  Create, if necessary, and switch to a subdirectory named DIR under
      the main directory for this test.  If DIR is missing, changes
      back to the default directory.  This command is principally
      intended to let you set up remote repositories.
T N    Set the timeout for gitlet commands in the rest of this test to N
      seconds.
+ NAME F
      Copy the contents of src/F into a file named NAME.
- NAME
      Delete the file named NAME.
> COMMAND OPERANDS
LINE1
LINE2
...
<<<
      Run gitlet.Main with COMMAND ARGUMENTS as its parameters.  Compare
      its output with LINE1, LINE2, etc., reporting an error if there is
      "sufficient" discrepency.  The <<< delimiter may be followed by
      an asterisk (*), in which case, the preceding lines are treated as
      Python regular expressions and matched accordingly. The directory
      or JAR file containing the gitlet.Main program is assumed to be
      in directory DIR specifed by --progdir (default is ..).
= NAME F
      Check that the file named NAME is identical to src/F, and report an
      error if not.
* NAME
      Check that the file NAME does not exist, and report an error if it
      does.
E NAME
      Check that file or directory NAME exists, and report an error if it
      does not.
D VAR "VALUE"
      Defines the variable VAR to have the literal value VALUE.  VALUE is
      taken to be a raw Python string (as in r"VALUE").  Substitutions are
      first applied to VALUE.

Don’t worry about the Python regular expressions thing mentioned in the above description: we’ll show you that it’s fairly straightforward and even go through an example of how to use it.

Let’s walk through a test to see what happens from start to finish. Let’s examine test02-basic-checkout.in.

Example test

When we first run this test, a temporary directory gets created that is initially empty. Our directory structure is now:

.
├── Makefile
├── student_tests
├── samples
│   ├── test01-init.in
│   ├── test02-basic-checkout.in
│   ├── test03-basic-log.in
│   ├── test04-prev-checkout.in
│   └── definitions.inc
├── src
│   ├── notwug.txt
│   └── wug.txt
├── test02-basic-checkout_0          <==== Just created
├── runner.py
└── tester.py

This temporary directory is the Gitlet repository that will be used for this execution of the test, so we will add things there and run all of our Gitlet commands there as well. If you ran the test a second time without deleting the directory, it’ll create a new directory called test02-basic-checkout_1, and so on. Each execution of a test uses it’s own directory, so don’t worry about tests interfering with each other as that cannot happen.

The first line of the test is a comment, so we ignore it.

The next section is:

> init
<<<

This shouldn’t have any output as we can tell by this section not having any text between the first line with > and the line with <<<. But, as we know, this should create a .gitlet folder. So our directory structure is now:

.
├── Makefile
├── student_tests
├── samples
│   ├── test01-init.in
│   ├── test02-basic-checkout.in
│   ├── test03-basic-log.in
│   ├── test04-prev-checkout.in
│   └── definitions.inc
├── src
│   ├── notwug.txt
│   └── wug.txt
├── test02-basic-checkout_0
│   └── .gitlet                     <==== Just created
├── runner.py
└── tester.py

The next section is:

+ wug.txt wug.txt

This line uses the + command. This will take the file on the right-hand side from the src directory and copy its contents to the file on the left-hand side in the temporary directory (creating it if it doesn’t exist). They happen to have the same name, but that doesn’t matter since they’re in different directories. After this command, our directory structure is now:

.
├── Makefile
├── student_tests
├── samples
│   ├── test01-init.in
│   ├── test02-basic-checkout.in
│   ├── test03-basic-log.in
│   ├── test04-prev-checkout.in
│   └── definitions.inc
├── src
│   ├── notwug.txt
│   └── wug.txt
├── test02-basic-checkout_0
│   ├── .gitlet
│   └── wug.txt                     <==== Just created
├── runner.py
└── tester.py

Now we see what the src directory is used for: it contains file contents that the tests can use to set up the Gitlet repository however you wants. If you want to add special contents to a file, you should add those contents to an appropriately named file in src and then use the same + command as we have here. It’s easy to get confused with the order of arguments, so make sure the right-hand side is referencing the file in the src directory, and the left-hand side is referencing the file in the temporary directory.

The next section is:

> add wug.txt
<<<

As you can see, there should be no output. The wug.txt file is now staged for addition in the temporary directory. At this point, your directory structure will likely change within the test02-basic-checkout_0/.gitlet directory since you’ll need to somehow persist the fact that wug.txt is staged for addition.

The next section is:

> commit "added wug"
<<<

And, again, there is no output, and, again, your directory strcuture within .gitlet might change.

The next section is:

+ wug.txt notwug.txt

Since wug.txt already exists in our temporary directory, its contents changes to be whatever was in src/notwug.txt.

The next section is

> checkout -- wug.txt
<<<

Which, again, has no output. However, it should change the contents of wug.txt in our temporary directory back to its original contents which is exactly the contents of src/wug.txt. The next command is what asserts that:

= wug.txt wug.txt

This is an assertion: if the file on the left-hand side (again, this is in the temporary directory) doesn’t have the exact contents of the file on the right-hand side (from the src directory), the testing script will error and say your file contents are not correct.

There are two other assertion commands available to you:

E NAME

Will assert that there exists a file/folder named NAME in the temporary directory. It doesn’t check the contents, only that it exists. If no file/folder with that name exists, the test will fail.

* NAME

Will assert that there does NOT exist a file/folder named NAME in the temporary directory. If there does exist a file/folder with that name, the test will fail.

That happened to be the last line of the test, so the test finishes. If the --keep flag was provided, the temporary directory will remain, otherwise it will be deleted. You might want to keep it if you suspect your .gitlet directory is not being properly setup or there is some issue with persistence.

Setup for a test

As you’ll soon discover, there can be a lot of repeated setup to test a particular command: for example, if you’re testing the checkout command you need to:

  1. Initialize a Gitlet Repository
  2. Create a commit with a file in some version (v1)
  3. Create another commit with that file in some other version (v2)
  4. Checkout that file to v1

And perhaps even more if you want to test with files that were untracked in the second commit but tracked in the first.

So the way you can save yourself time is by adding all that setup in a file and using the I command. Say we do that here:

# Initialize, add, and commit a file.
> init
<<<
+ a.txt wug.txt
> add a.txt
<<<
> commit "a is a wug"
<<<

We should place this file with the rest of the tests in the samples directory, but with a file extension .inc, so maybe we name it samples/commit_setup.inc. If we gave it the file extension .in, our testing script will mistake it for a test and try to run it individually. Now, in our actual test, we simply use the command:

I commit_setup.inc

This will have the testing script run all of the commands in that file and keep the temporary directory it creates. This keeps your tests relatively short and thus easier to read.

We’ve included one .inc file called definitions.inc that will set up patterns for your convenience. Let’s understand what patterns are.

Pattern matching output

The most confusing part of testing is the output for something like log. There are a few reasons why:

  1. The commit SHA will change as you modify your code and hash more things, so you would have to continually modify your test to keep up with the changes to the SHA.
  2. Your date will change every time since time only moves forwards.
  3. It makes the tests very long.

We also don’t really care the exact text: just that there is some SHA there and something with the right date format. For this reason, our tests use pattern matching.

This is not a concept you will need to understand, but at a high level we define a pattern for some text (i.e. a commit SHA) and then just check that the output has that pattern (without caring about the actual letters and numbers).

Here is how you’d do that for the output of log and check that it matches the pattern:

# First "import" the pattern defintions from our setup
I definitions.inc
# You would add your lines here that create commits with the
# specified messages. We'll omit this for this example.
> log
===
${COMMIT_HEAD}
added wug

===
${COMMIT_HEAD}
initial commit

<<<*

The section we see is the same as a normal Gitlet command, except it ends in <<<* which tells the testing script to use patterns. The patterns are enclosed in ${PATTERN_NAME}.

All the patterns are defined in samples/definitions.inc. You don’t need to understand the actual pattern, just the thing it matches. For example, HEADER matches the header of a commit which should look something like:

commit fc26c386f550fc17a0d4d359d70bae33c47c54b9

That’s just some random commit SHA.

So when we create the expected output for this test, we’ll need to know how many entries are in this log and what the commit messages are.

You can do similar things for the status command:

I definitions.inc
# Add commands here to setup the status. We'll omit them here.
> status
=== Branches ===
\*master

=== Staged Files ===
g.txt

=== Removed Files ===

=== Modifications Not Staged For Commit ===

=== Untracked Files ===
${ARBLINES}

<<<*

The pattern we used here is ARBLINES which is arbitrary lines. If you actually care what is untracked, then you can add that here without the pattern, but perhaps we’re more interested in seeing g.txt staged for addition.

Notice the \* on the branch master. Recall that in the status command, you should prefix the HEAD branch with a *. If you use a pattern, you’ll need to replace this * with a \* in the expected output. The reason is out of the scope of the class, but it is called “escaping” the asterisk. If you don’t use a pattern (i.e. your command ends in <<< not <<<*, then you can use the * without the \).

The final thing you can do with these patterns is “save” a matched portion. Warning: this seems like magic and we don’t care at all if you understand how this works, just know that it does and it is available to you. You can copy and paste the relevant part from our provided tests so you don’t need to worry too much about making these from scratch. With that out of the way, let’s see what this is.

If you’re doing a checkout command, you need to use the SHA identifier to specify which commit to checkout to/from. But remember we used patterns, so we don’t actually know the SHA identifier at the time of creating the test. That is problematic. We’ll use test04-prev-checkout.in to see how you can “capture” or “save” the SHA:

I definitions.inc
# Each ${COMMIT_HEAD} captures its commit UID.
# Not shown here, but the test sets up the log by making many commits
# with specific messages.
> log
===
${COMMIT_HEAD}
version 2 of wug.txt

===
${COMMIT_HEAD}
version 1 of wug.txt

===
${COMMIT_HEAD}
initial commit

<<<*

This will set up the UID (SHA) to be captured after the log command. So right after this command runs, we can use the D command to define the UIDs to variables:

# UID of second version
D UID2 "${1}"
# UID of first version
D UID1 "${2}"

Notice how the numbering is backwards: the numbering begins at 1 and starts at the top of the log. That is why the current version (i.e. second version) is defined as "${1}". We don’t care about the initial commit, so we don’t bother capturing it’s UID.

Now we can use that definition to checkout to that captured SHA:

> checkout ${UID1} -- wug.txt
<<<

And now you can make your assertions to ensure the checkout was successful.

Testing conclusion

There are many more complex things you can do with our testing script, but this is enough to write very good tests. You should use our provided tests as an example to get started, and also feel free to discuss on Ed high level ideas of how to test things. You may also share your .in files, but please make sure they’re correct before posting them and add comments so other students and staff can see what is going on.

Debugging Integration Tests

Recall from Lab 6 that debugging integration tests is a bit different with the new setup. The runner.py script will work just as it did for Capers, so you should read through that section in the Lab 6 spec and watch the video linked there. Here we describe strategies to debug:

Finding the right execution to debug

Each test runs your program multiple times, and each one of them has the potential to introduce a bug. The first priority is to identify the right execution of the program that introduces the bug. What we mean by this: imagine you’re failing a test that checks the status command. Say that the output differs by just one file: you say it’s untracked, but the test says it should be staged for addition. This does not mean the status command has a bug. It’s possible that the status command is buggy, but not guaranteed. It could be that your add command didn’t properly persist the fact that a file has been staged for addition! If that is the case, then even with a fully functioning status command, your program would error.

So finding the right (i.e. buggy) execution of the program is very important: how do we do that? You step through every single execution of the program using the runner.py script, and after every execution you look at your temporary directory to make sure everything has been written to a file correctly. This will be harder for serialized objects since, as we know, their contents will be a stream of unintelligable bytes: for serialized objects you can simply check that at the time of serialization they have the correct contents. You may even find that you never serialized it!

Eventually, you’ll find the bug. If you cannot, then that is when you can come to Office Hours or post a Gitbug. Be warned: we can only spend 10 minutes with each student in Office Hours, so if you have a nasty bug that you think would take a TA more than 10 minutes, then you should instead submit a Gitbug with as much information as possible. The better your Gitbug, the better/faster your response will be. Don’t forget to update your design doc: remember we will reject Gitbugs that do not have an up-to-date or complete design document.

Going Remote (Extra Credit)

This project is all about mimicking git’s local features. These are useful because they allow you to backup your own files and maintain multiple versions of them. However, git’s true power is really in its remote features, allowing collaboration with other people over the internet. The point is that both you and your friend could be collaborating on a single code base. If you make changes to the files, you can send them to your friend, and vice versa. And you’ll both have access to a shared history of all the changes either of you have made.

To get extra credit, implement some basic remote commands: namely add-remote, rm-remote, push, fetch, and pull You will get 64 extra-credit points for completing them. Don’t attempt or plan for extra credit until you have completed the rest of the project.

Depending on how flexibly you have designed the rest of the project, 64 extra-credit points may not be worth the amount of effort it takes to do this section. We’re certainly not expecting everyone to do it. Our priority will be in helping students complete the main project; if you’re doing the extra credit, we expect you to be able to stand on your own a little bit more than most students.

The Commands

A few notes about the remote commands:

So now let’s go over the commands:

add-remote

rm-remote

push

fetch

pull

I. Things to Avoid

There are few practices that experience has shown will cause you endless grief in the form of programs that don’t work and bugs that are very hard to find and sometimes not repeatable (“Heisenbugs”).

  1. Since you are likely to keep various information in files (such as commits), you might be tempted to use apparently convenient file-system operations (such as listing a directory) to sequence through all of them. Be careful. Methods such as File.list and File.listFiles produce file names in an undefined order. If you use them to implement the log command, in particular, you can get random results.

  2. Windows users especially should beware that the file separator character is / on Unix (or MacOS) and ‘\’ on Windows. So if you form file names in your program by concatenating some directory names and a file name together with explicit /s or \s, you can be sure that it won’t work on one system or the other. Java provides a system-dependent file separator character (System.getProperty("file.separator")), or you can use the multi-argument constructors to File.
  3. Be careful using a HashMap when serializing! The order of things within the HashMap is non-deterministic. The solution is to use a TreeMap which will always have the same order. More details here

J. Acknowledgments

Thanks to Alicia Luengo, Josh Hug, Sarah Kim, Austin Chen, Andrew Huang, Yan Zhao, Matthew Chow, especially Alan Yao, Daniel Nguyen, and Armani Ferrante for providing feedback on this project. Thanks to git for being awesome.

This project was largely inspired by [this][Nilsson Article] excellent article by Philip Nilsson.

This project was created by Joseph Moghadam. Modifications for Fall 2015, Fall 2017, and Fall 2019 by Paul Hilfinger.