Documentation

Overview

Package git implements derivation of a file graph from git log.

Distance

The distance is derived from the probability that two files appeared in the same commit. The core idea is that relevant files tend to be modified together.

Distance(x, y) = -log(P(y is relevant to x))
P(y is relevant to x) := sum(1/(len(c.files)-1) for c in x.commits if y in c.files) / len(x.commits)

or in English, distance from x to y is -log of the probability that y appears in a commit that touches x and is chosen randomly and independently. There are more nuances to this formula, e.g. file removals are not counted towards len(commit.files), and commits with len(files) = 1 or len(files) > limit are ignored.

Note that this formula penalizes large commits. The more files are modified in a commit, the weaker is the strength of its signal.

This graph defines distance only between files, and not directories.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type EdgeReader

type EdgeReader struct {
	// Reversed indicates that incoming edges must be read instead of outgoing.
	// In other words, read the edges of the tranposed graph.
	Reversed bool
}

    EdgeReader implements filegraph.EdgeReader. It works only with nodes returned by Graph.Node().

    func (*EdgeReader) ReadEdges

    func (r *EdgeReader) ReadEdges(from filegraph.Node, callback func(to filegraph.Node, distance float64) (keepGoing bool))

      ReadEdges implements filegraph.EdgeReader.

      type Graph

      type Graph struct {
      	// Commit is the git commit that the graph state corresponds to.
      	Commit string
      	// contains filtered or unexported fields
      }

        Graph is a file graph based on the git history.

        The graph represents aggregated history of all file changes in the repo, rather than the state of the repo at a single point of time. In particular, the graph may include nodes for files that no longer exist. It is generally not possible to tell if a node is a file or a directory, because it might have been a file at one point of time, and a directory at another.

        TODO(nodir): introduce a decay function to remove old nodes/edges.

        func Load

        func Load(ctx context.Context, repoDir string, opt LoadOptions) (*Graph, error)

          Load returns a file graph for a git repository. Caches the graph under the .git directory. May take minutes and log progress if the cache is cold.

          If the cache exists, but no longer matches the current ref commit, then applies new changes to the loaded graph and updates the cache.

          func (*Graph) Node

          func (g *Graph) Node(name string) filegraph.Node

            Node returns a node by its name. Returns nil if the node is not found. See also Node.Name().

            Idempotent: calling many times with the same name returns the same Node object.

            func (*Graph) Read

            func (g *Graph) Read(r *bufio.Reader) error

              Read reads the graph from r. It is the opposite of (*Graph).Write().

              func (*Graph) Update

              func (g *Graph) Update(ctx context.Context, repoDir, rev string, opt UpdateOptions) error

                Update updates the graph based on changes in a git repository. This is the only way to mutate the Graph. Applies all changes reachable from rev, but not from g.Commit, and updates g.Commit.

                If returns an error which wasn't returned by the callback, then it is possible that the graph is corrupted.

                func (*Graph) Write

                func (g *Graph) Write(w io.Writer) error

                  Write writes the graph to w. It is the opposite of (*Graph).Read().

                  Spec:

                  graph = header version git-commit-hash root total-number-of-edges root-edges
                  header = 54
                  version = 0
                  
                  root = node
                  node = prob-sum-denominator number-of-children children-sorted-by-base-name
                  children-sorted-by-base-name = child*
                  child = base-name node
                  
                  root-edges = node-edges
                  node-edges = number-of-edges edge*
                  edge =
                    index-of-the-adjacent-node-as-found-in-the-file
                    prob-sum
                    edges-of-children-sorted-by-base-name
                  edges-of-children-sorted-by-base-name = edge*
                  
                  where
                   all integer types are encoded as varint
                   all strings are encoded as length-prefixed utf8
                   `*` means "0 or more"
                  

                  type LoadOptions

                  type LoadOptions struct {
                  	UpdateOptions
                  
                  	// Ref is the git ref to load the graph for.
                  	// Defaults to refs/heads/main.
                  	//
                  	// If it is refs/heads/main, but it does not exist, then falls back to
                  	// refs/heads/master.
                  	Ref string
                  }

                    LoadOptions are options for Load() function.

                    type UpdateOptions

                    type UpdateOptions struct {
                    	// Callback, if not nil, is called each time after each commit is processed
                    	// and Graph.Commit is updated.
                    	Callback func() error
                    
                    	// MaxCommitSize is the maximum number of files touched by a commit.
                    	// Commits that exceed this limit are ignored.
                    	// The rationale is that large commits provide a weak signal of file
                    	// relatedness and are expensive to process, O(N^2).
                    	MaxCommitSize int
                    }

                      UpdateOptions are options for Graph.Update().