Documentation

Overview

Package git implements derivation of a file graph from git log and optionally from the file structure.

Change-log-based distance

This distance is derived from the probability that two files appeared in the same commit. The core idea is that relevant files tend to be modified together.

Distance(x, y) = -log(P(x is relevant to y))
P(x is relevant to y) := sum(1/(len(c.files)-1) for c in y.commits if x in c.files) / len(y.commits)

or in English,

- the distance from x to y is -log of the probability that x is relevant to y
- x is relevant to y if x is likely to appear in a commit that touches y,
  where the commit is chosen randomly and independently.

There are more nuances to this formula, e.g. file removals are not counted towards len(commit.files), and commits with len(files) = 1 or len(files) > limit are ignored. File renames are also taken care of.

Note that this formula penalizes large commits. The more files are modified in a commit, the weaker is the strength of its signal.

This graph defines distance only between files, and not directories.

File-structure-based distance

This distance is derived from the file structure. It is the number of edges between two files in the *file tree*, i.e. the number of hops one has to make to navigate from one file to another in the file tree. For example, given the following file stucture:

//
├── foo/
│   ├── bar.cc
│   └── baz.cc
└── qux.cc

The distance between //foo/bar.cc and //foo/baz.cc is 2 because they have a common parent, and the distance between //foo/bar.cc and and //qux.cc is 3 because the path goes through the root.

The file-structured-based distance can compensate for the weakness of the change-log-based distance.

This distance formula is disabled by default, and can be enabled in EdgeReader.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type EdgeReader

type EdgeReader struct {
	// ChangeLogDistanceFactor is the multiplier for distances derived from the
	// git log. If both ChangeLogDistanceFactor and FileStructureDistanceFactor
	// are zero, then ChangeLogDistanceFactor defaults to 1. If zero then
	// change-log-based edges are ignored.
	ChangeLogDistanceFactor float64

	// FileStructureDistanceFactor is the multiplier for distances derived from
	// the file structure. If zero, then such edges are not reported.
	FileStructureDistanceFactor float64
}

    EdgeReader implements filegraph.EdgeReader. It works only with nodes returned by Graph.Node().

    func (*EdgeReader) ReadEdges

    func (r *EdgeReader) ReadEdges(from filegraph.Node, callback func(to filegraph.Node, distance float64) (keepGoing bool))

      ReadEdges implements filegraph.EdgeReader.

      type Graph

      type Graph struct {
      	// Commit is the git commit that the graph state corresponds to.
      	Commit string
      	// contains filtered or unexported fields
      }

        Graph is a file graph based on the git history.

        The graph represents aggregated history of all file changes in the repo, rather than the state of the repo at a single point of time. In particular, the graph may include nodes for files that no longer exist. It is generally not possible to tell if a node is a file or a directory, because it might have been a file at one point of time, and a directory at another.

        TODO(nodir): introduce a decay function to remove old nodes/edges.

        func Load

        func Load(ctx context.Context, repoDir string, opt LoadOptions) (*Graph, error)

          Load returns a file graph for a git repository. Caches the graph under the .git directory. May take minutes and log progress if the cache is cold.

          If the cache exists, but no longer matches the current ref commit, then applies new changes to the loaded graph and updates the cache.

          func (*Graph) Node

          func (g *Graph) Node(name string) filegraph.Node

            Node returns a node by its name. Returns nil if the node is not found or if the name is empty. See also Node.Name().

            Idempotent: calling many times with the same name returns the same Node object.

            func (*Graph) Read

            func (g *Graph) Read(r *bufio.Reader) error

              Read reads the graph from r. It is the opposite of (*Graph).Write().

              func (*Graph) Update

              func (g *Graph) Update(ctx context.Context, repoDir, rev string, opt UpdateOptions) error

                Update updates the graph based on changes in a git repository. This is the only way to mutate the Graph. Applies all changes reachable from rev, but not from g.Commit, and updates g.Commit.

                If returns an error which wasn't returned by the callback, then it is possible that the graph is corrupted.

                func (*Graph) Write

                func (g *Graph) Write(w io.Writer) error

                  Write writes the graph to w. It is the opposite of (*Graph).Read().

                  Spec:

                  graph = header version git-commit-hash root total-number-of-edges root-edges
                  header = 54
                  version = 0
                  
                  root = node
                  node = prob-sum-denominator number-of-children children-sorted-by-base-name
                  children-sorted-by-base-name = child*
                  child = base-name node
                  
                  root-edges = node-edges
                  node-edges = number-of-edges edge*
                  edge =
                    index-of-the-adjacent-node-as-found-in-the-file
                    prob-sum
                    edges-of-children-sorted-by-base-name
                  edges-of-children-sorted-by-base-name = edge*
                  
                  where
                   all integer types are encoded as varint
                   all strings are encoded as length-prefixed utf8
                   `*` means "0 or more"
                  

                  type LoadOptions

                  type LoadOptions struct {
                  	UpdateOptions
                  
                  	// Ref is the git ref to load the graph for.
                  	// Defaults to refs/heads/main.
                  	//
                  	// If it is refs/heads/main, but it does not exist, then falls back to
                  	// refs/heads/master.
                  	Ref string
                  }

                    LoadOptions are options for Load() function.

                    type SelectionStrategy

                    type SelectionStrategy struct {
                    	Graph      *Graph
                    	EdgeReader *EdgeReader
                    
                    	// MaxDistance decides whether a test is to be selected: if it is closer or
                    	// equal than MaxDistance, then it is selected. Otherwise, skipped.
                    	//
                    	// Ignored by SelectEval.
                    	MaxDistance float64
                    
                    	// OnTestNotFound is called when a test file is not found in the filegraph and
                    	// is not among changed files. If nil, then the file name is logged.
                    	//
                    	// Ignored by Select.
                    	OnTestNotFound func(ctx context.Context, tv *evalpb.TestVariant)
                    }

                      SelectionStrategy implements a selection strategy based on a git graph.

                      func (*SelectionStrategy) Select

                      func (s *SelectionStrategy) Select(changedFiles []string, skipFile func(name string) (keepGoing bool))

                        Select calls skipTestFile for each test file that should be skipped. Does not skip files that it does not know about.

                        func (*SelectionStrategy) SelectEval

                        func (s *SelectionStrategy) SelectEval(ctx context.Context, in eval.Input, out *eval.Output) error

                          SelectEval implements eval.Strategy. It can be used to evaluate data quality of the graph. It is a version of Select specifically for evaluation.

                          This function has minimal input validation. It is not designed to be called by the evaluation framework directly. Instead it should be wrapped with another strategy function that does the validation. In particular, this function does not check in.ChangedFiles[i].Repo and does not check for file patterns that must be exempted from RTS.

                          type UpdateOptions

                          type UpdateOptions struct {
                          	// Callback, if not nil, is called each time after each commit is processed
                          	// and Graph.Commit is updated.
                          	Callback func() error
                          
                          	// MaxCommitSize is the maximum number of files touched by a commit.
                          	// Commits that exceed this limit are ignored.
                          	// The rationale is that large commits provide a weak signal of file
                          	// relatedness and are expensive to process, O(N^2).
                          	MaxCommitSize int
                          }

                            UpdateOptions are options for Graph.Update().