index

package
v0.0.0-...-7ffb936 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 27, 2025 License: Apache-2.0 Imports: 50 Imported by: 0

Documentation

Overview

Package index contains logic for building Zoekt indexes. NOTE: this package is not considered part of the public API, and it is not recommended to rely on it in external code.

Index

Constants

View Source
const FeatureVersion = 12

FeatureVersion is increased if a feature is added that requires reindexing data without changing the format version 2: Rank field for shards. 3: Rank documents within shards 4: Dedup file bugfix 5: Remove max line size limit 6: Include '#' into the LineFragment template 7: Record skip reasons in the index. 8: Record source path in the index. 9: Store ctags metadata & bump default max file size 10: Compound shards; more flexible TOC format. 11: Bloom filters for file names & contents 12: go-enry for identifying file languages

View Source
const IndexFormatVersion = 16

IndexFormatVersion is a version number. It is increased every time the on-disk index format is changed. 5: subrepositories. 6: remove size prefix for posting varint list. 7: move subrepos into Repository struct. 8: move repoMetaData out of indexMetadata 9: use bigendian uint64 for trigrams. 10: sections for rune offsets. 11: file ends in rune offsets. 12: 64-bit branchmasks. 13: content checksums 14: languages 15: rune based symbol sections 16: ctags metadata

View Source
const NextIndexFormatVersion = 17

17: compound shard (multi repo)

View Source
const ReadMinFeatureVersion = 8

ReadMinFeatureVersion constrains backwards compatibility by refusing to load a file with a FeatureVersion below it.

View Source
const (
	ScoreOffset = 10_000_000
)
View Source
const WriteMinFeatureVersion = 10

WriteMinFeatureVersion constrains forwards compatibility by emitting files that won't load in zoekt with a FeatureVersion below it.

Variables

View Source
var DefaultDir = filepath.Join(os.Getenv("HOME"), ".zoekt")
View Source
var Version string

Filled by the linker

Functions

func BranchNamesEqual

func BranchNamesEqual(a, b []zoekt.RepositoryBranch) bool

BranchNamesEqual compares the given zoekt.RepositoryBranch slices, and returns true iff both slices specify the same set of branch names in the same order.

func DetermineFileCategory

func DetermineFileCategory(doc *Document)

func DetermineLanguageIfUnknown

func DetermineLanguageIfUnknown(doc *Document)

func Explode

func Explode(dstDir string, inputShard string) error

Explode takes an input shard and creates 1 simple shard per repository. It is a wrapper around explode that takes care of removing the input shard and renaming the temporary shards.

func HostnameBestEffort

func HostnameBestEffort() string

func IndexFilePaths

func IndexFilePaths(p string) ([]string, error)

IndexFilePaths returns all paths for the IndexFile at filepath p that exist. Note: if no files exist this will return an empty slice and nil error.

This is p and the ".meta" file for p.

func JsonMarshalRepoMetaTemp

func JsonMarshalRepoMetaTemp(shardPath string, repositoryMetadata any) (tempPath, finalPath string, err error)

JsonMarshalRepoMetaTemp writes the json encoding of the given repository metadata to a temporary file in the same directory as the given shard path. It returns both the path of the temporary file and the path of the final file that the caller should use.

The caller is responsible for renaming the temporary file to the final file path, or removing the temporary file if it is no longer needed. TODO: Should we stick this in a util package?

func Merge

func Merge(dstDir string, files ...IndexFile) (tmpName, dstName string, _ error)

Merge files into a compound shard in dstDir. Merge returns tmpName and a dstName. It is the responsibility of the caller to delete the input shards and rename the temporary compound shard from tmpName to dstName.

func NewSearcher

func NewSearcher(r IndexFile) (zoekt.Searcher, error)

NewSearcher creates a Searcher for a single index file. Search results coming from this searcher are valid only for the lifetime of the Searcher itself, ie. []byte members should be copied into fresh buffers if the result is to survive closing the shard.

func ParseTemplate

func ParseTemplate(text string) (*template.Template, error)

ParseTemplate will parse the templates for FileURLTemplate, LineFragmentTemplate and CommitURLTemplate.

It makes available the extra function UrlJoinPath.

func PrintNgramStats

func PrintNgramStats(r IndexFile) error

PrintNgramStats outputs a list of the form

n_1 trigram_1
n_2 trigram_2
...

where n_i is the length of the postings list of trigram_i stored in r.

func ReadMetadata

func ReadMetadata(inf IndexFile) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)

ReadMetadata returns the metadata of index shard without reading the index data. The IndexFile is not closed.

func ReadMetadataPath

func ReadMetadataPath(p string) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)

ReadMetadataPath returns the metadata of index shard at p without reading the index data. ReadMetadataPath is a helper for ReadMetadata which opens the IndexFile at p.

func ReadMetadataPathAlive

func ReadMetadataPathAlive(p string) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)

ReadMetadataPathAlive is like ReadMetadataPath except that it only returns alive repositories.

func SetTombstone

func SetTombstone(shardPath string, repoID uint32) error

SetTombstone idempotently sets a tombstone for repoName in .meta.

func SortAndTruncateFiles

func SortAndTruncateFiles(files []zoekt.FileMatch, opts *zoekt.SearchOptions) []zoekt.FileMatch

SortAndTruncateFiles is a convenience around SortFiles and DisplayTruncator. Given an aggregated files it will sort and then truncate based on the search options.

func SortFiles

func SortFiles(ms []zoekt.FileMatch)

SortFiles sorts files matches in the order we want to present results to users. The order depends on the match score, which includes both query-dependent signals like word overlap, and file-only signals like the file ranks (if file ranks are enabled).

We don't only use the scores, we will also boost some results to present files with novel extensions.

func UnsetTombstone

func UnsetTombstone(shardPath string, repoID uint32) error

UnsetTombstone idempotently removes a tombstones for reopName in .meta.

Types

type Branch

type Branch struct {
	Name    string
	Version string
}

Branch describes a single branch version.

type Builder

type Builder struct {
	// contains filtered or unexported fields
}

Builder manages (parallel) creation of uniformly sized shards. The builder buffers up documents until it collects enough documents and then builds a shard and writes.

func NewBuilder

func NewBuilder(opts Options) (*Builder, error)

NewBuilder creates a new Builder instance.

func (*Builder) Add

func (b *Builder) Add(doc Document) error

func (*Builder) AddFile

func (b *Builder) AddFile(name string, content []byte) error

AddFile is a convenience wrapper for the Add method

func (*Builder) CheckMemoryUsage

func (b *Builder) CheckMemoryUsage()

CheckMemoryUsage checks the memory usage of the process and writes a memory profile if the heap usage exceeds the configured threshold. NOTE: this method is expensive and should only be used for debugging.

func (*Builder) Finish

func (b *Builder) Finish() error

Finish creates a last shard from the buffered documents, and clears stale shards from previous runs. This should always be called, also in failure cases, to ensure cleanup.

It is safe to call Finish() multiple times.

func (*Builder) MarkFileAsChangedOrRemoved

func (b *Builder) MarkFileAsChangedOrRemoved(path string)

MarkFileAsChangedOrRemoved indicates that the file specified by the given path has been changed or removed since the last indexing job for this repository.

If this build is a delta build, these files will be tombstoned in the older shards for this repository.

type DisplayTruncator

type DisplayTruncator func(before []zoekt.FileMatch) (after []zoekt.FileMatch, hasMore bool)

DisplayTruncator is a stateful function which enforces Document and Match display limits by truncating and mutating before. hasMore is true until the limits are exhausted. Once hasMore is false each subsequent call will return an empty after and hasMore false.

func NewDisplayTruncator

func NewDisplayTruncator(opts *zoekt.SearchOptions) (_ DisplayTruncator, hasLimits bool)

NewDisplayTruncator will return a DisplayTruncator which enforces the limits in opts. If there are no limits to enforce, hasLimits is false and there is no need to call DisplayTruncator.

type DocChecker

type DocChecker struct {
	// contains filtered or unexported fields
}

func (*DocChecker) Check

func (t *DocChecker) Check(content []byte, maxTrigramCount int, allowLargeFile bool) SkipReason

Check returns a reason why the given contents are probably not source texts.

type Document

type Document struct {
	Name              string
	Content           []byte
	Branches          []string
	SubRepositoryPath string
	Language          string
	Category          FileCategory

	SkipReason SkipReason

	// Document sections for symbols. Offsets should use bytes.
	Symbols         []DocumentSection
	SymbolsMetaData []*zoekt.Symbol
}

Document holds a document (file) to index.

type DocumentSection

type DocumentSection struct {
	Start, End uint32
}

type FileCategory

type FileCategory byte

FileCategory represents the category of a file, as determined by go-enry. It is non-exhaustive but tries to the major cases like whether the file is a test, generated, etc.

A file's category is used in search scoring to determine the weight of a file match.

const (
	// FileCategoryMissing is a sentinel value that indicates we never computed the file category during indexing
	// (which means we're reading from an old index version). This value can never be written to the index.
	FileCategoryMissing FileCategory = iota
	FileCategoryDefault
	FileCategoryTest
	FileCategoryVendored
	FileCategoryGenerated
	FileCategoryConfig
	FileCategoryDotFile
	FileCategoryBinary
	FileCategoryDocumentation
)

type HashOptions

type HashOptions struct {
	// contains filtered or unexported fields
}

HashOptions contains only the options in Options that upon modification leads to IndexState of IndexStateMismatch during the next index building.

type IndexFile

type IndexFile interface {
	Read(off uint32, sz uint32) ([]byte, error)
	Size() (uint32, error)
	Close()
	Name() string
}

IndexFile is a file suitable for concurrent read access. For performance reasons, it allows a mmap'd implementation.

func NewIndexFile

func NewIndexFile(f *os.File) (IndexFile, error)

NewIndexFile returns a new index file. The index file takes ownership of the passed in file, and may close it.

type IndexState

type IndexState string
const (
	IndexStateMissing IndexState = "missing"
	IndexStateCorrupt IndexState = "corrupt"
	IndexStateVersion IndexState = "version-mismatch"
	IndexStateOption  IndexState = "option-mismatch"
	IndexStateMeta    IndexState = "meta-mismatch"
	IndexStateContent IndexState = "content-mismatch"
	IndexStateEqual   IndexState = "equal"
)

type Options

type Options struct {
	// IndexDir is a directory that holds *.zoekt index files.
	IndexDir string

	// SizeMax is the maximum file size
	SizeMax int

	// Parallelism is the maximum number of shards to index in parallel
	Parallelism int

	// ShardMax sets the maximum corpus size for a single shard
	ShardMax int

	// TrigramMax sets the maximum number of distinct trigrams per document.
	TrigramMax int

	// RepositoryDescription holds names and URLs for the repository.
	RepositoryDescription zoekt.Repository

	// SubRepositories is a path => sub repository map.
	SubRepositories map[string]*zoekt.Repository

	// DisableCTags disables the generation of ctags metadata.
	DisableCTags bool

	// CtagsPath is the path to the ctags binary to run, or empty
	// if a valid binary couldn't be found.
	CTagsPath string

	// Same as CTagsPath but for scip-ctags
	ScipCTagsPath string

	// If set, ctags must succeed.
	CTagsMustSucceed bool

	// LargeFiles is a slice of glob patterns, including ** for any number
	// of directories, where matching file paths should be indexed
	// regardless of their size. The full pattern syntax is here:
	// https://github.com/bmatcuk/doublestar/tree/v1#patterns.
	LargeFiles []string

	// IsDelta is true if this run contains only the changed documents since the
	// last run.
	IsDelta bool

	LanguageMap ctags.LanguageMap

	// ShardMerging is true if builder should respect compound shards. This is a
	// Sourcegraph specific option.
	ShardMerging bool

	// HeapProfileTriggerBytes is the heap allocation in bytes that will trigger a memory profile. If 0, no memory profile
	// will be triggered. Note this trigger looks at total heap allocation (which includes both inuse and garbage objects).
	//
	// Profiles will be written to files named `index-memory.prof.n` in the index directory. No more than 10 files are written.
	//
	// Note: heap checking is "best effort", and it's possible for the process to OOM without triggering the heap profile.
	HeapProfileTriggerBytes uint64
	// contains filtered or unexported fields
}

Options sets options for the index building.

func (*Options) Args

func (o *Options) Args() []string

Args generates command line arguments for o. It is the "inverse" of Flags.

func (*Options) FindAllShards

func (o *Options) FindAllShards() []string

func (*Options) FindRepositoryMetadata

func (o *Options) FindRepositoryMetadata() (repository *zoekt.Repository, metadata *zoekt.IndexMetadata, ok bool, err error)

FindRepositoryMetadata returns the index metadata for the repository specified in the options. 'ok' is false if the repository's metadata couldn't be found or if an error occurred.

func (*Options) Flags

func (o *Options) Flags(fs *flag.FlagSet)

Flags adds flags for build options to fs. It is the "inverse" of Args.

func (*Options) GetHash

func (o *Options) GetHash() string

func (*Options) HashOptions

func (o *Options) HashOptions() HashOptions

func (*Options) IgnoreSizeMax

func (o *Options) IgnoreSizeMax(name string) bool

IgnoreSizeMax determines whether the max size should be ignored.

func (*Options) IncrementalSkipIndexing

func (o *Options) IncrementalSkipIndexing() bool

IncrementalSkipIndexing returns true if the index present on disk matches the build options.

func (*Options) IndexState

func (o *Options) IndexState() (IndexState, string)

IndexState checks how the index present on disk compares to the build options and returns the IndexState and the name of the first shard.

func (*Options) SetDefaults

func (o *Options) SetDefaults()

SetDefaults sets reasonable default options.

type ShardBuilder

type ShardBuilder struct {

	// IndexTime will be used as the time if non-zero. Otherwise
	// time.Now(). This is useful for doing reproducible builds in tests.
	IndexTime time.Time

	// a sortable 20 chars long id.
	ID string
	// contains filtered or unexported fields
}

ShardBuilder builds a single index shard.

func NewShardBuilder

func NewShardBuilder(r *zoekt.Repository) (*ShardBuilder, error)

NewShardBuilder creates a fresh ShardBuilder. The passed in Repository contains repo metadata, and may be set to nil.

func (*ShardBuilder) Add

func (b *ShardBuilder) Add(doc Document) error

Add a file which only occurs in certain branches.

func (*ShardBuilder) AddFile

func (b *ShardBuilder) AddFile(name string, content []byte) error

AddFile is a convenience wrapper for Add

func (*ShardBuilder) ContentSize

func (b *ShardBuilder) ContentSize() uint32

ContentSize returns the number of content bytes so far ingested.

func (*ShardBuilder) NumFiles

func (b *ShardBuilder) NumFiles() int

NumFiles returns the number of files added to this builder

func (*ShardBuilder) Write

func (b *ShardBuilder) Write(out io.Writer) error

type SkipReason

type SkipReason int
const (
	SkipReasonNone SkipReason = iota
	SkipReasonTooLarge
	SkipReasonTooSmall
	SkipReasonBinary
	SkipReasonTooManyTrigrams
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL