Documentation
¶
Overview ¶
Package index contains logic for building Zoekt indexes. NOTE: this package is not considered part of the public API, and it is not recommended to rely on it in external code.
Index ¶
- Constants
- Variables
- func BranchNamesEqual(a, b []zoekt.RepositoryBranch) bool
- func DetermineFileCategory(doc *Document)
- func DetermineLanguageIfUnknown(doc *Document)
- func Explode(dstDir string, inputShard string) error
- func HostnameBestEffort() string
- func IndexFilePaths(p string) ([]string, error)
- func JsonMarshalRepoMetaTemp(shardPath string, repositoryMetadata any) (tempPath, finalPath string, err error)
- func Merge(dstDir string, files ...IndexFile) (tmpName, dstName string, _ error)
- func NewSearcher(r IndexFile) (zoekt.Searcher, error)
- func ParseTemplate(text string) (*template.Template, error)
- func PrintNgramStats(r IndexFile) error
- func ReadMetadata(inf IndexFile) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)
- func ReadMetadataPath(p string) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)
- func ReadMetadataPathAlive(p string) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)
- func SetTombstone(shardPath string, repoID uint32) error
- func SortAndTruncateFiles(files []zoekt.FileMatch, opts *zoekt.SearchOptions) []zoekt.FileMatch
- func SortFiles(ms []zoekt.FileMatch)
- func UnsetTombstone(shardPath string, repoID uint32) error
- type Branch
- type Builder
- type DisplayTruncator
- type DocChecker
- type Document
- type DocumentSection
- type FileCategory
- type HashOptions
- type IndexFile
- type IndexState
- type Options
- func (o *Options) Args() []string
- func (o *Options) FindAllShards() []string
- func (o *Options) FindRepositoryMetadata() (repository *zoekt.Repository, metadata *zoekt.IndexMetadata, ok bool, ...)
- func (o *Options) Flags(fs *flag.FlagSet)
- func (o *Options) GetHash() string
- func (o *Options) HashOptions() HashOptions
- func (o *Options) IgnoreSizeMax(name string) bool
- func (o *Options) IncrementalSkipIndexing() bool
- func (o *Options) IndexState() (IndexState, string)
- func (o *Options) SetDefaults()
- type ShardBuilder
- type SkipReason
Constants ¶
const FeatureVersion = 12
FeatureVersion is increased if a feature is added that requires reindexing data without changing the format version 2: Rank field for shards. 3: Rank documents within shards 4: Dedup file bugfix 5: Remove max line size limit 6: Include '#' into the LineFragment template 7: Record skip reasons in the index. 8: Record source path in the index. 9: Store ctags metadata & bump default max file size 10: Compound shards; more flexible TOC format. 11: Bloom filters for file names & contents 12: go-enry for identifying file languages
const IndexFormatVersion = 16
IndexFormatVersion is a version number. It is increased every time the on-disk index format is changed. 5: subrepositories. 6: remove size prefix for posting varint list. 7: move subrepos into Repository struct. 8: move repoMetaData out of indexMetadata 9: use bigendian uint64 for trigrams. 10: sections for rune offsets. 11: file ends in rune offsets. 12: 64-bit branchmasks. 13: content checksums 14: languages 15: rune based symbol sections 16: ctags metadata
const NextIndexFormatVersion = 17
17: compound shard (multi repo)
const ReadMinFeatureVersion = 8
ReadMinFeatureVersion constrains backwards compatibility by refusing to load a file with a FeatureVersion below it.
const (
ScoreOffset = 10_000_000
)
const WriteMinFeatureVersion = 10
WriteMinFeatureVersion constrains forwards compatibility by emitting files that won't load in zoekt with a FeatureVersion below it.
Variables ¶
var DefaultDir = filepath.Join(os.Getenv("HOME"), ".zoekt")
var Version string
Filled by the linker
Functions ¶
func BranchNamesEqual ¶
func BranchNamesEqual(a, b []zoekt.RepositoryBranch) bool
BranchNamesEqual compares the given zoekt.RepositoryBranch slices, and returns true iff both slices specify the same set of branch names in the same order.
func DetermineFileCategory ¶
func DetermineFileCategory(doc *Document)
func DetermineLanguageIfUnknown ¶
func DetermineLanguageIfUnknown(doc *Document)
func Explode ¶
Explode takes an input shard and creates 1 simple shard per repository. It is a wrapper around explode that takes care of removing the input shard and renaming the temporary shards.
func HostnameBestEffort ¶
func HostnameBestEffort() string
func IndexFilePaths ¶
IndexFilePaths returns all paths for the IndexFile at filepath p that exist. Note: if no files exist this will return an empty slice and nil error.
This is p and the ".meta" file for p.
func JsonMarshalRepoMetaTemp ¶
func JsonMarshalRepoMetaTemp(shardPath string, repositoryMetadata any) (tempPath, finalPath string, err error)
JsonMarshalRepoMetaTemp writes the json encoding of the given repository metadata to a temporary file in the same directory as the given shard path. It returns both the path of the temporary file and the path of the final file that the caller should use.
The caller is responsible for renaming the temporary file to the final file path, or removing the temporary file if it is no longer needed. TODO: Should we stick this in a util package?
func Merge ¶
Merge files into a compound shard in dstDir. Merge returns tmpName and a dstName. It is the responsibility of the caller to delete the input shards and rename the temporary compound shard from tmpName to dstName.
func NewSearcher ¶
NewSearcher creates a Searcher for a single index file. Search results coming from this searcher are valid only for the lifetime of the Searcher itself, ie. []byte members should be copied into fresh buffers if the result is to survive closing the shard.
func ParseTemplate ¶
ParseTemplate will parse the templates for FileURLTemplate, LineFragmentTemplate and CommitURLTemplate.
It makes available the extra function UrlJoinPath.
func PrintNgramStats ¶
PrintNgramStats outputs a list of the form
n_1 trigram_1 n_2 trigram_2 ...
where n_i is the length of the postings list of trigram_i stored in r.
func ReadMetadata ¶
func ReadMetadata(inf IndexFile) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)
ReadMetadata returns the metadata of index shard without reading the index data. The IndexFile is not closed.
func ReadMetadataPath ¶
func ReadMetadataPath(p string) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)
ReadMetadataPath returns the metadata of index shard at p without reading the index data. ReadMetadataPath is a helper for ReadMetadata which opens the IndexFile at p.
func ReadMetadataPathAlive ¶
func ReadMetadataPathAlive(p string) ([]*zoekt.Repository, *zoekt.IndexMetadata, error)
ReadMetadataPathAlive is like ReadMetadataPath except that it only returns alive repositories.
func SetTombstone ¶
SetTombstone idempotently sets a tombstone for repoName in .meta.
func SortAndTruncateFiles ¶
SortAndTruncateFiles is a convenience around SortFiles and DisplayTruncator. Given an aggregated files it will sort and then truncate based on the search options.
func SortFiles ¶
SortFiles sorts files matches in the order we want to present results to users. The order depends on the match score, which includes both query-dependent signals like word overlap, and file-only signals like the file ranks (if file ranks are enabled).
We don't only use the scores, we will also boost some results to present files with novel extensions.
func UnsetTombstone ¶
UnsetTombstone idempotently removes a tombstones for reopName in .meta.
Types ¶
type Builder ¶
type Builder struct {
// contains filtered or unexported fields
}
Builder manages (parallel) creation of uniformly sized shards. The builder buffers up documents until it collects enough documents and then builds a shard and writes.
func NewBuilder ¶
NewBuilder creates a new Builder instance.
func (*Builder) CheckMemoryUsage ¶
func (b *Builder) CheckMemoryUsage()
CheckMemoryUsage checks the memory usage of the process and writes a memory profile if the heap usage exceeds the configured threshold. NOTE: this method is expensive and should only be used for debugging.
func (*Builder) Finish ¶
Finish creates a last shard from the buffered documents, and clears stale shards from previous runs. This should always be called, also in failure cases, to ensure cleanup.
It is safe to call Finish() multiple times.
func (*Builder) MarkFileAsChangedOrRemoved ¶
MarkFileAsChangedOrRemoved indicates that the file specified by the given path has been changed or removed since the last indexing job for this repository.
If this build is a delta build, these files will be tombstoned in the older shards for this repository.
type DisplayTruncator ¶
DisplayTruncator is a stateful function which enforces Document and Match display limits by truncating and mutating before. hasMore is true until the limits are exhausted. Once hasMore is false each subsequent call will return an empty after and hasMore false.
func NewDisplayTruncator ¶
func NewDisplayTruncator(opts *zoekt.SearchOptions) (_ DisplayTruncator, hasLimits bool)
NewDisplayTruncator will return a DisplayTruncator which enforces the limits in opts. If there are no limits to enforce, hasLimits is false and there is no need to call DisplayTruncator.
type DocChecker ¶
type DocChecker struct {
// contains filtered or unexported fields
}
func (*DocChecker) Check ¶
func (t *DocChecker) Check(content []byte, maxTrigramCount int, allowLargeFile bool) SkipReason
Check returns a reason why the given contents are probably not source texts.
type Document ¶
type Document struct { Name string Content []byte Branches []string SubRepositoryPath string Language string Category FileCategory SkipReason SkipReason // Document sections for symbols. Offsets should use bytes. Symbols []DocumentSection SymbolsMetaData []*zoekt.Symbol }
Document holds a document (file) to index.
type DocumentSection ¶
type DocumentSection struct {
Start, End uint32
}
type FileCategory ¶
type FileCategory byte
FileCategory represents the category of a file, as determined by go-enry. It is non-exhaustive but tries to the major cases like whether the file is a test, generated, etc.
A file's category is used in search scoring to determine the weight of a file match.
const ( // FileCategoryMissing is a sentinel value that indicates we never computed the file category during indexing // (which means we're reading from an old index version). This value can never be written to the index. FileCategoryMissing FileCategory = iota FileCategoryDefault FileCategoryTest FileCategoryVendored FileCategoryGenerated FileCategoryConfig FileCategoryDotFile FileCategoryBinary FileCategoryDocumentation )
type HashOptions ¶
type HashOptions struct {
// contains filtered or unexported fields
}
HashOptions contains only the options in Options that upon modification leads to IndexState of IndexStateMismatch during the next index building.
type IndexFile ¶
type IndexFile interface { Read(off uint32, sz uint32) ([]byte, error) Size() (uint32, error) Close() Name() string }
IndexFile is a file suitable for concurrent read access. For performance reasons, it allows a mmap'd implementation.
type IndexState ¶
type IndexState string
const ( IndexStateMissing IndexState = "missing" IndexStateCorrupt IndexState = "corrupt" IndexStateVersion IndexState = "version-mismatch" IndexStateOption IndexState = "option-mismatch" IndexStateMeta IndexState = "meta-mismatch" IndexStateContent IndexState = "content-mismatch" IndexStateEqual IndexState = "equal" )
type Options ¶
type Options struct { // IndexDir is a directory that holds *.zoekt index files. IndexDir string // SizeMax is the maximum file size SizeMax int // Parallelism is the maximum number of shards to index in parallel Parallelism int // ShardMax sets the maximum corpus size for a single shard ShardMax int // TrigramMax sets the maximum number of distinct trigrams per document. TrigramMax int // RepositoryDescription holds names and URLs for the repository. RepositoryDescription zoekt.Repository // SubRepositories is a path => sub repository map. SubRepositories map[string]*zoekt.Repository // DisableCTags disables the generation of ctags metadata. DisableCTags bool // CtagsPath is the path to the ctags binary to run, or empty // if a valid binary couldn't be found. CTagsPath string // Same as CTagsPath but for scip-ctags ScipCTagsPath string // If set, ctags must succeed. CTagsMustSucceed bool // LargeFiles is a slice of glob patterns, including ** for any number // of directories, where matching file paths should be indexed // regardless of their size. The full pattern syntax is here: // https://github.com/bmatcuk/doublestar/tree/v1#patterns. LargeFiles []string // IsDelta is true if this run contains only the changed documents since the // last run. IsDelta bool LanguageMap ctags.LanguageMap // ShardMerging is true if builder should respect compound shards. This is a // Sourcegraph specific option. ShardMerging bool // HeapProfileTriggerBytes is the heap allocation in bytes that will trigger a memory profile. If 0, no memory profile // will be triggered. Note this trigger looks at total heap allocation (which includes both inuse and garbage objects). // // Profiles will be written to files named `index-memory.prof.n` in the index directory. No more than 10 files are written. // // Note: heap checking is "best effort", and it's possible for the process to OOM without triggering the heap profile. HeapProfileTriggerBytes uint64 // contains filtered or unexported fields }
Options sets options for the index building.
func (*Options) FindAllShards ¶
func (*Options) FindRepositoryMetadata ¶
func (o *Options) FindRepositoryMetadata() (repository *zoekt.Repository, metadata *zoekt.IndexMetadata, ok bool, err error)
FindRepositoryMetadata returns the index metadata for the repository specified in the options. 'ok' is false if the repository's metadata couldn't be found or if an error occurred.
func (*Options) HashOptions ¶
func (o *Options) HashOptions() HashOptions
func (*Options) IgnoreSizeMax ¶
IgnoreSizeMax determines whether the max size should be ignored.
func (*Options) IncrementalSkipIndexing ¶
IncrementalSkipIndexing returns true if the index present on disk matches the build options.
func (*Options) IndexState ¶
func (o *Options) IndexState() (IndexState, string)
IndexState checks how the index present on disk compares to the build options and returns the IndexState and the name of the first shard.
func (*Options) SetDefaults ¶
func (o *Options) SetDefaults()
SetDefaults sets reasonable default options.
type ShardBuilder ¶
type ShardBuilder struct { // IndexTime will be used as the time if non-zero. Otherwise // time.Now(). This is useful for doing reproducible builds in tests. IndexTime time.Time // a sortable 20 chars long id. ID string // contains filtered or unexported fields }
ShardBuilder builds a single index shard.
func NewShardBuilder ¶
func NewShardBuilder(r *zoekt.Repository) (*ShardBuilder, error)
NewShardBuilder creates a fresh ShardBuilder. The passed in Repository contains repo metadata, and may be set to nil.
func (*ShardBuilder) Add ¶
func (b *ShardBuilder) Add(doc Document) error
Add a file which only occurs in certain branches.
func (*ShardBuilder) AddFile ¶
func (b *ShardBuilder) AddFile(name string, content []byte) error
AddFile is a convenience wrapper for Add
func (*ShardBuilder) ContentSize ¶
func (b *ShardBuilder) ContentSize() uint32
ContentSize returns the number of content bytes so far ingested.
func (*ShardBuilder) NumFiles ¶
func (b *ShardBuilder) NumFiles() int
NumFiles returns the number of files added to this builder
type SkipReason ¶
type SkipReason int
const ( SkipReasonNone SkipReason = iota SkipReasonTooLarge SkipReasonTooSmall SkipReasonBinary SkipReasonTooManyTrigrams )