Documentation
¶
Overview ¶
Package github implements the crawler.Crawler interface, getting data from the Github search API.
Index ¶
- func CloseResponseBody(resp *http.Response)
- func Filename(f string) queryField
- func Filesize(r rangeFormatter) queryField
- func FindFileSize(cache cachedSearch, targetFileCount, lowerBound, upperBound uint64) (uint64, error)
- func FindRangesForRepoSearch(cache cachedSearch, lowerBound, upperBound uint64) ([]string, error)
- func Keyword(k string) queryField
- func NewCrawler(accessToken string, retryCount uint64, client *http.Client, query Query) githubCrawler
- func Path(p string) queryField
- func Repo(r string) queryField
- func User(u string) queryField
- type GhClient
- func (gcl GhClient) Do(query string) (*http.Response, error)
- func (gcl GhClient) ForwardPaginatedQuery(ctx context.Context, query string, output chan<- GhResponseInfo) error
- func (gcl GhClient) GetDefaultBranch(url, repo string, m map[string]string) (string, error)
- func (gcl GhClient) GetFileCreationTime(k GhFileSpec) (time.Time, error)
- func (gcl GhClient) GetFileData(k GhFileSpec) ([]byte, error)
- func (gcl GhClient) GetRawUserContent(query string) (*http.Response, error)
- func (gcl GhClient) GetReposData(query string) (*http.Response, error)
- func (gcl GhClient) SearchGithubAPI(query string) (*http.Response, error)
- type GhFileSpec
- type GhResponseInfo
- type GitRepository
- type Query
- type RangeGreaterThan
- type RangeLessThan
- type RangeQueryResult
- type RangeWithin
- type RequestConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CloseResponseBody ¶
func Filename ¶
func Filename(f string) queryField
Filename takes a filename and formats it according to the Github API.
func Filesize ¶
func Filesize(r rangeFormatter) queryField
Filesize takes a rangeFormatter and formats it according to the Github API.
func FindFileSize ¶
func FindFileSize( cache cachedSearch, targetFileCount, lowerBound, upperBound uint64) (uint64, error)
FindFileSize finds the filesize range from [lowerBound, return value] that has the largest file count that is smaller than or equal to githubMaxResultsPerQuery. It is important to note that this returned value could already be in a previous range if the next file size has more than 1000 results. It is left to the caller to handle this bit of logic and guarantee forward progession in this case.
func FindRangesForRepoSearch ¶
Outputs a (possibly incomplete) list of ranges to query to find most search results as permissible by the search github search API. Github search only allows 1,000 results per query (paginated). Source: https://developer.github.com/v3/search/
This leaves the possibility of having file sizes with more than 1000 results, This would mean that the search as it is could not find all files. If queries are sorted by last indexed, and retrieved on regular intervals, it should be sufficient to get most if not all documents.
func Keyword ¶
func Keyword(k string) queryField
Keyword takes a single word, and formats it according to the Github API.
func NewCrawler ¶
func Path ¶
func Path(p string) queryField
Path takes a filepath and formats it according to the Github API.
Types ¶
type GhClient ¶
type GhClient struct { RequestConfig // contains filtered or unexported fields }
func (GhClient) ForwardPaginatedQuery ¶
func (gcl GhClient) ForwardPaginatedQuery(ctx context.Context, query string, output chan<- GhResponseInfo) error
ForwardPaginatedQuery follows the links to the next pages and performs all of the queries for a given search query, relaying the data from each request back to an output channel.
func (GhClient) GetDefaultBranch ¶
GetDefaultBranch gets the default branch of a github repository. m is a map which maps a github repository to its default branch. If repo is already in m, the default branch for url will be obtained from m; otherwise, a query will be made to github to obtain the default branch.
func (GhClient) GetFileCreationTime ¶
func (gcl GhClient) GetFileCreationTime( k GhFileSpec) (time.Time, error)
GetFileCreationTime gets the earliest date of a file.
func (GhClient) GetFileData ¶
func (gcl GhClient) GetFileData(k GhFileSpec) ([]byte, error)
GetFileData gets the bytes from a file.
func (GhClient) GetRawUserContent ¶
User content (file contents) is not API rate limited, so there's no use in throttling this call.
func (GhClient) GetReposData ¶
GetReposData performs a search query and handles rate limitting for the '/repos' endpoint as well as timed retries in the case of abuse prevention.
type GhFileSpec ¶
type GhFileSpec struct { Path string `json:"path,omitempty"` Repository GitRepository `json:"repository,omitempty"` }
type GhResponseInfo ¶
type GitRepository ¶
type Query ¶
type Query []queryField
Example of formating a query: QueryWith(
Filename("kustomization.yaml"), Filesize(RangeWithin{64, 192}), Keyword("copyright"), Keyword("2019"),
).String()
Outputs "q=filename:kustomization.yaml+size:64..192+copyright+2018" which would search for files that have [64, 192] bytes (inclusive range) and that contain the keywords 'copyright' and '2019' somewhere in the file.
type RangeGreaterThan ¶
type RangeGreaterThan struct {
// contains filtered or unexported fields
}
RangeLessThan is a range of values strictly greater than (>) size.
func (RangeGreaterThan) RangeString ¶
func (r RangeGreaterThan) RangeString() string
type RangeLessThan ¶
type RangeLessThan struct {
// contains filtered or unexported fields
}
RangeLessThan is a range of values strictly less than (<) size.
func (RangeLessThan) RangeString ¶
func (r RangeLessThan) RangeString() string
type RangeQueryResult ¶
type RangeQueryResult struct {
// contains filtered or unexported fields
}
func (*RangeQueryResult) Add ¶
func (r *RangeQueryResult) Add(other RangeQueryResult)
func (*RangeQueryResult) String ¶
func (r *RangeQueryResult) String() string
type RangeWithin ¶
type RangeWithin struct {
// contains filtered or unexported fields
}
RangeWithin is an inclusive range from start to end.
func RangeSizes ¶
func RangeSizes(s string) RangeWithin
func (RangeWithin) RangeString ¶
func (r RangeWithin) RangeString() string
func (RangeWithin) Size ¶
func (r RangeWithin) Size() uint64
type RequestConfig ¶
type RequestConfig struct {
// contains filtered or unexported fields
}
RequestConfig stores common variables that must be present for the queries. - CodeSearchRequests: ask Github to check the code indices given a query. - ContentsRequests: ask Github where to download a resource given a repo and a file path. - CommitsRequests: asks Github to list commits made one a file. Useful to determine the date of a file.
func (RequestConfig) CodeSearchRequestWith ¶
func (rc RequestConfig) CodeSearchRequestWith(query Query) request
CodeSearchRequestWith given a list of query parameters that specify the (patial) query, returns a request object with the (parital) query. Must call the URL method to get the string value of the URL. See request.CopyWith, to understand why the request object is useful.
func (RequestConfig) CommitsRequest ¶
func (rc RequestConfig) CommitsRequest(fullRepoName, path string) string
CommitsRequest given the repo name, and a filepath returns a formatted query for the Github API to find the commits that affect this file.
func (RequestConfig) ContentsRequest ¶
func (rc RequestConfig) ContentsRequest(fullRepoName, path string) string
ContentsRequest given the repo name, and the filepath returns a formatted query for the Github API to find the dowload information of this filepath.
func (RequestConfig) ReposRequest ¶
func (rc RequestConfig) ReposRequest(fullRepoName string) string