Git Provider Receiver
The Git Provider receiver scrapes data from Git vendors.
As a starting point, this receiver can infer many of the same core git metrics
across vendors, while being able to receive additional data specific to vendors.
The current default set of metrics common across all vendors can be found in documentation.md.
These default metrics can be used as leading indicators to the DORA metrics;
helping provide insight into modern-day engineering practices.
GitHub Metrics
The current metrics available via scraping from GitHub are:
- Repository count
- Repository branch time
- Repository branch count
- Repository contributor count
- Repository pull request open time
- Repository pull request time to merge
- Repository pull request time to approval
- Repository pull request count | stores an attribute of
pull_request_state
equal to open
or merged
Note: Some metrics may be disabled by default and have to be explicitly enabled.
For example, the repository contributor count metric is one such metric. This is
because this metric relies on the REST API which is subject to lower rate limits.
GitLab Metrics
The current metrics available via scraping from GitLab are:
- Repository count
- Repository branch time
- Repository branch count
- Repository contributor count
- Repository pull request time
- Repository pull request merge time
- Repository pull request approval time
- Repository pull request deployment time
Getting Started
The collection interval is common to all scrapers and is set to 30 seconds by default.
Note: Generally speaking, if the vendor allows for anonymous API calls, then you
won't have to configure any authentication, but you may only see public repositories
and organizations.
gitprovider:
collection_interval: <duration> #default = 30s
scrapers:
<scraper1>:
<scraper2>:
...
A more complete example using the GitHub & GitLab scrapers with authentication
is as follows:
extensions:
basicauth/github:
client_auth:
username: ${env:GH_USER}
password: ${env:GH_PAT}
bearertokenauth/gitlab:
token: ${env:GITLAB_PAT}
receivers:
gitprovider:
initial_delay: 1s
collection_interval: 60s
scrapers:
github:
metrics:
git.repository.contributor.count:
enabled: true
github_org: myfancyorg
#optional query override, defaults to "{org,user}:<github_org>"
search_query: "org:myfancyorg topic:o11yalltheway"
endpoint: "https://selfmanagedenterpriseserver.com"
auth:
authenticator: basicauth/github
service:
extensions: [basicauth/github, bearertokenauth/gitlab]
pipelines:
metrics:
receivers: [..., gitprovider]
processors: []
exporters: [...]
This receiver is developed upstream in the liatrio-otel-collector distribution
where a quick start exists with an example config
The available scrapers are:
Scraper |
Description |
[github] |
Git Metrics from GitHub |
[gitlab] |
Git Metrics from GitLab |
Rate Limiting
Given this receiver scrapes data from Git providers, it is subject to rate
limiting. The following section will give some sensible defaults for each
git provider.
GitHub
The GitHub scraper within this receiver primarily interacts with GitHub's
GraphQL API. The default rate
limit for GraphQL API is 5,000 points per hour (unless your PAT is associated
to a GitHub Enterprise Cloud organization, then your limit is 10,000).
The receiver on average costs 4 points per repository, allowing it to
scrape up to 1250 repositories per hour under normal conditions.
Given this average cost a good collection interval in seconds is:
\text{collection\_interval (seconds)} = \frac{4n}{r/3600} + 300
\begin{aligned}
\text{where:} \\
n &= \text{number of repositories} \\
r &= \text{hourly rate limit} \\
\end{aligned}
$r$ is likely 5000 but there are factors that can change this,
for more information see GitHub's docs.
The $300$ is a buffer to account for this being a rough estimate and to account
for the initial query to grab repositories.
In addition to these primary rate limits, GitHub enforces secondary rate limits
to prevent abuse and maintain API availability. The following secondary limit is
particularly relevant:
- Concurrent Requests Limit: The API allows no more than 100 concurrent
requests. This limit is shared across the REST and GraphQL APIs. Since the
scraper creates a goroutine per repository, having more than 100 repositories
returned by the
search_query
will result in exceeding this limit.
It is recommended to use the search_query
config option to limit the number of
repositories that are scraped. We recommend one instance of the receiver per
team (note: team
is not a valid quantifier when searching repositories topic
is). Reminder that each instance of the receiver should have its own
corresponding token for authentication as this is what rate limits are tied to.
In summary, we recommend the following:
- One instance of the receiver per team
- Each instance of the receiver should have its own token
- Leverage
search_query
config option to limit repositories returned to 100 or
less per instance
collection_interval
should be long enough to avoid rate limiting (see above
formula), recall these are lagging indicators so a longer interval is acceptable.
Additional Resources:
Updating tests
After using make gen
you may find your tests failing. This could be due to the
expected_happy_path.yaml
missing some of the changes from your code, or being
out of order.
You can resolve this manually by updating the file, or by regenerating it by
uncommenting the lines starting with //golden.WriteMetrics
in your test files,
and rerunning the unit tests. Comment the lines out again and commit the new
changes.