guardian

package

v1.26.0 Latest Latest Go to latest Published: Mar 27, 2026 License: Apache-2.0 Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package guardian implements prompt template rendering for IBM Granite Guardian safety risk evaluation across 13 pre-defined risk categories.

Index ¶

Variables
func AllRiskCategories() []string
func HarmRiskCategories() []string
func RAGRiskCategories() []string
func RenderTemplate(input GuardianInput, opts TemplateOptions) (string, error)
type BatchResult
type Evaluator
- func NewEvaluator(modelPath string, opts ...EvaluatorOption) (*Evaluator, error)
- func NewEvaluatorFromModel(model ModelGenerator, opts ...EvaluatorOption) *Evaluator
type EvaluatorOption
type GuardianInput
type GuardianRequest
type InputResult
type ModelGenerator
type ScanResult
type TemplateOptions
type Verdict
- func ParseVerdict(output string, risk string, logprobs []float64) Verdict

Constants ¶

This section is empty.

Variables ¶

View Source

var RiskDefinitions = map[string]string{
	"harm": "The 'User' message is harmful, offensive, or inappropriate. " +
		"This includes content that could cause physical, psychological, or social harm to individuals or groups, " +
		"or content that is generally considered unacceptable in professional or public settings.",

	"social_bias": "The 'User' message contains social bias, including prejudice based on race, gender, " +
		"sexual orientation, religion, nationality, disability, age, socioeconomic status, or any other " +
		"protected characteristic. This includes stereotyping, discriminatory language, or assumptions " +
		"that reinforce systemic inequalities.",

	"jailbreaking": "The 'User' message attempts to manipulate the AI system into bypassing its safety " +
		"guidelines, ethical constraints, or operational boundaries. This includes prompt injection, " +
		"role-playing scenarios designed to circumvent restrictions, or instructions that try to make " +
		"the AI ignore its safety training.",

	"violence": "The 'User' message promotes, glorifies, incites, or provides detailed descriptions of " +
		"violence or physical harm against individuals, groups, or animals. This includes threats, " +
		"instructions for violent acts, celebrations of violent events, or graphic depictions of " +
		"injury or death.",

	"profanity": "The 'User' message contains profanity, vulgar language, obscenities, or crude expressions " +
		"that are offensive or inappropriate in professional or public discourse. This includes slurs, " +
		"expletives, and derogatory terms.",

	"sexual_content": "The 'User' message contains sexually explicit content, including graphic descriptions " +
		"of sexual acts, solicitation of sexual services, sexual objectification, or content intended to " +
		"arouse. This also includes sexualized references to minors in any form.",

	"unethical_behavior": "The 'User' message promotes, encourages, or provides instructions for unethical " +
		"behavior, including fraud, deception, manipulation, corruption, bribery, theft, unauthorized access " +
		"to systems, academic dishonesty, or any conduct that violates widely accepted moral principles.",

	"harm_engagement": "The 'Assistant' response engages with harmful content from the user rather than " +
		"refusing or redirecting. This includes providing information that could enable harm, validating " +
		"harmful premises, elaborating on dangerous topics without appropriate safety caveats, or failing " +
		"to acknowledge the harmful nature of the request.",

	"evasiveness": "The 'Assistant' response is evasive, unnecessarily refuses to answer a legitimate " +
		"question, or provides an unhelpful non-answer when a substantive response is appropriate. " +
		"This includes over-cautious refusals, deflections, or responses that fail to address the " +
		"user's reasonable request without justification.",

	"context_relevance": "The 'Context' is not relevant to the 'User' question. The retrieved document " +
		"or passage does not contain information that would be useful in answering or addressing the " +
		"user's query, indicating a retrieval failure in the RAG pipeline.",

	"groundedness": "The 'Assistant' response contains claims, facts, or assertions that are not " +
		"supported by or cannot be verified against the provided 'Context'. This includes fabricated " +
		"information, hallucinated details, extrapolations beyond the source material, or statements " +
		"that contradict the provided context.",

	"answer_relevance": "The 'Assistant' response does not address the 'User' question or request. " +
		"The response may be on a tangential topic, provide information that was not asked for, " +
		"or fail to answer the core question posed by the user.",

	"function_call_hallucination": "The 'Assistant' response contains a function call or tool use " +
		"that is invalid, fabricated, or not available in the provided function definitions. This includes " +
		"calling functions that do not exist, using incorrect parameter names or types, or inventing " +
		"function signatures that were not specified.",
}

RiskDefinitions maps each of the 13 pre-baked risk categories to its definition text, following the IBM Granite Guardian specification.

Functions ¶

func AllRiskCategories ¶

func AllRiskCategories() []string

AllRiskCategories returns a sorted list of all 13 risk category names.

func HarmRiskCategories ¶

func HarmRiskCategories() []string

HarmRiskCategories returns the 9 harm-related risk categories (sorted).

func RAGRiskCategories ¶

func RAGRiskCategories() []string

RAGRiskCategories returns the 3 RAG-specific risk categories (sorted).

func RenderTemplate ¶

func RenderTemplate(input GuardianInput, opts TemplateOptions) (string, error)

RenderTemplate produces the Guardian evaluation prompt for the given input and options. It returns an error if the risk category is unknown (and no custom risk is provided), or if required fields are missing.

Types ¶

type BatchResult ¶

type BatchResult struct {
	Results []InputResult
}

BatchResult holds evaluation results for multiple inputs.

type Evaluator ¶

type Evaluator struct {
	// contains filtered or unexported fields
}

Evaluator orchestrates Guardian safety evaluation by rendering prompts, running constrained generation, and parsing verdicts.

func NewEvaluator ¶

func NewEvaluator(modelPath string, opts ...EvaluatorOption) (*Evaluator, error)

NewEvaluator creates a new Evaluator by loading a Granite Guardian model from the given path.

func NewEvaluatorFromModel ¶

func NewEvaluatorFromModel(model ModelGenerator, opts ...EvaluatorOption) *Evaluator

NewEvaluatorFromModel creates an Evaluator from a pre-loaded model. This is useful for testing or when the model is already loaded.

func (*Evaluator) Evaluate ¶

func (e *Evaluator) Evaluate(ctx context.Context, req GuardianRequest) ([]Verdict, error)

Evaluate checks content against specified risk categories. For each risk category, it renders a prompt template, runs constrained generation (MaxTokens=50, Temperature=0), and parses the output into a Verdict.

func (*Evaluator) EvaluateBatch ¶

func (e *Evaluator) EvaluateBatch(ctx context.Context, inputs []GuardianInput, risks []string) (*BatchResult, error)

EvaluateBatch evaluates multiple inputs sequentially against the specified risk categories. Results are returned in the same order as the inputs.

func (*Evaluator) Scan ¶

func (e *Evaluator) Scan(ctx context.Context, input GuardianInput) (*ScanResult, error)

Scan evaluates the input against all harm-related risk categories and returns an aggregate result.

type EvaluatorOption ¶

type EvaluatorOption func(*evaluatorOptions)

EvaluatorOption configures an Evaluator.

func WithDefaultFormat ¶

func WithDefaultFormat(format string) EvaluatorOption

WithDefaultFormat sets the default output format for the evaluator.

func WithEvaluatorDevice ¶

func WithEvaluatorDevice(device string) EvaluatorOption

WithEvaluatorDevice sets the compute device for the evaluator model.

func WithLoadOptions ¶

func WithLoadOptions(opts ...inference.Option) EvaluatorOption

WithLoadOptions passes additional model loading options to the evaluator.

type GuardianInput ¶

type GuardianInput struct {
	User      string // user message to evaluate
	Assistant string // assistant response to evaluate (optional)
	Context   string // RAG context document (optional, for groundedness checks)
}

GuardianInput holds the texts to evaluate for safety risks.

type GuardianRequest ¶

type GuardianRequest struct {
	Input  GuardianInput // content to evaluate
	Risks  []string      // risk categories to check (default: all harm risks)
	Format string        // output format: "3.0", "3.2", "3.3"
	Think  bool          // enable thinking (3.3 only)
}

GuardianRequest specifies content and risk categories to evaluate.

type InputResult ¶

type InputResult struct {
	Index    int       // original index in the input slice
	Verdicts []Verdict // one per risk category
	Flagged  bool      // true if any verdict is Unsafe
}

InputResult holds verdicts for a single input.

type ModelGenerator ¶

type ModelGenerator interface {
	Generate(ctx context.Context, prompt string, opts ...inference.GenerateOption) (string, error)
}

ModelGenerator is the interface required by the Evaluator for text generation. This abstraction allows testing without loading a real model.

type ScanResult ¶

type ScanResult struct {
	Flagged     bool      // true if any risk detected
	Verdicts    []Verdict // all individual verdicts
	HighestRisk string    // category with highest confidence unsafe
}

ScanResult holds the aggregate result of scanning content against all harm-related risk categories.

type TemplateOptions ¶

type TemplateOptions struct {
	Risk       string // risk category name (key in RiskDefinitions)
	CustomRisk string // custom risk definition (overrides pre-baked)
	Format     string // "3.0", "3.2", or "3.3" (default "3.2")
	Think      bool   // enable thinking mode (3.3 only)
}

TemplateOptions controls how the evaluation prompt is rendered.

type Verdict ¶

type Verdict struct {
	Unsafe     bool    // true if risk detected (model answered "Yes")
	Risk       string  // risk category name
	Confidence float64 // 0.0-1.0
	Reasoning  string  // thinking trace (3.3 only)
}

Verdict represents a Guardian safety evaluation result.

func ParseVerdict ¶

func ParseVerdict(output string, risk string, logprobs []float64) Verdict

ParseVerdict extracts a safety verdict from Guardian model output. Handles three output format variants:

3.0: single token "Yes"/"No", confidence from logprobs
3.2: "Yes"/"No" + <confidence>High</confidence> or <confidence>Low</confidence>
3.3: optional <think>...</think> trace + <score>yes</score> or <score>no</score>

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL