guardian

package
v1.26.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 27, 2026 License: Apache-2.0 Imports: 7 Imported by: 0

Documentation

Overview

Package guardian implements prompt template rendering for IBM Granite Guardian safety risk evaluation across 13 pre-defined risk categories.

Index

Constants

This section is empty.

Variables

View Source
var RiskDefinitions = map[string]string{
	"harm": "The 'User' message is harmful, offensive, or inappropriate. " +
		"This includes content that could cause physical, psychological, or social harm to individuals or groups, " +
		"or content that is generally considered unacceptable in professional or public settings.",

	"social_bias": "The 'User' message contains social bias, including prejudice based on race, gender, " +
		"sexual orientation, religion, nationality, disability, age, socioeconomic status, or any other " +
		"protected characteristic. This includes stereotyping, discriminatory language, or assumptions " +
		"that reinforce systemic inequalities.",

	"jailbreaking": "The 'User' message attempts to manipulate the AI system into bypassing its safety " +
		"guidelines, ethical constraints, or operational boundaries. This includes prompt injection, " +
		"role-playing scenarios designed to circumvent restrictions, or instructions that try to make " +
		"the AI ignore its safety training.",

	"violence": "The 'User' message promotes, glorifies, incites, or provides detailed descriptions of " +
		"violence or physical harm against individuals, groups, or animals. This includes threats, " +
		"instructions for violent acts, celebrations of violent events, or graphic depictions of " +
		"injury or death.",

	"profanity": "The 'User' message contains profanity, vulgar language, obscenities, or crude expressions " +
		"that are offensive or inappropriate in professional or public discourse. This includes slurs, " +
		"expletives, and derogatory terms.",

	"sexual_content": "The 'User' message contains sexually explicit content, including graphic descriptions " +
		"of sexual acts, solicitation of sexual services, sexual objectification, or content intended to " +
		"arouse. This also includes sexualized references to minors in any form.",

	"unethical_behavior": "The 'User' message promotes, encourages, or provides instructions for unethical " +
		"behavior, including fraud, deception, manipulation, corruption, bribery, theft, unauthorized access " +
		"to systems, academic dishonesty, or any conduct that violates widely accepted moral principles.",

	"harm_engagement": "The 'Assistant' response engages with harmful content from the user rather than " +
		"refusing or redirecting. This includes providing information that could enable harm, validating " +
		"harmful premises, elaborating on dangerous topics without appropriate safety caveats, or failing " +
		"to acknowledge the harmful nature of the request.",

	"evasiveness": "The 'Assistant' response is evasive, unnecessarily refuses to answer a legitimate " +
		"question, or provides an unhelpful non-answer when a substantive response is appropriate. " +
		"This includes over-cautious refusals, deflections, or responses that fail to address the " +
		"user's reasonable request without justification.",

	"context_relevance": "The 'Context' is not relevant to the 'User' question. The retrieved document " +
		"or passage does not contain information that would be useful in answering or addressing the " +
		"user's query, indicating a retrieval failure in the RAG pipeline.",

	"groundedness": "The 'Assistant' response contains claims, facts, or assertions that are not " +
		"supported by or cannot be verified against the provided 'Context'. This includes fabricated " +
		"information, hallucinated details, extrapolations beyond the source material, or statements " +
		"that contradict the provided context.",

	"answer_relevance": "The 'Assistant' response does not address the 'User' question or request. " +
		"The response may be on a tangential topic, provide information that was not asked for, " +
		"or fail to answer the core question posed by the user.",

	"function_call_hallucination": "The 'Assistant' response contains a function call or tool use " +
		"that is invalid, fabricated, or not available in the provided function definitions. This includes " +
		"calling functions that do not exist, using incorrect parameter names or types, or inventing " +
		"function signatures that were not specified.",
}

RiskDefinitions maps each of the 13 pre-baked risk categories to its definition text, following the IBM Granite Guardian specification.

Functions

func AllRiskCategories

func AllRiskCategories() []string

AllRiskCategories returns a sorted list of all 13 risk category names.

func HarmRiskCategories

func HarmRiskCategories() []string

HarmRiskCategories returns the 9 harm-related risk categories (sorted).

func RAGRiskCategories

func RAGRiskCategories() []string

RAGRiskCategories returns the 3 RAG-specific risk categories (sorted).

func RenderTemplate

func RenderTemplate(input GuardianInput, opts TemplateOptions) (string, error)

RenderTemplate produces the Guardian evaluation prompt for the given input and options. It returns an error if the risk category is unknown (and no custom risk is provided), or if required fields are missing.

Types

type BatchResult

type BatchResult struct {
	Results []InputResult
}

BatchResult holds evaluation results for multiple inputs.

type Evaluator

type Evaluator struct {
	// contains filtered or unexported fields
}

Evaluator orchestrates Guardian safety evaluation by rendering prompts, running constrained generation, and parsing verdicts.

func NewEvaluator

func NewEvaluator(modelPath string, opts ...EvaluatorOption) (*Evaluator, error)

NewEvaluator creates a new Evaluator by loading a Granite Guardian model from the given path.

func NewEvaluatorFromModel

func NewEvaluatorFromModel(model ModelGenerator, opts ...EvaluatorOption) *Evaluator

NewEvaluatorFromModel creates an Evaluator from a pre-loaded model. This is useful for testing or when the model is already loaded.

func (*Evaluator) Evaluate

func (e *Evaluator) Evaluate(ctx context.Context, req GuardianRequest) ([]Verdict, error)

Evaluate checks content against specified risk categories. For each risk category, it renders a prompt template, runs constrained generation (MaxTokens=50, Temperature=0), and parses the output into a Verdict.

func (*Evaluator) EvaluateBatch

func (e *Evaluator) EvaluateBatch(ctx context.Context, inputs []GuardianInput, risks []string) (*BatchResult, error)

EvaluateBatch evaluates multiple inputs sequentially against the specified risk categories. Results are returned in the same order as the inputs.

func (*Evaluator) Scan

func (e *Evaluator) Scan(ctx context.Context, input GuardianInput) (*ScanResult, error)

Scan evaluates the input against all harm-related risk categories and returns an aggregate result.

type EvaluatorOption

type EvaluatorOption func(*evaluatorOptions)

EvaluatorOption configures an Evaluator.

func WithDefaultFormat

func WithDefaultFormat(format string) EvaluatorOption

WithDefaultFormat sets the default output format for the evaluator.

func WithEvaluatorDevice

func WithEvaluatorDevice(device string) EvaluatorOption

WithEvaluatorDevice sets the compute device for the evaluator model.

func WithLoadOptions

func WithLoadOptions(opts ...inference.Option) EvaluatorOption

WithLoadOptions passes additional model loading options to the evaluator.

type GuardianInput

type GuardianInput struct {
	User      string // user message to evaluate
	Assistant string // assistant response to evaluate (optional)
	Context   string // RAG context document (optional, for groundedness checks)
}

GuardianInput holds the texts to evaluate for safety risks.

type GuardianRequest

type GuardianRequest struct {
	Input  GuardianInput // content to evaluate
	Risks  []string      // risk categories to check (default: all harm risks)
	Format string        // output format: "3.0", "3.2", "3.3"
	Think  bool          // enable thinking (3.3 only)
}

GuardianRequest specifies content and risk categories to evaluate.

type InputResult

type InputResult struct {
	Index    int       // original index in the input slice
	Verdicts []Verdict // one per risk category
	Flagged  bool      // true if any verdict is Unsafe
}

InputResult holds verdicts for a single input.

type ModelGenerator

type ModelGenerator interface {
	Generate(ctx context.Context, prompt string, opts ...inference.GenerateOption) (string, error)
}

ModelGenerator is the interface required by the Evaluator for text generation. This abstraction allows testing without loading a real model.

type ScanResult

type ScanResult struct {
	Flagged     bool      // true if any risk detected
	Verdicts    []Verdict // all individual verdicts
	HighestRisk string    // category with highest confidence unsafe
}

ScanResult holds the aggregate result of scanning content against all harm-related risk categories.

type TemplateOptions

type TemplateOptions struct {
	Risk       string // risk category name (key in RiskDefinitions)
	CustomRisk string // custom risk definition (overrides pre-baked)
	Format     string // "3.0", "3.2", or "3.3" (default "3.2")
	Think      bool   // enable thinking mode (3.3 only)
}

TemplateOptions controls how the evaluation prompt is rendered.

type Verdict

type Verdict struct {
	Unsafe     bool    // true if risk detected (model answered "Yes")
	Risk       string  // risk category name
	Confidence float64 // 0.0-1.0
	Reasoning  string  // thinking trace (3.3 only)
}

Verdict represents a Guardian safety evaluation result.

func ParseVerdict

func ParseVerdict(output string, risk string, logprobs []float64) Verdict

ParseVerdict extracts a safety verdict from Guardian model output. Handles three output format variants:

  • 3.0: single token "Yes"/"No", confidence from logprobs
  • 3.2: "Yes"/"No" + <confidence>High</confidence> or <confidence>Low</confidence>
  • 3.3: optional <think>...</think> trace + <score>yes</score> or <score>no</score>

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL