piiextractor

package module

v0.2.0 Latest Latest Go to latest Published: Aug 3, 2025 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/intMeric/pii-extractor

Links

Open Source Insights

README ¶

PII Extractor

A high-performance Go library for extracting and identifying Personally Identifiable Information (PII) from text data. Features multi-country support across 10 languages, intelligent deduplication, and 2.4x faster processing with advanced optimizations.

🚀 NEW in v0.2.0: Update PII context extration (10 words before & after PII)

🚀 Features

Multi-Country Support

United States: Phone numbers, SSNs, ZIP codes, street addresses, P.O. boxes
United Kingdom: Postal codes (SW1A 1AA), street addresses (221B Baker Street)
France: Metropolitan and DOM-TOM postal codes (75001, 97110), street addresses
Spain: Mainland and island postal codes (28013, 35001), street addresses
Italy: All valid postal codes (00186, 20100), street addresses
Germany: Phone numbers (+49 30 12345678), postal codes (10115), street addresses (Münchner Straße 15)
China: Phone numbers (+86 138 0013 8000), postal codes (100000), street addresses (北京市朝阳区建国门外大街 1 号)
India: Phone numbers (+91 98765 43210), postal codes (110001), street addresses (123 MG Road)
Arabic Countries: Phone numbers (+966 50 123 4567), postal codes (12345), street addresses (شارع الملك فهد)
Russia: Phone numbers (+7 495 123-45-67), postal codes (101000), street addresses (ул. Тверская, д. 13)

Comprehensive PII Detection

Contact Information: Email addresses, phone numbers
Government IDs: Social Security Numbers (US)
Addresses: Street addresses, postal/ZIP codes, P.O. boxes
Financial: Credit card numbers (Visa, MasterCard, generic), IBAN numbers
Digital: IP addresses (IPv4/IPv6), Bitcoin addresses

Advanced Features

🌍 Global Coverage: 10 countries with native language support and Unicode handling
⚡ High Performance: 2.4x faster processing with parallel extraction and optimized algorithms
🔍 Smart Deduplication: Automatically merges duplicate entities and consolidates contexts
📍 Context Extraction: Captures surrounding sentences or 8 words before/after for context
🚀 Parallel Processing: Automatic worker pools for large documents (>10KB)
💾 Memory Optimized: Pre-allocated data structures and efficient context caching
🎯 High Accuracy: Improved regex patterns to minimize false positives
🤖 LLM Validation: Optional validation using OpenAI, Anthropic, Gemini, Mistral, or Ollama
🛡️ Type-Safe API: Full Go type safety with convenient value objects

📦 Installation

go get github.com/intMeric/pii-extractor@v0.2.0

Quick Start

Basic Usage

package main

import (
    "fmt"
    "log"

    piiextractor "github.com/intMeric/pii-extractor"
)

func main() {
    // Create a default regex extractor
    extractor := piiextractor.NewDefaultRegexExtractor()

    // Sample text with various PII types
    text := `
    Hello, my name is John Doe. You can reach me at john.doe@example.com
    or call me at (555) 123-4567. My home address is 123 Main Street,
    New York, NY 10001. My credit card number is 4111-1111-1111-1111.
    `

    // Extract PII from text
    result, err := extractor.Extract(text)
    if err != nil {
        log.Fatal(err)
    }

    // Display summary
    fmt.Printf("Found %d PII entities:\n", result.Total)
    fmt.Printf("Types found: %v\n\n", result.Stats)

    // Process each entity
    for i, entity := range result.Entities {
        fmt.Printf("--- Entity %d ---\n", i+1)
        fmt.Printf("Type: %s\n", entity.Type)
        fmt.Printf("Value: %s\n", entity.GetValue())
        fmt.Printf("Count: %d\n", entity.GetCount())

        // Show context
        contexts := entity.GetContexts()
        if len(contexts) > 0 {
            fmt.Printf("Context: %s\n", contexts[0])
        }

        // Type-specific information
        switch entity.Type {
        case piiextractor.PiiTypeEmail:
            if email, ok := entity.AsEmail(); ok {
                fmt.Printf("Email domain: %s\n", getEmailDomain(email.GetValue()))
            }
        case piiextractor.PiiTypePhone:
            if phone, ok := entity.AsPhone(); ok {
                fmt.Printf("Phone country: %s\n", phone.Country)
            }
        case piiextractor.PiiTypeCreditCard:
            if cc, ok := entity.AsCreditCard(); ok {
                fmt.Printf("Card type: %s\n", cc.Type)
            }
        }
        fmt.Println()
    }
}

func getEmailDomain(email string) string {
    for i := len(email) - 1; i >= 0; i-- {
        if email[i] == '@' {
            return email[i+1:]
        }
    }
    return ""
}

Output Example

Found 5 PII entities:
Types found: map[email:1 phone:1 street_address:1 zip_code:1 credit_card:1]

--- Entity 1 ---
Type: email
Value: john.doe@example.com
Count: 1
Context: Hello, my name is John Doe. You can reach me at john.doe@example.com or call me at (555) 123-4567.
Email domain: example.com

--- Entity 2 ---
Type: credit_card
Value: 4111-1111-1111-1111
Count: 1
Context: My credit card number is 4111-1111-1111-1111.
Card type: visa

Multi-Country Extraction

package main

import (
    "fmt"
    "log"

    piiextractor "github.com/intMeric/pii-extractor"
)

func main() {
    // Create extractor with specific countries
    config := &piiextractor.ExtractorConfig{
        Countries: []string{"US", "UK", "France", "Spain", "Italy", "Germany", "China", "India", "Arabic", "Russia"},
    }
    extractor := piiextractor.NewExtractor(config)

    // International text sample
    text := `
    UK Address: 221B Baker Street, London SW1A 1AA
    French Address: 123 rue de la Paix, 75001 Paris
    Spanish Address: 123 Calle Mayor, 28013 Madrid
    Italian Address: 123 Via del Corso, 00186 Roma
    German Address: Münchner Straße 15, 80331 München
    Chinese Address: 北京市朝阳区建国门外大街1号, 100000
    Indian Address: 123 MG Road, Mumbai 400001
    Arabic Address: شارع الملك فهد، الرياض 11564
    Russian Address: ул. Тверская, д. 13, Москва 101000
    US Phone: (555) 123-4567
    German Phone: +49 30 12345678
    Chinese Phone: +86 138 0013 8000
    `

    result, err := extractor.Extract(text)
    if err != nil {
        log.Fatal(err)
    }

    // Group by country
    fmt.Printf("🇺🇸 US Entities: %d\n", len(result.GetUSEntities()))
    fmt.Printf("🇬🇧 UK Entities: %d\n", len(result.GetUKEntities()))
    fmt.Printf("🇫🇷 France Entities: %d\n", len(result.GetFranceEntities()))
    fmt.Printf("🇪🇸 Spain Entities: %d\n", len(result.GetSpainEntities()))
    fmt.Printf("🇮🇹 Italy Entities: %d\n", len(result.GetItalyEntities()))
    fmt.Printf("🇩🇪 Germany Entities: %d\n", len(result.GetGermanyEntities()))
    fmt.Printf("🇨🇳 China Entities: %d\n", len(result.GetChinaEntities()))
    fmt.Printf("🇮🇳 India Entities: %d\n", len(result.GetIndiaEntities()))
    fmt.Printf("🇸🇦 Arabic Entities: %d\n", len(result.GetArabicEntities()))
    fmt.Printf("🇷🇺 Russia Entities: %d\n", len(result.GetRussiaEntities()))
}

With LLM Validation

package main

import (
    "fmt"
    "log"

    piiextractor "github.com/intMeric/pii-extractor"
)

func main() {
    // Configure LLM validation
    config := piiextractor.DefaultValidationConfig()
    config.Enabled = true
    config.Provider = piiextractor.ProviderOpenAI
    config.APIKey = "your-openai-api-key"
    config.Model = "gpt-4"
    config.MinConfidence = 0.8

    // Create validated extractor
    baseExtractor := piiextractor.NewDefaultRegexExtractor()
    extractor, err := piiextractor.NewValidatedExtractor(baseExtractor, config)
    if err != nil {
        log.Fatal(err)
    }

    // Extract with validation
    text := "My email is john@example.com and my phone is (555) 123-4567"
    result, err := extractor.ExtractWithValidation(text)
    if err != nil {
        log.Fatal(err)
    }

    // Print validation results
    for _, entity := range result.Entities {
        if entity.IsValidated() {
            fmt.Printf("%s: %s (Valid: %t, Confidence: %.2f)\n",
                entity.Type.String(),
                entity.GetValue(),
                entity.IsValid(),
                entity.GetValidationConfidence())
        }
    }
}

📚 API Reference

Core Functions

// Basic extractors
func NewDefaultRegexExtractor() PiiExtractor
func NewRegexExtractor(config *ExtractorConfig) PiiExtractor
func NewLLMExtractor(provider, model string, config *ExtractorConfig) (PiiExtractor, error)

// Validation
func NewValidatedExtractor(base PiiExtractor, config *ValidationConfig) (*ValidatedExtractor, error)
func DefaultValidationConfig() *ValidationConfig

// Registry
func Register(name string, extractor PiiExtractor) error
func Get(name string) (PiiExtractor, error)

PII Types

Type	Description	Countries	Examples
`PiiTypeEmail`	Email addresses	Global	`john@example.com`
`PiiTypePhone`	Phone numbers	US, DE, CN, IN, AR, RU	`(555) 123-4567`, `+49 30 12345678`, `+86 138 0013 8000`
`PiiTypeSSN`	Social Security Numbers	US	`123-45-6789`
`PiiTypeZipCode`	Postal/ZIP codes	US, UK, FR, ES, IT, DE, CN, IN, AR, RU	`10001`, `SW1A 1AA`, `75001`, `10115`, `100000`
`PiiTypeStreetAddress`	Street addresses	US, UK, FR, ES, IT, DE, CN, IN, AR, RU	`123 Main Street`, `Münchner Straße 15`, `北京市朝阳区建国门外大街1号`
`PiiTypePoBox`	P.O. Box addresses	US	`P.O. Box 456`
`PiiTypeCreditCard`	Credit card numbers	Global	`4111-1111-1111-1111`
`PiiTypeIPAddress`	IP addresses	Global	`192.168.1.1`, `::1`
`PiiTypeBtcAddress`	Bitcoin addresses	Global	`1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa`
`PiiTypeIBAN`	Bank account numbers	Global	`GB82WEST12345698765432`

Result Methods

// Basic access
result.Total                     // Total entities found
result.Stats                     // Map of PiiType -> count
result.Entities                  // All entities

// Filtering
result.GetEntitiesByType(piiType)    // Filter by type
result.GetEmails()                   // Get all emails
result.GetPhones()                   // Get all phones
result.GetUSEntities()               // Get US-specific entities
result.GetUKEntities()               // Get UK-specific entities
result.GetGermanyEntities()          // Get Germany-specific entities
result.GetChinaEntities()            // Get China-specific entities
result.GetIndiaEntities()            // Get India-specific entities
result.GetArabicEntities()           // Get Arabic countries entities
result.GetRussiaEntities()           // Get Russia-specific entities

// Validation
result.GetValidatedEntities()        // Only validated entities
result.GetValidEntities()            // Only valid entities

// Utilities
result.IsEmpty()                     // Check if no entities found
result.HasType(piiType)              // Check if type exists

Value Objects

All PII values implement the Pii interface:

type Pii interface {
    String() string
    GetValue() string
    GetContexts() []string
    GetCount() int
}

Country-specific fields:

Phone.Country, SSN.Country, ZipCode.Country, etc.
CreditCard.Type (visa, mastercard, generic)
IPAddress.Version (ipv4, ipv6)

🏗️ Architecture

pii-extractor/
├── interface.go              # Main API with re-exports
├── pii/
│   └── types.go             # PII value objects and result types
├── extractors/
│   ├── interface.go         # Core interfaces
│   ├── registry.go          # Extractor registry
│   ├── regex/              # Regex-based extraction
│   │   ├── extractor.go    # Main regex extractor
│   │   ├── extraction.go   # Extraction logic with deduplication
│   │   └── patterns/       # Country-specific patterns
│   ├── llm/                # LLM-based extraction
│   └── hybrid/             # Validation and ensemble extractors
└── examples/               # Usage examples

📊 Performance & Coverage

v0.2.0 Context extraction

Update context extraction

v0.1.0 Performance Metrics

📊 Scalability: Performance gains increase with document size
🌍 Countries Supported: 10 with native language support
📱 Phone Format Coverage: 6 countries with native formats
🏠 Address Pattern Coverage: 10 countries with localized patterns
🔤 Unicode Support: Full UTF-8 support for international scripts
🧪 Test Coverage: 95%+ with comprehensive benchmarks

Language Distribution

Region	Countries	PII Types	Unicode Scripts
Western Europe	Germany, UK, France, Spain, Italy	Phone, Address, Postal	Latin, German umlauts
Asia-Pacific	China, India	Phone, Address, Postal	Chinese characters, Devanagari
Middle East	Arabic Countries	Phone, Address, Postal	Arabic script (RTL)
Eastern Europe	Russia	Phone, Address, Postal	Cyrillic
North America	United States	Phone, SSN, Address, Postal, P.O. Box	Latin

🔧 Development

Requirements

Go: 1.24.1 or later (performance optimizations use modern Go features)
Dependencies: gollm for LLM integration

Commands

# Build the library
go build

# Run all tests
go test ./...

# Run tests with coverage
go test -cover ./...

# Format code
go fmt ./...

# Check for issues
go vet ./...

# Tidy dependencies
go mod tidy

# Run the basic example
go run examples/basic/basic_usage.go

Testing

# Test specific packages
go test ./pii
go test ./extractors/regex
go test ./extractors/regex/patterns

# Run benchmarks (includes performance regression tests)
go test -bench=BenchmarkRegexExtractor -benchmem ./extractors/regex/

# Run all benchmarks
go test -bench=. ./...

📝 Changelog

v0.1.0 (2025-07-30) - Performance Revolution 🚀

⚡ Major Performance Improvements: RegexExtractor optimized for 2.4x faster processing
- Parallel Processing: Automatic worker pools for large documents (>10KB) and multi-country extraction
- Context Caching: Intelligent caching of strings.Fields() results for repeated context extraction
- Memory Optimization: Pre-allocated slices and maps reduce GC pressure by 25.6%
- Batch Processing: Optimized entity collection reduces function call overhead
🔧 Smart Optimizations: Performance features activate conditionally based on workload
- Large documents automatically use parallel processing
- High-match scenarios benefit from context caching
- Small documents maintain optimal sequential performance
📊 Benchmarking Suite: Comprehensive performance regression tests
- Real-world document processing benchmarks
- Memory allocation tracking and optimization
- Multi-scenario performance validation
🏗️ Architecture Enhancements: Improved extraction pipeline without breaking changes
- Enhanced extractWithContext generic function with pre-sized maps
- Optimized false positive filtering with byte-level processing
- Conditional optimization activation for maximum efficiency
📈 Performance Results:
- Large documents: 58.7% faster processing (2.4x improvement)
- Memory allocations: 25.6% reduction in allocation count
- Throughput: 10,000+ characters/second on typical hardware
- CPU utilization: Better multi-core usage for concurrent extraction

v0.0.3 (2025-01-28)

🌍 Major Multi-Language Expansion: Added support for 5 new languages/countries
- 🇩🇪 Germany: Phone numbers (+49 30 12345678), postal codes (10115), street addresses (Münchner Straße 15)
- 🇨🇳 China: Phone numbers (+86 138 0013 8000), postal codes (100000), street addresses (北京市朝阳区建国门外大街 1 号)
- 🇮🇳 India: Phone numbers (+91 98765 43210), postal codes (110001), street addresses (123 MG Road)
- 🇸🇦 Arabic Countries: Phone numbers (+966 50 123 4567), postal codes (12345), street addresses (شارع الملك فهد)
- 🇷🇺 Russia: Phone numbers (+7 495 123-45-67), postal codes (101000), street addresses (ул. Тверская, д. 13)
🔤 Unicode Support: Full support for international characters (German umlauts, Chinese characters, Arabic script, Cyrillic)
📊 Extended API: New country-specific filtering methods (GetGermanyEntities(), GetChinaEntities(), etc.)
🧪 Comprehensive Testing: 5 new test suites with real-world examples for each language
📚 Enhanced Documentation: Updated examples and usage patterns for international text extraction
🏗️ Architectural Improvements: Modular pattern system supports easy addition of new countries

v0.0.2 (2025-01-27)

🛠️ Enhanced False Positive Detection: Advanced filtering to prevent credit card and IBAN segments from being detected as phone numbers
🔧 Improved Pattern Accuracy: Refined US phone number regex for better test coverage while maintaining precision
🐛 Fixed Test Suite: All unit tests now pass consistently
⚡ Better Performance: Reduced false positives improve extraction accuracy by 41%

v0.0.1 (2025-01-27)

✅ Initial release with multi-country support
✅ Smart deduplication and context merging
✅ Improved regex patterns for reduced false positives
✅ LLM validation with multiple providers
✅ Type-safe API with comprehensive value objects

Documentation ¶

Index ¶

Constants
Variables
func GetTypedValue[T Pii](entity PiiEntity) (T, bool)
func List() []string
func NewEnsembleExtractor(extractors ...PiiExtractor) *hybridExtractor.EnsembleExtractor
func NewValidatedExtractor(baseExtractor PiiExtractor, config *hybridExtractor.ValidationConfig) (*hybridExtractor.ValidatedExtractor, error)
func Register(name string, extractor PiiExtractor) error
type BasePii
type BtcAddress
type CreditCard
type Email
type EnsembleExtractor
type ExtractionMethod
type ExtractorConfig
type IBAN
type IPAddress
type LLMProvider
type Phone
type Pii
type PiiEntity
type PiiExtractionResult
type PiiExtractor
- func Get(name string) (PiiExtractor, error)
- func GetByMethod(method ExtractionMethod) []PiiExtractor
- func NewDefaultRegexExtractor() PiiExtractor
- func NewLLMExtractor(provider llmExtractor.Provider, model string, config *ExtractorConfig) (PiiExtractor, error)
- func NewRegexExtractor(config *ExtractorConfig) PiiExtractor
type PiiType
type PoBox
type SSN
type StreetAddress
type ValidatedExtractor
type ValidationConfig
- func DefaultValidationConfig() *ValidationConfig
type ValidationResult
type ValidationStats
type ZipCode

Constants ¶

View Source

const (
	PiiTypePhone         = pii.PiiTypePhone
	PiiTypeEmail         = pii.PiiTypeEmail
	PiiTypeSSN           = pii.PiiTypeSSN
	PiiTypeZipCode       = pii.PiiTypeZipCode
	PiiTypePoBox         = pii.PiiTypePoBox
	PiiTypeStreetAddress = pii.PiiTypeStreetAddress
	PiiTypeCreditCard    = pii.PiiTypeCreditCard
	PiiTypeIPAddress     = pii.PiiTypeIPAddress
	PiiTypeBtcAddress    = pii.PiiTypeBtcAddress
	PiiTypeIBAN          = pii.PiiTypeIBAN
)

Re-export constants

View Source

const (
	MethodRegex  = extractors.MethodRegex
	MethodLLM    = extractors.MethodLLM
	MethodML     = extractors.MethodML
	MethodHybrid = extractors.MethodHybrid
)

Re-export extraction methods

View Source

const (
	ProviderOpenAI    = hybridExtractor.ProviderOpenAI
	ProviderMistral   = hybridExtractor.ProviderMistral
	ProviderGemini    = hybridExtractor.ProviderGemini
	ProviderOllama    = hybridExtractor.ProviderOllama
	ProviderAnthropic = hybridExtractor.ProviderAnthropic
)

Re-export LLM providers

Variables ¶

View Source

var NewBtcAddress = pii.NewBtcAddress

View Source

var NewCreditCard = pii.NewCreditCard

View Source

var NewEmail = pii.NewEmail

PII constructors

View Source

var NewIBAN = pii.NewIBAN

View Source

var NewIPAddress = pii.NewIPAddress

View Source

var NewPhone = pii.NewPhone

View Source

var NewPhoneUS = pii.NewPhoneUS

View Source

var NewPiiExtractionResult = pii.NewPiiExtractionResult

NewPiiExtractionResult creates a new extraction result

View Source

var NewPoBox = pii.NewPoBox

View Source

var NewSSN = pii.NewSSN

View Source

var NewStreetAddress = pii.NewStreetAddress

View Source

var NewZipCode = pii.NewZipCode

Functions ¶

func GetTypedValue ¶

func GetTypedValue[T Pii](entity PiiEntity) (T, bool)

GetTypedValue performs a safe type assertion for PII values

func List ¶

func List() []string

List returns all registered extractor names

func NewEnsembleExtractor ¶

func NewEnsembleExtractor(extractors ...PiiExtractor) *hybridExtractor.EnsembleExtractor

NewEnsembleExtractor creates a new ensemble extractor that combines multiple extractors

func NewValidatedExtractor ¶

func NewValidatedExtractor(baseExtractor PiiExtractor, config *hybridExtractor.ValidationConfig) (*hybridExtractor.ValidatedExtractor, error)

NewValidatedExtractor creates a new validated extractor that combines any base extractor with LLM validation

func Register ¶

func Register(name string, extractor PiiExtractor) error

Register adds an extractor to the global registry

Types ¶

type BasePii ¶

type BasePii = pii.BasePii

type BtcAddress ¶

type BtcAddress = pii.BtcAddress

type CreditCard ¶

type CreditCard = pii.CreditCard

type Email ¶

type Email = pii.Email

type EnsembleExtractor ¶

type EnsembleExtractor = hybridExtractor.EnsembleExtractor

type ExtractionMethod ¶

type ExtractionMethod = extractors.ExtractionMethod

Re-export extractors types for convenience

type ExtractorConfig ¶

type ExtractorConfig = extractors.ExtractorConfig

type IBAN ¶

type IBAN = pii.IBAN

type IPAddress ¶

type IPAddress = pii.IPAddress

type LLMProvider ¶

type LLMProvider = hybridExtractor.LLMProvider

type Phone ¶

type Phone = pii.Phone

type Pii ¶

type Pii = pii.Pii

Re-export PII value types

type PiiEntity ¶

type PiiEntity = pii.PiiEntity

type PiiExtractionResult ¶

type PiiExtractionResult = pii.PiiExtractionResult

type PiiExtractor ¶

type PiiExtractor = extractors.PiiExtractor

func Get ¶

func Get(name string) (PiiExtractor, error)

Get retrieves an extractor from the global registry

func GetByMethod ¶

func GetByMethod(method ExtractionMethod) []PiiExtractor

GetByMethod returns all extractors that use the specified method

func NewDefaultRegexExtractor ¶

func NewDefaultRegexExtractor() PiiExtractor

NewDefaultRegexExtractor creates a regex extractor with default settings

func NewLLMExtractor ¶

func NewLLMExtractor(provider llmExtractor.Provider, model string, config *ExtractorConfig) (PiiExtractor, error)

NewLLMExtractor creates a new LLM-based PII extractor

func NewRegexExtractor ¶

func NewRegexExtractor(config *ExtractorConfig) PiiExtractor

NewRegexExtractor creates a new regex-based PII extractor

type PiiType ¶

type PiiType = pii.PiiType

Re-export types from pii package for convenience

type PoBox ¶

type PoBox = pii.PoBox

type SSN ¶

type SSN = pii.SSN

type StreetAddress ¶

type StreetAddress = pii.StreetAddress

type ValidatedExtractor ¶

type ValidatedExtractor = hybridExtractor.ValidatedExtractor

type ValidationConfig ¶

type ValidationConfig = hybridExtractor.ValidationConfig

Re-export hybrid types for convenience

func DefaultValidationConfig ¶

func DefaultValidationConfig() *ValidationConfig

DefaultValidationConfig returns a default configuration for LLM validation

type ValidationResult ¶

type ValidationResult = pii.ValidationResult

type ValidationStats ¶

type ValidationStats = pii.ValidationStats

type ZipCode ¶

type ZipCode = pii.ZipCode

Source Files ¶

View all Source files

interface.go

Directories ¶

Path	Synopsis
extractors
hybrid
llm
regex
regex/patterns
pii

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL