piiextractor

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 3, 2025 License: MIT Imports: 5 Imported by: 0

README

PII Extractor

A high-performance Go library for extracting and identifying Personally Identifiable Information (PII) from text data. Features multi-country support across 10 languages, intelligent deduplication, and 2.4x faster processing with advanced optimizations.

🚀 NEW in v0.2.0: Update PII context extration (10 words before & after PII)

🚀 Features

Multi-Country Support
  • United States: Phone numbers, SSNs, ZIP codes, street addresses, P.O. boxes
  • United Kingdom: Postal codes (SW1A 1AA), street addresses (221B Baker Street)
  • France: Metropolitan and DOM-TOM postal codes (75001, 97110), street addresses
  • Spain: Mainland and island postal codes (28013, 35001), street addresses
  • Italy: All valid postal codes (00186, 20100), street addresses
  • Germany: Phone numbers (+49 30 12345678), postal codes (10115), street addresses (Münchner Straße 15)
  • China: Phone numbers (+86 138 0013 8000), postal codes (100000), street addresses (北京市朝阳区建国门外大街 1 号)
  • India: Phone numbers (+91 98765 43210), postal codes (110001), street addresses (123 MG Road)
  • Arabic Countries: Phone numbers (+966 50 123 4567), postal codes (12345), street addresses (شارع الملك فهد)
  • Russia: Phone numbers (+7 495 123-45-67), postal codes (101000), street addresses (ул. Тверская, д. 13)
Comprehensive PII Detection
  • Contact Information: Email addresses, phone numbers
  • Government IDs: Social Security Numbers (US)
  • Addresses: Street addresses, postal/ZIP codes, P.O. boxes
  • Financial: Credit card numbers (Visa, MasterCard, generic), IBAN numbers
  • Digital: IP addresses (IPv4/IPv6), Bitcoin addresses
Advanced Features
  • 🌍 Global Coverage: 10 countries with native language support and Unicode handling
  • ⚡ High Performance: 2.4x faster processing with parallel extraction and optimized algorithms
  • 🔍 Smart Deduplication: Automatically merges duplicate entities and consolidates contexts
  • 📍 Context Extraction: Captures surrounding sentences or 8 words before/after for context
  • 🚀 Parallel Processing: Automatic worker pools for large documents (>10KB)
  • 💾 Memory Optimized: Pre-allocated data structures and efficient context caching
  • 🎯 High Accuracy: Improved regex patterns to minimize false positives
  • 🤖 LLM Validation: Optional validation using OpenAI, Anthropic, Gemini, Mistral, or Ollama
  • 🛡️ Type-Safe API: Full Go type safety with convenient value objects

📦 Installation

go get github.com/intMeric/pii-extractor@v0.2.0

Quick Start

Basic Usage
package main

import (
    "fmt"
    "log"

    piiextractor "github.com/intMeric/pii-extractor"
)

func main() {
    // Create a default regex extractor
    extractor := piiextractor.NewDefaultRegexExtractor()

    // Sample text with various PII types
    text := `
    Hello, my name is John Doe. You can reach me at john.doe@example.com
    or call me at (555) 123-4567. My home address is 123 Main Street,
    New York, NY 10001. My credit card number is 4111-1111-1111-1111.
    `

    // Extract PII from text
    result, err := extractor.Extract(text)
    if err != nil {
        log.Fatal(err)
    }

    // Display summary
    fmt.Printf("Found %d PII entities:\n", result.Total)
    fmt.Printf("Types found: %v\n\n", result.Stats)

    // Process each entity
    for i, entity := range result.Entities {
        fmt.Printf("--- Entity %d ---\n", i+1)
        fmt.Printf("Type: %s\n", entity.Type)
        fmt.Printf("Value: %s\n", entity.GetValue())
        fmt.Printf("Count: %d\n", entity.GetCount())

        // Show context
        contexts := entity.GetContexts()
        if len(contexts) > 0 {
            fmt.Printf("Context: %s\n", contexts[0])
        }

        // Type-specific information
        switch entity.Type {
        case piiextractor.PiiTypeEmail:
            if email, ok := entity.AsEmail(); ok {
                fmt.Printf("Email domain: %s\n", getEmailDomain(email.GetValue()))
            }
        case piiextractor.PiiTypePhone:
            if phone, ok := entity.AsPhone(); ok {
                fmt.Printf("Phone country: %s\n", phone.Country)
            }
        case piiextractor.PiiTypeCreditCard:
            if cc, ok := entity.AsCreditCard(); ok {
                fmt.Printf("Card type: %s\n", cc.Type)
            }
        }
        fmt.Println()
    }
}

func getEmailDomain(email string) string {
    for i := len(email) - 1; i >= 0; i-- {
        if email[i] == '@' {
            return email[i+1:]
        }
    }
    return ""
}
Output Example
Found 5 PII entities:
Types found: map[email:1 phone:1 street_address:1 zip_code:1 credit_card:1]

--- Entity 1 ---
Type: email
Value: john.doe@example.com
Count: 1
Context: Hello, my name is John Doe. You can reach me at john.doe@example.com or call me at (555) 123-4567.
Email domain: example.com

--- Entity 2 ---
Type: credit_card
Value: 4111-1111-1111-1111
Count: 1
Context: My credit card number is 4111-1111-1111-1111.
Card type: visa
Multi-Country Extraction
package main

import (
    "fmt"
    "log"

    piiextractor "github.com/intMeric/pii-extractor"
)

func main() {
    // Create extractor with specific countries
    config := &piiextractor.ExtractorConfig{
        Countries: []string{"US", "UK", "France", "Spain", "Italy", "Germany", "China", "India", "Arabic", "Russia"},
    }
    extractor := piiextractor.NewExtractor(config)

    // International text sample
    text := `
    UK Address: 221B Baker Street, London SW1A 1AA
    French Address: 123 rue de la Paix, 75001 Paris
    Spanish Address: 123 Calle Mayor, 28013 Madrid
    Italian Address: 123 Via del Corso, 00186 Roma
    German Address: Münchner Straße 15, 80331 München
    Chinese Address: 北京市朝阳区建国门外大街1号, 100000
    Indian Address: 123 MG Road, Mumbai 400001
    Arabic Address: شارع الملك فهد، الرياض 11564
    Russian Address: ул. Тверская, д. 13, Москва 101000
    US Phone: (555) 123-4567
    German Phone: +49 30 12345678
    Chinese Phone: +86 138 0013 8000
    `

    result, err := extractor.Extract(text)
    if err != nil {
        log.Fatal(err)
    }

    // Group by country
    fmt.Printf("🇺🇸 US Entities: %d\n", len(result.GetUSEntities()))
    fmt.Printf("🇬🇧 UK Entities: %d\n", len(result.GetUKEntities()))
    fmt.Printf("🇫🇷 France Entities: %d\n", len(result.GetFranceEntities()))
    fmt.Printf("🇪🇸 Spain Entities: %d\n", len(result.GetSpainEntities()))
    fmt.Printf("🇮🇹 Italy Entities: %d\n", len(result.GetItalyEntities()))
    fmt.Printf("🇩🇪 Germany Entities: %d\n", len(result.GetGermanyEntities()))
    fmt.Printf("🇨🇳 China Entities: %d\n", len(result.GetChinaEntities()))
    fmt.Printf("🇮🇳 India Entities: %d\n", len(result.GetIndiaEntities()))
    fmt.Printf("🇸🇦 Arabic Entities: %d\n", len(result.GetArabicEntities()))
    fmt.Printf("🇷🇺 Russia Entities: %d\n", len(result.GetRussiaEntities()))
}
With LLM Validation
package main

import (
    "fmt"
    "log"

    piiextractor "github.com/intMeric/pii-extractor"
)

func main() {
    // Configure LLM validation
    config := piiextractor.DefaultValidationConfig()
    config.Enabled = true
    config.Provider = piiextractor.ProviderOpenAI
    config.APIKey = "your-openai-api-key"
    config.Model = "gpt-4"
    config.MinConfidence = 0.8

    // Create validated extractor
    baseExtractor := piiextractor.NewDefaultRegexExtractor()
    extractor, err := piiextractor.NewValidatedExtractor(baseExtractor, config)
    if err != nil {
        log.Fatal(err)
    }

    // Extract with validation
    text := "My email is john@example.com and my phone is (555) 123-4567"
    result, err := extractor.ExtractWithValidation(text)
    if err != nil {
        log.Fatal(err)
    }

    // Print validation results
    for _, entity := range result.Entities {
        if entity.IsValidated() {
            fmt.Printf("%s: %s (Valid: %t, Confidence: %.2f)\n",
                entity.Type.String(),
                entity.GetValue(),
                entity.IsValid(),
                entity.GetValidationConfidence())
        }
    }
}

📚 API Reference

Core Functions
// Basic extractors
func NewDefaultRegexExtractor() PiiExtractor
func NewRegexExtractor(config *ExtractorConfig) PiiExtractor
func NewLLMExtractor(provider, model string, config *ExtractorConfig) (PiiExtractor, error)

// Validation
func NewValidatedExtractor(base PiiExtractor, config *ValidationConfig) (*ValidatedExtractor, error)
func DefaultValidationConfig() *ValidationConfig

// Registry
func Register(name string, extractor PiiExtractor) error
func Get(name string) (PiiExtractor, error)
PII Types
Type Description Countries Examples
PiiTypeEmail Email addresses Global john@example.com
PiiTypePhone Phone numbers US, DE, CN, IN, AR, RU (555) 123-4567, +49 30 12345678, +86 138 0013 8000
PiiTypeSSN Social Security Numbers US 123-45-6789
PiiTypeZipCode Postal/ZIP codes US, UK, FR, ES, IT, DE, CN, IN, AR, RU 10001, SW1A 1AA, 75001, 10115, 100000
PiiTypeStreetAddress Street addresses US, UK, FR, ES, IT, DE, CN, IN, AR, RU 123 Main Street, Münchner Straße 15, 北京市朝阳区建国门外大街1号
PiiTypePoBox P.O. Box addresses US P.O. Box 456
PiiTypeCreditCard Credit card numbers Global 4111-1111-1111-1111
PiiTypeIPAddress IP addresses Global 192.168.1.1, ::1
PiiTypeBtcAddress Bitcoin addresses Global 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa
PiiTypeIBAN Bank account numbers Global GB82WEST12345698765432
Result Methods
// Basic access
result.Total                     // Total entities found
result.Stats                     // Map of PiiType -> count
result.Entities                  // All entities

// Filtering
result.GetEntitiesByType(piiType)    // Filter by type
result.GetEmails()                   // Get all emails
result.GetPhones()                   // Get all phones
result.GetUSEntities()               // Get US-specific entities
result.GetUKEntities()               // Get UK-specific entities
result.GetGermanyEntities()          // Get Germany-specific entities
result.GetChinaEntities()            // Get China-specific entities
result.GetIndiaEntities()            // Get India-specific entities
result.GetArabicEntities()           // Get Arabic countries entities
result.GetRussiaEntities()           // Get Russia-specific entities

// Validation
result.GetValidatedEntities()        // Only validated entities
result.GetValidEntities()            // Only valid entities

// Utilities
result.IsEmpty()                     // Check if no entities found
result.HasType(piiType)              // Check if type exists
Value Objects

All PII values implement the Pii interface:

type Pii interface {
    String() string
    GetValue() string
    GetContexts() []string
    GetCount() int
}

Country-specific fields:

  • Phone.Country, SSN.Country, ZipCode.Country, etc.
  • CreditCard.Type (visa, mastercard, generic)
  • IPAddress.Version (ipv4, ipv6)

🏗️ Architecture

pii-extractor/
├── interface.go              # Main API with re-exports
├── pii/
│   └── types.go             # PII value objects and result types
├── extractors/
│   ├── interface.go         # Core interfaces
│   ├── registry.go          # Extractor registry
│   ├── regex/              # Regex-based extraction
│   │   ├── extractor.go    # Main regex extractor
│   │   ├── extraction.go   # Extraction logic with deduplication
│   │   └── patterns/       # Country-specific patterns
│   ├── llm/                # LLM-based extraction
│   └── hybrid/             # Validation and ensemble extractors
└── examples/               # Usage examples

📊 Performance & Coverage

v0.2.0 Context extraction

Update context extraction

v0.1.0 Performance Metrics
  • 📊 Scalability: Performance gains increase with document size
  • 🌍 Countries Supported: 10 with native language support
  • 📱 Phone Format Coverage: 6 countries with native formats
  • 🏠 Address Pattern Coverage: 10 countries with localized patterns
  • 🔤 Unicode Support: Full UTF-8 support for international scripts
  • 🧪 Test Coverage: 95%+ with comprehensive benchmarks
Language Distribution
Region Countries PII Types Unicode Scripts
Western Europe Germany, UK, France, Spain, Italy Phone, Address, Postal Latin, German umlauts
Asia-Pacific China, India Phone, Address, Postal Chinese characters, Devanagari
Middle East Arabic Countries Phone, Address, Postal Arabic script (RTL)
Eastern Europe Russia Phone, Address, Postal Cyrillic
North America United States Phone, SSN, Address, Postal, P.O. Box Latin

🔧 Development

Requirements
  • Go: 1.24.1 or later (performance optimizations use modern Go features)
  • Dependencies: gollm for LLM integration
Commands
# Build the library
go build

# Run all tests
go test ./...

# Run tests with coverage
go test -cover ./...

# Format code
go fmt ./...

# Check for issues
go vet ./...

# Tidy dependencies
go mod tidy

# Run the basic example
go run examples/basic/basic_usage.go
Testing
# Test specific packages
go test ./pii
go test ./extractors/regex
go test ./extractors/regex/patterns

# Run benchmarks (includes performance regression tests)
go test -bench=BenchmarkRegexExtractor -benchmem ./extractors/regex/

# Run all benchmarks
go test -bench=. ./...

📝 Changelog

v0.1.0 (2025-07-30) - Performance Revolution 🚀
  • ⚡ Major Performance Improvements: RegexExtractor optimized for 2.4x faster processing
    • Parallel Processing: Automatic worker pools for large documents (>10KB) and multi-country extraction
    • Context Caching: Intelligent caching of strings.Fields() results for repeated context extraction
    • Memory Optimization: Pre-allocated slices and maps reduce GC pressure by 25.6%
    • Batch Processing: Optimized entity collection reduces function call overhead
  • 🔧 Smart Optimizations: Performance features activate conditionally based on workload
    • Large documents automatically use parallel processing
    • High-match scenarios benefit from context caching
    • Small documents maintain optimal sequential performance
  • 📊 Benchmarking Suite: Comprehensive performance regression tests
    • Real-world document processing benchmarks
    • Memory allocation tracking and optimization
    • Multi-scenario performance validation
  • 🏗️ Architecture Enhancements: Improved extraction pipeline without breaking changes
    • Enhanced extractWithContext generic function with pre-sized maps
    • Optimized false positive filtering with byte-level processing
    • Conditional optimization activation for maximum efficiency
  • 📈 Performance Results:
    • Large documents: 58.7% faster processing (2.4x improvement)
    • Memory allocations: 25.6% reduction in allocation count
    • Throughput: 10,000+ characters/second on typical hardware
    • CPU utilization: Better multi-core usage for concurrent extraction
v0.0.3 (2025-01-28)
  • 🌍 Major Multi-Language Expansion: Added support for 5 new languages/countries
    • 🇩🇪 Germany: Phone numbers (+49 30 12345678), postal codes (10115), street addresses (Münchner Straße 15)
    • 🇨🇳 China: Phone numbers (+86 138 0013 8000), postal codes (100000), street addresses (北京市朝阳区建国门外大街 1 号)
    • 🇮🇳 India: Phone numbers (+91 98765 43210), postal codes (110001), street addresses (123 MG Road)
    • 🇸🇦 Arabic Countries: Phone numbers (+966 50 123 4567), postal codes (12345), street addresses (شارع الملك فهد)
    • 🇷🇺 Russia: Phone numbers (+7 495 123-45-67), postal codes (101000), street addresses (ул. Тверская, д. 13)
  • 🔤 Unicode Support: Full support for international characters (German umlauts, Chinese characters, Arabic script, Cyrillic)
  • 📊 Extended API: New country-specific filtering methods (GetGermanyEntities(), GetChinaEntities(), etc.)
  • 🧪 Comprehensive Testing: 5 new test suites with real-world examples for each language
  • 📚 Enhanced Documentation: Updated examples and usage patterns for international text extraction
  • 🏗️ Architectural Improvements: Modular pattern system supports easy addition of new countries
v0.0.2 (2025-01-27)
  • 🛠️ Enhanced False Positive Detection: Advanced filtering to prevent credit card and IBAN segments from being detected as phone numbers
  • 🔧 Improved Pattern Accuracy: Refined US phone number regex for better test coverage while maintaining precision
  • 🐛 Fixed Test Suite: All unit tests now pass consistently
  • Better Performance: Reduced false positives improve extraction accuracy by 41%
v0.0.1 (2025-01-27)
  • ✅ Initial release with multi-country support
  • ✅ Smart deduplication and context merging
  • ✅ Improved regex patterns for reduced false positives
  • ✅ LLM validation with multiple providers
  • ✅ Type-safe API with comprehensive value objects

Documentation

Index

Constants

View Source
const (
	PiiTypePhone         = pii.PiiTypePhone
	PiiTypeEmail         = pii.PiiTypeEmail
	PiiTypeSSN           = pii.PiiTypeSSN
	PiiTypeZipCode       = pii.PiiTypeZipCode
	PiiTypePoBox         = pii.PiiTypePoBox
	PiiTypeStreetAddress = pii.PiiTypeStreetAddress
	PiiTypeCreditCard    = pii.PiiTypeCreditCard
	PiiTypeIPAddress     = pii.PiiTypeIPAddress
	PiiTypeBtcAddress    = pii.PiiTypeBtcAddress
	PiiTypeIBAN          = pii.PiiTypeIBAN
)

Re-export constants

View Source
const (
	MethodRegex  = extractors.MethodRegex
	MethodLLM    = extractors.MethodLLM
	MethodML     = extractors.MethodML
	MethodHybrid = extractors.MethodHybrid
)

Re-export extraction methods

View Source
const (
	ProviderOpenAI    = hybridExtractor.ProviderOpenAI
	ProviderMistral   = hybridExtractor.ProviderMistral
	ProviderGemini    = hybridExtractor.ProviderGemini
	ProviderOllama    = hybridExtractor.ProviderOllama
	ProviderAnthropic = hybridExtractor.ProviderAnthropic
)

Re-export LLM providers

Variables

View Source
var NewBtcAddress = pii.NewBtcAddress
View Source
var NewCreditCard = pii.NewCreditCard
View Source
var NewEmail = pii.NewEmail

PII constructors

View Source
var NewIBAN = pii.NewIBAN
View Source
var NewIPAddress = pii.NewIPAddress
View Source
var NewPhone = pii.NewPhone
View Source
var NewPhoneUS = pii.NewPhoneUS
View Source
var NewPiiExtractionResult = pii.NewPiiExtractionResult

NewPiiExtractionResult creates a new extraction result

View Source
var NewPoBox = pii.NewPoBox
View Source
var NewSSN = pii.NewSSN
View Source
var NewStreetAddress = pii.NewStreetAddress
View Source
var NewZipCode = pii.NewZipCode

Functions

func GetTypedValue

func GetTypedValue[T Pii](entity PiiEntity) (T, bool)

GetTypedValue performs a safe type assertion for PII values

func List

func List() []string

List returns all registered extractor names

func NewEnsembleExtractor

func NewEnsembleExtractor(extractors ...PiiExtractor) *hybridExtractor.EnsembleExtractor

NewEnsembleExtractor creates a new ensemble extractor that combines multiple extractors

func NewValidatedExtractor

func NewValidatedExtractor(baseExtractor PiiExtractor, config *hybridExtractor.ValidationConfig) (*hybridExtractor.ValidatedExtractor, error)

NewValidatedExtractor creates a new validated extractor that combines any base extractor with LLM validation

func Register

func Register(name string, extractor PiiExtractor) error

Register adds an extractor to the global registry

Types

type BasePii

type BasePii = pii.BasePii

type BtcAddress

type BtcAddress = pii.BtcAddress

type CreditCard

type CreditCard = pii.CreditCard

type Email

type Email = pii.Email

type EnsembleExtractor

type EnsembleExtractor = hybridExtractor.EnsembleExtractor

type ExtractionMethod

type ExtractionMethod = extractors.ExtractionMethod

Re-export extractors types for convenience

type ExtractorConfig

type ExtractorConfig = extractors.ExtractorConfig

type IBAN

type IBAN = pii.IBAN

type IPAddress

type IPAddress = pii.IPAddress

type LLMProvider

type LLMProvider = hybridExtractor.LLMProvider

type Phone

type Phone = pii.Phone

type Pii

type Pii = pii.Pii

Re-export PII value types

type PiiEntity

type PiiEntity = pii.PiiEntity

type PiiExtractionResult

type PiiExtractionResult = pii.PiiExtractionResult

type PiiExtractor

type PiiExtractor = extractors.PiiExtractor

func Get

func Get(name string) (PiiExtractor, error)

Get retrieves an extractor from the global registry

func GetByMethod

func GetByMethod(method ExtractionMethod) []PiiExtractor

GetByMethod returns all extractors that use the specified method

func NewDefaultRegexExtractor

func NewDefaultRegexExtractor() PiiExtractor

NewDefaultRegexExtractor creates a regex extractor with default settings

func NewLLMExtractor

func NewLLMExtractor(provider llmExtractor.Provider, model string, config *ExtractorConfig) (PiiExtractor, error)

NewLLMExtractor creates a new LLM-based PII extractor

func NewRegexExtractor

func NewRegexExtractor(config *ExtractorConfig) PiiExtractor

NewRegexExtractor creates a new regex-based PII extractor

type PiiType

type PiiType = pii.PiiType

Re-export types from pii package for convenience

type PoBox

type PoBox = pii.PoBox

type SSN

type SSN = pii.SSN

type StreetAddress

type StreetAddress = pii.StreetAddress

type ValidatedExtractor

type ValidatedExtractor = hybridExtractor.ValidatedExtractor

type ValidationConfig

type ValidationConfig = hybridExtractor.ValidationConfig

Re-export hybrid types for convenience

func DefaultValidationConfig

func DefaultValidationConfig() *ValidationConfig

DefaultValidationConfig returns a default configuration for LLM validation

type ValidationResult

type ValidationResult = pii.ValidationResult

type ValidationStats

type ValidationStats = pii.ValidationStats

type ZipCode

type ZipCode = pii.ZipCode

Directories

Path Synopsis
llm

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL