markdownchunker

package module

v1.2.0 Latest Latest Go to latest Published: Aug 14, 2025 License: BSD-3-Clause Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kydenul/markdown-chunker

Links

Open Source Insights

README ¶

Markdown Chunker

A Go library for intelligently splitting Markdown documents into semantic chunks. This library parses Markdown content and breaks it down into meaningful segments like headings, paragraphs, code blocks, tables, lists, and more.

Features

Flexible Chunking Strategies: Multiple built-in strategies (element-level, hierarchical, document-level) with support for custom strategies
Semantic Chunking: Splits Markdown documents based on content structure rather than arbitrary text length
Multiple Content Types: Supports headings, paragraphs, code blocks, tables, lists, blockquotes, and thematic breaks
Rich Metadata: Each chunk includes metadata like heading levels, word counts, code language, table dimensions, etc.
GitHub Flavored Markdown: Full support for GFM features including tables
Pure Text Extraction: Provides both original Markdown content and clean text for each chunk
Configurable Processing: Flexible configuration system for customizing chunking behavior
Advanced Error Handling: Comprehensive error handling with multiple modes (strict, permissive, silent)
Performance Monitoring: Built-in performance monitoring and memory optimization
Enhanced Metadata Extraction: Extensible metadata extraction system with link, image, and code analysis
Position Tracking: Precise position information for each chunk in the original document
Content Deduplication: SHA256 hash-based content deduplication
Memory Optimization: Object pooling and memory-efficient processing for large documents
Comprehensive Logging: Configurable logging system with multiple levels and formats
Easy Integration: Simple API for processing Markdown documents

Installation

go get github.com/kydenul/markdown-chunker

Quick Start

Basic Usage

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    markdown := `# My Document

This is a paragraph with some content.

## Code Example

` + "```go" + `
func main() {
    fmt.Println("Hello, World!")
}
` + "```" + `

| Column 1 | Column 2 |
|----------|----------|
| Value 1  | Value 2  |
`

    chunker := mc.NewMarkdownChunker()
    chunks, err := chunker.ChunkDocument([]byte(markdown))
    if err != nil {
        panic(err)
    }

    for _, chunk := range chunks {
        fmt.Printf("Type: %s, Content: %s\n", chunk.Type, chunk.Text)
    }
}

Advanced Usage with Configuration

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    // Create custom configuration
    config := mc.DefaultConfig()
    config.MaxChunkSize = 1000
    config.ErrorHandling = mc.ErrorModePermissive
    config.EnabledTypes = map[string]bool{
        "heading":    true,
        "paragraph":  true,
        "code":       true,
        "table":      true,
        "list":       false, // Disable list processing
    }
    
    // Configure logging
    config.EnableLog = true
    config.LogLevel = "INFO"
    config.LogFormat = "console"
    config.LogDirectory = "./logs"
    
    // Add custom metadata extractors
    config.CustomExtractors = []mc.MetadataExtractor{
        &mc.LinkExtractor{},
        &mc.ImageExtractor{},
        &mc.CodeComplexityExtractor{},
    }

    chunker := mc.NewMarkdownChunkerWithConfig(config)
    chunks, err := chunker.ChunkDocument([]byte(markdown))
    if err != nil {
        panic(err)
    }

    // Process chunks with enhanced metadata
    for _, chunk := range chunks {
        fmt.Printf("Type: %s, Position: %d:%d-%d:%d\n", 
            chunk.Type, 
            chunk.Position.StartLine, chunk.Position.StartCol,
            chunk.Position.EndLine, chunk.Position.EndCol)
        
        // Display links and images
        if len(chunk.Links) > 0 {
            fmt.Printf("  Links: %d\n", len(chunk.Links))
        }
        if len(chunk.Images) > 0 {
            fmt.Printf("  Images: %d\n", len(chunk.Images))
        }
        
        // Display hash for deduplication
        fmt.Printf("  Hash: %s\n", chunk.Hash[:8])
    }
    
    // Check for errors
    if chunker.HasErrors() {
        fmt.Printf("Processing errors: %d\n", len(chunker.GetErrors()))
    }
    
    // Get performance statistics
    stats := chunker.GetPerformanceStats()
    fmt.Printf("Processing time: %v\n", stats.ProcessingTime)
    fmt.Printf("Memory used: %d bytes\n", stats.MemoryUsed)
}

Chunking Strategies

The library supports multiple chunking strategies to handle different use cases and document structures. You can choose from built-in strategies or create custom ones.

Built-in Strategies

Element-Level Strategy (Default)

The element-level strategy processes each Markdown element individually, maintaining the original behavior of the library.

config := mc.DefaultConfig()
config.ChunkingStrategy = mc.ElementLevelConfig()

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

Use Cases:

Fine-grained content analysis
Search indexing with precise matching
Content that doesn't have clear hierarchical structure

Hierarchical Strategy

The hierarchical strategy groups content by heading levels, creating chunks that contain a heading and all its subordinate content.

config := mc.DefaultConfig()
config.ChunkingStrategy = mc.HierarchicalConfig(3) // Max depth of 3 levels

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

Configuration Options:

MaxDepth: Maximum heading level to consider (1-6)
MinDepth: Minimum heading level to start chunking
MergeEmpty: Whether to merge empty sections with parent
MinChunkSize: Minimum size for hierarchical chunks
MaxChunkSize: Maximum size before splitting hierarchical chunks

Use Cases:

Documentation with clear section structure
Books and articles with hierarchical organization
Content where context within sections is important

Document-Level Strategy

The document-level strategy treats the entire document as a single chunk.

config := mc.DefaultConfig()
config.ChunkingStrategy = mc.DocumentLevelConfig()

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

Use Cases:

Small documents that should be processed as a whole
Document classification tasks
When you need complete document context

Strategy Configuration Examples

Basic Strategy Usage

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    markdown := `# User Guide

Welcome to our application.

## Getting Started

Follow these steps to begin.

### Installation

Run the following command:

` + "```bash" + `
npm install our-app
` + "```" + `

### Configuration

Edit your config file:

` + "```json" + `
{
  "theme": "dark",
  "language": "en"
}
` + "```" + `

## Advanced Features

Learn about advanced functionality.

### Custom Themes

Create your own themes.

### Plugins

Extend functionality with plugins.`

    // Element-level strategy (default)
    fmt.Println("=== Element-Level Strategy ===")
    config1 := mc.DefaultConfig()
    config1.ChunkingStrategy = mc.ElementLevelConfig()
    
    chunker1 := mc.NewMarkdownChunkerWithConfig(config1)
    chunks1, _ := chunker1.ChunkDocument([]byte(markdown))
    
    fmt.Printf("Chunks: %d\n", len(chunks1))
    for i, chunk := range chunks1 {
        fmt.Printf("  %d. %s: %s\n", i+1, chunk.Type, 
            truncateText(chunk.Text, 50))
    }

    // Hierarchical strategy
    fmt.Println("\n=== Hierarchical Strategy (Max Depth 2) ===")
    config2 := mc.DefaultConfig()
    config2.ChunkingStrategy = mc.HierarchicalConfig(2)
    
    chunker2 := mc.NewMarkdownChunkerWithConfig(config2)
    chunks2, _ := chunker2.ChunkDocument([]byte(markdown))
    
    fmt.Printf("Chunks: %d\n", len(chunks2))
    for i, chunk := range chunks2 {
        fmt.Printf("  %d. %s (Level %d): %s\n", i+1, chunk.Type, 
            chunk.Level, truncateText(chunk.Text, 50))
    }

    // Document-level strategy
    fmt.Println("\n=== Document-Level Strategy ===")
    config3 := mc.DefaultConfig()
    config3.ChunkingStrategy = mc.DocumentLevelConfig()
    
    chunker3 := mc.NewMarkdownChunkerWithConfig(config3)
    chunks3, _ := chunker3.ChunkDocument([]byte(markdown))
    
    fmt.Printf("Chunks: %d\n", len(chunks3))
    fmt.Printf("  1. %s: %d characters\n", chunks3[0].Type, len(chunks3[0].Content))
}

func truncateText(text string, maxLen int) string {
    if len(text) <= maxLen {
        return text
    }
    return text[:maxLen] + "..."
}

Dynamic Strategy Switching

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    chunker := mc.NewMarkdownChunker()
    content := []byte(`# Document
    
## Section 1
Content for section 1.

## Section 2  
Content for section 2.`)

    // Start with element-level strategy
    chunks1, _ := chunker.ChunkDocument(content)
    fmt.Printf("Element-level: %d chunks\n", len(chunks1))

    // Switch to hierarchical strategy
    err := chunker.SetStrategy("hierarchical", mc.HierarchicalConfig(2))
    if err != nil {
        panic(err)
    }
    
    chunks2, _ := chunker.ChunkDocument(content)
    fmt.Printf("Hierarchical: %d chunks\n", len(chunks2))

    // Switch to document-level strategy
    err = chunker.SetStrategy("document-level", mc.DocumentLevelConfig())
    if err != nil {
        panic(err)
    }
    
    chunks3, _ := chunker.ChunkDocument(content)
    fmt.Printf("Document-level: %d chunks\n", len(chunks3))
}

Custom Strategies

You can create custom chunking strategies by implementing the ChunkingStrategy interface or using the strategy builder.

Using the Strategy Builder

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    // Create a custom strategy that:
    // 1. Creates separate chunks for H1 and H2 headings
    // 2. Merges paragraphs and lists with their parent heading
    // 3. Creates separate chunks for code blocks and tables
    
    builder := mc.NewCustomStrategyBuilder("content-focused", 
        "Groups content by importance, separating code and tables")
    
    // High priority: Separate chunks for major headings
    builder.AddRule(
        mc.HeadingLevelCondition{MinLevel: 1, MaxLevel: 2},
        mc.CreateSeparateChunkAction{},
        10,
    )
    
    // High priority: Separate chunks for code and tables
    builder.AddRule(
        mc.ContentTypeCondition{Types: []string{"code", "table"}},
        mc.CreateSeparateChunkAction{},
        9,
    )
    
    // Medium priority: Merge text content with parent
    builder.AddRule(
        mc.ContentTypeCondition{Types: []string{"paragraph", "list", "blockquote"}},
        mc.MergeWithParentAction{},
        5,
    )
    
    // Low priority: Skip minor headings (merge with parent)
    builder.AddRule(
        mc.HeadingLevelCondition{MinLevel: 3, MaxLevel: 6},
        mc.MergeWithParentAction{},
        3,
    )
    
    customStrategy := builder.Build()
    
    // Register and use the custom strategy
    chunker := mc.NewMarkdownChunker()
    err := chunker.RegisterStrategy(customStrategy)
    if err != nil {
        panic(err)
    }
    
    err = chunker.SetStrategy("content-focused", nil)
    if err != nil {
        panic(err)
    }
    
    markdown := `# Main Title

Introduction paragraph.

## Important Section

Some content here.

### Details

More details.

` + "```go" + `
func example() {
    fmt.Println("code")
}
` + "```" + `

| Feature | Status |
|---------|--------|
| Custom  | Active |`

    chunks, err := chunker.ChunkDocument([]byte(markdown))
    if err != nil {
        panic(err)
    }
    
    fmt.Printf("Custom strategy created %d chunks:\n", len(chunks))
    for i, chunk := range chunks {
        fmt.Printf("  %d. %s: %s\n", i+1, chunk.Type, 
            truncateText(chunk.Text, 60))
    }
}

Implementing a Custom Strategy Interface

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
    "github.com/yuin/goldmark/ast"
)

// CodeFocusedStrategy creates separate chunks for all code blocks
// and merges everything else by heading level
type CodeFocusedStrategy struct {
    config *mc.StrategyConfig
}

func (s *CodeFocusedStrategy) GetName() string {
    return "code-focused"
}

func (s *CodeFocusedStrategy) GetDescription() string {
    return "Separates all code blocks and groups other content hierarchically"
}

func (s *CodeFocusedStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *mc.MarkdownChunker) ([]mc.Chunk, error) {
    var chunks []mc.Chunk
    chunkID := 0
    
    // First pass: extract all code blocks as separate chunks
    ast.Walk(doc, func(node ast.Node, entering bool) (ast.WalkStatus, error) {
        if !entering {
            return ast.WalkContinue, nil
        }
        
        if codeBlock, ok := node.(*ast.FencedCodeBlock); ok {
            if chunk := chunker.ProcessCodeBlock(codeBlock, chunkID); chunk != nil {
                chunks = append(chunks, *chunk)
                chunkID++
            }
        }
        
        return ast.WalkContinue, nil
    })
    
    // Second pass: process remaining content hierarchically
    hierarchicalStrategy := &mc.HierarchicalStrategy{}
    hierarchicalChunks, err := hierarchicalStrategy.ChunkDocument(doc, source, chunker)
    if err != nil {
        return nil, err
    }
    
    // Filter out code blocks from hierarchical chunks and add them
    for _, chunk := range hierarchicalChunks {
        if chunk.Type != "code" {
            chunk.ID = chunkID
            chunks = append(chunks, chunk)
            chunkID++
        }
    }
    
    return chunks, nil
}

func (s *CodeFocusedStrategy) ValidateConfig(config *mc.StrategyConfig) error {
    // No specific validation needed for this strategy
    return nil
}

func (s *CodeFocusedStrategy) Clone() mc.ChunkingStrategy {
    return &CodeFocusedStrategy{
        config: s.config,
    }
}

func main() {
    chunker := mc.NewMarkdownChunker()
    
    // Register custom strategy
    customStrategy := &CodeFocusedStrategy{}
    err := chunker.RegisterStrategy(customStrategy)
    if err != nil {
        panic(err)
    }
    
    // Use custom strategy
    err = chunker.SetStrategy("code-focused", nil)
    if err != nil {
        panic(err)
    }
    
    markdown := `# API Documentation

This document describes our API.

## Authentication

Use API keys for authentication.

` + "```bash" + `
curl -H "Authorization: Bearer TOKEN" https://api.example.com
` + "```" + `

## Endpoints

### Users

Get user information:

` + "```javascript" + `
fetch('/api/users/123')
  .then(response => response.json())
  .then(data => console.log(data));
` + "```" + `

### Posts

Create a new post:

` + "```python" + `
import requests

response = requests.post('/api/posts', json={
    'title': 'My Post',
    'content': 'Post content here'
})
` + "```"

    chunks, err := chunker.ChunkDocument([]byte(markdown))
    if err != nil {
        panic(err)
    }
    
    fmt.Printf("Code-focused strategy created %d chunks:\n", len(chunks))
    for i, chunk := range chunks {
        fmt.Printf("  %d. %s", i+1, chunk.Type)
        if chunk.Type == "code" {
            if lang, exists := chunk.Metadata["language"]; exists {
                fmt.Printf(" (%s)", lang)
            }
        }
        fmt.Printf(": %s\n", truncateText(chunk.Text, 50))
    }
}

Strategy Best Practices

Choosing the Right Strategy

Element-Level Strategy
- Use for: Search indexing, fine-grained analysis, content without clear structure
- Pros: Maximum granularity, consistent chunk sizes
- Cons: May break logical content groupings
Hierarchical Strategy
- Use for: Documentation, books, structured content
- Pros: Preserves logical structure, maintains context
- Cons: Variable chunk sizes, may create very large chunks
Document-Level Strategy
- Use for: Small documents, document classification, full-context analysis
- Pros: Complete context, simple processing
- Cons: Not suitable for large documents, limited granularity
Custom Strategies
- Use for: Specific business requirements, specialized content types
- Pros: Tailored to exact needs, maximum flexibility
- Cons: Requires development effort, needs testing

Performance Considerations

// For high-performance scenarios, configure strategy caching
config := mc.DefaultConfig()
config.ChunkingStrategy = mc.HierarchicalConfig(3)
config.PerformanceMode = mc.PerformanceModeSpeedOptimized
config.EnableObjectPooling = true

// For memory-constrained environments
config.PerformanceMode = mc.PerformanceModeMemoryOptimized
config.MemoryLimit = 100 * 1024 * 1024 // 100MB limit

Error Handling with Strategies

config := mc.DefaultConfig()
config.ChunkingStrategy = mc.HierarchicalConfig(3)
config.ErrorHandling = mc.ErrorModePermissive // Continue on strategy errors

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

if err != nil {
    // Handle strategy-specific errors
    if chunkerErr, ok := err.(*mc.ChunkerError); ok {
        if chunkerErr.Type == mc.ErrorTypeStrategyExecutionFailed {
            fmt.Printf("Strategy error: %s\n", chunkerErr.Message)
            // Fallback to element-level strategy
            chunker.SetStrategy("element-level", mc.ElementLevelConfig())
            chunks, err = chunker.ChunkDocument(content)
        }
    }
}

Logging Usage

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    // Configure logging
    config := mc.DefaultConfig()
    config.EnableLog = true
    config.LogLevel = "DEBUG"        // DEBUG, INFO, WARN, ERROR
    config.LogFormat = "json"        // console, json
    config.LogDirectory = "./logs"   // Log file directory

    chunker := mc.NewMarkdownChunkerWithConfig(config)
    chunks, err := chunker.ChunkDocument([]byte(markdown))
    if err != nil {
        panic(err)
    }

    fmt.Printf("Processed %d chunks with detailed logging\n", len(chunks))
    fmt.Printf("Check %s directory for log files\n", config.LogDirectory)
}

Supported Content Types

Headings

Type: heading
Metadata: heading_level (1-6), word_count
Level: Heading level (1-6)
Enhanced Features: Position tracking, link/image extraction

Paragraphs

Type: paragraph
Metadata: word_count, char_count
Level: 0
Enhanced Features: Position tracking, link/image extraction, content hashing

Code Blocks

Type: code
Metadata: language, line_count
Level: 0
Enhanced Features: Code complexity analysis, syntax detection, position tracking

Tables

Type: table
Metadata: rows, columns, has_header, is_well_formed, alignments, cell_types, errors, error_count
Level: 0
Enhanced Features: Advanced table analysis, format validation, cell type detection

Lists

Type: list
Metadata: list_type (ordered/unordered), item_count
Level: 0
Enhanced Features: Nested list support, position tracking, link/image extraction

Blockquotes

Type: blockquote
Metadata: word_count
Level: 0
Enhanced Features: Nested blockquote support, position tracking, link/image extraction

Thematic Breaks

Type: thematic_break
Metadata: type (horizontal_rule)
Level: 0
Enhanced Features: Position tracking, content hashing

Configuration Options

The library provides extensive configuration options through the ChunkerConfig struct:

ChunkerConfig

type ChunkerConfig struct {
    MaxChunkSize        int                    // Maximum chunk size in characters (0 = unlimited)
    EnabledTypes        map[string]bool        // Enable/disable specific content types
    CustomExtractors    []MetadataExtractor    // Custom metadata extractors
    ErrorHandling       ErrorHandlingMode      // Error handling mode
    PerformanceMode     PerformanceMode        // Performance optimization mode
    FilterEmptyChunks   bool                   // Filter out empty chunks
    PreserveWhitespace  bool                   // Preserve whitespace in content
    MemoryLimit         int64                  // Memory usage limit in bytes
    EnableObjectPooling bool                   // Enable object pooling for performance
    
    // Strategy configuration
    ChunkingStrategy    *StrategyConfig        // Chunking strategy configuration
    
    // Logging configuration
    LogLevel            string                 // Log level: DEBUG, INFO, WARN, ERROR
    EnableLog           bool                   // Enable/disable logging
    LogFormat           string                 // Log format: console, json
    LogDirectory        string                 // Log file directory
}

StrategyConfig

type StrategyConfig struct {
    Name        string                 `json:"name"`         // Strategy name
    Parameters  map[string]interface{} `json:"parameters"`   // Strategy parameters
    
    // Hierarchical strategy specific
    MaxDepth    int  `json:"max_depth,omitempty"`     // Maximum heading level depth
    MinDepth    int  `json:"min_depth,omitempty"`     // Minimum heading level depth
    MergeEmpty  bool `json:"merge_empty,omitempty"`   // Merge empty sections
    
    // Size constraints
    MinChunkSize int `json:"min_chunk_size,omitempty"` // Minimum chunk size
    MaxChunkSize int `json:"max_chunk_size,omitempty"` // Maximum chunk size
    
    // Content filtering
    IncludeTypes []string `json:"include_types,omitempty"` // Include content types
    ExcludeTypes []string `json:"exclude_types,omitempty"` // Exclude content types
}

Error Handling Modes

const (
    ErrorModeStrict     ErrorHandlingMode = iota // Stop on first error
    ErrorModePermissive                          // Log errors but continue
    ErrorModeSilent                              // Ignore errors silently
)

Performance Modes

const (
    PerformanceModeDefault         PerformanceMode = iota
    PerformanceModeMemoryOptimized // Optimize for memory usage
    PerformanceModeSpeedOptimized  // Optimize for processing speed
)

API Reference

Types

Enhanced Chunk Structure

type Chunk struct {
    ID       int               `json:"id"`       // Unique chunk identifier
    Type     string            `json:"type"`     // Content type (heading, paragraph, etc.)
    Content  string            `json:"content"`  // Original Markdown content
    Text     string            `json:"text"`     // Plain text content
    Level    int               `json:"level"`    // Heading level (0 for non-headings)
    Metadata map[string]string `json:"metadata"` // Additional metadata
    
    // Enhanced fields
    Position ChunkPosition     `json:"position"` // Position in document
    Links    []Link           `json:"links"`    // Extracted links
    Images   []Image          `json:"images"`   // Extracted images
    Hash     string           `json:"hash"`     // Content hash for deduplication
}

Supporting Types

type ChunkPosition struct {
    StartLine int `json:"start_line"` // Starting line number
    EndLine   int `json:"end_line"`   // Ending line number
    StartCol  int `json:"start_col"`  // Starting column number
    EndCol    int `json:"end_col"`    // Ending column number
}

type Link struct {
    Text string `json:"text"` // Link text
    URL  string `json:"url"`  // Link URL
    Type string `json:"type"` // Link type: internal, external, anchor
}

type Image struct {
    Alt    string `json:"alt"`    // Alt text
    URL    string `json:"url"`    // Image URL
    Title  string `json:"title"`  // Image title
    Width  string `json:"width"`  // Image width (if available)
    Height string `json:"height"` // Image height (if available)
}

Core Functions

NewMarkdownChunker

func NewMarkdownChunker() *MarkdownChunker

Creates a new Markdown chunker instance with default configuration.

NewMarkdownChunkerWithConfig

func NewMarkdownChunkerWithConfig(config *ChunkerConfig) *MarkdownChunker

Creates a new Markdown chunker instance with custom configuration.

ChunkDocument

func (c *MarkdownChunker) ChunkDocument(content []byte) ([]Chunk, error)

Processes a Markdown document and returns an array of semantic chunks.

Strategy Management Functions

SetStrategy

func (c *MarkdownChunker) SetStrategy(strategyName string, config *StrategyConfig) error

Sets the chunking strategy for the chunker instance.

RegisterStrategy

func (c *MarkdownChunker) RegisterStrategy(strategy ChunkingStrategy) error

Registers a custom chunking strategy.

GetAvailableStrategies

func (c *MarkdownChunker) GetAvailableStrategies() []string

Returns a list of all available strategy names.

GetCurrentStrategy

func (c *MarkdownChunker) GetCurrentStrategy() string

Returns the name of the currently active strategy.

Strategy Configuration Functions

ElementLevelConfig

func ElementLevelConfig() *StrategyConfig

Creates configuration for element-level chunking strategy.

HierarchicalConfig

func HierarchicalConfig(maxDepth int) *StrategyConfig

Creates configuration for hierarchical chunking strategy with specified maximum depth.

DocumentLevelConfig

func DocumentLevelConfig() *StrategyConfig

Creates configuration for document-level chunking strategy.

CustomStrategyBuilder

func NewCustomStrategyBuilder(name, description string) *CustomStrategyBuilder

Creates a new builder for custom chunking strategies.

Error Handling Functions

func (c *MarkdownChunker) GetErrors() []*ChunkerError
func (c *MarkdownChunker) HasErrors() bool
func (c *MarkdownChunker) ClearErrors()
func (c *MarkdownChunker) GetErrorsByType(errorType ErrorType) []*ChunkerError

Performance Monitoring Functions

func (c *MarkdownChunker) GetPerformanceStats() PerformanceStats
func (c *MarkdownChunker) GetPerformanceMonitor() *PerformanceMonitor
func (c *MarkdownChunker) ResetPerformanceMonitor()

Utility Functions

func DefaultConfig() *ChunkerConfig
func ValidateConfig(config *ChunkerConfig) error

Logging Features

The library provides comprehensive logging capabilities to help with debugging, monitoring, and performance analysis.

Log Levels

DEBUG: Detailed information for debugging, including node processing and metadata extraction
INFO: General information about processing progress and results
WARN: Warning messages for potential issues
ERROR: Error messages for processing failures

Log Formats

console: Human-readable format suitable for development and debugging
json: Structured JSON format suitable for log aggregation and analysis

Logging Configuration

config := mc.DefaultConfig()

// Enable logging
config.EnableLog = true

// Set log level (DEBUG, INFO, WARN, ERROR)
config.LogLevel = "INFO"

// Set log format (console, json)
config.LogFormat = "console"

// Set log directory
config.LogDirectory = "./logs"

Logging Examples

Basic Logging

config := mc.DefaultConfig()
config.EnableLog = true
config.LogLevel = "INFO"
config.LogFormat = "console"
config.LogDirectory = "./logs"

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument([]byte(markdown))

Debug Logging with JSON Format

config := mc.DefaultConfig()
config.EnableLog = true
config.LogLevel = "DEBUG"
config.LogFormat = "json"
config.LogDirectory = "./debug-logs"

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument([]byte(markdown))

Error Logging

config := mc.DefaultConfig()
config.EnableLog = true
config.LogLevel = "ERROR"
config.LogFormat = "console"
config.LogDirectory = "./error-logs"
config.MaxChunkSize = 100  // Small limit to trigger errors
config.ErrorHandling = mc.ErrorModePermissive

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument([]byte(markdown))

// Errors are logged to files and can also be retrieved programmatically
if chunker.HasErrors() {
    for _, err := range chunker.GetErrors() {
        fmt.Printf("Error: %s - %s\n", err.Type.String(), err.Message)
    }
}

Log Output Examples

Console Format

2024-01-15 10:30:45 INFO  [chunker.go:123] Starting document processing
2024-01-15 10:30:45 DEBUG [chunker.go:145] Processing heading node: "Introduction"
2024-01-15 10:30:45 INFO  [chunker.go:234] Document processing completed: 15 chunks, 2.3ms

JSON Format

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "INFO",
  "message": "Starting document processing",
  "function": "ChunkDocument",
  "file": "chunker.go",
  "line": 123,
  "context": {
    "document_size": 1024,
    "config": {"max_chunk_size": 1000}
  }
}

Examples

Error Handling Example

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    // Configure for strict error handling
    config := mc.DefaultConfig()
    config.MaxChunkSize = 100
    config.ErrorHandling = mc.ErrorModeStrict

    chunker := mc.NewMarkdownChunkerWithConfig(config)
    
    longContent := `# Very Long Title That Exceeds The Maximum Chunk Size Limit
    
This is a very long paragraph that will definitely exceed the configured maximum chunk size limit and should trigger an error in strict mode.`

    chunks, err := chunker.ChunkDocument([]byte(longContent))
    if err != nil {
        if chunkerErr, ok := err.(*mc.ChunkerError); ok {
            fmt.Printf("Error Type: %s\n", chunkerErr.Type.String())
            fmt.Printf("Error Message: %s\n", chunkerErr.Message)
            fmt.Printf("Context: %+v\n", chunkerErr.Context)
        }
        return
    }

    fmt.Printf("Processed %d chunks\n", len(chunks))
}

Performance Monitoring Example

package main

import (
    "fmt"
    "time"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    config := mc.DefaultConfig()
    config.PerformanceMode = mc.PerformanceModeMemoryOptimized
    
    chunker := mc.NewMarkdownChunkerWithConfig(config)
    
    largeDocument := generateLargeMarkdown() // Your large document
    
    start := time.Now()
    chunks, err := chunker.ChunkDocument([]byte(largeDocument))
    if err != nil {
        panic(err)
    }
    
    // Get performance statistics
    stats := chunker.GetPerformanceStats()
    fmt.Printf("Processing Results:\n")
    fmt.Printf("  Chunks: %d\n", len(chunks))
    fmt.Printf("  Processing Time: %v\n", stats.ProcessingTime)
    fmt.Printf("  Memory Used: %d bytes\n", stats.MemoryUsed)
    fmt.Printf("  Chunks/Second: %.2f\n", stats.ChunksPerSecond)
    fmt.Printf("  Bytes/Second: %.2f\n", stats.BytesPerSecond)
    fmt.Printf("  Peak Memory: %d bytes\n", stats.PeakMemory)
}

Comprehensive Logging Example

package main

import (
    "fmt"
    "os"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    // Create configuration with comprehensive logging
    config := mc.DefaultConfig()
    
    // Enable detailed logging
    config.EnableLog = true
    config.LogLevel = "DEBUG"
    config.LogFormat = "json"
    config.LogDirectory = "./comprehensive-logs"
    
    // Configure processing options
    config.MaxChunkSize = 1000
    config.ErrorHandling = mc.ErrorModePermissive
    config.PerformanceMode = mc.PerformanceModeSpeedOptimized
    
    // Add metadata extractors for detailed logging
    config.CustomExtractors = []mc.MetadataExtractor{
        &mc.LinkExtractor{},
        &mc.ImageExtractor{},
        &mc.CodeComplexityExtractor{},
    }
    
    chunker := mc.NewMarkdownChunkerWithConfig(config)
    
    markdown := `# Logging Test Document

This document tests comprehensive logging features.

## Code Analysis

` + "```python" + `
def complex_algorithm(data):
    result = []
    for item in data:
        if item > 0:
            for i in range(item):
                if i % 2 == 0:
                    result.append(i * 2)
                else:
                    result.append(i * 3)
    return result
` + "```" + `

## Links and Images

Visit [our website](https://example.com) or check the ![logo](logo.png).

| Feature | Status | Link |
|---------|--------|------|
| Logging | Active | [docs](/logging) |
| Metrics | Beta | [metrics](/metrics) |`

    // Process with detailed logging
    chunks, err := chunker.ChunkDocument([]byte(markdown))
    if err != nil {
        fmt.Printf("Processing error: %v\n", err)
    }
    
    fmt.Printf("Processing Results:\n")
    fmt.Printf("  Chunks created: %d\n", len(chunks))
    fmt.Printf("  Log directory: %s\n", config.LogDirectory)
    
    // Display performance stats (also logged)
    stats := chunker.GetPerformanceStats()
    fmt.Printf("  Processing time: %v\n", stats.ProcessingTime)
    fmt.Printf("  Memory used: %d KB\n", stats.MemoryUsed/1024)
    
    // Show log files created
    if files, err := os.ReadDir(config.LogDirectory); err == nil {
        fmt.Printf("  Log files created:\n")
        for _, file := range files {
            if !file.IsDir() {
                fmt.Printf("    - %s\n", file.Name())
            }
        }
    }
    
    // Display any errors (also logged)
    if chunker.HasErrors() {
        fmt.Printf("  Errors encountered: %d\n", len(chunker.GetErrors()))
        for _, err := range chunker.GetErrors() {
            fmt.Printf("    - %s: %s\n", err.Type.String(), err.Message)
        }
    }
    
    fmt.Println("\nCheck the log files for detailed processing information:")
    fmt.Println("  - DEBUG logs show node processing details")
    fmt.Println("  - INFO logs show processing progress")
    fmt.Println("  - Performance metrics are logged")
    fmt.Println("  - Error details are logged with context")
}

Advanced Configuration Example

package main

import (
    "fmt"
    mc "github.com/kydenul/markdown-chunker"
)

func main() {
    // Create advanced configuration
    config := mc.DefaultConfig()
    
    // Only process specific content types
    config.EnabledTypes = map[string]bool{
        "heading":    true,
        "paragraph":  true,
        "code":       true,
        "table":      true,
        "list":       false,
        "blockquote": false,
    }
    
    // Configure logging
    config.EnableLog = true
    config.LogLevel = "INFO"
    config.LogFormat = "console"
    config.LogDirectory = "./processing-logs"
    
    // Add custom metadata extractors
    config.CustomExtractors = []mc.MetadataExtractor{
        &mc.LinkExtractor{},
        &mc.ImageExtractor{},
        &mc.CodeComplexityExtractor{},
    }
    
    // Configure error handling and performance
    config.ErrorHandling = mc.ErrorModePermissive
    config.PerformanceMode = mc.PerformanceModeSpeedOptimized
    config.FilterEmptyChunks = true
    config.MaxChunkSize = 2000
    
    chunker := mc.NewMarkdownChunkerWithConfig(config)
    
    markdown := `# Document with Links and Images

This paragraph contains a [link](https://example.com) and an ![image](image.jpg).

` + "```python" + `
def complex_function():
    for i in range(100):
        if i % 2 == 0:
            print(f"Even: {i}")
        else:
            print(f"Odd: {i}")
` + "```" + `

| Name | URL | Type |
|------|-----|------|
| Example | https://example.com | external |
| Internal | /page | internal |`

    chunks, err := chunker.ChunkDocument([]byte(markdown))
    if err != nil {
        panic(err)
    }
    
    for _, chunk := range chunks {
        fmt.Printf("\n=== %s Chunk ===\n", chunk.Type)
        fmt.Printf("Position: %d:%d to %d:%d\n", 
            chunk.Position.StartLine, chunk.Position.StartCol,
            chunk.Position.EndLine, chunk.Position.EndCol)
        
        if len(chunk.Links) > 0 {
            fmt.Printf("Links found: %d\n", len(chunk.Links))
            for _, link := range chunk.Links {
                fmt.Printf("  - %s (%s): %s\n", link.Text, link.Type, link.URL)
            }
        }
        
        if len(chunk.Images) > 0 {
            fmt.Printf("Images found: %d\n", len(chunk.Images))
            for _, img := range chunk.Images {
                fmt.Printf("  - %s: %s\n", img.Alt, img.URL)
            }
        }
        
        // Display custom metadata
        for key, value := range chunk.Metadata {
            if key == "code_complexity" || key == "link_count" || key == "image_count" {
                fmt.Printf("Custom metadata - %s: %s\n", key, value)
            }
        }
        
        fmt.Printf("Hash: %s\n", chunk.Hash[:16])
    }
    
    // Check for any processing errors
    if chunker.HasErrors() {
        fmt.Printf("\nProcessing errors: %d\n", len(chunker.GetErrors()))
        for _, err := range chunker.GetErrors() {
            fmt.Printf("  - %s: %s\n", err.Type.String(), err.Message)
        }
    }
    
    fmt.Printf("\nProcessing logged to: %s\n", config.LogDirectory)
}

Filtering and Analysis

// Filter chunks by type
func filterChunksByType(chunks []mc.Chunk, chunkType string) []mc.Chunk {
    var filtered []mc.Chunk
    for _, chunk := range chunks {
        if chunk.Type == chunkType {
            filtered = append(filtered, chunk)
        }
    }
    return filtered
}

// Analyze table structure
func analyzeTable(chunk mc.Chunk) {
    if chunk.Type != "table" {
        return
    }
    
    fmt.Printf("Table Analysis:\n")
    fmt.Printf("  Rows: %s\n", chunk.Metadata["rows"])
    fmt.Printf("  Columns: %s\n", chunk.Metadata["columns"])
    fmt.Printf("  Well-formed: %s\n", chunk.Metadata["is_well_formed"])
    
    if alignments, exists := chunk.Metadata["alignments"]; exists {
        fmt.Printf("  Alignments: %s\n", alignments)
    }
    
    if cellTypes, exists := chunk.Metadata["cell_types"]; exists {
        fmt.Printf("  Cell Types: %s\n", cellTypes)
    }
}

// Extract all links from document
func extractAllLinks(chunks []mc.Chunk) []mc.Link {
    var allLinks []mc.Link
    for _, chunk := range chunks {
        allLinks = append(allLinks, chunk.Links...)
    }
    return allLinks
}

Migration Guide

Upgrading to Strategy System

The chunking strategy system is fully backward compatible. Existing code will continue to work without changes, using the element-level strategy by default.

Existing Code (Still Works)

// This continues to work exactly as before
chunker := mc.NewMarkdownChunker()
chunks, err := chunker.ChunkDocument(content)

Migrating to Explicit Strategy Configuration

// Explicitly specify the same behavior as before
config := mc.DefaultConfig()
config.ChunkingStrategy = mc.ElementLevelConfig()

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

Upgrading to Hierarchical Strategy

// Switch to hierarchical chunking for better structure
config := mc.DefaultConfig()
config.ChunkingStrategy = mc.HierarchicalConfig(3) // Max 3 heading levels

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

Configuration Migration

If you have existing configuration files, you can migrate them using the built-in migration helper:

// Load your existing configuration
oldConfig := loadExistingConfig() // Your existing config loading

// Migrate to new format
newConfig := mc.MigrateConfig(oldConfig)

// The migrated config will have element-level strategy by default
chunker := mc.NewMarkdownChunkerWithConfig(newConfig)

Common Migration Patterns

From Fixed Processing to Strategy-Based

Before:

chunker := mc.NewMarkdownChunker()
chunks, err := chunker.ChunkDocument(content)

// Process all chunks the same way
for _, chunk := range chunks {
    processChunk(chunk)
}

After:

config := mc.DefaultConfig()

// Choose strategy based on content type
if isDocumentation(content) {
    config.ChunkingStrategy = mc.HierarchicalConfig(3)
} else if isSmallDocument(content) {
    config.ChunkingStrategy = mc.DocumentLevelConfig()
} else {
    config.ChunkingStrategy = mc.ElementLevelConfig()
}

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

// Process chunks with strategy-aware logic
for _, chunk := range chunks {
    processChunkWithStrategy(chunk, config.ChunkingStrategy.Name)
}

From Manual Grouping to Hierarchical Strategy

Before:

chunker := mc.NewMarkdownChunker()
chunks, err := chunker.ChunkDocument(content)

// Manually group chunks by headings
groupedChunks := groupChunksByHeading(chunks)

After:

config := mc.DefaultConfig()
config.ChunkingStrategy = mc.HierarchicalConfig(2) // Group by H1 and H2

chunker := mc.NewMarkdownChunkerWithConfig(config)
chunks, err := chunker.ChunkDocument(content)

// Chunks are already grouped hierarchically
for _, chunk := range chunks {
    if chunk.Type == "heading" {
        fmt.Printf("Section: %s (Level %d)\n", chunk.Text, chunk.Level)
    }
}

Breaking Changes

There are no breaking changes in the public API. All existing functions and types remain unchanged. The strategy system is additive and optional.

Performance Considerations During Migration

Element-level strategy: Same performance as before
Hierarchical strategy: Slightly higher memory usage due to structure building
Document-level strategy: Lower memory usage for small documents
Custom strategies: Performance depends on implementation

Testing Your Migration

func TestMigration(t *testing.T) {
    content := []byte(`# Test Document
    
## Section 1
Content here.

## Section 2  
More content.`)

    // Test that old behavior is preserved
    oldChunker := mc.NewMarkdownChunker()
    oldChunks, err := oldChunker.ChunkDocument(content)
    assert.NoError(t, err)

    // Test that explicit element-level config produces same results
    config := mc.DefaultConfig()
    config.ChunkingStrategy = mc.ElementLevelConfig()
    newChunker := mc.NewMarkdownChunkerWithConfig(config)
    newChunks, err := newChunker.ChunkDocument(content)
    assert.NoError(t, err)

    // Should produce identical results
    assert.Equal(t, len(oldChunks), len(newChunks))
    for i, oldChunk := range oldChunks {
        assert.Equal(t, oldChunk.Type, newChunks[i].Type)
        assert.Equal(t, oldChunk.Content, newChunks[i].Content)
        assert.Equal(t, oldChunk.Text, newChunks[i].Text)
    }
}

Use Cases

Documentation Processing: Break down large documentation into searchable chunks with precise position tracking
Content Analysis: Analyze document structure, content distribution, and extract metadata
RAG Systems: Prepare Markdown content for vector databases with enhanced metadata and deduplication
Content Migration: Convert Markdown documents to structured data with comprehensive error handling
Static Site Generation: Process Markdown files with advanced table processing and link extraction
Content Quality Assurance: Validate document structure and identify formatting issues
Performance-Critical Applications: Process large documents efficiently with memory optimization
Multi-language Documentation: Handle complex documents with configurable processing options

Dependencies

goldmark: CommonMark compliant Markdown parser

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the BSD-3 License - see the LICENSE file for details.

Metadata Extractors

The library includes several built-in metadata extractors and supports custom extractors:

Built-in Extractors

LinkExtractor

Extracts link information from content:

link_count: Number of links found
external_links: Count of external links
internal_links: Count of internal links
anchor_links: Count of anchor links

ImageExtractor

Extracts image information from content:

image_count: Number of images found
image_types: Types of images (by extension)

CodeComplexityExtractor

Analyzes code blocks for complexity:

code_complexity: Complexity score based on control structures
function_count: Number of functions detected
loop_count: Number of loops detected
conditional_count: Number of conditional statements

Custom Extractors

You can create custom metadata extractors by implementing the MetadataExtractor interface:

type CustomExtractor struct{}

func (e *CustomExtractor) Extract(node ast.Node, source []byte) map[string]string {
    metadata := make(map[string]string)
    // Your custom extraction logic here
    return metadata
}

func (e *CustomExtractor) SupportedTypes() []string {
    return []string{"heading", "paragraph"} // Specify supported types
}

Error Types

The library defines several error types for comprehensive error handling:

const (
    ErrorTypeInvalidInput    // Invalid or nil input
    ErrorTypeParsingFailed   // Markdown parsing failed
    ErrorTypeMemoryExhausted // Memory limit exceeded
    ErrorTypeTimeout         // Processing timeout
    ErrorTypeConfigInvalid   // Invalid configuration
    ErrorTypeChunkTooLarge   // Chunk exceeds size limit
)

Performance Optimization

Memory Optimization Features

Object Pooling: Reuse objects to reduce garbage collection
Streaming Processing: Process large documents without loading everything into memory
Memory Monitoring: Track memory usage and detect leaks
Configurable Limits: Set memory limits to prevent excessive usage

Performance Monitoring

The library provides detailed performance statistics:

type PerformanceStats struct {
    ProcessingTime  time.Duration // Total processing time
    MemoryUsed      int64         // Memory used during processing
    ChunksPerSecond float64       // Processing throughput
    BytesPerSecond  float64       // Byte processing rate
    TotalChunks     int           // Total chunks processed
    TotalBytes      int64         // Total bytes processed
    ChunkBytes      int64         // Total chunk content bytes
    PeakMemory      int64         // Peak memory usage
}

Changelog

v3.0.0 (Latest)

Flexible Chunking Strategies: Multiple built-in strategies (element-level, hierarchical, document-level)
Custom Strategy Support: Create custom chunking strategies using builder pattern or interface implementation
Strategy Registry: Register, manage, and switch between different chunking strategies
Hierarchical Chunking: Group content by heading levels with configurable depth and merging options
Document-Level Chunking: Process entire documents as single chunks for specific use cases
Dynamic Strategy Switching: Change strategies at runtime without recreating chunker instances
Strategy Configuration: Rich configuration options for each strategy type
Performance Optimization: Strategy-specific caching and memory optimization
Backward Compatibility: Full compatibility with existing code - no breaking changes
Migration Tools: Built-in configuration migration helpers

v2.1.0

Comprehensive Logging System: Configurable logging with multiple levels (DEBUG, INFO, WARN, ERROR)
Multiple Log Formats: Support for console and JSON log formats
Structured Logging: Rich context information including function names, line numbers, and processing metrics
Performance Logging: Detailed performance metrics and memory usage tracking
Error Context Logging: Enhanced error logging with full context information
Configurable Log Directory: Flexible log file location configuration
Integration with All Features: Logging integrated with error handling, performance monitoring, and metadata extraction

v2.0.0

Enhanced Configuration System: Flexible configuration with validation
Advanced Error Handling: Multiple error modes with detailed error information
Performance Monitoring: Built-in performance tracking and optimization
Enhanced Metadata Extraction: Extensible metadata system with link, image, and code analysis
Position Tracking: Precise position information for each chunk
Content Deduplication: SHA256-based content hashing
Memory Optimization: Object pooling and memory-efficient processing
Advanced Table Processing: Improved table analysis with format validation
Custom Extractors: Support for custom metadata extractors

v1.0.0

Initial release
Support for all major Markdown elements
GitHub Flavored Markdown support
Basic metadata extraction

Documentation ¶

Index ¶

func CreateMigrationGuide() string
func EnsureDefaultStrategyConfig(config *ChunkerConfig)
func IsLegacyConfig(config any) bool
func ValidateAndFillDefaults(config *StrategyConfig) error
func ValidateConfig(config *ChunkerConfig) error
type AdvancedTableProcessor
- func NewAdvancedTableProcessor(source []byte) *AdvancedTableProcessor
- func (p *AdvancedTableProcessor) ProcessTable(table *extast.Table) *TableInfo
type Chunk
type ChunkPool
- func NewChunkPool() *ChunkPool
- func (cp *ChunkPool) Get() *Chunk
- func (cp *ChunkPool) Put(chunk *Chunk)
- func (cp *ChunkPool) Reset()
type ChunkPosition
type ChunkerConfig
- func DefaultConfig() *ChunkerConfig
type ChunkerError
- func NewChunkerError(errorType ErrorType, message string, cause error) *ChunkerError
- func (e *ChunkerError) Error() string
- func (e *ChunkerError) Unwrap() error
- func (e *ChunkerError) WithContext(key string, value any) *ChunkerError
type ChunkerPool
- func NewChunkerPool(config *ChunkerConfig) *ChunkerPool
- func (cp *ChunkerPool) Get() *MarkdownChunker
- func (cp *ChunkerPool) Put(chunker *MarkdownChunker)
type ChunkingContext
- func NewChunkingContext(chunker *MarkdownChunker, source []byte) *ChunkingContext
- func (ctx *ChunkingContext) Clone() *ChunkingContext
- func (ctx *ChunkingContext) GetCustomData(key string) (any, bool)
- func (ctx *ChunkingContext) SetCustomData(key string, value any)
type ChunkingRule
- func NewChunkingRule(name, description string, condition RuleCondition, action RuleAction, ...) *ChunkingRule
- func (r *ChunkingRule) Clone() *ChunkingRule
- func (r *ChunkingRule) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)
- func (r *ChunkingRule) Match(node ast.Node, context *ChunkingContext) bool
- func (r *ChunkingRule) String() string
- func (r *ChunkingRule) Validate() error
type ChunkingStrategy
type CodeComplexityExtractor
- func (e *CodeComplexityExtractor) Extract(node ast.Node, source []byte) map[string]string
- func (e *CodeComplexityExtractor) SupportedTypes() []string
type ConcurrentChunker
- func NewConcurrentChunker(config *ChunkerConfig) *ConcurrentChunker
- func (cc *ConcurrentChunker) ChunkDocument(content []byte) ([]Chunk, error)
- func (cc *ConcurrentChunker) ChunkDocumentBatch(contents [][]byte, maxConcurrency int) ([][]Chunk, []error)
- func (cc *ConcurrentChunker) ChunkDocumentConcurrent(contents [][]byte) ([][]Chunk, []error)
- func (cc *ConcurrentChunker) ClearErrors()
- func (cc *ConcurrentChunker) GetErrors() []*ChunkerError
- func (cc *ConcurrentChunker) GetPerformanceStats() PerformanceStats
- func (cc *ConcurrentChunker) ProcessDocumentsConcurrently(contents [][]byte, maxConcurrency int) (*ConcurrentProcessingStats, [][]Chunk, []error)
type ConcurrentProcessingStats
type ConfigMigrationResult
- func MigrateConfig(config any) (*ConfigMigrationResult, error)
- func MigrateConfigWithLogger(config any, logger log.Logger) (*ConfigMigrationResult, error)
type ConfigVersion
- func GetConfigVersion(config any) ConfigVersion
type ContentSizeCondition
- func NewContentSizeCondition(minSize, maxSize int) *ContentSizeCondition
- func (c *ContentSizeCondition) Clone() RuleCondition
- func (c *ContentSizeCondition) GetDescription() string
- func (c *ContentSizeCondition) GetName() string
- func (c *ContentSizeCondition) Match(node ast.Node, context *ChunkingContext) bool
- func (c *ContentSizeCondition) Validate() error
type ContentTypeCondition
- func NewContentTypeCondition(types ...string) *ContentTypeCondition
- func (c *ContentTypeCondition) Clone() RuleCondition
- func (c *ContentTypeCondition) GetDescription() string
- func (c *ContentTypeCondition) GetName() string
- func (c *ContentTypeCondition) Match(node ast.Node, context *ChunkingContext) bool
- func (c *ContentTypeCondition) Validate() error
type CreateSeparateChunkAction
- func NewCreateSeparateChunkAction(chunkType string, metadata map[string]string) *CreateSeparateChunkAction
- func (a *CreateSeparateChunkAction) Clone() RuleAction
- func (a *CreateSeparateChunkAction) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)
- func (a *CreateSeparateChunkAction) GetDescription() string
- func (a *CreateSeparateChunkAction) GetName() string
- func (a *CreateSeparateChunkAction) Validate() error
type CustomStrategy
- func (s *CustomStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)
- func (s *CustomStrategy) Clone() ChunkingStrategy
- func (s *CustomStrategy) GetConfig() *StrategyConfig
- func (s *CustomStrategy) GetDescription() string
- func (s *CustomStrategy) GetEnabledRuleCount() int
- func (s *CustomStrategy) GetName() string
- func (s *CustomStrategy) GetRuleCount() int
- func (s *CustomStrategy) GetRules() []*ChunkingRule
- func (s *CustomStrategy) SetConfig(config *StrategyConfig) error
- func (s *CustomStrategy) String() string
- func (s *CustomStrategy) ValidateConfig(config *StrategyConfig) error
type CustomStrategyBuilder
- func NewContentTypeBasedStrategyBuilder(name string, separateTypes, mergeTypes []string) *CustomStrategyBuilder
- func NewCustomStrategyBuilder(name, description string) *CustomStrategyBuilder
- func NewHeadingBasedStrategyBuilder(name string, maxLevel int) *CustomStrategyBuilder
- func NewSizeBasedStrategyBuilder(name string, minSize, maxSize int) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) AddRule(name, description string, condition RuleCondition, action RuleAction, ...) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) AddRuleObject(rule *ChunkingRule) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) Build() (ChunkingStrategy, error)
- func (b *CustomStrategyBuilder) ClearRules() *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) Clone() *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) DisableRule(name string) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) EnableRule(name string) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) GetRule(name string) *ChunkingRule
- func (b *CustomStrategyBuilder) GetRuleCount() int
- func (b *CustomStrategyBuilder) GetRules() []*ChunkingRule
- func (b *CustomStrategyBuilder) HasRule(name string) bool
- func (b *CustomStrategyBuilder) RemoveRule(name string) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) SetConfig(config *StrategyConfig) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) SetDescription(description string) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) SetName(name string) *CustomStrategyBuilder
- func (b *CustomStrategyBuilder) String() string
- func (b *CustomStrategyBuilder) Validate() error
type DefaultErrorHandler
- func NewDefaultErrorHandler(mode ErrorHandlingMode) *DefaultErrorHandler
- func NewDefaultErrorHandlerWithLogger(mode ErrorHandlingMode, logger log.Logger) *DefaultErrorHandler
- func (h *DefaultErrorHandler) ClearErrors()
- func (h *DefaultErrorHandler) GetErrorCount() int
- func (h *DefaultErrorHandler) GetErrorCountByType(errorType ErrorType) int
- func (h *DefaultErrorHandler) GetErrors() []*ChunkerError
- func (h *DefaultErrorHandler) GetErrorsByType(errorType ErrorType) []*ChunkerError
- func (h *DefaultErrorHandler) HandleError(err *ChunkerError) error
- func (h *DefaultErrorHandler) HasErrors() bool
- func (h *DefaultErrorHandler) SetLogger(logger log.Logger)
type DepthCondition
- func NewDepthCondition(minDepth, maxDepth int) *DepthCondition
- func (c *DepthCondition) Clone() RuleCondition
- func (c *DepthCondition) GetDescription() string
- func (c *DepthCondition) GetName() string
- func (c *DepthCondition) Match(node ast.Node, context *ChunkingContext) bool
- func (c *DepthCondition) Validate() error
type DocumentLevelStrategy
- func NewDocumentLevelStrategy() *DocumentLevelStrategy
- func NewDocumentLevelStrategyWithConfig(config *StrategyConfig) *DocumentLevelStrategy
- func (s *DocumentLevelStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)
- func (s *DocumentLevelStrategy) Clone() ChunkingStrategy
- func (s *DocumentLevelStrategy) GetConfig() *StrategyConfig
- func (s *DocumentLevelStrategy) GetDescription() string
- func (s *DocumentLevelStrategy) GetName() string
- func (s *DocumentLevelStrategy) SetConfig(config *StrategyConfig) error
- func (s *DocumentLevelStrategy) ValidateConfig(config *StrategyConfig) error
type ElementLevelStrategy
- func NewElementLevelStrategy() *ElementLevelStrategy
- func NewElementLevelStrategyWithConfig(config *StrategyConfig) *ElementLevelStrategy
- func (s *ElementLevelStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)
- func (s *ElementLevelStrategy) Clone() ChunkingStrategy
- func (s *ElementLevelStrategy) GetConfig() *StrategyConfig
- func (s *ElementLevelStrategy) GetDescription() string
- func (s *ElementLevelStrategy) GetName() string
- func (s *ElementLevelStrategy) SetConfig(config *StrategyConfig) error
- func (s *ElementLevelStrategy) ValidateConfig(config *StrategyConfig) error
type ErrorHandler
type ErrorHandlingMode
type ErrorType
- func (et ErrorType) String() string
type HeadingLevelCondition
- func NewHeadingLevelCondition(minLevel, maxLevel int) *HeadingLevelCondition
- func (c *HeadingLevelCondition) Clone() RuleCondition
- func (c *HeadingLevelCondition) GetDescription() string
- func (c *HeadingLevelCondition) GetName() string
- func (c *HeadingLevelCondition) Match(node ast.Node, context *ChunkingContext) bool
- func (c *HeadingLevelCondition) Validate() error
type HierarchicalChunk
type HierarchicalStrategy
- func NewHierarchicalStrategy() *HierarchicalStrategy
- func NewHierarchicalStrategyWithConfig(config *StrategyConfig) *HierarchicalStrategy
- func (s *HierarchicalStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)
- func (s *HierarchicalStrategy) Clone() ChunkingStrategy
- func (s *HierarchicalStrategy) GetConfig() *StrategyConfig
- func (s *HierarchicalStrategy) GetDescription() string
- func (s *HierarchicalStrategy) GetName() string
- func (s *HierarchicalStrategy) SetConfig(config *StrategyConfig) error
- func (s *HierarchicalStrategy) ValidateConfig(config *StrategyConfig) error
type Image
type ImageExtractor
- func (e *ImageExtractor) Extract(node ast.Node, source []byte) map[string]string
- func (e *ImageExtractor) SupportedTypes() []string
type LegacyChunkerConfig
type Link
type LinkExtractor
- func (e *LinkExtractor) Extract(node ast.Node, source []byte) map[string]string
- func (e *LinkExtractor) SupportedTypes() []string
type LogContext
- func NewLogContext(functionName string) *LogContext
- func (lc *LogContext) ToLogFields() []any
- func (lc *LogContext) WithCodeInfo(language string, lineCount int, codeBlockType string) *LogContext
- func (lc *LogContext) WithContentInfo(contentLength, textLength, wordCount int) *LogContext
- func (lc *LogContext) WithDocumentInfo(documentSize, chunkCount int) *LogContext
- func (lc *LogContext) WithHeadingInfo(level, wordCount int) *LogContext
- func (lc *LogContext) WithLinksAndImages(linksCount, imagesCount int) *LogContext
- func (lc *LogContext) WithListInfo(listType string, itemCount int) *LogContext
- func (lc *LogContext) WithMetadata(key string, value any) *LogContext
- func (lc *LogContext) WithNodeInfo(nodeType string, nodeID int) *LogContext
- func (lc *LogContext) WithPositionInfo(startLine, endLine, startCol, endCol int) *LogContext
- func (lc *LogContext) WithProcessTime(duration time.Duration) *LogContext
- func (lc *LogContext) WithTableInfo(rowCount, columnCount int, isWellFormed bool) *LogContext
type MarkdownChunker
- func NewMarkdownChunker() *MarkdownChunker
- func NewMarkdownChunkerWithConfig(config *ChunkerConfig) *MarkdownChunker
- func NewMarkdownChunkerWithHierarchicalStrategy(maxDepth int) *MarkdownChunker
- func NewMarkdownChunkerWithStrategy(strategyName string) *MarkdownChunker
- func (c *MarkdownChunker) ChunkDocument(content []byte) ([]Chunk, error)
- func (c *MarkdownChunker) ClearErrors()
- func (c *MarkdownChunker) ClearStrategyCache()
- func (c *MarkdownChunker) GetAvailableStrategies() []string
- func (c *MarkdownChunker) GetCacheStats() map[string]any
- func (c *MarkdownChunker) GetCurrentStrategy() (string, string)
- func (c *MarkdownChunker) GetErrors() []*ChunkerError
- func (c *MarkdownChunker) GetErrorsByType(errorType ErrorType) []*ChunkerError
- func (c *MarkdownChunker) GetPerformanceMonitor() *PerformanceMonitor
- func (c *MarkdownChunker) GetPerformanceStats() PerformanceStats
- func (c *MarkdownChunker) GetStrategyConfig() *StrategyConfig
- func (c *MarkdownChunker) GetStrategyCount() int
- func (c *MarkdownChunker) HasErrors() bool
- func (c *MarkdownChunker) HasStrategy(strategyName string) bool
- func (c *MarkdownChunker) RegisterStrategy(strategy ChunkingStrategy) error
- func (c *MarkdownChunker) ResetPerformanceMonitor()
- func (c *MarkdownChunker) SetStrategy(strategyName string, config *StrategyConfig) error
- func (c *MarkdownChunker) UnregisterStrategy(strategyName string) error
- func (c *MarkdownChunker) UpdateStrategyConfig(config *StrategyConfig) error
type MemoryLimiter
- func NewMemoryLimiter(maxMemoryBytes int64) *MemoryLimiter
- func (ml *MemoryLimiter) CheckMemoryLimit() error
- func (ml *MemoryLimiter) GetCurrentMemoryUsage() int64
- func (ml *MemoryLimiter) GetMemoryLimit() int64
- func (ml *MemoryLimiter) SetLogger(logger log.Logger)
- func (ml *MemoryLimiter) SetMemoryLimit(maxMemoryBytes int64)
type MemoryOptimizer
- func NewMemoryOptimizer(memoryLimit int64) *MemoryOptimizer
- func (mo *MemoryOptimizer) CheckMemoryLimit() error
- func (mo *MemoryOptimizer) ForceGC()
- func (mo *MemoryOptimizer) GetChunk() *Chunk
- func (mo *MemoryOptimizer) GetGCThreshold() int64
- func (mo *MemoryOptimizer) GetMemoryStats() MemoryOptimizerStats
- func (mo *MemoryOptimizer) GetStringBuilder() *strings.Builder
- func (mo *MemoryOptimizer) PutChunk(chunk *Chunk)
- func (mo *MemoryOptimizer) PutStringBuilder(sb *strings.Builder)
- func (mo *MemoryOptimizer) RecordProcessedBytes(bytes int64)
- func (mo *MemoryOptimizer) Reset()
- func (mo *MemoryOptimizer) SetGCThreshold(threshold int64)
- func (mo *MemoryOptimizer) SetLogger(logger log.Logger)
type MemoryOptimizerStats
type MergeWithParentAction
- func NewMergeWithParentAction(separator string) *MergeWithParentAction
- func (a *MergeWithParentAction) Clone() RuleAction
- func (a *MergeWithParentAction) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)
- func (a *MergeWithParentAction) GetDescription() string
- func (a *MergeWithParentAction) GetName() string
- func (a *MergeWithParentAction) Validate() error
type MetadataExtractor
type ObjectPool
type OptimizedStringOperations
- func NewOptimizedStringOperations() *OptimizedStringOperations
- func (oso *OptimizedStringOperations) BuildContent(parts ...string) string
- func (oso *OptimizedStringOperations) JoinStrings(strs []string, separator string) string
- func (oso *OptimizedStringOperations) TrimAndClean(text string) string
type PerformanceMode
type PerformanceMonitor
- func NewPerformanceMonitor() *PerformanceMonitor
- func (pm *PerformanceMonitor) CheckMemoryThresholds()
- func (pm *PerformanceMonitor) ForceGC()
- func (pm *PerformanceMonitor) GetMemoryStats() runtime.MemStats
- func (pm *PerformanceMonitor) GetStats() PerformanceStats
- func (pm *PerformanceMonitor) IsRunning() bool
- func (pm *PerformanceMonitor) RecordBytes(bytes int64)
- func (pm *PerformanceMonitor) RecordChunk(chunk *Chunk)
- func (pm *PerformanceMonitor) RecordStrategyExecution(strategyName string, executionTime time.Duration, chunksGenerated int)
- func (pm *PerformanceMonitor) Reset()
- func (pm *PerformanceMonitor) SetLogger(logger log.Logger)
- func (pm *PerformanceMonitor) Start()
- func (pm *PerformanceMonitor) Stop()
type PerformanceStats
type ProcessingJob
type ProcessingResult
type RuleAction
type RuleCondition
type SkipNodeAction
- func NewSkipNodeAction(reason string) *SkipNodeAction
- func (a *SkipNodeAction) Clone() RuleAction
- func (a *SkipNodeAction) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)
- func (a *SkipNodeAction) GetDescription() string
- func (a *SkipNodeAction) GetName() string
- func (a *SkipNodeAction) Validate() error
type StrategyCache
- func NewStrategyCache() *StrategyCache
- func (sc *StrategyCache) Clear()
- func (sc *StrategyCache) Get(name string) (ChunkingStrategy, bool)
- func (sc *StrategyCache) Keys() []string
- func (sc *StrategyCache) Put(name string, strategy ChunkingStrategy)
- func (sc *StrategyCache) Remove(name string)
- func (sc *StrategyCache) Size() int
type StrategyConfig
- func CreateConfigFromParameters(strategyName string, params map[string]any) (*StrategyConfig, error)
- func DefaultStrategyConfig(name string) *StrategyConfig
- func DocumentLevelConfig() *StrategyConfig
- func DocumentLevelConfigWithSize(minSize, maxSize int) *StrategyConfig
- func ElementLevelConfig() *StrategyConfig
- func ElementLevelConfigWithSize(minSize, maxSize int) *StrategyConfig
- func ElementLevelConfigWithTypes(includeTypes, excludeTypes []string) *StrategyConfig
- func HierarchicalConfig(maxDepth int) *StrategyConfig
- func HierarchicalConfigAdvanced(maxDepth, minDepth int, mergeEmpty bool) *StrategyConfig
- func HierarchicalConfigWithSize(maxDepth, minSize, maxSize int) *StrategyConfig
- func MergeConfigs(base, override *StrategyConfig) (*StrategyConfig, error)
- func (sc *StrategyConfig) Clone() *StrategyConfig
- func (sc *StrategyConfig) String() string
- func (sc *StrategyConfig) ValidateConfig() error
type StrategyPool
- func NewStrategyPool() *StrategyPool
- func (sp *StrategyPool) Clear()
- func (sp *StrategyPool) CreatePool(strategyName string, factory func() ChunkingStrategy)
- func (sp *StrategyPool) Get(strategyName string, factory func() ChunkingStrategy) ChunkingStrategy
- func (sp *StrategyPool) GetPoolCount() int
- func (sp *StrategyPool) HasPool(strategyName string) bool
- func (sp *StrategyPool) Put(strategy ChunkingStrategy)
- func (sp *StrategyPool) RemovePool(strategyName string)
type StrategyRegistry
- func NewStrategyRegistry() *StrategyRegistry
- func (sr *StrategyRegistry) Get(name string) (ChunkingStrategy, error)
- func (sr *StrategyRegistry) GetStrategyCount() int
- func (sr *StrategyRegistry) HasStrategy(name string) bool
- func (sr *StrategyRegistry) List() []string
- func (sr *StrategyRegistry) Register(strategy ChunkingStrategy) error
- func (sr *StrategyRegistry) Unregister(name string) error
type StringBuilderPool
- func NewStringBuilderPool() *StringBuilderPool
- func (sbp *StringBuilderPool) Get() *strings.Builder
- func (sbp *StringBuilderPool) Put(sb *strings.Builder)
type TableInfo
- func (info *TableInfo) GetTableMetadata() map[string]string
type WorkerPool
- func NewWorkerPool(workers int, config *ChunkerConfig) *WorkerPool
- func (wp *WorkerPool) GetResult() ProcessingResult
- func (wp *WorkerPool) ProcessBatch(contents [][]byte) ([][]Chunk, []error)
- func (wp *WorkerPool) Start()
- func (wp *WorkerPool) Stop()
- func (wp *WorkerPool) Submit(job ProcessingJob)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CreateMigrationGuide ¶ added in v1.2.0

func CreateMigrationGuide() string

CreateMigrationGuide 创建迁移指南

func EnsureDefaultStrategyConfig ¶ added in v1.2.0

func EnsureDefaultStrategyConfig(config *ChunkerConfig)

EnsureDefaultStrategyConfig 确保配置中有有效的策略配置

func IsLegacyConfig ¶ added in v1.2.0

func IsLegacyConfig(config any) bool

IsLegacyConfig 检查配置是否为旧版本配置

func ValidateAndFillDefaults ¶ added in v1.2.0

func ValidateAndFillDefaults(config *StrategyConfig) error

ValidateAndFillDefaults 验证策略配置并填充默认值

func ValidateConfig ¶

func ValidateConfig(config *ChunkerConfig) error

ValidateConfig 验证配置的有效性

Types ¶

type AdvancedTableProcessor ¶

type AdvancedTableProcessor struct {
	// contains filtered or unexported fields
}

AdvancedTableProcessor 高级表格处理器

func NewAdvancedTableProcessor ¶

func NewAdvancedTableProcessor(source []byte) *AdvancedTableProcessor

NewAdvancedTableProcessor 创建高级表格处理器

func (*AdvancedTableProcessor) ProcessTable ¶

func (p *AdvancedTableProcessor) ProcessTable(table *extast.Table) *TableInfo

ProcessTable 处理表格并返回详细信息

type Chunk ¶

type Chunk struct {
	ID       int               `json:"id"`
	Type     string            `json:"type"`    // heading, paragraph, table, code, list
	Content  string            `json:"content"` // 原始 markdown 内容
	Text     string            `json:"text"`    // 纯文本内容，用于向量化
	Level    int               `json:"level"`   // 标题层级 (仅对 heading 有效)
	Metadata map[string]string `json:"metadata"`

	Position ChunkPosition `json:"position"` // 在文档中的位置
	Links    []Link        `json:"links"`    // 包含的链接
	Images   []Image       `json:"images"`   // 包含的图片
	Hash     string        `json:"hash"`     // 内容哈希，用于去重
}

Chunk 表示分块后的内容

type ChunkPool ¶

type ChunkPool struct {
	// contains filtered or unexported fields
}

ChunkPool 块对象池

func NewChunkPool ¶

func NewChunkPool() *ChunkPool

NewChunkPool 创建新的块对象池

func (*ChunkPool) Get ¶

func (cp *ChunkPool) Get() *Chunk

Get 从池中获取一个块对象

func (*ChunkPool) Put ¶

func (cp *ChunkPool) Put(chunk *Chunk)

Put 将块对象放回池中

func (*ChunkPool) Reset ¶

func (cp *ChunkPool) Reset()

Reset 重置池（清空所有对象）

type ChunkPosition ¶

type ChunkPosition struct {
	StartLine int `json:"start_line"` // 起始行号（从1开始）
	EndLine   int `json:"end_line"`   // 结束行号（从1开始）
	StartCol  int `json:"start_col"`  // 起始列号（从1开始）
	EndCol    int `json:"end_col"`    // 结束列号（从1开始）
}

ChunkPosition 表示块在文档中的位置

type ChunkerConfig ¶

type ChunkerConfig struct {
	// MaxChunkSize 最大块大小（字符数），0表示无限制
	MaxChunkSize int

	// EnabledTypes 启用的内容类型，nil表示启用所有类型
	EnabledTypes map[string]bool

	// CustomExtractors 自定义元数据提取器
	CustomExtractors []MetadataExtractor

	// ErrorHandling 错误处理模式
	ErrorHandling ErrorHandlingMode

	// PerformanceMode 性能模式
	PerformanceMode PerformanceMode

	// FilterEmptyChunks 是否过滤空块
	FilterEmptyChunks bool

	// PreserveWhitespace 是否保留空白字符
	PreserveWhitespace bool

	// MemoryLimit 内存使用限制（字节），0表示无限制
	MemoryLimit int64

	// EnableObjectPooling 是否启用对象池化
	EnableObjectPooling bool

	// 日志配置
	LogLevel     string `json:"log_level"`     // DEBUG, INFO, WARN, ERROR
	EnableLog    bool   `json:"enable_log"`    // 是否启用日志
	LogFormat    string `json:"log_format"`    // 日志格式 (json, console)
	LogDirectory string `json:"log_directory"` // 日志文件目录

	// 策略配置
	ChunkingStrategy *StrategyConfig `json:"chunking_strategy,omitempty"` // 分块策略配置
}

ChunkerConfig 分块器配置

func DefaultConfig ¶

func DefaultConfig() *ChunkerConfig

DefaultConfig 返回默认配置

type ChunkerError ¶

type ChunkerError struct {
	Type      ErrorType      `json:"type"`
	Message   string         `json:"message"`
	Context   map[string]any `json:"context"`
	Cause     error          `json:"cause,omitempty"`
	Timestamp time.Time      `json:"timestamp"`
}

ChunkerError 分块器错误

func NewChunkerError ¶

func NewChunkerError(errorType ErrorType, message string, cause error) *ChunkerError

NewChunkerError 创建新的分块器错误

func (*ChunkerError) Error ¶

func (e *ChunkerError) Error() string

Error 实现 error 接口

func (*ChunkerError) Unwrap ¶

func (e *ChunkerError) Unwrap() error

Unwrap 实现 errors.Unwrap 接口

func (*ChunkerError) WithContext ¶

func (e *ChunkerError) WithContext(key string, value any) *ChunkerError

WithContext 添加上下文信息

type ChunkerPool ¶

type ChunkerPool struct {
	// contains filtered or unexported fields
}

ChunkerPool 分块器对象池，用于并发处理

func NewChunkerPool ¶

func NewChunkerPool(config *ChunkerConfig) *ChunkerPool

NewChunkerPool 创建新的分块器对象池

func (*ChunkerPool) Get ¶

func (cp *ChunkerPool) Get() *MarkdownChunker

Get 从池中获取一个分块器实例

func (*ChunkerPool) Put ¶

func (cp *ChunkerPool) Put(chunker *MarkdownChunker)

Put 将分块器实例放回池中

type ChunkingContext ¶ added in v1.2.0

type ChunkingContext struct {
	CurrentChunk   *Chunk           // 当前正在处理的块
	PreviousChunk  *Chunk           // 前一个块
	ParentNode     ast.Node         // 父节点
	Depth          int              // 当前深度
	ChunkCount     int              // 已处理的块数量
	TotalNodes     int              // 总节点数
	Source         []byte           // 源文档内容
	Chunker        *MarkdownChunker // 分块器实例
	CustomData     map[string]any   // 自定义数据
	ProcessingTime time.Duration    // 处理时间
}

ChunkingContext 分块上下文，提供分块过程中的状态信息

func NewChunkingContext ¶ added in v1.2.0

func NewChunkingContext(chunker *MarkdownChunker, source []byte) *ChunkingContext

NewChunkingContext 创建新的分块上下文

func (*ChunkingContext) Clone ¶ added in v1.2.0

func (ctx *ChunkingContext) Clone() *ChunkingContext

Clone 创建上下文的副本

func (*ChunkingContext) GetCustomData ¶ added in v1.2.0

func (ctx *ChunkingContext) GetCustomData(key string) (any, bool)

GetCustomData 获取自定义数据

func (*ChunkingContext) SetCustomData ¶ added in v1.2.0

func (ctx *ChunkingContext) SetCustomData(key string, value any)

SetCustomData 设置自定义数据

type ChunkingRule ¶ added in v1.2.0

type ChunkingRule struct {
	Name        string        `json:"name"`        // 规则名称
	Description string        `json:"description"` // 规则描述
	Condition   RuleCondition `json:"-"`           // 规则条件（不序列化）
	Action      RuleAction    `json:"-"`           // 规则动作（不序列化）
	Priority    int           `json:"priority"`    // 规则优先级（数值越大优先级越高）
	Enabled     bool          `json:"enabled"`     // 是否启用
}

ChunkingRule 分块规则将条件和动作组合成完整的规则

func NewChunkingRule ¶ added in v1.2.0

func NewChunkingRule(name, description string, condition RuleCondition, action RuleAction, priority int) *ChunkingRule

NewChunkingRule 创建新的分块规则

func (*ChunkingRule) Clone ¶ added in v1.2.0

func (r *ChunkingRule) Clone() *ChunkingRule

Clone 创建规则的副本

func (*ChunkingRule) Execute ¶ added in v1.2.0

func (r *ChunkingRule) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)

Execute 执行规则动作

func (*ChunkingRule) Match ¶ added in v1.2.0

func (r *ChunkingRule) Match(node ast.Node, context *ChunkingContext) bool

Match 检查规则是否匹配

func (*ChunkingRule) String ¶ added in v1.2.0

func (r *ChunkingRule) String() string

String 返回规则的字符串表示

func (*ChunkingRule) Validate ¶ added in v1.2.0

func (r *ChunkingRule) Validate() error

Validate 验证规则配置

type ChunkingStrategy ¶ added in v1.2.0

type ChunkingStrategy interface {
	// GetName 返回策略名称
	GetName() string

	// GetDescription 返回策略描述
	GetDescription() string

	// ChunkDocument 使用该策略对文档进行分块
	ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)

	// ValidateConfig 验证策略特定的配置
	ValidateConfig(config *StrategyConfig) error

	// Clone 创建策略的副本（用于并发安全）
	Clone() ChunkingStrategy
}

ChunkingStrategy 定义分块策略的核心接口

type CodeComplexityExtractor ¶

type CodeComplexityExtractor struct{}

CodeComplexityExtractor 代码复杂度分析提取器

func (*CodeComplexityExtractor) Extract ¶

func (e *CodeComplexityExtractor) Extract(node ast.Node, source []byte) map[string]string

Extract 提取代码复杂度信息

func (*CodeComplexityExtractor) SupportedTypes ¶

func (e *CodeComplexityExtractor) SupportedTypes() []string

SupportedTypes 返回支持的内容类型

type ConcurrentChunker ¶

type ConcurrentChunker struct {
	// contains filtered or unexported fields
}

ConcurrentChunker 并发安全的分块器包装器

func NewConcurrentChunker ¶

func NewConcurrentChunker(config *ChunkerConfig) *ConcurrentChunker

NewConcurrentChunker 创建新的并发安全分块器

func (*ConcurrentChunker) ChunkDocument ¶

func (cc *ConcurrentChunker) ChunkDocument(content []byte) ([]Chunk, error)

ChunkDocument 线程安全的文档分块方法

func (*ConcurrentChunker) ChunkDocumentBatch ¶

func (cc *ConcurrentChunker) ChunkDocumentBatch(contents [][]byte, maxConcurrency int) ([][]Chunk, []error)

ChunkDocumentBatch 批量处理文档，支持并发控制

func (*ConcurrentChunker) ChunkDocumentConcurrent ¶

func (cc *ConcurrentChunker) ChunkDocumentConcurrent(contents [][]byte) ([][]Chunk, []error)

ChunkDocumentConcurrent 并发处理多个文档

func (*ConcurrentChunker) ClearErrors ¶

func (cc *ConcurrentChunker) ClearErrors()

ClearErrors 清除错误信息（线程安全）

func (*ConcurrentChunker) GetErrors ¶

func (cc *ConcurrentChunker) GetErrors() []*ChunkerError

GetErrors 获取错误信息（线程安全）

func (*ConcurrentChunker) GetPerformanceStats ¶

func (cc *ConcurrentChunker) GetPerformanceStats() PerformanceStats

GetPerformanceStats 获取性能统计信息（线程安全）

func (*ConcurrentChunker) ProcessDocumentsConcurrently ¶

func (cc *ConcurrentChunker) ProcessDocumentsConcurrently(contents [][]byte, maxConcurrency int) (*ConcurrentProcessingStats, [][]Chunk, []error)

ProcessDocumentsConcurrently 并发处理文档并收集统计信息

type ConcurrentProcessingStats ¶

type ConcurrentProcessingStats struct {
	TotalDocuments     int           `json:"total_documents"`     // 总文档数
	ProcessedDocuments int           `json:"processed_documents"` // 已处理文档数
	FailedDocuments    int           `json:"failed_documents"`    // 失败文档数
	TotalChunks        int           `json:"total_chunks"`        // 总块数
	ProcessingTime     time.Duration `json:"processing_time"`     // 总处理时间
	AverageTime        time.Duration `json:"average_time"`        // 平均处理时间
	Concurrency        int           `json:"concurrency"`         // 并发度
	ThroughputDocs     float64       `json:"throughput_docs"`     // 文档吞吐量（文档/秒）
	ThroughputChunks   float64       `json:"throughput_chunks"`   // 块吞吐量（块/秒）
}

ConcurrentProcessingStats 并发处理统计信息

type ConfigMigrationResult ¶ added in v1.2.0

type ConfigMigrationResult struct {
	// 迁移后的配置
	Config *ChunkerConfig `json:"config"`
	// 是否进行了迁移
	Migrated bool `json:"migrated"`
	// 原始版本
	OriginalVersion ConfigVersion `json:"original_version"`
	// 目标版本
	TargetVersion ConfigVersion `json:"target_version"`
	// 迁移警告
	Warnings []string `json:"warnings"`
	// 迁移说明
	Notes []string `json:"notes"`
}

ConfigMigrationResult 配置迁移结果

func MigrateConfig ¶ added in v1.2.0

func MigrateConfig(config any) (*ConfigMigrationResult, error)

MigrateConfig 迁移配置到最新版本这个函数处理从旧版本配置到新版本配置的迁移

func MigrateConfigWithLogger ¶ added in v1.2.0

func MigrateConfigWithLogger(config any, logger log.Logger) (*ConfigMigrationResult, error)

MigrateConfigWithLogger 带日志记录的配置迁移

type ConfigVersion ¶ added in v1.2.0

type ConfigVersion string

ConfigVersion 表示配置版本

const (
	// ConfigVersionV1 版本1配置（策略系统之前）
	ConfigVersionV1 ConfigVersion = "v1"
	// ConfigVersionV2 版本2配置（策略系统）
	ConfigVersionV2 ConfigVersion = "v2"
)

func GetConfigVersion ¶ added in v1.2.0

func GetConfigVersion(config any) ConfigVersion

GetConfigVersion 获取配置版本

type ContentSizeCondition ¶ added in v1.2.0

type ContentSizeCondition struct {
	MinSize int `json:"min_size"` // 最小内容大小（字符数）
	MaxSize int `json:"max_size"` // 最大内容大小（字符数，0表示无限制）
}

ContentSizeCondition 内容大小条件匹配指定大小范围内的内容

func NewContentSizeCondition ¶ added in v1.2.0

func NewContentSizeCondition(minSize, maxSize int) *ContentSizeCondition

NewContentSizeCondition 创建内容大小条件

func (*ContentSizeCondition) Clone ¶ added in v1.2.0

func (c *ContentSizeCondition) Clone() RuleCondition

Clone 创建条件的副本

func (*ContentSizeCondition) GetDescription ¶ added in v1.2.0

func (c *ContentSizeCondition) GetDescription() string

GetDescription 返回条件描述

func (*ContentSizeCondition) GetName ¶ added in v1.2.0

func (c *ContentSizeCondition) GetName() string

GetName 返回条件名称

func (*ContentSizeCondition) Match ¶ added in v1.2.0

func (c *ContentSizeCondition) Match(node ast.Node, context *ChunkingContext) bool

Match 检查节点是否匹配内容大小条件

func (*ContentSizeCondition) Validate ¶ added in v1.2.0

func (c *ContentSizeCondition) Validate() error

Validate 验证条件配置

type ContentTypeCondition ¶ added in v1.2.0

type ContentTypeCondition struct {
	Types []string `json:"types"` // 允许的内容类型列表
}

ContentTypeCondition 内容类型条件匹配指定类型的内容节点

func NewContentTypeCondition ¶ added in v1.2.0

func NewContentTypeCondition(types ...string) *ContentTypeCondition

NewContentTypeCondition 创建内容类型条件

func (*ContentTypeCondition) Clone ¶ added in v1.2.0

func (c *ContentTypeCondition) Clone() RuleCondition

Clone 创建条件的副本

func (*ContentTypeCondition) GetDescription ¶ added in v1.2.0

func (c *ContentTypeCondition) GetDescription() string

GetDescription 返回条件描述

func (*ContentTypeCondition) GetName ¶ added in v1.2.0

func (c *ContentTypeCondition) GetName() string

GetName 返回条件名称

func (*ContentTypeCondition) Match ¶ added in v1.2.0

func (c *ContentTypeCondition) Match(node ast.Node, context *ChunkingContext) bool

Match 检查节点是否匹配内容类型条件

func (*ContentTypeCondition) Validate ¶ added in v1.2.0

func (c *ContentTypeCondition) Validate() error

Validate 验证条件配置

type CreateSeparateChunkAction ¶ added in v1.2.0

type CreateSeparateChunkAction struct {
	ChunkType string            `json:"chunk_type"` // 块类型（可选，为空时使用节点类型）
	Metadata  map[string]string `json:"metadata"`   // 附加元数据
}

CreateSeparateChunkAction 创建独立块动作将匹配的节点创建为独立的块

func NewCreateSeparateChunkAction ¶ added in v1.2.0

func NewCreateSeparateChunkAction(chunkType string, metadata map[string]string) *CreateSeparateChunkAction

NewCreateSeparateChunkAction 创建独立块动作

func (*CreateSeparateChunkAction) Clone ¶ added in v1.2.0

func (a *CreateSeparateChunkAction) Clone() RuleAction

Clone 创建动作的副本

func (*CreateSeparateChunkAction) Execute ¶ added in v1.2.0

func (a *CreateSeparateChunkAction) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)

Execute 执行创建独立块动作

func (*CreateSeparateChunkAction) GetDescription ¶ added in v1.2.0

func (a *CreateSeparateChunkAction) GetDescription() string

GetDescription 返回动作描述

func (*CreateSeparateChunkAction) GetName ¶ added in v1.2.0

func (a *CreateSeparateChunkAction) GetName() string

GetName 返回动作名称

func (*CreateSeparateChunkAction) Validate ¶ added in v1.2.0

func (a *CreateSeparateChunkAction) Validate() error

Validate 验证动作配置

type CustomStrategy ¶ added in v1.2.0

type CustomStrategy struct {
	Name        string          `json:"name"`        // 策略名称
	Description string          `json:"description"` // 策略描述
	Rules       []*ChunkingRule `json:"rules"`       // 规则列表（按优先级排序）
	Config      *StrategyConfig `json:"config"`      // 策略配置
	// contains filtered or unexported fields
}

CustomStrategy 自定义分块策略基于规则的自定义分块策略实现

func (*CustomStrategy) ChunkDocument ¶ added in v1.2.0

func (s *CustomStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)

ChunkDocument 使用自定义策略对文档进行分块

func (*CustomStrategy) Clone ¶ added in v1.2.0

func (s *CustomStrategy) Clone() ChunkingStrategy

Clone 创建策略的副本（用于并发安全）

func (*CustomStrategy) GetConfig ¶ added in v1.2.0

func (s *CustomStrategy) GetConfig() *StrategyConfig

GetConfig 获取策略配置

func (*CustomStrategy) GetDescription ¶ added in v1.2.0

func (s *CustomStrategy) GetDescription() string

GetDescription 返回策略描述

func (*CustomStrategy) GetEnabledRuleCount ¶ added in v1.2.0

func (s *CustomStrategy) GetEnabledRuleCount() int

GetEnabledRuleCount 获取启用的规则数量

func (*CustomStrategy) GetName ¶ added in v1.2.0

func (s *CustomStrategy) GetName() string

GetName 返回策略名称

func (*CustomStrategy) GetRuleCount ¶ added in v1.2.0

func (s *CustomStrategy) GetRuleCount() int

GetRuleCount 获取规则数量

func (*CustomStrategy) GetRules ¶ added in v1.2.0

func (s *CustomStrategy) GetRules() []*ChunkingRule

GetRules 获取所有规则

func (*CustomStrategy) SetConfig ¶ added in v1.2.0

func (s *CustomStrategy) SetConfig(config *StrategyConfig) error

SetConfig 设置策略配置

func (*CustomStrategy) String ¶ added in v1.2.0

func (s *CustomStrategy) String() string

String 返回策略的字符串表示

func (*CustomStrategy) ValidateConfig ¶ added in v1.2.0

func (s *CustomStrategy) ValidateConfig(config *StrategyConfig) error

ValidateConfig 验证策略特定的配置

type CustomStrategyBuilder ¶ added in v1.2.0

type CustomStrategyBuilder struct {
	Name        string          `json:"name"`        // 策略名称
	Description string          `json:"description"` // 策略描述
	Rules       []*ChunkingRule `json:"rules"`       // 规则列表
	Config      *StrategyConfig `json:"config"`      // 策略配置
	// contains filtered or unexported fields
}

CustomStrategyBuilder 自定义策略构建器用于构建基于规则的自定义分块策略

func NewContentTypeBasedStrategyBuilder ¶ added in v1.2.0

func NewContentTypeBasedStrategyBuilder(name string, separateTypes, mergeTypes []string) *CustomStrategyBuilder

NewContentTypeBasedStrategyBuilder 创建基于内容类型的策略构建器

func NewCustomStrategyBuilder ¶ added in v1.2.0

func NewCustomStrategyBuilder(name, description string) *CustomStrategyBuilder

NewCustomStrategyBuilder 创建新的自定义策略构建器

func NewHeadingBasedStrategyBuilder ¶ added in v1.2.0

func NewHeadingBasedStrategyBuilder(name string, maxLevel int) *CustomStrategyBuilder

NewHeadingBasedStrategyBuilder 创建基于标题的策略构建器

func NewSizeBasedStrategyBuilder ¶ added in v1.2.0

func NewSizeBasedStrategyBuilder(name string, minSize, maxSize int) *CustomStrategyBuilder

NewSizeBasedStrategyBuilder 创建基于大小的策略构建器

func (*CustomStrategyBuilder) AddRule ¶ added in v1.2.0

func (b *CustomStrategyBuilder) AddRule(name, description string, condition RuleCondition, action RuleAction, priority int) *CustomStrategyBuilder

AddRule 添加分块规则

func (*CustomStrategyBuilder) AddRuleObject ¶ added in v1.2.0

func (b *CustomStrategyBuilder) AddRuleObject(rule *ChunkingRule) *CustomStrategyBuilder

AddRuleObject 添加规则对象

func (*CustomStrategyBuilder) Build ¶ added in v1.2.0

func (b *CustomStrategyBuilder) Build() (ChunkingStrategy, error)

Build 构建自定义策略

func (*CustomStrategyBuilder) ClearRules ¶ added in v1.2.0

func (b *CustomStrategyBuilder) ClearRules() *CustomStrategyBuilder

ClearRules 清空所有规则

func (*CustomStrategyBuilder) Clone ¶ added in v1.2.0

func (b *CustomStrategyBuilder) Clone() *CustomStrategyBuilder

Clone 创建构建器的副本

func (*CustomStrategyBuilder) DisableRule ¶ added in v1.2.0

func (b *CustomStrategyBuilder) DisableRule(name string) *CustomStrategyBuilder

DisableRule 禁用指定名称的规则

func (*CustomStrategyBuilder) EnableRule ¶ added in v1.2.0

func (b *CustomStrategyBuilder) EnableRule(name string) *CustomStrategyBuilder

EnableRule 启用指定名称的规则

func (*CustomStrategyBuilder) GetRule ¶ added in v1.2.0

func (b *CustomStrategyBuilder) GetRule(name string) *ChunkingRule

GetRule 获取指定名称的规则

func (*CustomStrategyBuilder) GetRuleCount ¶ added in v1.2.0

func (b *CustomStrategyBuilder) GetRuleCount() int

GetRuleCount 获取规则数量

func (*CustomStrategyBuilder) GetRules ¶ added in v1.2.0

func (b *CustomStrategyBuilder) GetRules() []*ChunkingRule

GetRules 获取所有规则

func (*CustomStrategyBuilder) HasRule ¶ added in v1.2.0

func (b *CustomStrategyBuilder) HasRule(name string) bool

HasRule 检查是否存在指定名称的规则

func (*CustomStrategyBuilder) RemoveRule ¶ added in v1.2.0

func (b *CustomStrategyBuilder) RemoveRule(name string) *CustomStrategyBuilder

RemoveRule 移除指定名称的规则

func (*CustomStrategyBuilder) SetConfig ¶ added in v1.2.0

func (b *CustomStrategyBuilder) SetConfig(config *StrategyConfig) *CustomStrategyBuilder

SetConfig 设置策略配置

func (*CustomStrategyBuilder) SetDescription ¶ added in v1.2.0

func (b *CustomStrategyBuilder) SetDescription(description string) *CustomStrategyBuilder

SetDescription 设置策略描述

func (*CustomStrategyBuilder) SetName ¶ added in v1.2.0

func (b *CustomStrategyBuilder) SetName(name string) *CustomStrategyBuilder

SetName 设置策略名称

func (*CustomStrategyBuilder) String ¶ added in v1.2.0

func (b *CustomStrategyBuilder) String() string

String 返回构建器的字符串表示

func (*CustomStrategyBuilder) Validate ¶ added in v1.2.0

func (b *CustomStrategyBuilder) Validate() error

Validate 验证策略构建器配置

type DefaultErrorHandler ¶

type DefaultErrorHandler struct {
	// contains filtered or unexported fields
}

DefaultErrorHandler 默认错误处理器

func NewDefaultErrorHandler ¶

func NewDefaultErrorHandler(mode ErrorHandlingMode) *DefaultErrorHandler

NewDefaultErrorHandler 创建默认错误处理器

func NewDefaultErrorHandlerWithLogger ¶ added in v1.1.0

func NewDefaultErrorHandlerWithLogger(mode ErrorHandlingMode, logger log.Logger) *DefaultErrorHandler

NewDefaultErrorHandlerWithLogger 创建带日志器的默认错误处理器

func (*DefaultErrorHandler) ClearErrors ¶

func (h *DefaultErrorHandler) ClearErrors()

ClearErrors 清除所有错误

func (*DefaultErrorHandler) GetErrorCount ¶

func (h *DefaultErrorHandler) GetErrorCount() int

GetErrorCount 获取错误数量

func (*DefaultErrorHandler) GetErrorCountByType ¶

func (h *DefaultErrorHandler) GetErrorCountByType(errorType ErrorType) int

GetErrorCountByType 按类型获取错误数量

func (*DefaultErrorHandler) GetErrors ¶

func (h *DefaultErrorHandler) GetErrors() []*ChunkerError

GetErrors 获取所有错误

func (*DefaultErrorHandler) GetErrorsByType ¶

func (h *DefaultErrorHandler) GetErrorsByType(errorType ErrorType) []*ChunkerError

GetErrorsByType 按类型获取错误

func (*DefaultErrorHandler) HandleError ¶

func (h *DefaultErrorHandler) HandleError(err *ChunkerError) error

HandleError 处理错误

func (*DefaultErrorHandler) HasErrors ¶

func (h *DefaultErrorHandler) HasErrors() bool

HasErrors 检查是否有错误

func (*DefaultErrorHandler) SetLogger ¶ added in v1.1.0

func (h *DefaultErrorHandler) SetLogger(logger log.Logger)

SetLogger 设置日志器

type DepthCondition ¶ added in v1.2.0

type DepthCondition struct {
	MinDepth int `json:"min_depth"` // 最小深度
	MaxDepth int `json:"max_depth"` // 最大深度（0表示无限制）
}

DepthCondition 深度条件匹配指定深度范围内的节点

func NewDepthCondition ¶ added in v1.2.0

func NewDepthCondition(minDepth, maxDepth int) *DepthCondition

NewDepthCondition 创建深度条件

func (*DepthCondition) Clone ¶ added in v1.2.0

func (c *DepthCondition) Clone() RuleCondition

Clone 创建条件的副本

func (*DepthCondition) GetDescription ¶ added in v1.2.0

func (c *DepthCondition) GetDescription() string

GetDescription 返回条件描述

func (*DepthCondition) GetName ¶ added in v1.2.0

func (c *DepthCondition) GetName() string

GetName 返回条件名称

func (*DepthCondition) Match ¶ added in v1.2.0

func (c *DepthCondition) Match(node ast.Node, context *ChunkingContext) bool

Match 检查节点是否匹配深度条件

func (*DepthCondition) Validate ¶ added in v1.2.0

func (c *DepthCondition) Validate() error

Validate 验证条件配置

type DocumentLevelStrategy ¶ added in v1.2.0

type DocumentLevelStrategy struct {
	// contains filtered or unexported fields
}

DocumentLevelStrategy 文档级分块策略将整个文档作为单个块处理

func NewDocumentLevelStrategy ¶ added in v1.2.0

func NewDocumentLevelStrategy() *DocumentLevelStrategy

NewDocumentLevelStrategy 创建新的文档级分块策略

func NewDocumentLevelStrategyWithConfig ¶ added in v1.2.0

func NewDocumentLevelStrategyWithConfig(config *StrategyConfig) *DocumentLevelStrategy

NewDocumentLevelStrategyWithConfig 使用指定配置创建文档级分块策略

func (*DocumentLevelStrategy) ChunkDocument ¶ added in v1.2.0

func (s *DocumentLevelStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)

ChunkDocument 使用文档级策略对文档进行分块

func (*DocumentLevelStrategy) Clone ¶ added in v1.2.0

func (s *DocumentLevelStrategy) Clone() ChunkingStrategy

Clone 创建策略的副本（用于并发安全）

func (*DocumentLevelStrategy) GetConfig ¶ added in v1.2.0

func (s *DocumentLevelStrategy) GetConfig() *StrategyConfig

GetConfig 获取策略配置

func (*DocumentLevelStrategy) GetDescription ¶ added in v1.2.0

func (s *DocumentLevelStrategy) GetDescription() string

GetDescription 返回策略描述

func (*DocumentLevelStrategy) GetName ¶ added in v1.2.0

func (s *DocumentLevelStrategy) GetName() string

GetName 返回策略名称

func (*DocumentLevelStrategy) SetConfig ¶ added in v1.2.0

func (s *DocumentLevelStrategy) SetConfig(config *StrategyConfig) error

SetConfig 设置策略配置

func (*DocumentLevelStrategy) ValidateConfig ¶ added in v1.2.0

func (s *DocumentLevelStrategy) ValidateConfig(config *StrategyConfig) error

ValidateConfig 验证策略特定的配置

type ElementLevelStrategy ¶ added in v1.2.0

type ElementLevelStrategy struct {
	// contains filtered or unexported fields
}

ElementLevelStrategy 元素级分块策略（默认策略）按 Markdown 元素类型逐个分块，保持与当前行为完全一致

func NewElementLevelStrategy ¶ added in v1.2.0

func NewElementLevelStrategy() *ElementLevelStrategy

NewElementLevelStrategy 创建新的元素级分块策略

func NewElementLevelStrategyWithConfig ¶ added in v1.2.0

func NewElementLevelStrategyWithConfig(config *StrategyConfig) *ElementLevelStrategy

NewElementLevelStrategyWithConfig 使用指定配置创建元素级分块策略

func (*ElementLevelStrategy) ChunkDocument ¶ added in v1.2.0

func (s *ElementLevelStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)

ChunkDocument 使用元素级策略对文档进行分块

func (*ElementLevelStrategy) Clone ¶ added in v1.2.0

func (s *ElementLevelStrategy) Clone() ChunkingStrategy

Clone 创建策略的副本（用于并发安全）

func (*ElementLevelStrategy) GetConfig ¶ added in v1.2.0

func (s *ElementLevelStrategy) GetConfig() *StrategyConfig

GetConfig 获取策略配置

func (*ElementLevelStrategy) GetDescription ¶ added in v1.2.0

func (s *ElementLevelStrategy) GetDescription() string

GetDescription 返回策略描述

func (*ElementLevelStrategy) GetName ¶ added in v1.2.0

func (s *ElementLevelStrategy) GetName() string

GetName 返回策略名称

func (*ElementLevelStrategy) SetConfig ¶ added in v1.2.0

func (s *ElementLevelStrategy) SetConfig(config *StrategyConfig) error

SetConfig 设置策略配置

func (*ElementLevelStrategy) ValidateConfig ¶ added in v1.2.0

func (s *ElementLevelStrategy) ValidateConfig(config *StrategyConfig) error

ValidateConfig 验证策略特定的配置

type ErrorHandler ¶

type ErrorHandler interface {
	// HandleError 处理错误
	HandleError(err *ChunkerError) error
	// GetErrors 获取所有错误
	GetErrors() []*ChunkerError
	// ClearErrors 清除所有错误
	ClearErrors()
	// HasErrors 检查是否有错误
	HasErrors() bool
}

ErrorHandler 错误处理器接口

type ErrorHandlingMode ¶

type ErrorHandlingMode int

ErrorHandlingMode 错误处理模式

const (
	// ErrorModeStrict 严格模式，遇到错误立即返回
	ErrorModeStrict ErrorHandlingMode = iota
	// ErrorModePermissive 宽松模式，记录错误但继续处理
	ErrorModePermissive
	// ErrorModeSilent 静默模式，忽略错误
	ErrorModeSilent
)

type ErrorType ¶

type ErrorType int

ErrorType 错误类型

const (
	// ErrorTypeInvalidInput 无效输入错误
	ErrorTypeInvalidInput ErrorType = iota
	// ErrorTypeParsingFailed 解析失败错误
	ErrorTypeParsingFailed
	// ErrorTypeMemoryExhausted 内存不足错误
	ErrorTypeMemoryExhausted
	// ErrorTypeTimeout 超时错误
	ErrorTypeTimeout
	// ErrorTypeConfigInvalid 配置无效错误
	ErrorTypeConfigInvalid
	// ErrorTypeChunkTooLarge 块过大错误
	ErrorTypeChunkTooLarge
	// ErrorTypeStrategyNotFound 策略未找到错误
	ErrorTypeStrategyNotFound
	// ErrorTypeStrategyConfigInvalid 策略配置无效错误
	ErrorTypeStrategyConfigInvalid
	// ErrorTypeStrategyExecutionFailed 策略执行失败错误
	ErrorTypeStrategyExecutionFailed
)

func (ErrorType) String ¶

func (et ErrorType) String() string

String 返回错误类型的字符串表示

type HeadingLevelCondition ¶ added in v1.2.0

type HeadingLevelCondition struct {
	MinLevel int `json:"min_level"` // 最小层级（包含）
	MaxLevel int `json:"max_level"` // 最大层级（包含）
}

HeadingLevelCondition 标题层级条件匹配指定层级范围内的标题

func NewHeadingLevelCondition ¶ added in v1.2.0

func NewHeadingLevelCondition(minLevel, maxLevel int) *HeadingLevelCondition

NewHeadingLevelCondition 创建标题层级条件

func (*HeadingLevelCondition) Clone ¶ added in v1.2.0

func (c *HeadingLevelCondition) Clone() RuleCondition

Clone 创建条件的副本

func (*HeadingLevelCondition) GetDescription ¶ added in v1.2.0

func (c *HeadingLevelCondition) GetDescription() string

GetDescription 返回条件描述

func (*HeadingLevelCondition) GetName ¶ added in v1.2.0

func (c *HeadingLevelCondition) GetName() string

GetName 返回条件名称

func (*HeadingLevelCondition) Match ¶ added in v1.2.0

func (c *HeadingLevelCondition) Match(node ast.Node, context *ChunkingContext) bool

Match 检查节点是否匹配标题层级条件

func (*HeadingLevelCondition) Validate ¶ added in v1.2.0

func (c *HeadingLevelCondition) Validate() error

Validate 验证条件配置

type HierarchicalChunk ¶ added in v1.2.0

type HierarchicalChunk struct {
	Chunk    Chunk                `json:"chunk"`    // 基础块信息
	Children []*HierarchicalChunk `json:"children"` // 子块列表
	Parent   *HierarchicalChunk   `json:"-"`        // 父块引用（不序列化）
	Level    int                  `json:"level"`    // 层级深度
}

HierarchicalChunk 表示层级结构中的块

type HierarchicalStrategy ¶ added in v1.2.0

type HierarchicalStrategy struct {
	// contains filtered or unexported fields
}

HierarchicalStrategy 层级分块策略按文档层级结构分块，将标题及其下属内容作为一个块

func NewHierarchicalStrategy ¶ added in v1.2.0

func NewHierarchicalStrategy() *HierarchicalStrategy

NewHierarchicalStrategy 创建新的层级分块策略

func NewHierarchicalStrategyWithConfig ¶ added in v1.2.0

func NewHierarchicalStrategyWithConfig(config *StrategyConfig) *HierarchicalStrategy

NewHierarchicalStrategyWithConfig 使用指定配置创建层级分块策略

func (*HierarchicalStrategy) ChunkDocument ¶ added in v1.2.0

func (s *HierarchicalStrategy) ChunkDocument(doc ast.Node, source []byte, chunker *MarkdownChunker) ([]Chunk, error)

ChunkDocument 使用层级策略对文档进行分块

func (*HierarchicalStrategy) Clone ¶ added in v1.2.0

func (s *HierarchicalStrategy) Clone() ChunkingStrategy

Clone 创建策略的副本（用于并发安全）

func (*HierarchicalStrategy) GetConfig ¶ added in v1.2.0

func (s *HierarchicalStrategy) GetConfig() *StrategyConfig

GetConfig 获取策略配置

func (*HierarchicalStrategy) GetDescription ¶ added in v1.2.0

func (s *HierarchicalStrategy) GetDescription() string

GetDescription 返回策略描述

func (*HierarchicalStrategy) GetName ¶ added in v1.2.0

func (s *HierarchicalStrategy) GetName() string

GetName 返回策略名称

func (*HierarchicalStrategy) SetConfig ¶ added in v1.2.0

func (s *HierarchicalStrategy) SetConfig(config *StrategyConfig) error

SetConfig 设置策略配置

func (*HierarchicalStrategy) ValidateConfig ¶ added in v1.2.0

func (s *HierarchicalStrategy) ValidateConfig(config *StrategyConfig) error

ValidateConfig 验证策略特定的配置

type Image ¶

type Image struct {
	Alt    string `json:"alt"`    // 替代文本
	URL    string `json:"url"`    // 图片地址
	Title  string `json:"title"`  // 图片标题
	Width  string `json:"width"`  // 图片宽度
	Height string `json:"height"` // 图片高度
}

Image 表示图片信息

type ImageExtractor ¶

type ImageExtractor struct{}

ImageExtractor 图片提取器

func (*ImageExtractor) Extract ¶

func (e *ImageExtractor) Extract(node ast.Node, source []byte) map[string]string

Extract 提取图片信息

func (*ImageExtractor) SupportedTypes ¶

func (e *ImageExtractor) SupportedTypes() []string

SupportedTypes 返回支持的内容类型

type LegacyChunkerConfig ¶ added in v1.2.0

type LegacyChunkerConfig struct {
	// 基本配置
	MaxChunkSize        int                 `json:"max_chunk_size"`
	EnabledTypes        map[string]bool     `json:"enabled_types"`
	CustomExtractors    []MetadataExtractor `json:"custom_extractors"`
	ErrorHandling       ErrorHandlingMode   `json:"error_handling"`
	PerformanceMode     PerformanceMode     `json:"performance_mode"`
	FilterEmptyChunks   bool                `json:"filter_empty_chunks"`
	PreserveWhitespace  bool                `json:"preserve_whitespace"`
	MemoryLimit         int64               `json:"memory_limit"`
	EnableObjectPooling bool                `json:"enable_object_pooling"`

	// 日志配置
	LogLevel     string `json:"log_level"`
	EnableLog    bool   `json:"enable_log"`
	LogFormat    string `json:"log_format"`
	LogDirectory string `json:"log_directory"`

	// 版本标识
	Version ConfigVersion `json:"version,omitempty"`
}

LegacyChunkerConfig 旧版本的分块器配置（策略系统之前）

type Link ¶

type Link struct {
	Text string `json:"text"` // 链接文本
	URL  string `json:"url"`  // 链接地址
	Type string `json:"type"` // 链接类型：internal, external, anchor
}

Link 表示链接信息

type LinkExtractor ¶

type LinkExtractor struct{}

LinkExtractor 链接提取器

func (*LinkExtractor) Extract ¶

func (e *LinkExtractor) Extract(node ast.Node, source []byte) map[string]string

Extract 提取链接信息

func (*LinkExtractor) SupportedTypes ¶

func (e *LinkExtractor) SupportedTypes() []string

SupportedTypes 返回支持的内容类型

type LogContext ¶ added in v1.1.0

type LogContext struct {
	FunctionName string         `json:"function_name"` // 函数名
	FileName     string         `json:"file_name"`     // 文件名
	LineNumber   int            `json:"line_number"`   // 行号
	NodeType     string         `json:"node_type"`     // 节点类型
	NodeID       int            `json:"node_id"`       // 节点ID
	ChunkCount   int            `json:"chunk_count"`   // 块数量
	DocumentSize int            `json:"document_size"` // 文档大小
	ProcessTime  time.Duration  `json:"process_time"`  // 处理时间
	Metadata     map[string]any `json:"metadata"`      // 额外元数据
}

LogContext 表示日志上下文信息

func NewLogContext ¶ added in v1.1.0

func NewLogContext(functionName string) *LogContext

NewLogContext 创建新的日志上下文

func (*LogContext) ToLogFields ¶ added in v1.1.0

func (lc *LogContext) ToLogFields() []any

ToLogFields 将日志上下文转换为日志字段

func (*LogContext) WithCodeInfo ¶ added in v1.1.0

func (lc *LogContext) WithCodeInfo(language string, lineCount int, codeBlockType string) *LogContext

WithCodeInfo 添加代码块特定信息到日志上下文

func (*LogContext) WithContentInfo ¶ added in v1.1.0

func (lc *LogContext) WithContentInfo(contentLength, textLength, wordCount int) *LogContext

WithContentInfo 添加内容统计信息到日志上下文

func (*LogContext) WithDocumentInfo ¶ added in v1.1.0

func (lc *LogContext) WithDocumentInfo(documentSize, chunkCount int) *LogContext

WithDocumentInfo 添加文档信息到日志上下文

func (*LogContext) WithHeadingInfo ¶ added in v1.1.0

func (lc *LogContext) WithHeadingInfo(level, wordCount int) *LogContext

WithHeadingInfo 添加标题特定信息到日志上下文

func (*LogContext) WithLinksAndImages ¶ added in v1.1.0

func (lc *LogContext) WithLinksAndImages(linksCount, imagesCount int) *LogContext

WithLinksAndImages 添加链接和图片信息到日志上下文

func (*LogContext) WithListInfo ¶ added in v1.1.0

func (lc *LogContext) WithListInfo(listType string, itemCount int) *LogContext

WithListInfo 添加列表特定信息到日志上下文

func (*LogContext) WithMetadata ¶ added in v1.1.0

func (lc *LogContext) WithMetadata(key string, value any) *LogContext

WithMetadata 添加自定义元数据到日志上下文

func (*LogContext) WithNodeInfo ¶ added in v1.1.0

func (lc *LogContext) WithNodeInfo(nodeType string, nodeID int) *LogContext

WithNodeInfo 添加节点信息到日志上下文

func (*LogContext) WithPositionInfo ¶ added in v1.1.0

func (lc *LogContext) WithPositionInfo(startLine, endLine, startCol, endCol int) *LogContext

WithPositionInfo 添加位置信息到日志上下文

func (*LogContext) WithProcessTime ¶ added in v1.1.0

func (lc *LogContext) WithProcessTime(duration time.Duration) *LogContext

WithProcessTime 添加处理时间到日志上下文

func (*LogContext) WithTableInfo ¶ added in v1.1.0

func (lc *LogContext) WithTableInfo(rowCount, columnCount int, isWellFormed bool) *LogContext

WithTableInfo 添加表格特定信息到日志上下文

type MarkdownChunker ¶

type MarkdownChunker struct {
	// contains filtered or unexported fields
}

MarkdownChunker Markdown 分块器

func NewMarkdownChunker ¶

func NewMarkdownChunker() *MarkdownChunker

NewMarkdownChunker 创建新的分块器，使用默认配置这个函数保持向后兼容性，确保现有代码无需修改即可工作

func NewMarkdownChunkerWithConfig ¶

func NewMarkdownChunkerWithConfig(config *ChunkerConfig) *MarkdownChunker

NewMarkdownChunkerWithConfig 使用指定配置创建新的分块器

func NewMarkdownChunkerWithHierarchicalStrategy ¶ added in v1.2.0

func NewMarkdownChunkerWithHierarchicalStrategy(maxDepth int) *MarkdownChunker

NewMarkdownChunkerWithHierarchicalStrategy 创建使用层级策略的分块器这是一个便捷函数，用于快速创建层级分块器

func NewMarkdownChunkerWithStrategy ¶ added in v1.2.0

func NewMarkdownChunkerWithStrategy(strategyName string) *MarkdownChunker

NewMarkdownChunkerWithStrategy 使用指定策略创建新的分块器这是一个便捷函数，用于快速创建使用特定策略的分块器

func (*MarkdownChunker) ChunkDocument ¶

func (c *MarkdownChunker) ChunkDocument(content []byte) ([]Chunk, error)

ChunkDocument 对整个文档进行分块

func (*MarkdownChunker) ClearErrors ¶

func (c *MarkdownChunker) ClearErrors()

ClearErrors 清除所有错误

func (*MarkdownChunker) ClearStrategyCache ¶ added in v1.2.0

func (c *MarkdownChunker) ClearStrategyCache()

ClearStrategyCache 清空策略缓存

func (*MarkdownChunker) GetAvailableStrategies ¶ added in v1.2.0

func (c *MarkdownChunker) GetAvailableStrategies() []string

GetAvailableStrategies 获取所有可用的策略列表

func (*MarkdownChunker) GetCacheStats ¶ added in v1.2.0

func (c *MarkdownChunker) GetCacheStats() map[string]any

GetCacheStats 获取缓存统计信息

func (*MarkdownChunker) GetCurrentStrategy ¶ added in v1.2.0

func (c *MarkdownChunker) GetCurrentStrategy() (string, string)

GetCurrentStrategy 获取当前使用的策略信息

func (*MarkdownChunker) GetErrors ¶

func (c *MarkdownChunker) GetErrors() []*ChunkerError

GetErrors 获取处理过程中的所有错误

func (*MarkdownChunker) GetErrorsByType ¶

func (c *MarkdownChunker) GetErrorsByType(errorType ErrorType) []*ChunkerError

GetErrorsByType 按类型获取错误

func (*MarkdownChunker) GetPerformanceMonitor ¶

func (c *MarkdownChunker) GetPerformanceMonitor() *PerformanceMonitor

GetPerformanceMonitor 获取性能监控器（用于高级用法）

func (*MarkdownChunker) GetPerformanceStats ¶

func (c *MarkdownChunker) GetPerformanceStats() PerformanceStats

GetPerformanceStats 获取性能统计信息

func (*MarkdownChunker) GetStrategyConfig ¶ added in v1.2.0

func (c *MarkdownChunker) GetStrategyConfig() *StrategyConfig

GetStrategyConfig 获取当前策略的配置

func (*MarkdownChunker) GetStrategyCount ¶ added in v1.2.0

func (c *MarkdownChunker) GetStrategyCount() int

GetStrategyCount 获取已注册的策略数量

func (*MarkdownChunker) HasErrors ¶

func (c *MarkdownChunker) HasErrors() bool

HasErrors 检查是否有错误

func (*MarkdownChunker) HasStrategy ¶ added in v1.2.0

func (c *MarkdownChunker) HasStrategy(strategyName string) bool

HasStrategy 检查是否存在指定的策略

func (*MarkdownChunker) RegisterStrategy ¶ added in v1.2.0

func (c *MarkdownChunker) RegisterStrategy(strategy ChunkingStrategy) error

RegisterStrategy 注册新的策略

func (*MarkdownChunker) ResetPerformanceMonitor ¶

func (c *MarkdownChunker) ResetPerformanceMonitor()

ResetPerformanceMonitor 重置性能监控器

func (*MarkdownChunker) SetStrategy ¶ added in v1.2.0

func (c *MarkdownChunker) SetStrategy(strategyName string, config *StrategyConfig) error

SetStrategy 设置分块策略

func (*MarkdownChunker) UnregisterStrategy ¶ added in v1.2.0

func (c *MarkdownChunker) UnregisterStrategy(strategyName string) error

UnregisterStrategy 注销策略

func (*MarkdownChunker) UpdateStrategyConfig ¶ added in v1.2.0

func (c *MarkdownChunker) UpdateStrategyConfig(config *StrategyConfig) error

UpdateStrategyConfig 更新当前策略的配置

type MemoryLimiter ¶

type MemoryLimiter struct {
	// contains filtered or unexported fields
}

MemoryLimiter 内存限制器

func NewMemoryLimiter ¶

func NewMemoryLimiter(maxMemoryBytes int64) *MemoryLimiter

NewMemoryLimiter 创建新的内存限制器

func (*MemoryLimiter) CheckMemoryLimit ¶

func (ml *MemoryLimiter) CheckMemoryLimit() error

CheckMemoryLimit 检查内存使用是否超过限制

func (*MemoryLimiter) GetCurrentMemoryUsage ¶

func (ml *MemoryLimiter) GetCurrentMemoryUsage() int64

GetCurrentMemoryUsage 获取当前内存使用量

func (*MemoryLimiter) GetMemoryLimit ¶

func (ml *MemoryLimiter) GetMemoryLimit() int64

GetMemoryLimit 获取内存限制

func (*MemoryLimiter) SetLogger ¶ added in v1.1.0

func (ml *MemoryLimiter) SetLogger(logger log.Logger)

SetLogger 设置日志器

func (*MemoryLimiter) SetMemoryLimit ¶

func (ml *MemoryLimiter) SetMemoryLimit(maxMemoryBytes int64)

SetMemoryLimit 设置内存限制

type MemoryOptimizer ¶

type MemoryOptimizer struct {
	// contains filtered or unexported fields
}

MemoryOptimizer 内存优化器

func NewMemoryOptimizer ¶

func NewMemoryOptimizer(memoryLimit int64) *MemoryOptimizer

NewMemoryOptimizer 创建新的内存优化器

func (*MemoryOptimizer) CheckMemoryLimit ¶

func (mo *MemoryOptimizer) CheckMemoryLimit() error

CheckMemoryLimit 检查内存限制

func (*MemoryOptimizer) ForceGC ¶

func (mo *MemoryOptimizer) ForceGC()

ForceGC 强制执行垃圾回收

func (*MemoryOptimizer) GetChunk ¶

func (mo *MemoryOptimizer) GetChunk() *Chunk

GetChunk 获取一个优化的块对象

func (*MemoryOptimizer) GetGCThreshold ¶

func (mo *MemoryOptimizer) GetGCThreshold() int64

GetGCThreshold 获取GC触发阈值

func (*MemoryOptimizer) GetMemoryStats ¶

func (mo *MemoryOptimizer) GetMemoryStats() MemoryOptimizerStats

GetMemoryStats 获取内存统计信息

func (*MemoryOptimizer) GetStringBuilder ¶

func (mo *MemoryOptimizer) GetStringBuilder() *strings.Builder

GetStringBuilder 获取一个字符串构建器

func (*MemoryOptimizer) PutChunk ¶

func (mo *MemoryOptimizer) PutChunk(chunk *Chunk)

PutChunk 归还块对象到池中

func (*MemoryOptimizer) PutStringBuilder ¶

func (mo *MemoryOptimizer) PutStringBuilder(sb *strings.Builder)

PutStringBuilder 归还字符串构建器到池中

func (*MemoryOptimizer) RecordProcessedBytes ¶

func (mo *MemoryOptimizer) RecordProcessedBytes(bytes int64)

RecordProcessedBytes 记录已处理的字节数

func (*MemoryOptimizer) Reset ¶

func (mo *MemoryOptimizer) Reset()

Reset 重置优化器状态

func (*MemoryOptimizer) SetGCThreshold ¶

func (mo *MemoryOptimizer) SetGCThreshold(threshold int64)

SetGCThreshold 设置GC触发阈值

func (*MemoryOptimizer) SetLogger ¶ added in v1.1.0

func (mo *MemoryOptimizer) SetLogger(logger log.Logger)

SetLogger 设置日志器

type MemoryOptimizerStats ¶

type MemoryOptimizerStats struct {
	CurrentMemory    int64 `json:"current_memory"`    // 当前内存使用
	MemoryLimit      int64 `json:"memory_limit"`      // 内存限制
	ProcessedBytes   int64 `json:"processed_bytes"`   // 已处理字节数
	GCThreshold      int64 `json:"gc_threshold"`      // GC阈值
	TotalAllocations int64 `json:"total_allocations"` // 总分配内存
	GCCycles         int64 `json:"gc_cycles"`         // GC周期数
}

MemoryOptimizerStats 内存优化器统计信息

type MergeWithParentAction ¶ added in v1.2.0

type MergeWithParentAction struct {
	Separator string `json:"separator"` // 合并时使用的分隔符
}

MergeWithParentAction 与父块合并动作将匹配的节点合并到父块中

func NewMergeWithParentAction ¶ added in v1.2.0

func NewMergeWithParentAction(separator string) *MergeWithParentAction

NewMergeWithParentAction 创建与父块合并动作

func (*MergeWithParentAction) Clone ¶ added in v1.2.0

func (a *MergeWithParentAction) Clone() RuleAction

Clone 创建动作的副本

func (*MergeWithParentAction) Execute ¶ added in v1.2.0

func (a *MergeWithParentAction) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)

Execute 执行与父块合并动作

func (*MergeWithParentAction) GetDescription ¶ added in v1.2.0

func (a *MergeWithParentAction) GetDescription() string

GetDescription 返回动作描述

func (*MergeWithParentAction) GetName ¶ added in v1.2.0

func (a *MergeWithParentAction) GetName() string

GetName 返回动作名称

func (*MergeWithParentAction) Validate ¶ added in v1.2.0

func (a *MergeWithParentAction) Validate() error

Validate 验证动作配置

type MetadataExtractor ¶

type MetadataExtractor interface {
	// Extract 从AST节点中提取元数据
	Extract(node ast.Node, source []byte) map[string]string
	// SupportedTypes 返回支持的内容类型
	SupportedTypes() []string
}

MetadataExtractor 元数据提取器接口

type ObjectPool ¶

type ObjectPool interface {
	Get() any
	Put(any)
	Reset()
}

ObjectPool 对象池接口

type OptimizedStringOperations ¶

type OptimizedStringOperations struct {
	// contains filtered or unexported fields
}

OptimizedStringOperations 优化的字符串操作

func NewOptimizedStringOperations ¶

func NewOptimizedStringOperations() *OptimizedStringOperations

NewOptimizedStringOperations 创建优化的字符串操作实例

func (*OptimizedStringOperations) BuildContent ¶

func (oso *OptimizedStringOperations) BuildContent(parts ...string) string

BuildContent 优化的内容构建

func (*OptimizedStringOperations) JoinStrings ¶

func (oso *OptimizedStringOperations) JoinStrings(strs []string, separator string) string

JoinStrings 优化的字符串连接

func (*OptimizedStringOperations) TrimAndClean ¶

func (oso *OptimizedStringOperations) TrimAndClean(text string) string

TrimAndClean 优化的字符串清理

type PerformanceMode ¶

type PerformanceMode int

PerformanceMode 性能模式

const (
	// PerformanceModeDefault 默认性能模式
	PerformanceModeDefault PerformanceMode = iota
	// PerformanceModeMemoryOptimized 内存优化模式
	PerformanceModeMemoryOptimized
	// PerformanceModeSpeedOptimized 速度优化模式
	PerformanceModeSpeedOptimized
)

type PerformanceMonitor ¶

type PerformanceMonitor struct {
	// contains filtered or unexported fields
}

PerformanceMonitor 性能监控器

func NewPerformanceMonitor ¶

func NewPerformanceMonitor() *PerformanceMonitor

NewPerformanceMonitor 创建新的性能监控器

func (*PerformanceMonitor) CheckMemoryThresholds ¶ added in v1.1.0

func (pm *PerformanceMonitor) CheckMemoryThresholds()

CheckMemoryThresholds 检查内存使用阈值并记录警告

func (*PerformanceMonitor) ForceGC ¶

func (pm *PerformanceMonitor) ForceGC()

ForceGC 强制垃圾回收（用于测试和内存优化）

func (*PerformanceMonitor) GetMemoryStats ¶

func (pm *PerformanceMonitor) GetMemoryStats() runtime.MemStats

GetMemoryStats 获取详细的内存统计信息

func (*PerformanceMonitor) GetStats ¶

func (pm *PerformanceMonitor) GetStats() PerformanceStats

GetStats 获取性能统计信息

func (*PerformanceMonitor) IsRunning ¶

func (pm *PerformanceMonitor) IsRunning() bool

IsRunning 检查监控器是否正在运行

func (*PerformanceMonitor) RecordBytes ¶

func (pm *PerformanceMonitor) RecordBytes(bytes int64)

RecordBytes 记录处理的字节数（用于输入文档大小）

func (*PerformanceMonitor) RecordChunk ¶

func (pm *PerformanceMonitor) RecordChunk(chunk *Chunk)

RecordChunk 记录处理的块信息

func (*PerformanceMonitor) RecordStrategyExecution ¶ added in v1.2.0

func (pm *PerformanceMonitor) RecordStrategyExecution(strategyName string, executionTime time.Duration, chunksGenerated int)

RecordStrategyExecution 记录策略执行信息

func (*PerformanceMonitor) Reset ¶

func (pm *PerformanceMonitor) Reset()

Reset 重置监控器状态

func (*PerformanceMonitor) SetLogger ¶ added in v1.1.0

func (pm *PerformanceMonitor) SetLogger(logger log.Logger)

SetLogger 设置日志器

func (*PerformanceMonitor) Start ¶

func (pm *PerformanceMonitor) Start()

Start 开始性能监控

func (*PerformanceMonitor) Stop ¶

func (pm *PerformanceMonitor) Stop()

Stop 停止性能监控

type PerformanceStats ¶

type PerformanceStats struct {
	ProcessingTime  time.Duration `json:"processing_time"`   // 处理时间
	MemoryUsed      int64         `json:"memory_used"`       // 使用的内存（字节）
	ChunksPerSecond float64       `json:"chunks_per_second"` // 每秒处理的块数
	BytesPerSecond  float64       `json:"bytes_per_second"`  // 每秒处理的字节数
	TotalChunks     int           `json:"total_chunks"`      // 总块数
	TotalBytes      int64         `json:"total_bytes"`       // 总字节数（输入文档大小）
	ChunkBytes      int64         `json:"chunk_bytes"`       // 块内容总字节数
	PeakMemory      int64         `json:"peak_memory"`       // 峰值内存使用
}

PerformanceStats 性能统计信息

type ProcessingJob ¶

type ProcessingJob struct {
	ID      int
	Content []byte
}

ProcessingJob 处理任务

type ProcessingResult ¶

type ProcessingResult struct {
	ID     int
	Chunks []Chunk
	Error  error
}

ProcessingResult 处理结果

type RuleAction ¶ added in v1.2.0

type RuleAction interface {
	// Execute 执行动作，返回处理后的块
	Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)

	// GetName 返回动作名称
	GetName() string

	// GetDescription 返回动作描述
	GetDescription() string

	// Validate 验证动作配置是否有效
	Validate() error

	// Clone 创建动作的副本
	Clone() RuleAction
}

RuleAction 规则动作接口定义匹配条件后执行的动作

type RuleCondition ¶ added in v1.2.0

type RuleCondition interface {
	// Match 检查节点是否匹配条件
	Match(node ast.Node, context *ChunkingContext) bool

	// GetName 返回条件名称
	GetName() string

	// GetDescription 返回条件描述
	GetDescription() string

	// Validate 验证条件配置是否有效
	Validate() error

	// Clone 创建条件的副本
	Clone() RuleCondition
}

RuleCondition 规则条件接口定义分块规则的匹配条件

type SkipNodeAction ¶ added in v1.2.0

type SkipNodeAction struct {
	Reason string `json:"reason"` // 跳过原因
}

SkipNodeAction 跳过节点动作跳过匹配的节点，不创建块

func NewSkipNodeAction ¶ added in v1.2.0

func NewSkipNodeAction(reason string) *SkipNodeAction

NewSkipNodeAction 创建跳过节点动作

func (*SkipNodeAction) Clone ¶ added in v1.2.0

func (a *SkipNodeAction) Clone() RuleAction

Clone 创建动作的副本

func (*SkipNodeAction) Execute ¶ added in v1.2.0

func (a *SkipNodeAction) Execute(node ast.Node, context *ChunkingContext) (*Chunk, error)

Execute 执行跳过节点动作

func (*SkipNodeAction) GetDescription ¶ added in v1.2.0

func (a *SkipNodeAction) GetDescription() string

GetDescription 返回动作描述

func (*SkipNodeAction) GetName ¶ added in v1.2.0

func (a *SkipNodeAction) GetName() string

GetName 返回动作名称

func (*SkipNodeAction) Validate ¶ added in v1.2.0

func (a *SkipNodeAction) Validate() error

Validate 验证动作配置

type StrategyCache ¶ added in v1.2.0

type StrategyCache struct {
	// contains filtered or unexported fields
}

StrategyCache 策略缓存

func NewStrategyCache ¶ added in v1.2.0

func NewStrategyCache() *StrategyCache

NewStrategyCache 创建策略缓存

func (*StrategyCache) Clear ¶ added in v1.2.0

func (sc *StrategyCache) Clear()

Clear 清空缓存

func (*StrategyCache) Get ¶ added in v1.2.0

func (sc *StrategyCache) Get(name string) (ChunkingStrategy, bool)

Get 从缓存获取策略

func (*StrategyCache) Keys ¶ added in v1.2.0

func (sc *StrategyCache) Keys() []string

Keys 获取所有缓存的策略名称

func (*StrategyCache) Put ¶ added in v1.2.0

func (sc *StrategyCache) Put(name string, strategy ChunkingStrategy)

Put 将策略放入缓存

func (*StrategyCache) Remove ¶ added in v1.2.0

func (sc *StrategyCache) Remove(name string)

Remove 从缓存移除策略

func (*StrategyCache) Size ¶ added in v1.2.0

func (sc *StrategyCache) Size() int

Size 获取缓存大小

type StrategyConfig ¶ added in v1.2.0

type StrategyConfig struct {
	// 通用配置
	Name       string         `json:"name"`       // 策略名称
	Parameters map[string]any `json:"parameters"` // 策略参数

	// 层级策略特定配置
	MaxDepth   int  `json:"max_depth,omitempty"`   // 最大层级深度
	MinDepth   int  `json:"min_depth,omitempty"`   // 最小层级深度
	MergeEmpty bool `json:"merge_empty,omitempty"` // 是否合并空章节

	// 大小限制配置
	MinChunkSize int `json:"min_chunk_size,omitempty"` // 最小块大小
	MaxChunkSize int `json:"max_chunk_size,omitempty"` // 最大块大小

	// 内容过滤配置
	IncludeTypes []string `json:"include_types,omitempty"` // 包含的内容类型
	ExcludeTypes []string `json:"exclude_types,omitempty"` // 排除的内容类型
}

StrategyConfig 策略配置结构

func CreateConfigFromParameters ¶ added in v1.2.0

func CreateConfigFromParameters(strategyName string, params map[string]any) (*StrategyConfig, error)

CreateConfigFromParameters 从参数映射创建策略配置

func DefaultStrategyConfig ¶ added in v1.2.0

func DefaultStrategyConfig(name string) *StrategyConfig

DefaultStrategyConfig 创建默认策略配置

func DocumentLevelConfig ¶ added in v1.2.0

func DocumentLevelConfig() *StrategyConfig

DocumentLevelConfig 创建文档级策略配置

func DocumentLevelConfigWithSize ¶ added in v1.2.0

func DocumentLevelConfigWithSize(minSize, maxSize int) *StrategyConfig

DocumentLevelConfigWithSize 创建带大小限制的文档级策略配置

func ElementLevelConfig ¶ added in v1.2.0

func ElementLevelConfig() *StrategyConfig

ElementLevelConfig 创建元素级策略配置

func ElementLevelConfigWithSize ¶ added in v1.2.0

func ElementLevelConfigWithSize(minSize, maxSize int) *StrategyConfig

ElementLevelConfigWithSize 创建带大小限制的元素级策略配置

func ElementLevelConfigWithTypes ¶ added in v1.2.0

func ElementLevelConfigWithTypes(includeTypes, excludeTypes []string) *StrategyConfig

ElementLevelConfigWithTypes 创建带内容类型过滤的元素级策略配置

func HierarchicalConfig ¶ added in v1.2.0

func HierarchicalConfig(maxDepth int) *StrategyConfig

HierarchicalConfig 创建层级策略配置

func HierarchicalConfigAdvanced ¶ added in v1.2.0

func HierarchicalConfigAdvanced(maxDepth, minDepth int, mergeEmpty bool) *StrategyConfig

HierarchicalConfigAdvanced 创建高级层级策略配置

func HierarchicalConfigWithSize ¶ added in v1.2.0

func HierarchicalConfigWithSize(maxDepth, minSize, maxSize int) *StrategyConfig

HierarchicalConfigWithSize 创建带大小限制的层级策略配置

func MergeConfigs ¶ added in v1.2.0

func MergeConfigs(base, override *StrategyConfig) (*StrategyConfig, error)

MergeConfigs 合并两个策略配置

func (*StrategyConfig) Clone ¶ added in v1.2.0

func (sc *StrategyConfig) Clone() *StrategyConfig

Clone 创建策略配置的副本

func (*StrategyConfig) String ¶ added in v1.2.0

func (sc *StrategyConfig) String() string

String 返回策略配置的字符串表示

func (*StrategyConfig) ValidateConfig ¶ added in v1.2.0

func (sc *StrategyConfig) ValidateConfig() error

ValidateConfig 验证策略配置

type StrategyPool ¶ added in v1.2.0

type StrategyPool struct {
	// contains filtered or unexported fields
}

StrategyPool 策略实例池

func NewStrategyPool ¶ added in v1.2.0

func NewStrategyPool() *StrategyPool

NewStrategyPool 创建策略池

func (*StrategyPool) Clear ¶ added in v1.2.0

func (sp *StrategyPool) Clear()

Clear 清空所有池

func (*StrategyPool) CreatePool ¶ added in v1.2.0

func (sp *StrategyPool) CreatePool(strategyName string, factory func() ChunkingStrategy)

CreatePool 为指定策略创建池

func (*StrategyPool) Get ¶ added in v1.2.0

func (sp *StrategyPool) Get(strategyName string, factory func() ChunkingStrategy) ChunkingStrategy

Get 从池中获取策略实例

func (*StrategyPool) GetPoolCount ¶ added in v1.2.0

func (sp *StrategyPool) GetPoolCount() int

GetPoolCount 获取池的数量

func (*StrategyPool) HasPool ¶ added in v1.2.0

func (sp *StrategyPool) HasPool(strategyName string) bool

HasPool 检查是否存在指定策略的池

func (*StrategyPool) Put ¶ added in v1.2.0

func (sp *StrategyPool) Put(strategy ChunkingStrategy)

Put 将策略实例放回池中

func (*StrategyPool) RemovePool ¶ added in v1.2.0

func (sp *StrategyPool) RemovePool(strategyName string)

RemovePool 移除指定策略的池

type StrategyRegistry ¶ added in v1.2.0

type StrategyRegistry struct {
	// contains filtered or unexported fields
}

StrategyRegistry 策略注册器

func NewStrategyRegistry ¶ added in v1.2.0

func NewStrategyRegistry() *StrategyRegistry

NewStrategyRegistry 创建策略注册器

func (*StrategyRegistry) Get ¶ added in v1.2.0

func (sr *StrategyRegistry) Get(name string) (ChunkingStrategy, error)

Get 获取策略

func (*StrategyRegistry) GetStrategyCount ¶ added in v1.2.0

func (sr *StrategyRegistry) GetStrategyCount() int

GetStrategyCount 获取已注册策略数量

func (*StrategyRegistry) HasStrategy ¶ added in v1.2.0

func (sr *StrategyRegistry) HasStrategy(name string) bool

HasStrategy 检查策略是否存在

func (*StrategyRegistry) List ¶ added in v1.2.0

func (sr *StrategyRegistry) List() []string

List 列出所有可用策略

func (*StrategyRegistry) Register ¶ added in v1.2.0

func (sr *StrategyRegistry) Register(strategy ChunkingStrategy) error

Register 注册策略

func (*StrategyRegistry) Unregister ¶ added in v1.2.0

func (sr *StrategyRegistry) Unregister(name string) error

Unregister 注销策略

type StringBuilderPool ¶

type StringBuilderPool struct {
	// contains filtered or unexported fields
}

StringBuilderPool 字符串构建器对象池

func NewStringBuilderPool ¶

func NewStringBuilderPool() *StringBuilderPool

NewStringBuilderPool 创建新的字符串构建器对象池

func (*StringBuilderPool) Get ¶

func (sbp *StringBuilderPool) Get() *strings.Builder

Get 从池中获取一个字符串构建器

func (*StringBuilderPool) Put ¶

func (sbp *StringBuilderPool) Put(sb *strings.Builder)

Put 将字符串构建器放回池中

type TableInfo ¶

type TableInfo struct {
	Rows         int
	Columns      int
	HasHeader    bool
	HeaderCells  []string
	DataRows     [][]string
	Alignments   []string          // left, center, right
	CellTypes    map[string]string // 单元格内容类型分析
	IsWellFormed bool
	Errors       []string
}

TableInfo 表格信息结构

func (*TableInfo) GetTableMetadata ¶

func (info *TableInfo) GetTableMetadata() map[string]string

GetTableMetadata 获取表格元数据

type WorkerPool ¶

type WorkerPool struct {
	// contains filtered or unexported fields
}

WorkerPool 工作池，用于更精细的并发控制

func NewWorkerPool ¶

func NewWorkerPool(workers int, config *ChunkerConfig) *WorkerPool

NewWorkerPool 创建新的工作池

func (*WorkerPool) GetResult ¶

func (wp *WorkerPool) GetResult() ProcessingResult

GetResult 获取结果

func (*WorkerPool) ProcessBatch ¶

func (wp *WorkerPool) ProcessBatch(contents [][]byte) ([][]Chunk, []error)

ProcessBatch 批量处理任务

func (*WorkerPool) Start ¶

func (wp *WorkerPool) Start()

Start 启动工作池

func (*WorkerPool) Stop ¶

func (wp *WorkerPool) Stop()

Stop 停止工作池

func (*WorkerPool) Submit ¶

func (wp *WorkerPool) Submit(job ProcessingJob)

Submit 提交任务

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL