Documentation
¶
Overview ¶
Package dit classifies HTML form, field, and page types.
It provides a three-stage ML pipeline: logistic regression for form types, a CRF model for field types, and logistic regression for page types.
c, _ := dit.New()
results, _ := c.ExtractForms(htmlString)
for _, r := range results {
fmt.Println(r.Type) // "login"
fmt.Println(r.Fields) // {"username": "username or email", "password": "password"}
}
page, _ := c.ExtractPageType(htmlString)
fmt.Println(page.Type) // "login"
Index ¶
- func FindModel(name string) (string, error)
- func ModelDir() string
- type Classifier
- func (c *Classifier) ExtractForms(html string) ([]FormResult, error)
- func (c *Classifier) ExtractFormsProba(html string, threshold float64) ([]FormResultProba, error)
- func (c *Classifier) ExtractPageType(html string) (*PageResult, error)
- func (c *Classifier) ExtractPageTypeProba(html string, threshold float64) (*PageResultProba, error)
- func (c *Classifier) Save(path string) error
- type EvalConfig
- type EvalResult
- type FormResult
- type FormResultProba
- type PageResult
- type PageResultProba
- type TrainConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Classifier ¶
type Classifier struct {
// contains filtered or unexported fields
}
Classifier wraps the form and field type classification models.
func Load ¶
func Load(path string) (*Classifier, error)
Load loads a trained classifier from a model file.
func New ¶
func New() (*Classifier, error)
New loads the classifier from "model.json", searching the current directory and parent directories up to the module root, then ~/.dit/model.json.
func Train ¶
func Train(dataDir string, config *TrainConfig) (*Classifier, error)
Train trains a classifier on annotated HTML forms in the given data directory.
func (*Classifier) ExtractForms ¶
func (c *Classifier) ExtractForms(html string) ([]FormResult, error)
ExtractForms extracts and classifies all forms in the given HTML string. Returns an empty slice (not nil) if no forms are found.
func (*Classifier) ExtractFormsProba ¶
func (c *Classifier) ExtractFormsProba(html string, threshold float64) ([]FormResultProba, error)
ExtractFormsProba extracts forms and returns classification probabilities. Probabilities below threshold are omitted.
func (*Classifier) ExtractPageType ¶ added in v0.0.3
func (c *Classifier) ExtractPageType(html string) (*PageResult, error)
ExtractPageType classifies the page type and all forms in the HTML.
func (*Classifier) ExtractPageTypeProba ¶ added in v0.0.3
func (c *Classifier) ExtractPageTypeProba(html string, threshold float64) (*PageResultProba, error)
ExtractPageTypeProba classifies the page type with probabilities.
func (*Classifier) Save ¶
func (c *Classifier) Save(path string) error
Save writes the classifier to a model file.
type EvalConfig ¶
EvalConfig holds configuration for evaluation.
type EvalResult ¶
type EvalResult struct {
FormAccuracy float64
FieldAccuracy float64
SequenceAccuracy float64
PageAccuracy float64
FormCorrect int
FormTotal int
FieldCorrect int
FieldTotal int
SequenceCorrect int
SequenceTotal int
PageCorrect int
PageTotal int
// Per-class metrics
PageConfusion map[string]map[string]int
PageClasses []string
PagePrecision map[string]float64
PageRecall map[string]float64
PageF1 map[string]float64
PageMacroF1 float64
PageWeightedF1 float64
}
EvalResult holds cross-validation evaluation results.
func Evaluate ¶
func Evaluate(dataDir string, config *EvalConfig) (*EvalResult, error)
Evaluate runs cross-validation evaluation on annotated data.
type FormResult ¶
type FormResult struct {
Type string `json:"type"`
Fields map[string]string `json:"fields,omitempty"`
}
FormResult holds the classification result for a single form.
type FormResultProba ¶
type FormResultProba struct {
Type map[string]float64 `json:"type"`
Fields map[string]map[string]float64 `json:"fields,omitempty"`
}
FormResultProba holds probability-based classification results for a single form.
type PageResult ¶ added in v0.0.3
type PageResult struct {
Type string `json:"type"`
Forms []FormResult `json:"forms,omitempty"`
}
PageResult holds the page type classification result.
type PageResultProba ¶ added in v0.0.3
type PageResultProba struct {
Type map[string]float64 `json:"type"`
Forms []FormResultProba `json:"forms,omitempty"`
}
PageResultProba holds probability-based page type classification results.
type TrainConfig ¶
type TrainConfig struct {
Verbose bool
}
TrainConfig holds configuration for training.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package classifier implements form and field type classification.
|
Package classifier implements form and field type classification. |
|
cmd
|
|
|
dit
command
|
|
|
dit-collect
command
|
|
|
Package crf implements a linear-chain Conditional Random Field.
|
Package crf implements a linear-chain Conditional Random Field. |
|
internal
|
|
|
htmlutil
Package htmlutil provides HTML form and field extraction utilities.
|
Package htmlutil provides HTML form and field extraction utilities. |
|
storage
Package storage provides access to annotation data for form classification training.
|
Package storage provides access to annotation data for form classification training. |
|
textutil
Package textutil provides text processing utilities for form classification.
|
Package textutil provides text processing utilities for form classification. |
|
vectorizer
Package vectorizer provides text vectorization utilities matching sklearn behavior.
|
Package vectorizer provides text vectorization utilities matching sklearn behavior. |
