dit

package module
v0.0.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2026 License: MIT Imports: 9 Imported by: 0

README

dît

Banner

dît (means found in Kurdish) tells you the type of an HTML page, form, and fields using machine learning.

It classifies pages (login, error, landing, blog, etc.), detects whether a form is a login, search, registration, password recovery, contact, mailing list, order form, or something else, and classifies each field (username, password, email, search query, etc.). Zero external ML dependencies.

Install

go get github.com/happyhackingspace/dit

Usage

As a Library
import "github.com/happyhackingspace/dit"

// Load classifier (finds model.json automatically)
c, _ := dit.New()

// Classify page type
page, _ := c.ExtractPageType(htmlString)
fmt.Println(page.Type)  // "login"
fmt.Println(page.Forms) // form classifications included

// Classify forms in HTML
results, _ := c.ExtractForms(htmlString)
for _, r := range results {
    fmt.Println(r.Type)   // "login"
    fmt.Println(r.Fields) // {"username": "username or email", "password": "password"}
}

// With probabilities
pageProba, _ := c.ExtractPageTypeProba(htmlString, 0.05)
formProba, _ := c.ExtractFormsProba(htmlString, 0.05)

// Train a new model
c, _ := dit.Train("data/", &dit.TrainConfig{Verbose: true})
c.Save("model.json")

// Evaluate via cross-validation
result, _ := dit.Evaluate("data/", &dit.EvalConfig{Folds: 10})
fmt.Printf("Form accuracy: %.1f%%\n", result.FormAccuracy*100)
fmt.Printf("Page accuracy: %.1f%%\n", result.PageAccuracy*100)
As a CLI
# Classify page type and forms on a URL
dit run https://github.com/login

# Classify forms in a local file
dit run login.html

# With probabilities
dit run https://github.com/login --proba

# Download training data and model from Hugging Face
dit data download

# Train a model
dit train model.json --data-folder data

# Evaluate model accuracy
dit evaluate --data-folder data

# Upload training data and model to Hugging Face
dit data upload

Page Types

Type Description
login Login page
registration Registration / signup page
search Search results page
checkout Checkout / payment page
contact Contact page
password_reset Password reset page
landing Landing / home page
product Product page
blog Blog / article page
settings Settings / account page
soft_404 Soft 404 (HTTP 200 but "not found" content)
error Error page (404, 403, 500, etc.)
captcha CAPTCHA / bot detection page
parked Domain parking page
coming_soon Under construction / maintenance page
admin Admin panel / dashboard
directory_listing Open directory index
default_page Unconfigured server default
waf_block WAF block page
other Other page type

Form Types

Type Description
login Login form
search Search form
registration Registration / signup form
password/login recovery Password reset / recovery form
contact/comment Contact or comment form
join mailing list Newsletter / mailing list signup
order/add to cart Order or add-to-cart form
other Other form type

Field Types

Category Types
Authentication username, password, password confirmation, email, email confirmation, username or email
Names first name, last name, middle name, full name, organization name, gender
Address country, city, state, address, postal code
Contact phone, fax, url
Search search query, search category
Content comment text, comment title, about me text
Buttons submit button, cancel button, reset button
Verification captcha, honeypot, TOS confirmation, remember me checkbox, receive emails confirmation
Security security question, security answer
Time full date, day, month, year, timezone
Product product quantity, sorting option, style select
Other other number, other read-only, other

Full list of 79 field type codes in data/config.json (run dit data download to get the data).

Accuracy

Cross-validation results (10-fold, grouped by domain):

Metric Score
Form type accuracy 82.9% (1135/1369)
Field type accuracy 86.6% (4518/5218)
Sequence accuracy 78.7% (1025/1302)
Page type accuracy 53.4% (403/754)
Page macro F1 40.2%
Page weighted F1 53.6%

Trained on 1000+ annotated web forms and 754 annotated web pages.

Contributing

See CONTRIBUTING.md.

Credits

Go port of Formasaurus.

License

MIT

Documentation

Overview

Package dit classifies HTML form, field, and page types.

It provides a three-stage ML pipeline: logistic regression for form types, a CRF model for field types, and logistic regression for page types.

c, _ := dit.New()
results, _ := c.ExtractForms(htmlString)
for _, r := range results {
    fmt.Println(r.Type)   // "login"
    fmt.Println(r.Fields) // {"username": "username or email", "password": "password"}
}

page, _ := c.ExtractPageType(htmlString)
fmt.Println(page.Type) // "login"

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FindModel added in v0.0.3

func FindModel(name string) (string, error)

FindModel searches for a model file by name. Search order: current dir walk-up to module root, then ~/.dit/.

func ModelDir added in v0.0.3

func ModelDir() string

ModelDir returns the default model storage directory (~/.dit).

Types

type Classifier

type Classifier struct {
	// contains filtered or unexported fields
}

Classifier wraps the form and field type classification models.

func Load

func Load(path string) (*Classifier, error)

Load loads a trained classifier from a model file.

func New

func New() (*Classifier, error)

New loads the classifier from "model.json", searching the current directory and parent directories up to the module root, then ~/.dit/model.json.

func Train

func Train(dataDir string, config *TrainConfig) (*Classifier, error)

Train trains a classifier on annotated HTML forms in the given data directory.

func (*Classifier) ExtractForms

func (c *Classifier) ExtractForms(html string) ([]FormResult, error)

ExtractForms extracts and classifies all forms in the given HTML string. Returns an empty slice (not nil) if no forms are found.

func (*Classifier) ExtractFormsProba

func (c *Classifier) ExtractFormsProba(html string, threshold float64) ([]FormResultProba, error)

ExtractFormsProba extracts forms and returns classification probabilities. Probabilities below threshold are omitted.

func (*Classifier) ExtractPageType added in v0.0.3

func (c *Classifier) ExtractPageType(html string) (*PageResult, error)

ExtractPageType classifies the page type and all forms in the HTML.

func (*Classifier) ExtractPageTypeProba added in v0.0.3

func (c *Classifier) ExtractPageTypeProba(html string, threshold float64) (*PageResultProba, error)

ExtractPageTypeProba classifies the page type with probabilities.

func (*Classifier) Save

func (c *Classifier) Save(path string) error

Save writes the classifier to a model file.

type EvalConfig

type EvalConfig struct {
	Folds   int
	Verbose bool
}

EvalConfig holds configuration for evaluation.

type EvalResult

type EvalResult struct {
	FormAccuracy     float64
	FieldAccuracy    float64
	SequenceAccuracy float64
	PageAccuracy     float64
	FormCorrect      int
	FormTotal        int
	FieldCorrect     int
	FieldTotal       int
	SequenceCorrect  int
	SequenceTotal    int
	PageCorrect      int
	PageTotal        int
	// Per-class metrics
	PageConfusion  map[string]map[string]int
	PageClasses    []string
	PagePrecision  map[string]float64
	PageRecall     map[string]float64
	PageF1         map[string]float64
	PageMacroF1    float64
	PageWeightedF1 float64
}

EvalResult holds cross-validation evaluation results.

func Evaluate

func Evaluate(dataDir string, config *EvalConfig) (*EvalResult, error)

Evaluate runs cross-validation evaluation on annotated data.

type FormResult

type FormResult struct {
	Type   string            `json:"type"`
	Fields map[string]string `json:"fields,omitempty"`
}

FormResult holds the classification result for a single form.

type FormResultProba

type FormResultProba struct {
	Type   map[string]float64            `json:"type"`
	Fields map[string]map[string]float64 `json:"fields,omitempty"`
}

FormResultProba holds probability-based classification results for a single form.

type PageResult added in v0.0.3

type PageResult struct {
	Type  string       `json:"type"`
	Forms []FormResult `json:"forms,omitempty"`
}

PageResult holds the page type classification result.

type PageResultProba added in v0.0.3

type PageResultProba struct {
	Type  map[string]float64 `json:"type"`
	Forms []FormResultProba  `json:"forms,omitempty"`
}

PageResultProba holds probability-based page type classification results.

type TrainConfig

type TrainConfig struct {
	Verbose bool
}

TrainConfig holds configuration for training.

Directories

Path Synopsis
Package classifier implements form and field type classification.
Package classifier implements form and field type classification.
cmd
dit command
dit-collect command
Package crf implements a linear-chain Conditional Random Field.
Package crf implements a linear-chain Conditional Random Field.
internal
cli
htmlutil
Package htmlutil provides HTML form and field extraction utilities.
Package htmlutil provides HTML form and field extraction utilities.
storage
Package storage provides access to annotation data for form classification training.
Package storage provides access to annotation data for form classification training.
textutil
Package textutil provides text processing utilities for form classification.
Package textutil provides text processing utilities for form classification.
vectorizer
Package vectorizer provides text vectorization utilities matching sklearn behavior.
Package vectorizer provides text vectorization utilities matching sklearn behavior.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL