refrax

package module

v1.0.2 Latest Latest Go to latest Published: Apr 2, 2026 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/0mjs/refrax

Links

Open Source Insights

README ¶

Refrax

Refrax is a Go library and CLI for extracting structured JSON from Healthlink-style clinical referral documents.

It is designed for technical teams, clinical informatics teams, and medical operations teams that need a reproducible way to turn referral text or PDFs into machine-readable data.

Refrax supports:

plain-text referral documents
PDF referral documents
automatic OCR fallback for image-only or weak-text PDFs
JSON output with confidence details and extraction warnings

Refrax is available as both a CLI and a Go package. The CLI is the quickest way to operate it locally or in scripts, and the Go API is intended for embedding the same extraction flow into backend services.

Scope

Refrax supports Healthlink-style referral documents. It is not:

an official Healthlink product
a Healthlink network integration
a clinical decision, triage, or diagnostic system
a workflow product for scheduling, messaging, or patient management

The project is focused on document parsing, normalization, detection, and structured extraction.

Installation

Install the CLI:

go install github.com/0mjs/refrax/cmd/refrax@latest

Validate the build:

go test ./...

If the installed binary is not on your PATH, run it with:

$(go env GOPATH)/bin/refrax schema

Project release and governance files:

OCR Setup

Refrax has two operating modes:

base mode: text extraction and direct PDF text extraction
OCR-capable mode: base mode plus OCR fallback and forced OCR

On macOS, install OCR dependencies with Homebrew:

brew install tesseract
brew install imagemagick

tesseract is required for OCR fallback and --ocr.

ImageMagick is optional. Refrax uses it only for OCR preprocessing and OCR fixture generation when available.

After installing dependencies, verify the local runtime with:

refrax doctor

Runtime Dependencies

Text extraction from PDFs is handled through MuPDF via go-fitz.

OCR fallback requires:

tesseract

Optional image preprocessing for OCR uses:

ImageMagick convert

If tesseract is not available, Refrax will still process text and directly extractable PDFs, but forced OCR and OCR fallback will not run.

Use refrax doctor to inspect whether OCR and OCR preprocessing are available on the current machine.

Refrax performs extraction locally. It does not send document contents to a remote Refrax service.

Quick Start

Print the default schema:

refrax schema

Inspect local OCR capabilities:

refrax doctor

Detect whether a document looks like a supported Healthlink-style referral:

refrax detect ./testdata/samples/referral.txt

Extract structured JSON from text:

refrax extract ./testdata/samples/referral.txt --pretty

Extract with an additive schema overlay:

refrax extract ./testdata/samples/referral.txt --pretty --schema-overlay ./overlay.json

Extract structured JSON from a PDF:

refrax extract ./referral.pdf --pretty

Force OCR for a PDF:

refrax extract ./referral.pdf --pretty --ocr

Include extraction explanations:

refrax extract ./referral.pdf --pretty --explain

Include the normalized raw text used for extraction:

refrax extract ./referral.pdf --pretty --raw

Read text from standard input:

cat ./testdata/samples/referral.txt | refrax extract - --pretty

Extraction Behavior

For text inputs, Refrax extracts directly from the supplied text.

For PDF inputs, Refrax:

attempts direct text extraction with MuPDF
evaluates text quality, structure, confidence, and corruption indicators
falls back to OCR when the MuPDF result looks weak or corrupted

You can override that decision with --ocr to force OCR on a PDF.

The result always follows the same JSON envelope, regardless of whether the source was text, MuPDF, or OCR.

CLI Reference

`refrax schema`

Prints the default schema as JSON.

`refrax detect <file|->`

Runs document detection and returns:

document type
confidence
matched signals

`refrax doctor`

Reports runtime OCR capabilities, including whether Refrax can find:

tesseract
ImageMagick convert

This is the quickest way to confirm whether automatic OCR fallback is available on the current machine.

`refrax extract <file|->`

Extracts structured JSON from text or PDF input.

Flags:

--pretty: pretty-print JSON
--raw: include raw_text
--fuzzy: enable fuzzy field matching for noisy OCR text
--ocr: force OCR for PDF inputs
--explain: include field-level extraction and fallback explanations
--schema-overlay <file.json>: apply an additive schema overlay before extraction

`refrax schema [--schema-overlay <file.json>]`

Prints the default schema, or the merged effective schema when an overlay file is supplied.

Schema Overlays

Refrax supports additive schema overlays.

Overlays are intended for cases where your documents contain:

extra labels for built-in fields
additional labeled fields that should be extracted
custom multiline fields

Overlays do not remove built-in fields from the default schema.

An overlay JSON file looks like this:

{
  "name": "custom_referral",
  "version": "v2",
  "sections": {
    "history": [
      {
        "key": "reason_for_referral",
        "labels": ["Referral Reason"]
      },
      {
        "key": "preferred_consultant",
        "labels": ["Preferred Consultant"]
      },
      {
        "key": "additional_notes",
        "labels": ["Additional Notes"],
        "multiline": true
      }
    ]
  }
}

Common CLI usage:

refrax schema --schema-overlay ./overlay.json
refrax extract ./referral.pdf --pretty --schema-overlay ./overlay.json

Go Integration

The recommended application-facing API is the top-level refrax package.

Install And Import

Add Refrax to your Go module:

go get github.com/0mjs/refrax

Import it in your service code:

import "github.com/0mjs/refrax"

Basic Extraction

The main entry points are:

refrax.ExtractFile(path, opts...)
refrax.ExtractBytes(name, data, opts...)
refrax.ExtractText(text, opts...)
refrax.ExtractReader(name, r, opts...)

The simplest file-based example is:

package main

import (
	"fmt"

	"github.com/0mjs/refrax"
)

func main() {
	result, err := refrax.ExtractFile("referral.pdf")
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Method)
	fmt.Println(result.Confidence)
}

For in-memory PDF bytes:

result, err := refrax.ExtractBytes("referral.pdf", pdfBytes)

For plain text:

result, err := refrax.ExtractText(text)

Configuring Extraction

The same option helpers work with all four entry points:

result, err := refrax.ExtractReader(
	filename,
	file,
	refrax.WithFuzzyKeys(true),
	refrax.WithRawText(false),
	refrax.WithExplain(true),
	refrax.WithOCR(refrax.OCRAuto),
)

Available config controls:

refrax.WithFuzzyKeys(true) to tolerate noisier OCR labels
refrax.WithRawText(true) to return raw_text in the result
refrax.WithExplain(true) to include field-level and fallback explanations
refrax.WithOCR(refrax.OCRAuto) to allow MuPDF first, then OCR fallback
refrax.WithOCR(refrax.OCRForce) to skip MuPDF and force OCR for PDFs
refrax.WithOCR(refrax.OCRDisabled) to forbid OCR and rely on text or MuPDF only
refrax.WithSchemaOverlay(...) to add labels or fields on top of the built-in schema

For applications that prefer explicit config objects:

cfg := refrax.Config{
	FuzzyKeys:      true,
	IncludeRawText: false,
	Explain:        true,
	OCR:            refrax.OCRAuto,
}

result, err := refrax.ExtractWithConfig(refrax.File("referral.pdf"), cfg)

Applying A Schema Overlay

In Go, you can define the overlay inline:

overlay := refrax.SchemaOverlay{
	Name:    "custom_referral",
	Version: "v2",
	Sections: map[refrax.SchemaSection][]refrax.SchemaField{
		refrax.SectionHistory: {
			{Key: "reason_for_referral", Labels: []string{"Referral Reason"}},
			{Key: "preferred_consultant", Labels: []string{"Preferred Consultant"}},
			{Key: "additional_notes", Labels: []string{"Additional Notes"}, Multiline: true},
		},
	},
}

result, err := refrax.ExtractFile(
	"referral.pdf",
	refrax.WithSchemaOverlay(overlay),
)

Or load it from JSON:

overlay, err := refrax.LoadSchemaOverlayFile("overlay.json")
if err != nil {
	panic(err)
}

result, err := refrax.ExtractFile(
	"referral.pdf",
	refrax.WithSchemaOverlay(overlay),
)

Choosing OCR Behavior

For PDF inputs, Refrax normally:

extracts text with MuPDF
scores the extracted text
falls back to OCR if the direct extraction looks too weak or corrupted

Use these OCR modes depending on your API contract:

refrax.OCRAuto for the default production path
refrax.OCRForce when a caller explicitly requests OCR
refrax.OCRDisabled when you want deterministic non-OCR behavior

You can inspect runtime support in process:

caps := refrax.Capabilities()
if !caps.OCRAvailable {
	fmt.Println("tesseract is not installed on this machine")
}

Using Refrax In A `net/http` API

This pattern works for a typical upload endpoint that accepts a single file and returns extraction JSON:

package main

import (
	"encoding/json"
	"log"
	"net/http"

	"github.com/0mjs/refrax"
)

func main() {
	http.HandleFunc("/extract", extractHandler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func extractHandler(w http.ResponseWriter, r *http.Request) {
	if r.Method != http.MethodPost {
		http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
		return
	}

	r.Body = http.MaxBytesReader(w, r.Body, 20<<20)
	if err := r.ParseMultipartForm(20 << 20); err != nil {
		http.Error(w, "invalid multipart form", http.StatusBadRequest)
		return
	}

	file, header, err := r.FormFile("file")
	if err != nil {
		http.Error(w, "missing file", http.StatusBadRequest)
		return
	}
	defer file.Close()

	result, err := refrax.ExtractReader(
		header.Filename,
		file,
		refrax.WithFuzzyKeys(true),
		refrax.WithOCR(refrax.OCRAuto),
	)
	if err != nil {
		http.Error(w, err.Error(), http.StatusUnprocessableEntity)
		return
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(result)
}

In most APIs, the useful production checks are:

reject oversized uploads before extraction
log method, confidence, and warnings
decide what to do with low-confidence results before downstream persistence
expose whether OCR was used, especially if extraction latency matters

Returning Stable API Responses

Many applications should not forward the raw Refrax result unchanged. A common pattern is to wrap it in an application response with your own request identifiers and policy fields.

type ExtractResponse struct {
	RequestID    string         `json:"request_id"`
	DocumentType string         `json:"document_type"`
	Method       refrax.Method  `json:"method"`
	Confidence   float64        `json:"confidence"`
	Warnings     []string       `json:"warnings,omitempty"`
	Data         map[string]map[string]any `json:"data"`
}

That gives you space to add workflow-specific fields later without changing Refrax itself.

Source Constructors

Refrax also still exposes source constructors for cases where a single Source value is useful:

refrax.File(path)
refrax.Bytes(name, data)
refrax.Text(text)
refrax.Reader(name, r)

Those can still be used with:

result, err := refrax.Extract(refrax.File("referral.pdf"))

But for most applications, the ExtractFile / ExtractBytes / ExtractText / ExtractReader helpers are the cleaner entry points.

Result Shape

Extraction returns a JSON object with this top-level structure:

{
  "document_type": "healthlink_style_referral",
  "schema": "healthlink_style_referral:v1",
  "method": "ocr",
  "confidence": 97.42,
  "confidence_details": {
    "detector_contribution": 13.6,
    "text_quality": 47.82,
    "text_volume": 6,
    "field_coverage": 20,
    "medical_context": 10,
    "penalties": 0
  },
  "metadata": {
    "page_count": 1,
    "corruption": {
      "severity": "moderately_corrupted",
      "indicators": [
        "garbled_text_layer",
        "sparse_text_layer"
      ],
      "pages_with_text": 1
    },
    "fallback": {
      "attempted": true,
      "used": true,
      "from": "mupdf",
      "reason": "corrupted_low_text_volume"
    },
    "ocr": {
      "selected_psm": "6",
      "preprocessing_applied": true
    }
  },
  "explanation": {
    "fallback": {
      "attempted": true,
      "used": true,
      "from": "mupdf",
      "reason": "corrupted_low_text_volume",
      "summary": "MuPDF text looked corrupted and too sparse, so OCR was used instead."
    },
    "fields": [
      {
        "section": "patient",
        "field": "age",
        "matched_by": "label",
        "label": "Age",
        "raw_value": "33 years",
        "normalized_value": "33"
      }
    ]
  },
  "data": {
    "patient": {
      "age": "33",
      "gender": "Female"
    }
  },
  "warnings": [
    "used_ocr_fallback"
  ]
}

Important fields:

document_type: current detector verdict
schema: schema name and version
method: text, mupdf, or ocr
confidence: overall extraction confidence
confidence_details: component-level scoring signals. See CONFIDENCE.md for the scoring formula, ranges, and interpretation guidance.
metadata.page_count: number of PDF pages processed when the source is a PDF
metadata.corruption: MuPDF-side corruption severity, indicators, and page extraction stats
metadata.fallback: whether OCR fallback was considered, attempted, or used
metadata.ocr: selected Tesseract PSM and whether preprocessing was applied
explanation.fallback: human-readable reason for fallback decisions when --explain or WithExplain(true) is used
explanation.fields: field-by-field extraction reasons, matched labels, and normalized values
data: structured extracted sections and fields
warnings: extraction quality or fallback warnings

Benchmarks

Refrax includes benchmark coverage for:

text extraction
direct PDF text extraction via MuPDF
OCR fallback extraction for image-only PDFs

Run the extraction benchmarks with:

go test ./pkg/extract -run '^$' -bench . -benchmem

The OCR fallback benchmark is skipped automatically when tesseract is not installed.

OCR Fixture Generation

Refrax includes a helper for generating image-only OCR test fixtures from local PDFs.

Create an OCR-targeted PDF:

refrax fixture ocrify ./referral.pdf ./referral-ocr.pdf

Create a degraded OCR-targeted PDF:

refrax fixture ocrify ./referral.pdf ./referral-ocr-degraded.pdf --degrade

Optional flags:

--dpi N
--quality N

This command is intended for test consistency and OCR verification. It is not part of normal production extraction.

Do not commit or share generated outputs derived from unsanitized clinical documents. Prefer synthetic or scrubbed inputs that comply with DATA_POLICY.md.

Notes For Clinical and Technical Review

Refrax extracts and normalizes document content. It does not validate clinical appropriateness, clinical priority, or referral suitability.

Before production deployment, review:

expected field coverage for your document set
confidence thresholds appropriate for your workflow
handling of low-confidence or warning-bearing results
data governance requirements for local storage, logs, and onward transmission

If you want a deeper explanation of how confidence is calculated and how to interpret confidence_details, see CONFIDENCE.md.

Documentation ¶

Index ¶

Constants
type ConfidenceDetails
type Config
- func DefaultConfig() Config
type CorruptionMetadata
type Explanation
type ExtractionMetadata
type FallbackExplanation
type FallbackMetadata
type FieldExplanation
type Method
type OCRMetadata
type OCRMode
type Option
- func DisableOCR() Option
- func ForceOCR() Option
- func WithConfig(cfg Config) Option
- func WithExplain(enabled bool) Option
- func WithFuzzyKeys(enabled bool) Option
- func WithOCR(mode OCRMode) Option
- func WithRawText(enabled bool) Option
- func WithSchema(def Schema) Option
- func WithSchemaOverlay(overlay SchemaOverlay) Option
type Result
- func Extract(src Source, opts ...Option) (*Result, error)
- func ExtractBytes(name string, data []byte, opts ...Option) (*Result, error)
- func ExtractFile(path string, opts ...Option) (*Result, error)
- func ExtractReader(name string, r io.Reader, opts ...Option) (*Result, error)
- func ExtractText(text string, opts ...Option) (*Result, error)
- func ExtractWithConfig(src Source, cfg Config) (*Result, error)
type RuntimeCapabilities
- func Capabilities() RuntimeCapabilities
type Schema
- func DefaultSchema() Schema
type SchemaField
type SchemaOverlay
- func LoadSchemaOverlayFile(path string) (SchemaOverlay, error)
type SchemaSection
type Source
- func Bytes(name string, data []byte) Source
- func File(path string) Source
- func Reader(name string, r io.Reader) Source
- func Text(text string) Source
type ToolCapabilities

Constants ¶

View Source

const (
	MethodText  = types.MethodText
	MethodMuPDF = types.MethodMuPDF
	MethodOCR   = types.MethodOCR
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ConfidenceDetails ¶

type ConfidenceDetails = types.ConfidenceDetails

type Config ¶

type Config struct {
	FuzzyKeys      bool
	IncludeRawText bool
	Explain        bool
	OCR            OCRMode
	Schema         Schema
	SchemaOverlay  *SchemaOverlay
}

func DefaultConfig ¶

func DefaultConfig() Config

type CorruptionMetadata ¶

type CorruptionMetadata = types.CorruptionMetadata

type Explanation ¶

type Explanation = types.Explanation

type ExtractionMetadata ¶

type ExtractionMetadata = types.ExtractionMetadata

type FallbackExplanation ¶

type FallbackExplanation = types.FallbackExplanation

type FallbackMetadata ¶

type FallbackMetadata = types.FallbackMetadata

type FieldExplanation ¶

type FieldExplanation = types.FieldExplanation

type Method ¶

type Method = types.Method

type OCRMetadata ¶

type OCRMetadata = types.OCRMetadata

type OCRMode ¶

type OCRMode string

const (
	OCRAuto     OCRMode = "auto"
	OCRForce    OCRMode = "force"
	OCRDisabled OCRMode = "disabled"
)

type Option ¶

type Option func(*Config)

func DisableOCR ¶

func DisableOCR() Option

func ForceOCR ¶

func ForceOCR() Option

func WithConfig ¶

func WithConfig(cfg Config) Option

func WithExplain ¶

func WithExplain(enabled bool) Option

func WithFuzzyKeys ¶

func WithFuzzyKeys(enabled bool) Option

func WithOCR ¶

func WithOCR(mode OCRMode) Option

func WithRawText ¶

func WithRawText(enabled bool) Option

func WithSchema ¶ added in v1.0.1

func WithSchema(def Schema) Option

func WithSchemaOverlay ¶ added in v1.0.1

func WithSchemaOverlay(overlay SchemaOverlay) Option

type Result ¶

type Result = types.Result

func Extract ¶

func Extract(src Source, opts ...Option) (*Result, error)

func ExtractBytes ¶ added in v1.0.1

func ExtractBytes(name string, data []byte, opts ...Option) (*Result, error)

func ExtractFile ¶ added in v1.0.1

func ExtractFile(path string, opts ...Option) (*Result, error)

func ExtractReader ¶ added in v1.0.1

func ExtractReader(name string, r io.Reader, opts ...Option) (*Result, error)

func ExtractText ¶ added in v1.0.1

func ExtractText(text string, opts ...Option) (*Result, error)

func ExtractWithConfig ¶

func ExtractWithConfig(src Source, cfg Config) (*Result, error)

type RuntimeCapabilities ¶

type RuntimeCapabilities = capabilities.Report

func Capabilities ¶

func Capabilities() RuntimeCapabilities

type Schema ¶ added in v1.0.1

type Schema = schema.Definition

func DefaultSchema ¶ added in v1.0.1

func DefaultSchema() Schema

type SchemaField ¶ added in v1.0.1

type SchemaField = schema.Field

type SchemaOverlay ¶ added in v1.0.1

type SchemaOverlay = schema.Overlay

func LoadSchemaOverlayFile ¶ added in v1.0.1

func LoadSchemaOverlayFile(path string) (SchemaOverlay, error)

type SchemaSection ¶ added in v1.0.1

type SchemaSection = schema.Section

const (
	SectionPatient     SchemaSection = schema.SectionPatient
	SectionExamination SchemaSection = schema.SectionExamination
	SectionHistory     SchemaSection = schema.SectionHistory
	SectionMetrics     SchemaSection = schema.SectionMetrics
	SectionSocial      SchemaSection = schema.SectionSocial
	SectionMedication  SchemaSection = schema.SectionMedication
)

type Source ¶

type Source struct {
	// contains filtered or unexported fields
}

func Bytes ¶

func Bytes(name string, data []byte) Source

func File ¶

func File(path string) Source

func Reader ¶

func Reader(name string, r io.Reader) Source

func Text ¶

func Text(text string) Source

type ToolCapabilities ¶

type ToolCapabilities = capabilities.Tool

Source Files ¶

View all Source files

refrax.go

Directories ¶

Path	Synopsis
cmd
refrax command
internal
capabilities
fixture
mupdf
ocr
testpdf
pkg
detect
extract
normalize
schema
types

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL