refrax

package module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 2, 2026 License: MIT Imports: 8 Imported by: 0

README

Refrax

Refrax is a Go library and CLI for extracting structured JSON from Healthlink-style clinical referral documents.

It is designed for technical teams, clinical informatics teams, and medical operations teams that need a reproducible way to turn referral text or PDFs into machine-readable data.

Refrax supports:

  • plain-text referral documents
  • PDF referral documents
  • automatic OCR fallback for image-only or weak-text PDFs
  • JSON output with confidence details and extraction warnings

Refrax is available as both a CLI and a Go package. The CLI is the quickest way to operate it locally or in scripts, and the Go API is intended for embedding the same extraction flow into backend services.

Scope

Refrax supports Healthlink-style referral documents. It is not:

  • an official Healthlink product
  • a Healthlink network integration
  • a clinical decision, triage, or diagnostic system
  • a workflow product for scheduling, messaging, or patient management

The project is focused on document parsing, normalization, detection, and structured extraction.

Installation

Install the CLI:

go install github.com/0mjs/refrax/cmd/refrax@latest

Validate the build:

go test ./...

If the installed binary is not on your PATH, run it with:

$(go env GOPATH)/bin/refrax schema

Project release and governance files:

OCR Setup

Refrax has two operating modes:

  • base mode: text extraction and direct PDF text extraction
  • OCR-capable mode: base mode plus OCR fallback and forced OCR

On macOS, install OCR dependencies with Homebrew:

brew install tesseract
brew install imagemagick

tesseract is required for OCR fallback and --ocr.

ImageMagick is optional. Refrax uses it only for OCR preprocessing and OCR fixture generation when available.

After installing dependencies, verify the local runtime with:

refrax doctor

Runtime Dependencies

Text extraction from PDFs is handled through MuPDF via go-fitz.

OCR fallback requires:

  • tesseract

Optional image preprocessing for OCR uses:

  • ImageMagick convert

If tesseract is not available, Refrax will still process text and directly extractable PDFs, but forced OCR and OCR fallback will not run.

Use refrax doctor to inspect whether OCR and OCR preprocessing are available on the current machine.

Refrax performs extraction locally. It does not send document contents to a remote Refrax service.

Quick Start

Print the default schema:

refrax schema

Inspect local OCR capabilities:

refrax doctor

Detect whether a document looks like a supported Healthlink-style referral:

refrax detect ./testdata/samples/referral.txt

Extract structured JSON from text:

refrax extract ./testdata/samples/referral.txt --pretty

Extract with an additive schema overlay:

refrax extract ./testdata/samples/referral.txt --pretty --schema-overlay ./overlay.json

Extract structured JSON from a PDF:

refrax extract ./referral.pdf --pretty

Force OCR for a PDF:

refrax extract ./referral.pdf --pretty --ocr

Include extraction explanations:

refrax extract ./referral.pdf --pretty --explain

Include the normalized raw text used for extraction:

refrax extract ./referral.pdf --pretty --raw

Read text from standard input:

cat ./testdata/samples/referral.txt | refrax extract - --pretty

Extraction Behavior

For text inputs, Refrax extracts directly from the supplied text.

For PDF inputs, Refrax:

  1. attempts direct text extraction with MuPDF
  2. evaluates text quality, structure, confidence, and corruption indicators
  3. falls back to OCR when the MuPDF result looks weak or corrupted

You can override that decision with --ocr to force OCR on a PDF.

The result always follows the same JSON envelope, regardless of whether the source was text, MuPDF, or OCR.

CLI Reference

refrax schema

Prints the default schema as JSON.

refrax detect <file|->

Runs document detection and returns:

  • document type
  • confidence
  • matched signals
refrax doctor

Reports runtime OCR capabilities, including whether Refrax can find:

  • tesseract
  • ImageMagick convert

This is the quickest way to confirm whether automatic OCR fallback is available on the current machine.

refrax extract <file|->

Extracts structured JSON from text or PDF input.

Flags:

  • --pretty: pretty-print JSON
  • --raw: include raw_text
  • --fuzzy: enable fuzzy field matching for noisy OCR text
  • --ocr: force OCR for PDF inputs
  • --explain: include field-level extraction and fallback explanations
  • --schema-overlay <file.json>: apply an additive schema overlay before extraction
refrax schema [--schema-overlay <file.json>]

Prints the default schema, or the merged effective schema when an overlay file is supplied.

Schema Overlays

Refrax supports additive schema overlays.

Overlays are intended for cases where your documents contain:

  • extra labels for built-in fields
  • additional labeled fields that should be extracted
  • custom multiline fields

Overlays do not remove built-in fields from the default schema.

An overlay JSON file looks like this:

{
  "name": "custom_referral",
  "version": "v2",
  "sections": {
    "history": [
      {
        "key": "reason_for_referral",
        "labels": ["Referral Reason"]
      },
      {
        "key": "preferred_consultant",
        "labels": ["Preferred Consultant"]
      },
      {
        "key": "additional_notes",
        "labels": ["Additional Notes"],
        "multiline": true
      }
    ]
  }
}

Common CLI usage:

refrax schema --schema-overlay ./overlay.json
refrax extract ./referral.pdf --pretty --schema-overlay ./overlay.json

Go Integration

The recommended application-facing API is the top-level refrax package.

Install And Import

Add Refrax to your Go module:

go get github.com/0mjs/refrax

Import it in your service code:

import "github.com/0mjs/refrax"
Basic Extraction

The main entry points are:

  • refrax.ExtractFile(path, opts...)
  • refrax.ExtractBytes(name, data, opts...)
  • refrax.ExtractText(text, opts...)
  • refrax.ExtractReader(name, r, opts...)

The simplest file-based example is:

package main

import (
	"fmt"

	"github.com/0mjs/refrax"
)

func main() {
	result, err := refrax.ExtractFile("referral.pdf")
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Method)
	fmt.Println(result.Confidence)
}

For in-memory PDF bytes:

result, err := refrax.ExtractBytes("referral.pdf", pdfBytes)

For plain text:

result, err := refrax.ExtractText(text)
Configuring Extraction

The same option helpers work with all four entry points:

result, err := refrax.ExtractReader(
	filename,
	file,
	refrax.WithFuzzyKeys(true),
	refrax.WithRawText(false),
	refrax.WithExplain(true),
	refrax.WithOCR(refrax.OCRAuto),
)

Available config controls:

  • refrax.WithFuzzyKeys(true) to tolerate noisier OCR labels
  • refrax.WithRawText(true) to return raw_text in the result
  • refrax.WithExplain(true) to include field-level and fallback explanations
  • refrax.WithOCR(refrax.OCRAuto) to allow MuPDF first, then OCR fallback
  • refrax.WithOCR(refrax.OCRForce) to skip MuPDF and force OCR for PDFs
  • refrax.WithOCR(refrax.OCRDisabled) to forbid OCR and rely on text or MuPDF only
  • refrax.WithSchemaOverlay(...) to add labels or fields on top of the built-in schema

For applications that prefer explicit config objects:

cfg := refrax.Config{
	FuzzyKeys:      true,
	IncludeRawText: false,
	Explain:        true,
	OCR:            refrax.OCRAuto,
}

result, err := refrax.ExtractWithConfig(refrax.File("referral.pdf"), cfg)
Applying A Schema Overlay

In Go, you can define the overlay inline:

overlay := refrax.SchemaOverlay{
	Name:    "custom_referral",
	Version: "v2",
	Sections: map[refrax.SchemaSection][]refrax.SchemaField{
		refrax.SectionHistory: {
			{Key: "reason_for_referral", Labels: []string{"Referral Reason"}},
			{Key: "preferred_consultant", Labels: []string{"Preferred Consultant"}},
			{Key: "additional_notes", Labels: []string{"Additional Notes"}, Multiline: true},
		},
	},
}

result, err := refrax.ExtractFile(
	"referral.pdf",
	refrax.WithSchemaOverlay(overlay),
)

Or load it from JSON:

overlay, err := refrax.LoadSchemaOverlayFile("overlay.json")
if err != nil {
	panic(err)
}

result, err := refrax.ExtractFile(
	"referral.pdf",
	refrax.WithSchemaOverlay(overlay),
)
Choosing OCR Behavior

For PDF inputs, Refrax normally:

  1. extracts text with MuPDF
  2. scores the extracted text
  3. falls back to OCR if the direct extraction looks too weak or corrupted

Use these OCR modes depending on your API contract:

  • refrax.OCRAuto for the default production path
  • refrax.OCRForce when a caller explicitly requests OCR
  • refrax.OCRDisabled when you want deterministic non-OCR behavior

You can inspect runtime support in process:

caps := refrax.Capabilities()
if !caps.OCRAvailable {
	fmt.Println("tesseract is not installed on this machine")
}
Using Refrax In A net/http API

This pattern works for a typical upload endpoint that accepts a single file and returns extraction JSON:

package main

import (
	"encoding/json"
	"log"
	"net/http"

	"github.com/0mjs/refrax"
)

func main() {
	http.HandleFunc("/extract", extractHandler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func extractHandler(w http.ResponseWriter, r *http.Request) {
	if r.Method != http.MethodPost {
		http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
		return
	}

	r.Body = http.MaxBytesReader(w, r.Body, 20<<20)
	if err := r.ParseMultipartForm(20 << 20); err != nil {
		http.Error(w, "invalid multipart form", http.StatusBadRequest)
		return
	}

	file, header, err := r.FormFile("file")
	if err != nil {
		http.Error(w, "missing file", http.StatusBadRequest)
		return
	}
	defer file.Close()

	result, err := refrax.ExtractReader(
		header.Filename,
		file,
		refrax.WithFuzzyKeys(true),
		refrax.WithOCR(refrax.OCRAuto),
	)
	if err != nil {
		http.Error(w, err.Error(), http.StatusUnprocessableEntity)
		return
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(result)
}

In most APIs, the useful production checks are:

  • reject oversized uploads before extraction
  • log method, confidence, and warnings
  • decide what to do with low-confidence results before downstream persistence
  • expose whether OCR was used, especially if extraction latency matters
Returning Stable API Responses

Many applications should not forward the raw Refrax result unchanged. A common pattern is to wrap it in an application response with your own request identifiers and policy fields.

type ExtractResponse struct {
	RequestID    string         `json:"request_id"`
	DocumentType string         `json:"document_type"`
	Method       refrax.Method  `json:"method"`
	Confidence   float64        `json:"confidence"`
	Warnings     []string       `json:"warnings,omitempty"`
	Data         map[string]map[string]any `json:"data"`
}

That gives you space to add workflow-specific fields later without changing Refrax itself.

Source Constructors

Refrax also still exposes source constructors for cases where a single Source value is useful:

  • refrax.File(path)
  • refrax.Bytes(name, data)
  • refrax.Text(text)
  • refrax.Reader(name, r)

Those can still be used with:

result, err := refrax.Extract(refrax.File("referral.pdf"))

But for most applications, the ExtractFile / ExtractBytes / ExtractText / ExtractReader helpers are the cleaner entry points.

Result Shape

Extraction returns a JSON object with this top-level structure:

{
  "document_type": "healthlink_style_referral",
  "schema": "healthlink_style_referral:v1",
  "method": "ocr",
  "confidence": 97.42,
  "confidence_details": {
    "detector_contribution": 13.6,
    "text_quality": 47.82,
    "text_volume": 6,
    "field_coverage": 20,
    "medical_context": 10,
    "penalties": 0
  },
  "metadata": {
    "page_count": 1,
    "corruption": {
      "severity": "moderately_corrupted",
      "indicators": [
        "garbled_text_layer",
        "sparse_text_layer"
      ],
      "pages_with_text": 1
    },
    "fallback": {
      "attempted": true,
      "used": true,
      "from": "mupdf",
      "reason": "corrupted_low_text_volume"
    },
    "ocr": {
      "selected_psm": "6",
      "preprocessing_applied": true
    }
  },
  "explanation": {
    "fallback": {
      "attempted": true,
      "used": true,
      "from": "mupdf",
      "reason": "corrupted_low_text_volume",
      "summary": "MuPDF text looked corrupted and too sparse, so OCR was used instead."
    },
    "fields": [
      {
        "section": "patient",
        "field": "age",
        "matched_by": "label",
        "label": "Age",
        "raw_value": "33 years",
        "normalized_value": "33"
      }
    ]
  },
  "data": {
    "patient": {
      "age": "33",
      "gender": "Female"
    }
  },
  "warnings": [
    "used_ocr_fallback"
  ]
}

Important fields:

  • document_type: current detector verdict
  • schema: schema name and version
  • method: text, mupdf, or ocr
  • confidence: overall extraction confidence
  • confidence_details: component-level scoring signals. See CONFIDENCE.md for the scoring formula, ranges, and interpretation guidance.
  • metadata.page_count: number of PDF pages processed when the source is a PDF
  • metadata.corruption: MuPDF-side corruption severity, indicators, and page extraction stats
  • metadata.fallback: whether OCR fallback was considered, attempted, or used
  • metadata.ocr: selected Tesseract PSM and whether preprocessing was applied
  • explanation.fallback: human-readable reason for fallback decisions when --explain or WithExplain(true) is used
  • explanation.fields: field-by-field extraction reasons, matched labels, and normalized values
  • data: structured extracted sections and fields
  • warnings: extraction quality or fallback warnings

Benchmarks

Refrax includes benchmark coverage for:

  • text extraction
  • direct PDF text extraction via MuPDF
  • OCR fallback extraction for image-only PDFs

Run the extraction benchmarks with:

go test ./pkg/extract -run '^$' -bench . -benchmem

The OCR fallback benchmark is skipped automatically when tesseract is not installed.

OCR Fixture Generation

Refrax includes a helper for generating image-only OCR test fixtures from local PDFs.

Create an OCR-targeted PDF:

refrax fixture ocrify ./referral.pdf ./referral-ocr.pdf

Create a degraded OCR-targeted PDF:

refrax fixture ocrify ./referral.pdf ./referral-ocr-degraded.pdf --degrade

Optional flags:

  • --dpi N
  • --quality N

This command is intended for test consistency and OCR verification. It is not part of normal production extraction.

Do not commit or share generated outputs derived from unsanitized clinical documents. Prefer synthetic or scrubbed inputs that comply with DATA_POLICY.md.

Notes For Clinical and Technical Review

Refrax extracts and normalizes document content. It does not validate clinical appropriateness, clinical priority, or referral suitability.

Before production deployment, review:

  • expected field coverage for your document set
  • confidence thresholds appropriate for your workflow
  • handling of low-confidence or warning-bearing results
  • data governance requirements for local storage, logs, and onward transmission

If you want a deeper explanation of how confidence is calculated and how to interpret confidence_details, see CONFIDENCE.md.

Documentation

Index

Constants

View Source
const (
	MethodText  = types.MethodText
	MethodMuPDF = types.MethodMuPDF
	MethodOCR   = types.MethodOCR
)

Variables

This section is empty.

Functions

This section is empty.

Types

type ConfidenceDetails

type ConfidenceDetails = types.ConfidenceDetails

type Config

type Config struct {
	FuzzyKeys      bool
	IncludeRawText bool
	Explain        bool
	OCR            OCRMode
	Schema         Schema
	SchemaOverlay  *SchemaOverlay
}

func DefaultConfig

func DefaultConfig() Config

type CorruptionMetadata

type CorruptionMetadata = types.CorruptionMetadata

type Explanation

type Explanation = types.Explanation

type ExtractionMetadata

type ExtractionMetadata = types.ExtractionMetadata

type FallbackExplanation

type FallbackExplanation = types.FallbackExplanation

type FallbackMetadata

type FallbackMetadata = types.FallbackMetadata

type FieldExplanation

type FieldExplanation = types.FieldExplanation

type Method

type Method = types.Method

type OCRMetadata

type OCRMetadata = types.OCRMetadata

type OCRMode

type OCRMode string
const (
	OCRAuto     OCRMode = "auto"
	OCRForce    OCRMode = "force"
	OCRDisabled OCRMode = "disabled"
)

type Option

type Option func(*Config)

func DisableOCR

func DisableOCR() Option

func ForceOCR

func ForceOCR() Option

func WithConfig

func WithConfig(cfg Config) Option

func WithExplain

func WithExplain(enabled bool) Option

func WithFuzzyKeys

func WithFuzzyKeys(enabled bool) Option

func WithOCR

func WithOCR(mode OCRMode) Option

func WithRawText

func WithRawText(enabled bool) Option

func WithSchema added in v1.0.1

func WithSchema(def Schema) Option

func WithSchemaOverlay added in v1.0.1

func WithSchemaOverlay(overlay SchemaOverlay) Option

type Result

type Result = types.Result

func Extract

func Extract(src Source, opts ...Option) (*Result, error)

func ExtractBytes added in v1.0.1

func ExtractBytes(name string, data []byte, opts ...Option) (*Result, error)

func ExtractFile added in v1.0.1

func ExtractFile(path string, opts ...Option) (*Result, error)

func ExtractReader added in v1.0.1

func ExtractReader(name string, r io.Reader, opts ...Option) (*Result, error)

func ExtractText added in v1.0.1

func ExtractText(text string, opts ...Option) (*Result, error)

func ExtractWithConfig

func ExtractWithConfig(src Source, cfg Config) (*Result, error)

type RuntimeCapabilities

type RuntimeCapabilities = capabilities.Report

func Capabilities

func Capabilities() RuntimeCapabilities

type Schema added in v1.0.1

type Schema = schema.Definition

func DefaultSchema added in v1.0.1

func DefaultSchema() Schema

type SchemaField added in v1.0.1

type SchemaField = schema.Field

type SchemaOverlay added in v1.0.1

type SchemaOverlay = schema.Overlay

func LoadSchemaOverlayFile added in v1.0.1

func LoadSchemaOverlayFile(path string) (SchemaOverlay, error)

type SchemaSection added in v1.0.1

type SchemaSection = schema.Section
const (
	SectionPatient     SchemaSection = schema.SectionPatient
	SectionExamination SchemaSection = schema.SectionExamination
	SectionHistory     SchemaSection = schema.SectionHistory
	SectionMetrics     SchemaSection = schema.SectionMetrics
	SectionSocial      SchemaSection = schema.SectionSocial
	SectionMedication  SchemaSection = schema.SectionMedication
)

type Source

type Source struct {
	// contains filtered or unexported fields
}

func Bytes

func Bytes(name string, data []byte) Source

func File

func File(path string) Source

func Reader

func Reader(name string, r io.Reader) Source

func Text

func Text(text string) Source

type ToolCapabilities

type ToolCapabilities = capabilities.Tool

Directories

Path Synopsis
cmd
refrax command
internal
ocr
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL