Refrax
Refrax is a Go library and CLI for extracting structured JSON from Healthlink-style clinical referral documents.
It is designed for technical teams, clinical informatics teams, and medical operations teams that need a reproducible way to turn referral text or PDFs into machine-readable data.
Refrax supports:
- plain-text referral documents
- PDF referral documents
- automatic OCR fallback for image-only or weak-text PDFs
- JSON output with confidence details and extraction warnings
Refrax is available as both a CLI and a Go package. The CLI is the quickest way to operate it locally or in scripts, and the Go API is intended for embedding the same extraction flow into backend services.
Scope
Refrax supports Healthlink-style referral documents. It is not:
- an official Healthlink product
- a Healthlink network integration
- a clinical decision, triage, or diagnostic system
- a workflow product for scheduling, messaging, or patient management
The project is focused on document parsing, normalization, detection, and structured extraction.
Installation
Install the CLI:
go install github.com/0mjs/refrax/cmd/refrax@latest
Validate the build:
go test ./...
If the installed binary is not on your PATH, run it with:
$(go env GOPATH)/bin/refrax schema
Project release and governance files:
OCR Setup
Refrax has two operating modes:
- base mode: text extraction and direct PDF text extraction
- OCR-capable mode: base mode plus OCR fallback and forced OCR
On macOS, install OCR dependencies with Homebrew:
brew install tesseract
brew install imagemagick
tesseract is required for OCR fallback and --ocr.
ImageMagick is optional. Refrax uses it only for OCR preprocessing and OCR fixture generation when available.
After installing dependencies, verify the local runtime with:
refrax doctor
Runtime Dependencies
Text extraction from PDFs is handled through MuPDF via go-fitz.
OCR fallback requires:
Optional image preprocessing for OCR uses:
If tesseract is not available, Refrax will still process text and directly extractable PDFs, but forced OCR and OCR fallback will not run.
Use refrax doctor to inspect whether OCR and OCR preprocessing are available on the current machine.
Refrax performs extraction locally. It does not send document contents to a remote Refrax service.
Quick Start
Print the default schema:
refrax schema
Inspect local OCR capabilities:
refrax doctor
Detect whether a document looks like a supported Healthlink-style referral:
refrax detect ./testdata/samples/referral.txt
Extract structured JSON from text:
refrax extract ./testdata/samples/referral.txt --pretty
Extract with an additive schema overlay:
refrax extract ./testdata/samples/referral.txt --pretty --schema-overlay ./overlay.json
Extract structured JSON from a PDF:
refrax extract ./referral.pdf --pretty
Force OCR for a PDF:
refrax extract ./referral.pdf --pretty --ocr
Include extraction explanations:
refrax extract ./referral.pdf --pretty --explain
Include the normalized raw text used for extraction:
refrax extract ./referral.pdf --pretty --raw
Read text from standard input:
cat ./testdata/samples/referral.txt | refrax extract - --pretty
For text inputs, Refrax extracts directly from the supplied text.
For PDF inputs, Refrax:
- attempts direct text extraction with MuPDF
- evaluates text quality, structure, confidence, and corruption indicators
- falls back to OCR when the MuPDF result looks weak or corrupted
You can override that decision with --ocr to force OCR on a PDF.
The result always follows the same JSON envelope, regardless of whether the source was text, MuPDF, or OCR.
CLI Reference
refrax schema
Prints the default schema as JSON.
refrax detect <file|->
Runs document detection and returns:
- document type
- confidence
- matched signals
refrax doctor
Reports runtime OCR capabilities, including whether Refrax can find:
tesseract
- ImageMagick
convert
This is the quickest way to confirm whether automatic OCR fallback is available on the current machine.
Extracts structured JSON from text or PDF input.
Flags:
--pretty: pretty-print JSON
--raw: include raw_text
--fuzzy: enable fuzzy field matching for noisy OCR text
--ocr: force OCR for PDF inputs
--explain: include field-level extraction and fallback explanations
--schema-overlay <file.json>: apply an additive schema overlay before extraction
refrax schema [--schema-overlay <file.json>]
Prints the default schema, or the merged effective schema when an overlay file is supplied.
Schema Overlays
Refrax supports additive schema overlays.
Overlays are intended for cases where your documents contain:
- extra labels for built-in fields
- additional labeled fields that should be extracted
- custom multiline fields
Overlays do not remove built-in fields from the default schema.
An overlay JSON file looks like this:
{
"name": "custom_referral",
"version": "v2",
"sections": {
"history": [
{
"key": "reason_for_referral",
"labels": ["Referral Reason"]
},
{
"key": "preferred_consultant",
"labels": ["Preferred Consultant"]
},
{
"key": "additional_notes",
"labels": ["Additional Notes"],
"multiline": true
}
]
}
}
Common CLI usage:
refrax schema --schema-overlay ./overlay.json
refrax extract ./referral.pdf --pretty --schema-overlay ./overlay.json
Go Integration
The recommended application-facing API is the top-level refrax package.
Install And Import
Add Refrax to your Go module:
go get github.com/0mjs/refrax
Import it in your service code:
import "github.com/0mjs/refrax"
The main entry points are:
refrax.ExtractFile(path, opts...)
refrax.ExtractBytes(name, data, opts...)
refrax.ExtractText(text, opts...)
refrax.ExtractReader(name, r, opts...)
The simplest file-based example is:
package main
import (
"fmt"
"github.com/0mjs/refrax"
)
func main() {
result, err := refrax.ExtractFile("referral.pdf")
if err != nil {
panic(err)
}
fmt.Println(result.Method)
fmt.Println(result.Confidence)
}
For in-memory PDF bytes:
result, err := refrax.ExtractBytes("referral.pdf", pdfBytes)
For plain text:
result, err := refrax.ExtractText(text)
The same option helpers work with all four entry points:
result, err := refrax.ExtractReader(
filename,
file,
refrax.WithFuzzyKeys(true),
refrax.WithRawText(false),
refrax.WithExplain(true),
refrax.WithOCR(refrax.OCRAuto),
)
Available config controls:
refrax.WithFuzzyKeys(true) to tolerate noisier OCR labels
refrax.WithRawText(true) to return raw_text in the result
refrax.WithExplain(true) to include field-level and fallback explanations
refrax.WithOCR(refrax.OCRAuto) to allow MuPDF first, then OCR fallback
refrax.WithOCR(refrax.OCRForce) to skip MuPDF and force OCR for PDFs
refrax.WithOCR(refrax.OCRDisabled) to forbid OCR and rely on text or MuPDF only
refrax.WithSchemaOverlay(...) to add labels or fields on top of the built-in schema
For applications that prefer explicit config objects:
cfg := refrax.Config{
FuzzyKeys: true,
IncludeRawText: false,
Explain: true,
OCR: refrax.OCRAuto,
}
result, err := refrax.ExtractWithConfig(refrax.File("referral.pdf"), cfg)
Applying A Schema Overlay
In Go, you can define the overlay inline:
overlay := refrax.SchemaOverlay{
Name: "custom_referral",
Version: "v2",
Sections: map[refrax.SchemaSection][]refrax.SchemaField{
refrax.SectionHistory: {
{Key: "reason_for_referral", Labels: []string{"Referral Reason"}},
{Key: "preferred_consultant", Labels: []string{"Preferred Consultant"}},
{Key: "additional_notes", Labels: []string{"Additional Notes"}, Multiline: true},
},
},
}
result, err := refrax.ExtractFile(
"referral.pdf",
refrax.WithSchemaOverlay(overlay),
)
Or load it from JSON:
overlay, err := refrax.LoadSchemaOverlayFile("overlay.json")
if err != nil {
panic(err)
}
result, err := refrax.ExtractFile(
"referral.pdf",
refrax.WithSchemaOverlay(overlay),
)
Choosing OCR Behavior
For PDF inputs, Refrax normally:
- extracts text with MuPDF
- scores the extracted text
- falls back to OCR if the direct extraction looks too weak or corrupted
Use these OCR modes depending on your API contract:
refrax.OCRAuto for the default production path
refrax.OCRForce when a caller explicitly requests OCR
refrax.OCRDisabled when you want deterministic non-OCR behavior
You can inspect runtime support in process:
caps := refrax.Capabilities()
if !caps.OCRAvailable {
fmt.Println("tesseract is not installed on this machine")
}
Using Refrax In A net/http API
This pattern works for a typical upload endpoint that accepts a single file and returns extraction JSON:
package main
import (
"encoding/json"
"log"
"net/http"
"github.com/0mjs/refrax"
)
func main() {
http.HandleFunc("/extract", extractHandler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
func extractHandler(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
return
}
r.Body = http.MaxBytesReader(w, r.Body, 20<<20)
if err := r.ParseMultipartForm(20 << 20); err != nil {
http.Error(w, "invalid multipart form", http.StatusBadRequest)
return
}
file, header, err := r.FormFile("file")
if err != nil {
http.Error(w, "missing file", http.StatusBadRequest)
return
}
defer file.Close()
result, err := refrax.ExtractReader(
header.Filename,
file,
refrax.WithFuzzyKeys(true),
refrax.WithOCR(refrax.OCRAuto),
)
if err != nil {
http.Error(w, err.Error(), http.StatusUnprocessableEntity)
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(result)
}
In most APIs, the useful production checks are:
- reject oversized uploads before extraction
- log
method, confidence, and warnings
- decide what to do with low-confidence results before downstream persistence
- expose whether OCR was used, especially if extraction latency matters
Returning Stable API Responses
Many applications should not forward the raw Refrax result unchanged. A common pattern is to wrap it in an application response with your own request identifiers and policy fields.
type ExtractResponse struct {
RequestID string `json:"request_id"`
DocumentType string `json:"document_type"`
Method refrax.Method `json:"method"`
Confidence float64 `json:"confidence"`
Warnings []string `json:"warnings,omitempty"`
Data map[string]map[string]any `json:"data"`
}
That gives you space to add workflow-specific fields later without changing Refrax itself.
Source Constructors
Refrax also still exposes source constructors for cases where a single Source value is useful:
refrax.File(path)
refrax.Bytes(name, data)
refrax.Text(text)
refrax.Reader(name, r)
Those can still be used with:
result, err := refrax.Extract(refrax.File("referral.pdf"))
But for most applications, the ExtractFile / ExtractBytes / ExtractText / ExtractReader helpers are the cleaner entry points.
Result Shape
Extraction returns a JSON object with this top-level structure:
{
"document_type": "healthlink_style_referral",
"schema": "healthlink_style_referral:v1",
"method": "ocr",
"confidence": 97.42,
"confidence_details": {
"detector_contribution": 13.6,
"text_quality": 47.82,
"text_volume": 6,
"field_coverage": 20,
"medical_context": 10,
"penalties": 0
},
"metadata": {
"page_count": 1,
"corruption": {
"severity": "moderately_corrupted",
"indicators": [
"garbled_text_layer",
"sparse_text_layer"
],
"pages_with_text": 1
},
"fallback": {
"attempted": true,
"used": true,
"from": "mupdf",
"reason": "corrupted_low_text_volume"
},
"ocr": {
"selected_psm": "6",
"preprocessing_applied": true
}
},
"explanation": {
"fallback": {
"attempted": true,
"used": true,
"from": "mupdf",
"reason": "corrupted_low_text_volume",
"summary": "MuPDF text looked corrupted and too sparse, so OCR was used instead."
},
"fields": [
{
"section": "patient",
"field": "age",
"matched_by": "label",
"label": "Age",
"raw_value": "33 years",
"normalized_value": "33"
}
]
},
"data": {
"patient": {
"age": "33",
"gender": "Female"
}
},
"warnings": [
"used_ocr_fallback"
]
}
Important fields:
document_type: current detector verdict
schema: schema name and version
method: text, mupdf, or ocr
confidence: overall extraction confidence
confidence_details: component-level scoring signals. See CONFIDENCE.md for the scoring formula, ranges, and interpretation guidance.
metadata.page_count: number of PDF pages processed when the source is a PDF
metadata.corruption: MuPDF-side corruption severity, indicators, and page extraction stats
metadata.fallback: whether OCR fallback was considered, attempted, or used
metadata.ocr: selected Tesseract PSM and whether preprocessing was applied
explanation.fallback: human-readable reason for fallback decisions when --explain or WithExplain(true) is used
explanation.fields: field-by-field extraction reasons, matched labels, and normalized values
data: structured extracted sections and fields
warnings: extraction quality or fallback warnings
Benchmarks
Refrax includes benchmark coverage for:
- text extraction
- direct PDF text extraction via MuPDF
- OCR fallback extraction for image-only PDFs
Run the extraction benchmarks with:
go test ./pkg/extract -run '^$' -bench . -benchmem
The OCR fallback benchmark is skipped automatically when tesseract is not installed.
OCR Fixture Generation
Refrax includes a helper for generating image-only OCR test fixtures from local PDFs.
Create an OCR-targeted PDF:
refrax fixture ocrify ./referral.pdf ./referral-ocr.pdf
Create a degraded OCR-targeted PDF:
refrax fixture ocrify ./referral.pdf ./referral-ocr-degraded.pdf --degrade
Optional flags:
This command is intended for test consistency and OCR verification. It is not part of normal production extraction.
Do not commit or share generated outputs derived from unsanitized clinical documents. Prefer synthetic or scrubbed inputs that comply with DATA_POLICY.md.
Notes For Clinical and Technical Review
Refrax extracts and normalizes document content. It does not validate clinical appropriateness, clinical priority, or referral suitability.
Before production deployment, review:
- expected field coverage for your document set
- confidence thresholds appropriate for your workflow
- handling of low-confidence or warning-bearing results
- data governance requirements for local storage, logs, and onward transmission
If you want a deeper explanation of how confidence is calculated and how to interpret confidence_details, see CONFIDENCE.md.