text-transform

command module
v0.0.0-...-0b1d096 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 14, 2026 License: MIT Imports: 12 Imported by: 0

README

text-transform / tt

Convert text formats without fiddling with flags.

text-transform [input] [output]

The format is detected from the file extension. Input goes through markdown as an intermediate step, then renders into the target format.

Examples

text-transform article.html article.epub
text-transform document.pdf notes.md
text-transform page.html page.txt
text-transform https://mitchellh.com/writing/as-code as-code.md

Install

go install github.com/FalkZ/text-transform@latest

This installs the binary as text-transform. To use it as tt, add an alias.

tt Alias

To use tt instead of text-transform, add an alias to your shell configuration:

Bash (~/.bashrc):

alias tt=text-transform

Zsh (~/.zshrc):

alias tt=text-transform

Fish (~/.config/fish/config.fish):

alias tt text-transform

Supported Formats

Input Output
pdf, html, md, txt epub, html, pdf, md, txt

Images

Images are preserved as file paths throughout the pipeline. When converting to markdown and images are present, tt writes a directory instead of a single file:

tt page.html out.md
# produces out/out.md + out/image1.png + ...

PDFs are a mess

Sadly PDFs have no internal structure to their content. This makes it extremely hard to get well-formatted output.

Luckily, I found that converting to markdown and then letting an agent like Claude do the cleanup helps massively.

Here is the cleanup prompt I use:

You are given an OCR created markdown file: $ARGUMENTS

It contains all kinds of artifacts that need to be cleaned up.

Create tasks for each point and fix them one by one.

- table of contents should be converted into simple unordered lists
- detect and remove the following:
  - running headers
  - copyright and other footer text
  - numbers with no context (page numbers)
  - weird formatting that does not conform to markdown
  - any kind of symbols that relate to formatting and are not part of markdown like: `.......` or `*******`
  - out of place spacing like: `(something )` & `test , other`
- Deduplicate duplicate sections
- Fix these common OCR errors:
  - Rejoin words split across line breaks with hyphens (e.g. "cor-\nrect" → "correct")
  - Fix common OCR misreads using context (e.g. 'rn' misread as 'm', 'l' as '1', 'O' as '0')
  - Correct obviously broken compound words that were split by a page break
- Fix obvious escaping issues like `Author*innen` => `Author\*innen`, `Item #1` => `Item \#1`
- Make sure headings follow a uniform structure
  - Double check with table of contents if needed
- Image captions, if present, should be italic and in a uniform structure
- Characters like '•' or similar indicate lists, create unordered list from these items
- `1. some text 2. some text` indicate ordered lists, convert them

**Critical:** Only fix clear errors, never rephrase or alter content.

**Critical:** Never remove images, they are an integral part of the document.

Do not create scripts to do your job but rather go through the document in chunks and reason about specific changes.

At the end, think about other improvements and make suggestions on what else to fix.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL