text-transform

command module

v0.0.0-...-0b1d096 Latest Latest Go to latest Published: Mar 14, 2026 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/FalkZ/text-transform

Links

Open Source Insights

README ¶

text-transform / tt

Convert text formats without fiddling with flags.

text-transform [input] [output]

The format is detected from the file extension. Input goes through markdown as an intermediate step, then renders into the target format.

Examples

text-transform article.html article.epub
text-transform document.pdf notes.md
text-transform page.html page.txt
text-transform https://mitchellh.com/writing/as-code as-code.md

Install

go install github.com/FalkZ/text-transform@latest

This installs the binary as text-transform. To use it as tt, add an alias.

`tt` Alias

To use tt instead of text-transform, add an alias to your shell configuration:

Bash (~/.bashrc):

alias tt=text-transform

Zsh (~/.zshrc):

alias tt=text-transform

Fish (~/.config/fish/config.fish):

alias tt text-transform

Supported Formats

Input	Output
pdf, html, md, txt	epub, html, pdf, md, txt

Images

Images are preserved as file paths throughout the pipeline. When converting to markdown and images are present, tt writes a directory instead of a single file:

tt page.html out.md
# produces out/out.md + out/image1.png + ...

PDFs are a mess

Sadly PDFs have no internal structure to their content. This makes it extremely hard to get well-formatted output.

Luckily, I found that converting to markdown and then letting an agent like Claude do the cleanup helps massively.

Here is the cleanup prompt I use:

You are given an OCR created markdown file: $ARGUMENTS

It contains all kinds of artifacts that need to be cleaned up.

Create tasks for each point and fix them one by one.

- table of contents should be converted into simple unordered lists
- detect and remove the following:
  - running headers
  - copyright and other footer text
  - numbers with no context (page numbers)
  - weird formatting that does not conform to markdown
  - any kind of symbols that relate to formatting and are not part of markdown like: `.......` or `*******`
  - out of place spacing like: `(something )` & `test , other`
- Deduplicate duplicate sections
- Fix these common OCR errors:
  - Rejoin words split across line breaks with hyphens (e.g. "cor-\nrect" → "correct")
  - Fix common OCR misreads using context (e.g. 'rn' misread as 'm', 'l' as '1', 'O' as '0')
  - Correct obviously broken compound words that were split by a page break
- Fix obvious escaping issues like `Author*innen` => `Author\*innen`, `Item #1` => `Item \#1`
- Make sure headings follow a uniform structure
  - Double check with table of contents if needed
- Image captions, if present, should be italic and in a uniform structure
- Characters like '•' or similar indicate lists, create unordered list from these items
- `1. some text 2. some text` indicate ordered lists, convert them

**Critical:** Only fix clear errors, never rephrase or alter content.

**Critical:** Never remove images, they are an integral part of the document.

Do not create scripts to do your job but rather go through the document in chunks and reason about specific changes.

At the end, think about other improvements and make suggestions on what else to fix.