FormatAsText produces a readable "Speaker: text" format suitable for LLM consumption.
Adjacent cues from the same speaker are merged into a single line.
Parse extracts cues from WebVTT data.
Handles the WEBVTT header, timestamp lines, and <v Speaker N> voice tags.
Malformed cues are skipped; returns what can be parsed.