When the ASR system generates VTTs they contain numerous instances where sentences are fragmented across lines. Often the last word of a sentence appears on a separate line with a very short timestamp (often just milliseconds). It’s so short that when published to the caption track, those 1-line captions do not appear on the video. Even if they do appear on the video, having the last word of a sentence appear on it’s own is disorienting to read. This happens in videos of every audio quality, be that high production value or live lecture. I’d like to request a feature that addresses the sentence fragmentation issue in VTT files. Specifically, a post-processing script or built-in functionality that can automatically merge these short, fragmented segments with the preceding segment.
If possible this script should ideally be able to define a minimum segment duration. Segments shorter than the threshold (300 milliseconds for example) should be considered candidates for merging. In tandem with that, the script should be able to examine the text content of short segments. If a segment contains a single word with punctuation, it should be merged. When segments are merged, the end timestamp of the combined segment should become the end timestamp of the original last segment. This script could be a toggle in the transcript editor and generate a new “version” of the transcript, or it can be on by default.
Here's an example of the end of a sentence breaking onto one line:
-------
00:45:56.979 --> 00:46:00.969
So usually following major extinction events, you have increases of
00:46:00.969 --> 00:46:01.118
diversity.
------
And here’s an example how the script should make this timestamp look:
------
00:45:56.979 --> 00:46:01.118
So usually following major extinction events, you have increases of diversity.
------