ADR-002: Batched Transcription Pipeline Architecture¶
Status: Proposed Date: 2024-12-11 Decision Makers: WhisperJAV Development Team Target Release: v1.8.0+
Context¶
WhisperJAV's current transcription pipeline processes scenes sequentially, where each scene goes through the full pipeline (speech segmentation → transcription → SRT) before the next scene begins.
Current Architecture (Sequential)¶
Scene 1: Load → Speech Segmentation → Transcribe Groups → Write SRT
Scene 2: Load → Speech Segmentation → Transcribe Groups → Write SRT
Scene 3: Load → Speech Segmentation → Transcribe Groups → Write SRT
...
Final: Stitch all scene SRTs together
Observations from Testing¶
When processing a 5-minute audio file with 3 scenes: - Each scene triggers a separate speech segmentation pass - The Whisper model remains loaded but processes one VAD group at a time - Progress reporting shows scene-level updates, not segment-level - Total segments not known until all scenes are processed
Key Question¶
"Would a batched architecture—where all scenes are segmented first, then all segments transcribed together—provide better performance, progress reporting, and resource utilization?"
Problem Statement¶
The sequential architecture has several limitations:
| Issue | Impact |
|---|---|
| Repeated I/O | Each scene loaded separately, potential cache misses |
| No global progress | Cannot show "X of Y total segments" until all scenes processed |
| Memory fragmentation | VAD model and Whisper model both in VRAM throughout |
| Limited parallelization | Scene processing is inherently serial |
| Unpredictable ETA | Cannot estimate completion without knowing total work |
Proposed Solution: Two-Stage Batched Processing¶
Architecture Overview¶
Stage 1: Speech Segmentation (Batch All Scenes)
┌─────────────────────────────────────────────────────────────────┐
│ Scene 1 ─→ VAD ─→ [Group 1, Group 2] │
│ Scene 2 ─→ VAD ─→ [Group 1] │
│ Scene 3 ─→ VAD ─→ [Group 1, Group 2, Group 3] │
│ │
│ Output: Global VAD Group Registry (6 groups total) │
│ with provenance tracking (scene_idx, group_idx) │
└─────────────────────────────────────────────────────────────────┘
│
[Release VAD model from VRAM]
│
▼
Stage 2: Batched Transcription
┌─────────────────────────────────────────────────────────────────┐
│ Load Whisper model once │
│ │
│ for group in global_registry: │
│ transcribe(group.audio_data) │
│ track_provenance(group.scene_idx, group.group_idx) │
│ │
│ Output: Transcription results with scene/group mapping │
└─────────────────────────────────────────────────────────────────┘
│
▼
Stage 3: SRT Reconstruction
┌─────────────────────────────────────────────────────────────────┐
│ Group results by scene_idx │
│ Apply timestamp offsets per scene │
│ Write scene SRTs → Stitch → Final output │
└─────────────────────────────────────────────────────────────────┘
Data Structures¶
@dataclass
class VADGroupEntry:
"""Entry in the global VAD group registry."""
scene_idx: int # Which scene this group belongs to
group_idx: int # Group index within the scene
audio_data: np.ndarray # Audio samples for this group
sample_rate: int # Audio sample rate
start_sec: float # Start time within scene
end_sec: float # End time within scene
scene_offset_sec: float # Scene start offset in original audio
segments: List[SpeechSegment] # Individual speech segments
class VADGroupRegistry:
"""Global registry of all VAD groups across all scenes."""
entries: List[VADGroupEntry]
@property
def total_groups(self) -> int: ...
@property
def total_segments(self) -> int: ...
def groups_by_scene(self, scene_idx: int) -> List[VADGroupEntry]: ...
API Changes¶
# New method in BalancedPipeline
def process_batched(self, scene_paths: List[Tuple[Path, float, float, float]]) -> Dict:
"""Process all scenes using two-stage batched architecture."""
# Stage 1: Batch speech segmentation
registry = self._batch_segment_scenes(scene_paths)
# Release VAD model
self.asr.release_vad_model()
# Stage 2: Batch transcription
results = self.asr.transcribe_registry(registry)
# Stage 3: Reconstruct SRTs
return self._reconstruct_scene_srts(results, scene_paths)
# New method in FasterWhisperProASR
def transcribe_registry(self, registry: VADGroupRegistry) -> List[TranscriptionResult]:
"""Transcribe all groups in the registry."""
...
Alternatives Considered¶
Alternative 1: Keep Sequential, Add Caching¶
Approach: Cache VAD results, reuse if scene already processed.
Rejected because: - Still sequential processing - Caching adds complexity without addressing core issues - Progress reporting still scene-bound
Alternative 2: Parallel Scene Processing¶
Approach: Process multiple scenes in parallel using multiprocessing.
Rejected because: - VRAM constraints limit parallelism (Whisper models are large) - Complexity of managing multiple GPU contexts - Diminishing returns on consumer hardware
Alternative 3: Streaming Architecture¶
Approach: Stream audio through pipeline, process segments as they arrive.
Rejected because: - Major rewrite of pipeline architecture - Incompatible with scene detection paradigm - Overkill for batch processing use case
Consequences¶
Positive¶
| Benefit | Description |
|---|---|
| Global progress | Know total segments upfront: "Transcribing 45/120 segments" |
| Better ETA | Accurate time estimates based on segments processed |
| Memory optimization | Release VAD model before loading Whisper |
| Reduced I/O | Load audio once per scene, keep in memory |
| Parallelization ready | Stage 1 could be parallelized in future |
| Predictable behavior | Same processing regardless of scene count |
Negative¶
| Drawback | Mitigation |
|---|---|
| Higher peak memory | Limit registry size, process in chunks if needed |
| Delayed first output | Show progress during Stage 1 to maintain UX |
| Architecture change | Keep sequential as fallback, gradual migration |
| Testing complexity | Comprehensive test coverage for new data flow |
Risks¶
- Memory exhaustion on very long files (100+ scenes)
-
Mitigation: Chunk processing with configurable batch size
-
Regression in edge cases (single scene, no speech detected)
-
Mitigation: Thorough test coverage, fallback to sequential
-
Progress UI changes required for two-stage model
- Mitigation: Abstract progress interface, update incrementally
Implementation Plan¶
Phase 1: Foundation (v1.8.0)¶
- [ ] Implement
VADGroupRegistrydata structure - [ ] Add
batch_segment_scenes()to BalancedPipeline - [ ] Add
transcribe_registry()to FasterWhisperProASR - [ ] Update progress reporting for two-stage model
- [ ] Add feature flag to enable/disable batched processing
Phase 2: Optimization (v1.8.x)¶
- [ ] Implement chunked processing for large files
- [ ] Add memory profiling and limits
- [ ] Optimize audio data handling (memory views vs copies)
- [ ] Performance benchmarking vs sequential
Phase 3: Migration (v1.9.0)¶
- [ ] Make batched processing the default
- [ ] Deprecate sequential-only code paths
- [ ] Update documentation and user guides
Metrics for Success¶
| Metric | Target |
|---|---|
| Memory usage | < 20% increase in peak usage |
| Processing time | No regression (ideally 5-10% improvement) |
| Progress accuracy | ETA within 10% of actual for files > 1 minute |
| Test coverage | > 90% for new code paths |
References¶
- Current implementation:
whisperjav/pipelines/balanced_pipeline.py - Speech segmentation:
whisperjav/modules/speech_segmentation/ - ASR module:
whisperjav/modules/faster_whisper_pro_asr.py - Related: ADR-001 (YAML configuration architecture)
Appendix: Current Code Flow¶
BalancedPipeline.process() - Scene Loop (lines 278-356)¶
for idx, (scene_path, start_time_sec, _, _) in enumerate(scene_paths):
scene_srt_path = scene_srts_dir / f"{scene_path.stem}.srt"
# ... progress update ...
self.asr.transcribe_to_srt(scene_path, scene_srt_path, task=self.asr_task)
# ... collect results ...
FasterWhisperProASR.transcribe() - VAD Group Loop (lines 426-430)¶
all_segments = []
for i, vad_group in enumerate(vad_segments, 1):
segments = self._transcribe_vad_group(audio_data, sample_rate, vad_group)
all_segments.extend(segments)