Olli Sorjonen Expands Simple-captioner For Rapid Batch Media Tagging

simple-captioner Version 1.0.2.1 now supports batch processing of images and videos through a standalone graphical interface. The updated tool automatically generates descriptive text files alongside media files located in user-selected directories.
Developed by Olli Sorjonen who also created ComfyUI-Olm-SplineMask, simple-captioner integrates recent Qwen vision-language models to improve local media tagging accuracy. Professionals training custom models or organizing digital archives can replace manual labeling with an automated pipeline.
Model Size: varies & VRAM GPU: ~16GB required
Streamlined local media tagging workflows
- Batches process images and video files with real-time progress tracking.
- Offers prompt customization and selectable caption length modes.
- Applies 8-bit or 4-bit quantization to reduce hardware strain.
- Includes Flash Attention 2 toggles and automatic fallback options.
Creative professionals handling local media libraries often struggle with consistent metadata entry. By recursively scanning subfolders and skipping previously tagged files, this utility removes repetitive typing while keeping data strictly on-device.
Practical considerations for new updates
Recent updates refresh the interface to support Gradio 6.10.0 alongside new memory cleanup routines. Memory demands shift depending on the chosen vision model, making lower-precision settings necessary for cards near the baseline specification.
'It's built for my own use-cases and seems to work ok enough, but there can be issues hiding as always, so open a GitHub issue if you find something broken,'
noted the creator in a community post. Initial setup requires manual installation of a CUDA-compatible PyTorch build before running the application.
Local creators can manage media sets efficiently while keeping data private. Download the latest release using the link below. Get simple-captioner on GitHub.