Product
Research
Oct 7, 2025
How We Built a Self-Improving Multimodal Data Recognition System with Reinforcement Learning
Ashutosh Synghal, VP Engineering: When traditional approaches hit a wall, we turned to RL to revolutionize multimodal data transcription quality and speed
The Problem We Faced
Building high-quality speech-to-text models traditionally requires massive amounts of human-labeled audio data. We were hitting a frustrating bottleneck: our baseline Whisper model achieved decent results with Word Error Rates (WER) around 4%, but improving it further meant months of expensive human transcription and annotation work. We knew there had to be a better way.
The real challenge wasn't just accuracy—it was scale and comprehensiveness. We needed to produce both high-quality transcriptions and detailed annotations while continuously improving our model's performance. Traditional supervised learning approaches would take 9+ months to collect the quality data we needed.
Our Solution: Reinforcement Learning Meets Speech Recognition
We decided to tackle this with a novel approach: using reinforcement learning to create a self-improving speech recognition system. Here's how we did it.
Step 1: Building Our RL Environment
First, we created what we call an "AudioTranscriptionEnvironment"—essentially a specialized playground for our speech model to learn and improve:
This environment handles audio preprocessing, calculates rewards based on transcription accuracy, and manages our decentralized data collection system. Think of it as creating the rules of the game for our speech recognition challenge.
Step 2: Choosing the Right Algorithm
We evaluated several reinforcement learning approaches and settled on Direct Preference Optimization (DPO) over more complex alternatives like PPO. Why DPO?
Simpler data collection: Instead of requiring absolute quality scores, DPO only needs preference pairs ("this transcription is better than that one")
Perfect for decentralized feedback: Contributors can easily compare transcription and annotation pairs without needing expertise in audio evaluation
More stable training: No separate reward model to train, reducing complexity
Step 3: Engineering a Robust Multimodal Extension for VERL
One of our most significant technical achievements was building a comprehensive abstraction layer that extends VERL's capabilities to handle multimodal data. This wasn't just a simple wrapper—we engineered a sophisticated system that fundamentally bridges the gap between audio processing and text-based RL training:
This robust extension required solving complex challenges around semantic preservation, modality alignment, and maintaining gradient flow across different data representations. The result was a production-ready system that could seamlessly leverage VERL's state-of-the-art RL algorithms while operating on audio data—something that had never been done before at this scale.
Step 4: Integration and Training Pipeline
Our complete pipeline follows this flow:
Start with baseline model: Downloaded and tested Whisper performance
Collect human preferences: Used our existing platform to collect preference comparisons between transcription and annotation pairs
Engineer multimodal extension: Built a sophisticated abstraction layer that enables modern RL frameworks to work with audio data for the first time
Train with DPO: Used preference pairs to improve the model via reinforcement learning through our extended framework
Generate synthetic data: Used the improved model to create more training examples
Continuous improvement: Fed synthetic preferences back into training for ongoing enhancement
The Results: Beyond Our Expectations
The numbers speak for themselves:
Quality Improvements:
Significant reduction in Word Error Rate: From 2-3% down to even lower error rates
99.6% human-level accuracy: Our final model matches human transcription and annotation quality on complex audio
Speed and Scale:
20x throughput improvement: Massive scaling of audio processing capabilities
10x faster dataset expansion: 10,000+ hours of quality transcribed and annotated audio in 7 weeks instead of 9+ months
83% reduction in training time: Complete RL fine-tuning in 4 days vs. industry standard 3-4 weeks
Key Technical Insights
Why Reinforcement Learning Works for Speech Recognition
Traditional supervised learning for speech recognition requires perfectly labeled data. But RL lets us learn from preferences and comparisons, which are much easier for humans to provide accurately. Instead of asking "Is this transcription and annotation perfect?", we ask "Which transcription-annotation pair is better?"—a much more reliable signal.
Engineering a Revolutionary Multimodal RL Extension
One of our most significant technical breakthroughs was developing a comprehensive multimodal extension that fundamentally expands reinforcement learning capabilities beyond text-only processing. This wasn't just adapting existing tools—we built a novel architectural layer that enables robust reinforcement learning on multimodal data:
Semantic preservation across modalities: Maintaining meaning and context when translating between audio and text representations
Cross-modal gradient flow optimization: Ensuring training signals propagate correctly through different data types
Dynamic modality alignment: Real-time validation that audio features map correctly to RL training objectives
Production-scale robustness: Handling edge cases, data quality variations, and scaling challenges
Our extension creates what is essentially a new class of RL framework—one that can seamlessly operate on audio, text, and their complex interrelationships. This breakthrough enabled us to apply VERL's cutting-edge algorithms to speech recognition for the first time, opening up entirely new possibilities for multimodal AI training.
The Power of Synthetic Data Generation
Once our model improved beyond a certain threshold, we could use it to generate high-quality synthetic training data. This created a virtuous cycle: better model → better synthetic data → even better model.
Decentralized Data Collection
By distributing data collection and using blockchain verification, we achieved both scale and quality control. Contributors could work independently while maintaining data integrity through cryptographic verification.
Lessons Learned
What worked:
DPO was the right choice for our decentralized setup
Our novel multimodal extension unlocked RL potential for speech recognition
Combining human and synthetic preferences accelerated improvement
The robust abstraction architecture enabled seamless scaling across modalities
What we'd do differently:
Start with a larger initial batch of human preferences
Design the multimodal extension architecture even earlier in the process
Build more comprehensive tools for monitoring cross-modal training dynamics
Looking Forward
This approach opens up exciting possibilities beyond speech recognition. The same principles could apply to any multi-modal AI task where quality evaluation is subjective but comparisons are reliable—video captioning, image description, or even creative content generation.
We've shown that reinforcement learning isn't just for games or robotics. When applied thoughtfully to real-world problems like speech recognition, it can deliver dramatic improvements in both quality and efficiency.
The future of AI training might not be about collecting more labeled data—it might be about building better feedback loops that let our models teach themselves.