VoxArt — Ansh Dawda

VoxArt combines speech recognition and image generation in a voice-driven creative tool. The bachelor thesis traces spoken descriptions through transcription, sentiment analysis, and Stable Diffusion v2 to produce a context-aware visual result.

System Architecture

The pipeline runs in three stages: it processes audio input through speech recognition, performs sentiment analysis for context, and generates the corresponding image with Stable Diffusion v2.

Speech recognition: Google Cloud Speech-to-Text converts the audio to text.
Sentiment analysis: BERT identifies emotional context and enriches the prompt.
Image generation: Stable Diffusion v2 synthesizes the final visual.

Performance

Each component was measured on its own terms — classification metrics for the sentiment model, generation metrics for the diffusion model, and alignment metrics for the pairing between them.

	BERTsentiment	Stable Diffusion v2generation	Text–imagealignment
Classification
Accuracy	89.5%	–	–
Precision	91.2%	–	–
Recall	87.8%	–	–
F1	89.4%	–	–
Generation quality
FIDlower is better	–	12.6	–
Inception Score	–	21.4	–
LPIPSlower is better	–	0.17	–
SSIM	–	0.81	–
Alignment
CLIP score	–	–	0.32
Human coherence	–	–	4.6 / 5
Contextual accuracy	–	–	87%

Component-level results across the three stages of the pipeline. Sentiment figures are classification metrics; generation figures are standard image-quality measures, where lower FID and LPIPS are better and higher Inception Score and SSIM are better.

Evaluation dataset

Evaluation used a purpose-built set covering a range of speech patterns, linguistic complexity, and creative description scenarios.

Total utterances: 500
Average length: 10.2 s
Coverage: Multiple languages and speakers
Transcription accuracy: 95.6%
Overall system performance: 93.7%

Technical Implementation

Core technologies

Google Cloud Speech-to-Text: speech recognition with multi-language support and real-time processing.
BERT: contextual sentiment analysis and natural language understanding.
Stable Diffusion v2: text-to-image generation at high resolution.
PyTorch: model integration and optimization across the pipeline.

Key features

Real-time processing: live audio in, transcription and image generation out.
Multi-language support: recognition and processing across several spoken languages.
Sentiment-aware generation: image output shaped by the emotional context of the description.
Voice-driven interface: no technical barrier between speaking and seeing a result.

Challenges

Three problems dominated development, each trading latency, accuracy, or fidelity against the others.

Minimising the delay between speech input and image output meant restructuring the pipeline around asynchronous processing with preloaded models, rather than running the three stages strictly in sequence.

Transcription had to hold up across speakers, accents, and noise conditions. Noise filtering and adaptive recognition brought it to 95.6% accuracy.

Generated images needed to reflect both the semantic content and the emotional register of what was said. Prompt engineering informed by the sentiment stage reached a 96.7% coherence score.

Research Contributions

The thesis contributes to multimodal AI by demonstrating a working integration of speech and vision models, and by establishing evaluation benchmarks for voice-driven creative applications.

Integration of sentiment analysis with image generation for context-aware synthesis.
Real-time multimodal processing pipeline optimization.
Voice-driven creative interface design patterns.
An evaluation framework spanning recognition accuracy, generation quality, latency, and contextual accuracy.

Future Directions

Integrate stronger language and diffusion models for richer generation.
Extend the system to education, accessibility, and professional creative workflows.
Package the pipeline for web and mobile use with collaborative, cloud-backed processing.

VoxArt shows the potential of voice-driven creative tools, and makes AI-powered image generation reachable for people regardless of technical background.