go back to projects

Bachelor Thesis Project

VoxArt

A Speech-to-Image Synthesis system that transforms spoken language into visual art.

built with
  • BERT
  • Stable Diffusion v2
  • PyTorch
  • Google Cloud Speech-to-Text API
  • Natural Language Processing
  • Deep Learning
[Voice x Art]
VoxArt represents a different approach to multimodal AI, seamlessly integrating speech recognition with advanced image generation models. This bachelor thesis project demonstrates the practical implementation of a voice-driven creative tool that transforms spoken descriptions into high-quality visual representations using state-of-the-art deep learning architectures.

System Architecture

The VoxArt system employs a sophisticated three-stage pipeline that processes audio input through speech recognition, performs sentiment analysis for enhanced context understanding, and generates corresponding images using Stable Diffusion v2.
VoxArt Processing Pipeline

[1] Speech Recognition

Google Cloud Speech-to-Text API converts spoken audio into structured text with high accuracy

[2] Sentiment Analysis

BERT model analyzes emotional context and enhances prompt understanding

[3] Image Generation

Stable Diffusion v2 synthesizes high-quality images from processed text descriptions

Performance Metrics

Comprehensive evaluation of the VoxArt system demonstrates exceptional performance across all integrated components, with strong results in sentiment analysis and image generation quality.
Component Performance Analysis
Sentiment Analysis (BERT)
Accuracy:89.5%
Precision:91.2%
Recall:87.8%
F1 Score:89.4%
Accuracy: Overall correctness of sentiment predictions across all emotional categories. Precision: Proportion of correctly identified sentiments among all positive predictions. Recall: Ability to identify all relevant emotional contexts in speech input. F1 Score: Harmonic mean balancing precision and recall for robust sentiment classification.
Image Generation (Stable Diffusion v2)
FID Score:12.6
Inception Score (IS):21.4
LPIPS:0.17
SSIM:0.81
CLIP Score:0.32
Coherence (human eval.):4.6/5
Contextual Accuracy:87%
FID: Fréchet Inception Distance measures realism by comparing feature distributions (lower is better). IS: Inception Score evaluates image quality and diversity (higher indicates better generation). LPIPS: Learned Perceptual Image Patch Similarity for human-like quality assessment. SSIM: Structural Similarity Index measuring luminance, contrast, and structure preservation. CLIP Score: Vision-language model alignment between generated images and text descriptions. Coherence: Human evaluation of visual-semantic consistency and artistic quality. Contextual Accuracy: Correspondence between spoken emotional context and visual output.

Technical Implementation

The implementation leverages cutting-edge technologies to create a robust and scalable speech-to-image synthesis system with real-time processing capabilities and high-quality output generation.
Core Technologies
  • Google Cloud Speech-to-Text API: Advanced speech recognition with support for multiple languages and real-time processing
  • BERT (Bidirectional Encoder Representations): Contextual sentiment analysis and natural language understanding
  • Stable Diffusion v2: State-of-the-art text-to-image generation with high-resolution output capabilities
  • PyTorch: Deep learning frameworks for model integration and optimization
Key Features
  • Real-time Processing: Live audio input with immediate transcription and image generation
  • Multi-language Support: Recognition and processing of various spoken languages
  • Sentiment-aware Generation: Enhanced image output based on emotional context analysis
  • User-friendly Interface: Intuitive voice-driven interaction without technical barriers

Research Dataset & Methodology

The evaluation was conducted using a comprehensive research dataset specifically designed to test the system's performance across various speech patterns, linguistic complexities, and creative description scenarios.
Dataset Characteristics
Total Utterances:500
Avg. Length:10.2 seconds
Languages:Multiple
Speakeand GPU accelerationrs:Various

Performance metrics calculated based on dataset of 500 utterances recorded from various speakers with average length of 10.2 seconds.

Quality Metrics
  • Transcription accuracy measurement
  • Word Error Rate (WER) calculation
  • Language detection accuracy
  • Sentiment classification performance

Word Error Rate measures accuracy of transcribed words compared to ground truth. Speaker diarization represents precision of correctly identifying different speakers.

Evaluation Results

The comprehensive evaluation demonstrates the system's reliability and effectiveness across all performance dimensions, with particularly strong results in image generation quality and contextual accuracy.

Overall System Performance: 93.7%
Language detection accuracy indicates precision of correctly identifying language spoken in audio

Implementation Challenges & Solutions

Developing VoxArt presented several technical challenges, from managing real-time processing requirements to ensuring accurate interpretation of spoken descriptions and maintaining high-quality visual output.
Latency Optimization

Implementing efficient pipeline processing to minimize delay between speech input and image generation output.

Solution: Asynchronous processing with optimized model loading for real-time performance.
Speech Accuracy

Achieving high transcription accuracy across different speakers, accents, and environmental conditions.

Achievement: 95.6% transcription accuracy with robust noise filtering and adaptive recognition algorithms.
Visual Quality

Ensuring generated images accurately reflect the semantic content and emotional context of spoken descriptions.

Result: 96.7% coherence score with enhanced prompt engineering and sentiment-aware generation.

Research Contributions

This thesis project contributes to the field of multimodal AI by demonstrating effective integration of speech and vision technologies, providing insights into real-time processing requirements, and establishing benchmarks for voice-driven creative applications.
Novel Approaches
  • Integration of sentiment analysis with image generation for context-aware synthesis
  • Real-time multimodal processing pipeline optimization
  • Voice-driven creative interface design patterns
  • User experience design for creative applications
Evaluation Metrics
  • Speech recognition accuracy assessment
  • Image generation quality metrics
  • User satisfaction and usability studies
  • System latency and performance analysis
  • Contextual accuracy measurement

Future Directions

The VoxArt project opens several avenues for future research and development, including enhanced multimodal understanding, improved generation quality, and broader applications in creative and educational domains.
Enhanced Models

Integration of larger, more sophisticated models for improved understanding and generation capabilities, including GPT-4 and advanced diffusion models.

Expanded Applications

Extension to educational tools, accessibility applications, and professional creative workflows including design automation and content creation platforms.

Platform Integration

Development of web and mobile applications for broader accessibility and user adoption, with cloud-based processing and collaborative features.

Project Impact

VoxArt demonstrates the potential of voice-driven creative tools and contributes to making AI-powered image generation more accessible and intuitive for users across different technical backgrounds.
my resume

too bright? click ↝