MLOps Pipeline — Production LLM Training from Experiment to Deployment • Ylli Prifti

The Problem: From Notebook to Production

After proving that specialized models could outperform generic ones for domain-specific tasks, the next challenge became operationalizing the training process. The initial AnySecret Assistant training was functional but not production-ready: hardcoded paths, manual model evaluation, no experiment tracking, and deployment was a copy-paste affair.

The goal: build a production MLOps pipeline that could reliably train, evaluate, and deploy domain-specific models with minimal manual intervention.

The Production Pipeline Architecture

The enhanced MLOps pipeline represents a complete rethinking of the training workflow, designed around reproducibility, automation, and professional deployment practices.

Core Components

Training Infrastructure

Hydra configuration management - No more hardcoded parameters
Timestamped experiment runs - Every training run gets a unique directory
Automatic model versioning - Models tagged and tracked from training to deployment
Multi-GPU support - Local and remote training configurations

Evaluation Framework

Comprehensive metrics - ROUGE, BLEU, exact match, Jaccard overlap
Baseline comparison - Automatically compare against base model performance
Quality gates - Models only deploy if they beat baseline metrics
Adversarial testing - Systematic evaluation on edge cases

Deployment Automation

HuggingFace Hub integration - One-command publishing
Format conversion - Automatic GGUF generation for Ollama deployment
Model card generation - Automated documentation with metrics and training details

MLOps Pipeline Architecture

Configuration-Driven Training

The pipeline uses Hydra for hierarchical configuration management, enabling systematic experimentation without code changes.

Training Configurations

# configs/train.yaml - Production configuration
model_name: meta-llama/Meta-Llama-3-8B  # Base model for maximum flexibility
dataset_path: ./data
dataset_pattern: "*.jsonl"
data_format: messages  # Support both chat and completion formats

# Hyperparameters optimized for domain training
epochs: 16
batch_size: 8
lr: 0.0025  # Higher LR for LoRA training
max_length: 512
gradient_accumulation_steps: 4

# LoRA configuration for maximum capacity
lora_r: 64          # High rank for complex domain knowledge
lora_alpha: 128     # 2x scaling factor
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Memory optimization
gradient_checkpointing: true
bf16: true  # Modern GPU precision

Multi-GPU Configurations

The pipeline includes optimized configurations for different hardware setups:

train_small - RTX 3080 (12GB) with Qwen-0.5B
train_medium - RTX 4090 (24GB) with Llama-3B
train_large - A100 (40GB+) with Llama-8B
train_remote - SSH-based cloud training

Data Pipeline & Quality Assurance

Flexible Data Format Support

The pipeline handles multiple data formats with automatic conversion:

# Chat/Messages format
{"messages": [
  {"role": "user", "content": "How do I bootstrap AnySecret for AWS?"},
  {"role": "assistant", "content": "Use `anysecret config --provider aws...`"}
]}

# Prompt/Completion format  
{"prompt": "Bootstrap AnySecret for AWS", "completion": "Use `anysecret config...`"}

Quality Validation Pipeline

Before training begins, the system runs comprehensive data validation:

def validate_dataset(dataset, tokenizer):
    """Pre-training data quality checks"""
    issues = []
    
    # Check for empty/malformed samples
    empty_count = sum(1 for x in dataset if not x["text"].strip())
    
    # Analyze token distribution
    lengths = [len(tokenizer.encode(x["text"])) for x in dataset[:100]]
    
    # Detect duplicates
    if len(texts) != len(set(texts)):
        issues.append("Duplicate samples detected")
    
    # Validate formatting
    for sample in dataset:
        if not validate_sample_format(sample):
            issues.append(f"Invalid format: {sample}")
            
    return issues

Experiment Tracking & Reproducibility

Timestamped Run Management

Every training run creates a timestamped directory with complete experiment state:

outputs/runs/2024-12-23_14-30-45/
├── model/                    # Fine-tuned model weights
│   ├── adapter_config.json   # LoRA configuration
│   ├── adapter_model.safetensors
│   └── tokenizer.json
├── config_used.yaml          # Exact config snapshot
├── metrics.json              # Final evaluation metrics  
├── logs/                     # TensorBoard training logs
└── artifacts/
    ├── data_stats.json       # Dataset statistics
    └── training_log.txt      # Detailed training output

Symlink Management

The system maintains a latest symlink pointing to the most recent successful run, enabling easy model loading and comparison scripts.

Training Pipeline Implementation

Distributed Training Support

The pipeline supports multiple parallelization strategies:

def setup_training_environment(cfg):
    """Configure training for available hardware"""
    
    if cfg.distributed.enabled:
        # Multi-GPU training with DDP
        torch.distributed.init_process_group(backend="nccl")
        model = torch.nn.parallel.DistributedDataParallel(model)
        
    if cfg.optimization.gradient_checkpointing:
        # Trade compute for memory
        model.gradient_checkpointing_enable()
        
    if cfg.optimization.mixed_precision:
        # Use automatic mixed precision
        scaler = GradScaler()

Remote Training Automation

For cloud GPU training, the pipeline includes SSH automation:

# Push code and sync environment
make push REMOTE_HOST=gpu-server.local

# Initialize remote environment  
make ssh-init

# Run training in background tmux session
make ssh-train-bg TRAIN_CONFIG=train_large

# Monitor training progress
make ssh-monitor

Comprehensive Evaluation Framework

Multi-Metric Assessment

The evaluation system goes beyond simple loss metrics:

def run_comprehensive_evaluation(model, eval_dataset):
    """Multi-metric model evaluation"""
    
    metrics = {
        'exact_match': 0,
        'jaccard_overlaps': [],
        'rouge_scores': [],
        'bleu_scores': [],
        'response_lengths': [],
        'generation_times': []
    }
    
    for sample in eval_dataset:
        # Generate model response
        response = generate_response(model, sample['prompt'])
        expected = sample['completion']
        
        # Calculate metrics
        metrics['exact_match'] += exact_match(response, expected)
        metrics['jaccard_overlaps'].append(jaccard_similarity(response, expected))
        
        # Performance metrics
        metrics['generation_times'].append(generation_time)
        
    return aggregate_metrics(metrics)

Baseline Comparison

Every model evaluation includes automatic baseline comparison:

def compare_with_baseline(fine_tuned_metrics, baseline_metrics):
    """Determine if fine-tuned model beats baseline"""
    
    improvements = {}
    for metric in ['exact_match', 'rouge1', 'bleu']:
        improvement = fine_tuned_metrics[metric] - baseline_metrics[metric]
        improvements[metric] = {
            'improvement': improvement,
            'percent_change': (improvement / baseline_metrics[metric]) * 100
        }
    
    # Quality gate: Must improve on key metrics
    passes_gate = (
        improvements['exact_match']['improvement'] > 0.02 and
        improvements['rouge1']['improvement'] > 0.05
    )
    
    return passes_gate, improvements

Deployment Automation

Multi-Target Publishing

The pipeline supports publishing to multiple deployment targets:

# Local model archive
make publish

# HuggingFace Hub with model card
make publish-hf

# Ollama-compatible GGUF conversion
make gguf QUANT=Q5_K_M

# Docker container for serving
make docker-build

Model Card Generation

Deployment includes automatic documentation:

def generate_model_card(run_dir, metrics, config):
    """Auto-generate comprehensive model documentation"""
    
    card_template = """
# {model_name} - Domain Specialist

## Model Details
- Base Model: {base_model}
- Training Date: {date}
- Fine-tuning Method: LoRA (r={lora_r})
- Dataset Size: {dataset_size} examples

## Performance
- Exact Match: {exact_match:.1%}
- ROUGE-1: {rouge1:.3f}
- Improvement over baseline: {improvement:.1%}

## Training Configuration
{training_config}

## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("{repo_id}")
tokenizer = AutoTokenizer.from_pretrained("{repo_id}")

"""

return card_template.format(**template_vars)


---

## Production Operations

### Monitoring & Alerting

The pipeline includes built-in monitoring for production training:

```python
def monitor_training_health(metrics_log):
    """Monitor training for issues"""
    
    alerts = []
    
    # Check for divergence
    if metrics_log['loss'][-1] > metrics_log['loss'][0] * 1.5:
        alerts.append("Training loss diverging")
        
    # Check gradient health
    if metrics_log['grad_norm'][-1] > 100:
        alerts.append("Exploding gradients detected")
        
    # Check GPU utilization
    if metrics_log['gpu_util'][-1] < 0.8:
        alerts.append("Low GPU utilization")
        
    return alerts

Cost Optimization

Built-in cost tracking for cloud training:

def calculate_training_cost(start_time, end_time, instance_type):
    """Track training costs for budgeting"""
    
    duration_hours = (end_time - start_time).total_seconds() / 3600
    
    instance_costs = {
        'p3.2xlarge': 3.06,   # AWS V100
        'p4d.24xlarge': 32.77,  # AWS A100
        'g5.xlarge': 1.006    # AWS T4
    }
    
    total_cost = duration_hours * instance_costs[instance_type]
    
    return {
        'duration_hours': duration_hours,
        'cost_per_hour': instance_costs[instance_type], 
        'total_cost': total_cost,
        'cost_per_sample': total_cost / num_training_samples
    }

Key Improvements Over Initial Implementation

The production pipeline represents significant advances over the initial AnySecret training:

Reliability

Comprehensive error handling and recovery
Automatic checkpoint saving and resumption
Health monitoring throughout training

Reproducibility

Complete configuration versioning
Deterministic training with seed management
Environment capture and restoration

Efficiency

Multi-GPU training support
Memory optimization for consumer hardware
Remote training automation for cost optimization

Quality Assurance

Systematic baseline comparison
Comprehensive evaluation metrics
Automated quality gates before deployment

Operations

One-command deployment to multiple targets
Automated documentation generation
Cost tracking and budget alerts

Real-World Impact

The production pipeline has enabled systematic experimentation that led to measurable improvements:

Training efficiency: 3x faster iteration cycles through automation
Model quality: 15% average improvement in exact match scores through systematic evaluation
Cost reduction: 60% lower training costs through remote GPU optimization
Team productivity: Non-experts can now train and deploy models confidently

The pipeline design has been open-sourced as part of the LLM Training Workshop, enabling teams to implement similar production-grade training infrastructure.

Next Steps

The MLOps pipeline continues evolving with planned enhancements:

Advanced monitoring: Integration with MLflow for experiment comparison
Automated hyperparameter tuning: Optuna integration for systematic optimization
Multi-cloud support: Seamless training across AWS, GCP, and Azure
Federated learning: Support for distributed training across multiple organizations

This production infrastructure represents the foundation for scaling specialized model training from individual projects to enterprise-wide AI initiatives.