Character AI Training — Beyond Static Datasets to Human Feedback • Ylli Prifti

The Character AI Challenge

Character AI isn’t just about training a model to speak like someone—it’s about capturing personality, behavior patterns, and response styles that feel authentic to users who know the character. This ongoing university research project tackles a problem that generic fine-tuning approaches struggle with: how do you measure and improve character authenticity beyond traditional metrics?

The project combines transcript-based training with human feedback integration, creating a pipeline from initial fine-tuning to iterative behavioral refinement based on real user interactions.

🎭 The Core Question: How do you train an AI to not just mimic dialogue but embody the psychological patterns that make a character recognizable and engaging?

Why Character-Specific Training Matters

Character AI faces unique challenges that general-purpose models can’t solve:

Consistency across conversations: Characters need persistent personality traits
Behavioral authenticity: Users who know the character will spot inconsistencies
Context-aware responses: Characters should reflect their background and relationships
Emotional range: Different characters express emotions differently

The project uses movie transcript data formatted as character-dialogue pairs, but the real innovation comes in the feedback integration system that allows iterative improvement based on user assessments of character authenticity.

2. Environment

2.1 🚀 Setting Up Your LLM Training Environment: What’s Happening Behind the Script

Before diving into large language model (LLM) training, getting your development environment right is half the battle. The script above automates just that—building a clean, GPU-accelerated Python environment prepped for training with llama.cpp, a high-performance C++ backend for running LLMs like LLaMA on your local machine.

Let’s break down what this script does and why it matters:

2.2 🧱 Core System Setup

The script starts by updating your package list and installing essential development tools (build-essential, python3, python3-pip, and python3-venv). These are the Lego bricks of any Python-based ML project. Without them, you’d be stuck chasing missing compilers or Python modules.

2.3 ⚙️ GPU & CUDA Detection

It then checks for NVIDIA drivers using nvidia-smi. If they’re missing, it installs nvidia-driver-535 and gives you a heads-up to reboot. This step is crucial. Your GPU is your best friend when training or running inference on LLMs, and without CUDA support, PyTorch will fall back to CPU.

The script also verifies that PyTorch detects CUDA (torch.cuda.is_available()). If it doesn’t, it installs a CUDA-enabled version of PyTorch using the cu118 wheel. This ensures GPU acceleration is actually used when training or running models.

2.4 🔳 Python Version & Virtual Environment

Next, it checks that your Python version meets the minimum required (3.12.3) and exits early if not. This ensures compatibility with libraries and avoids weird bugs later.

It then sets up a virtual environment in a directory called ../clean_env, activates it, and installs Python packages listed in requirements.txt. This isolates your project dependencies and keeps your system Python untouched: a must for reproducibility and sanity.

2.5 📁 Building `llama.cpp` with CUDA

Finally, the script checks for the llama.cpp repo, clones it if missing, and builds it from source with CUDA support enabled. llama.cpp is a blazing-fast C++ implementation that makes it possible to run quantized LLMs (like LLaMA or Mistral) efficiently, even on consumer GPUs. Compiling with CUDA enables GPU-accelerated inference; a game-changer when you’re working with large models locally.

2.6 🎉 Done and Dusted

After all this, you’re left with a fully-isolated Python environment, CUDA-enabled PyTorch, and a compiled version of llama.cpp ready to load quantized LLMs. Whether you’re training, fine-tuning, or just playing around with LLaMA models, this script gets you from zero to hero in one go.

Now you can focus on the real fun: making your model say smart (or wildly unhinged) things.

#!/bin/bash

set -e  # Exit on error
set -o pipefail  # Exit if any command in a pipeline fails

# Constants
VENV_DIR="../clean_env"
PYTHON_VERSION_REQUIRED="3.12.3"
REQUIREMENTS_FILE="requirements.txt"
LLAMA_DIR="../llama.cpp"

echo "🚀 Starting setup..."

# Function to check if a package is installed
is_installed() {
    dpkg -l | grep -qw "$1"
}

# Update and install core dependencies
echo "🔄 Updating package lists..."
sudo apt update -y

echo "📦 Installing necessary packages..."
sudo apt install -y python3 python3-pip python3-venv build-essential

# Ensure NVIDIA drivers and CUDA support
if command -v nvidia-smi &>/dev/null; then
    echo "✅ NVIDIA drivers detected."
else
    echo "❌ NVIDIA drivers not found. Installing..."
    sudo apt install -y nvidia-driver-535  # Update with latest driver version if needed
    echo "⚠️ Please reboot the system for NVIDIA drivers to take effect!"
fi

# Check if PyTorch can detect CUDA
if python3 -c "import torch; print(torch.cuda.is_available())" 2>/dev/null | grep -q "True"; then
    echo "✅ CUDA is available for PyTorch."
else
    echo "❌ CUDA not found for PyTorch. Installing PyTorch with CUDA..."
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
fi

# Ensure Python version compatibility
PYTHON_VERSION=$(python3 -c 'import sys; print(".".join(map(str, sys.version_info[:2])))')

if printf '%s\n' "$PYTHON_VERSION_REQUIRED" "$PYTHON_VERSION" | sort -V | head -n 1 | grep -q "$PYTHON_VERSION_REQUIRED"; then
    echo "✅ Python version is $PYTHON_VERSION (>= $PYTHON_VERSION_REQUIRED)."
else
    echo "❌ Python $PYTHON_VERSION is too old. Please install Python $PYTHON_VERSION_REQUIRED+."
    exit 1
fi

# Ensure virtual environment tools are installed
if ! dpkg -l | grep -q python3-venv; then
    echo "⚠️ python3-venv not found. Installing..."
    sudo apt install -y python3-venv
fi

# Create and activate virtual environment
if [[ ! -d "$VENV_DIR" ]]; then
    echo "🐍 Creating virtual environment at $VENV_DIR..."
    python3 -m venv "$VENV_DIR"
fi

echo "🔧 Activating virtual environment..."
source "$VENV_DIR/bin/activate"

# Upgrade pip and install dependencies
echo "📦 Installing requirements from $REQUIREMENTS_FILE..."
pip install --upgrade pip
pip install -r "$REQUIREMENTS_FILE"

# Build llama.cpp with CUDA support
if [[ ! -d "$LLAMA_DIR" ]]; then
    echo "📥 Cloning llama.cpp into $LLAMA_DIR..."
    git clone https://github.com/ggerganov/llama.cpp.git "$LLAMA_DIR"
fi

echo "🔨 Building llama.cpp with CUDA..."
cd "$LLAMA_DIR"
rm -rf build
mkdir build && cd build
cmake .. -DLLAMA_CUDA=on
make -j$(nproc)

echo "🎉 Setup complete! Virtual environment is ready."

3 Preliminary considerations

3.1 Choosing a Base Model for Training

Choosing the right base model is all about trade-offs between architecture, size, inference speed, and deployment targets. Transformer-based causal LMs (like Mistral, LLaMA, or Phi) offer a solid foundation, especially if you’re training character-driven dialogues. Smaller models (3B–7B) train faster and can fit comfortably on a 24–48GB GPU, while larger ones (13B–70B) require serious VRAM but reward you with richer representations. For inference efficiency, quantization is key — formats like Q4_K_M or Q5_1 can reduce memory usage dramatically with minimal loss in quality.

For this project, my dev environments inlcudes two machines with Nvidia 3080 + Nvidia 3090. The final model will run on a machine that contains two RTX Quadro 8000 (48GB Each), and I can train on up to 8 H100/A100 GPUs connected in NVLink (unlikely to need all that, since the final quantized model needs to fit 48GB)

The final trained model, will need to run on Open-WebUI (the user interface) with Ollama (retricing the final output to GGUF)

3.2 How will we Measure Success?

Success isn’t just whether the model trains without crashing — it’s whether the model behaves how we want it to. For this project, that means: does it extend the thoughts, cadence, and emotion of each character? We will start by creating a holdout test set: snippets of character dialogue the model hasn’t seen during training. Then compare generations from various checkpoints — subjective evaluation helps here (e.g., “does this sound like Tilda or Ahsan?”). We can also run BLEU, ROUGE, or perplexity, but these aren’t great for creative generation. Another good trick: We will use Open-WebUI to use the same prompt across models with different parameters (base model, LoRA config, epochs) and manually rank outputs, spotting which fine-tune is more aligned. Open-Webui supports side-by-side chat interface with inference history to compare outputs in a conversation flow. We shuold be able to feel the progress long before the metrics confirm it.

3.3 Extensions Beyond Training on Static Data

Static datasets will get us started — but for the model to be useful in the wild, it needs to talk to the world. We will need to consider Retrieval-Augmented Generation and agents. RAG lets the model pull from external sources — thinking vector search over documents, or even real-time web data — and blend that context into its outputs. For example, Tilda could respond not just with wisdom, but with relevant quotes from new research. Agents push that further by letting models plan actions, hit APIs, and loop reasoning. We will plug in Open-WebUI to orchestrate these flows, using the LLM as a thinking brain and APIs as limbs. With live data streams (e.g., X/Bluesky, scholarly feeds), the fine-tuned model evolves from static mimic to real-time conversational oracle.

3.4 Next Steps Beyond Text Inference: Voice + Vision

Once the model nails text generation, it’s time to unlock more human modalities — starting with voice and vision. On the voice side, we can clone speakers like Tilda using tools like Bark, XTTS, or ElevenLabs, turning generated text into natural-sounding speech. This creates a seamless, cinematic interface for interactive storytelling or AI-powered performances. For vision, integrating models like CLIP, BLIP-2, or Llava lets a character not only speak but see. Imagine a character who can describe a scene from a film still, respond to memes, or interpret mood boards. Adding OCR, facial recognition, or video summarization, and we’ve got a truly multimodal actor. Eventually, we can combine all three (text, voice, vision) into a lightweight agent that operates in real-time. Whether we’re building an assistant, a character simulator, or a docu-bot, stepping beyond plain text opens the door to richer, more intuitive human-AI interaction.

4. Preparing input data

The module for training takes in input a data folder (default ./data) and file format (default jsonl or alternatively raw for raw text files)

The movie transcript has been prepared in jsonl as shown in the example below:

{"speaker": "Tilda", "text": "The child thought: if only it was possible to hold worlds in one's hands with respect"}
{"speaker": "Benjy", "text": "7, O, i did i just did the code powered by command blocks which I might just do just before mods."}
{"speaker": "Lily", "text": "And how did you know to do that?"}
{"speaker": "Benjy", "text": "It told me in this book"}
{"speaker": "Lily", "text": "and then where did you find the book"}
{"speaker": "Benjy", "text": "in the chest. chests when tapped open up and you can see what's inside them"}
{"speaker": "Lily", "text": "so it's like someone left the password to open that"}

5. Running the training

We will run a number of trainings with different base modesl and parameters (hence fine-tuning)

For simplicity we will start with a small model: mistralai/Mistral-7B-Instruct-v0.3. During each training we will consider the following parameters


(clean_env) ubuntu@luminous-einstein:~/Dev/finetune-lab$ python lora-model-training.py --help

 Usage: lora-model-training.py [OPTIONS] MODEL_NAME

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    model_name      TEXT  [default: None] [required]                                                                                                                                                                                                                                                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --device                       TEXT     Device to use for training ('cpu' or 'cuda') [default: None] [required]                                                                                                                                                                                                  │
│ *  --batch-size                   INTEGER  Batch size for training [default: None] [required]                                                                                                                                                                                                                       │
│ *  --max-length                   INTEGER  Maximum sequence length [default: None] [required]                                                                                                                                                                                                                       │
│ *  --gradient-accumulation        INTEGER  Gradient accumulation steps [default: None] [required]                                                                                                                                                                                                                   │
│ *  --epochs                       INTEGER  Number of training epochs [default: None] [required]                                                                                                                                                                                                                     │
│    --data-folder                  TEXT     Folder containing the training data [default: ./data]                                                                                                                                                                                                                    │
│    --format                       TEXT     Data format: 'jsonl' or 'raw' [default: jsonl]                                                                                                                                                                                                                           │
│    --extensions                   TEXT     File extensions to include in the training data [default: .txt, .csv]                                                                                                                                                                                                    │
│    --install-completion                    Install completion for the current shell.                                                                                                                                                                                                                                │
│    --show-completion                       Show completion for the current shell, to copy it or customize the installation.                                                                                                                                                                                         │
│    --help                                  Show this message and exit.                                                                                                                                                                                                                                              │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Fine-Tuning with LoRA: A Deep Dive into My Training Script

When building and adapting large language models to custom domains or characters, fine-tuning can quickly become costly and memory-intensive. To address this, my training script leverages LoRA (Low-Rank Adaptation) along with 4-bit quantization, creating an efficient yet powerful workflow for fine-tuning transformer-based models like Mistral.

🚀 4-Bit Quantization

First, the model is loaded using Hugging Face’s AutoModelForCausalLM with a BitsAndBytesConfig set to:

load_in_4bit=True
bnb_4bit_quant_type="nf4"
bnb_4bit_compute_dtype=torch.float16
bnb_4bit_use_double_quant=True

This drastically reduces memory usage, making it feasible to train large models on a single GPU.

🧠 LoRA Config Explained

We inject LoRA adapters into the attention layers (q_proj and v_proj) with:

r=8: Rank of low-rank matrices (trade-off between quality and size)
lora_alpha=32: Scaling factor
lora_dropout=0.05: Regularization
bias="none": No bias adaptation

These adapters are trained while the original weights remain frozen. This massively reduces the number of trainable parameters, enabling fast and efficient fine-tuning.

🧪 Training Arguments

The script uses Hugging Face’s Trainer with key arguments:

fp16=True: Mixed-precision training
gradient_checkpointing=True: Saves memory by recomputing activations on backward pass
gradient_accumulation_steps: Allows small batch sizes while simulating a larger effective batch size

This combination of quantization, LoRA, and memory optimizations allows quick iteration on custom datasets (e.g., movie scripts, character dialogues) without massive infrastructure.

You get performance, portability, and power — all in one go.

⚠️ When LoRA Isn’t the Right Tool

While LoRA is fantastic for targeted fine-tuning with limited resources, it may not be suitable for:

Training from scratch or full model re-training, where full weight updates are necessary
Highly dynamic tasks requiring architectural changes or adapter placement beyond the attention layers
Non-transformer architectures, where LoRA integration may be unsupported or ineffective

In such cases, consider alternatives like:

Full fine-tuning (if compute isn’t a bottleneck)
QLoRA for improved memory handling at larger scale
Prompt tuning or adapter fusion for modular training on multi-task setups

Choosing the right method depends on your compute budget, data size, and flexibility needs.

The Human Feedback Integration Pipeline

After initial training produces candidate models with reasonable character behavior, the real innovation begins: systematic human feedback collection and integration.

Deployment for Evaluation

The most promising models are deployed on Open-WebUI, creating an accessible interface where researchers and students can interact with different character models. This isn’t just for demonstration—it’s a structured data collection phase.

Structured Feedback Collection

Rather than relying on subjective impressions, the system collects specific behavioral assessments through questionnaires focused on:

“Match behavior”: Does the response align with the character’s established personality?
“Missing expected behavior”: What character traits or responses are absent?
Behavioral specificity: Detailed feedback on character-specific mannerisms, speech patterns, and decision-making

This creates a dataset of human judgments about character authenticity that goes far beyond traditional BLEU or ROUGE scores.

The Iterative Loop (In Progress)

The collected feedback becomes training data for the next iteration:

Behavioral analysis: Feedback is analyzed to identify systematic gaps in character portrayal
Augmented training data: Missing behaviors are translated into training examples
Refined fine-tuning: The model undergoes additional training incorporating behavioral corrections
Validation: New models are tested against both original metrics and human feedback criteria

Why This Approach Matters

Traditional fine-tuning treats character training as a static problem—train once, deploy forever. But character authenticity is subjective and context-dependent. The human feedback loop creates a path for continuous improvement based on actual user interactions rather than proxy metrics.

This methodology could extend beyond entertainment to any domain where AI needs to embody specific personas: customer service representatives, educational tutors, or therapeutic assistants.

Current Status and Next Steps

Completed: Initial LoRA fine-tuning, Open-WebUI deployment, feedback collection system In Progress: Behavioral feedback analysis and translation to training data Planned: Second iteration training run with integrated feedback data

The project demonstrates that character AI isn’t solved by better training data alone—it requires systematic human feedback integration to achieve authentic personality embodiment.

Technical Architecture

The complete pipeline leverages:

LoRA fine-tuning for efficient character adaptation
Open-WebUI for accessible interaction interfaces
Structured feedback collection for behavioral assessment
GGUF conversion for efficient inference deployment
Iterative training incorporating human behavioral feedback

This represents a significant step beyond static fine-tuning toward dynamic, feedback-driven character AI development.

Character AI Training — Beyond Static Datasets to Human Feedback