How to train your dragon GPT model Ylli P, February 20, 2025February 21, 2025 In this series, I am going to share part of my pro bono research work with Birkbeck, University of London, School of Mathematical and Computational Sciences in collaboration with School of Creative Arts, Culture and Communication. The project idea, as hinted in the title (dragon aside), is to train a GPT model with movie transcripts produced by the school of creative arts students at Birkbeck, and when lunching the movies, allow the audience to chat with their preferred character via the GPT Model. By the end of this article, we would have set up the code for iterative training models, we would have trained a few models and tested them and configured the models to work with web interface for end user usage. We will be ready for the next phase of fine-tuning our movie GPT model. Se below the training in process on a single Nvidia RTX A4000 of GPT2 base model. Initial considerations The intent sounds easier on paper than it can turn out to be, we have to set up from the outset a process that is going to require many iterations of training, fine-tuning and sanitizing the transcripts. There is limited to no budget for the project for us to make use of cloud GPUs, however we have access to a university machine with two RTX8000 Quadro (48GB VRAM each). That should allow us to train with a decent base model. Additionally, in my home lab I have a RTX3080 (10GB VRAM) which might not be enough for training any usable models, but is good enough for the initial iterations with small models (e.g. GPT2 / 128M parameters) Phase 1: Setting Up the Environment As with any good exploratory experiment, taking some time for setting up a good basis is important. Because we will be switching back and forth between different machines (local machine, university machine and very likely once we have a training model we will need more powerful machines for training with a bigger base model), we should set up a GitHub repo to use for keeping the code and the data files in sync. We are setting up the repo as private during the exploratory phases of this research project, with the intention of making public the necessary code and data for reproducibility once we write the final report. We use Ubuntu server on our machines, so as a first step, we create a init.sh script that installs the needed packages on Ubuntu (if not available). These are things like, Nvidia drivers, CUDA drivers or more practically Python virtual environments. Additionally, we rely on the init script to also create a new clean python virtual environment and install the packages from the requirements file. Below a snipped from the init.sh file. #!/bin/bash set -e # Exit on error set -o pipefail # Exit if any command in a pipeline fails # Constants VENV_DIR="../clean_env" PYTHON_VERSION_REQUIRED="3.8" REQUIREMENTS_FILE="requirements.txt" ... echo "📦 Installing necessary packages..." sudo apt install -y python3 python3-pip python3-venv build-essential # Ensure NVIDIA drivers and CUDA support if command -v nvidia-smi &>/dev/null; then echo "✅ NVIDIA drivers detected." else echo "❌ NVIDIA drivers not found. Installing..." sudo apt install -y nvidia-driver-535 # Update with latest driver version if needed echo "⚠️ Please reboot the system for NVIDIA drivers to take effect!" fi # Check if PyTorch can detect CUDA if python3 -c "import torch; print(torch.cuda.is_available())" 2>/dev/null | grep -q "True"; then echo "✅ CUDA is available for PyTorch." else echo "❌ CUDA not found for PyTorch. Installing PyTorch with CUDA..." pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 fi ... PyTorch vs TensorFlow As stated in the previous paragraphs, we are going to research and fine tune a model that learns from movie transcripts and is able to converse as one of the movie characters. We will be using base models published on Hugging Face (aka and from here on HF) and use some of the HF tools. Most of HF models and tools are build with and for PyTorch. While this is not exclusive, the community support and tooling availability for PyTorch and HF published models is greater. While generally with a steeper learning curve, TensorFlow does slightly better in performance and production deployment tooling. Conversion and interoperability is also possible using ONNX (Open Neural Network eXchange) libraries. With that in mind, we are going to write our training program using PyTorch libraries. Requirements and other libraries accelerate: While we intend to start small (e.g. GPT2 128M parameters), the working model is very likely to be a model with 7B parameters (if resources allow for it, an instruct model of 20B parameters would be a better fit for what we have in mind). The VRAM size and GPU requirements will require to distribute the workload at the very least on multiple GPUs. We will use accelerate, a package from HF, to distribute the workload when required. torch: These are the PyTorch libraries, published on PyPi.org as torch. In their own words: ‘PyTorch is a highly flexible and widely used deep learning framework that provides essential tools for building and training neural networks. It excels in dynamic computation graphing, which allows for more intuitive model development and debugging compared to static graph frameworks. With native support for GPUs, PyTorch efficiently handles large-scale matrix operations and computations, which are crucial for training large language models.’ transformers: the HF library to build on top of existing pre-trained models. In their own words: … is a powerful and versatile toolkit designed for working with state-of-the-art natural language processing (NLP) models, particularly transformer-based architectures like BERT, GPT, and T5. It provides pre-trained models and easy-to-use interfaces for a variety of NLP tasks, including text classification, translation, question-answering, and more. The library abstracts the complexities involved in implementing these complex models from scratch, allowing developers and researchers to quickly fine-tune and deploy models for their specific tasks. While we will be using other auxiliary libraries for example for running the models or cleaning and adopting the input datasets, the above represent the main libraries used for training and fine-tuning during this project. Phase 2: Setting up Model Training In this section, we go through the steps taken to write a custom class for model training that is going to help us through the multiple iterations of training and fine-tuning the model. Our Training Data The training data (the movie transcripts) is in the form of large text files. We are going to set up a folder (namely ./data) where we can drop our training data, and set up our training class to read all files from the data folder that match certain extensions. ... # Global parameters default_data_folder = "./data" default_extensions = ['.txt', '.csv'] results_folder = "../results" logs_folder = "../logs" ... def load_text_files(self, folder_path: str, extensions: List[str]): text_data = "" for file_path in Path(folder_path).glob("*.*"): if file_path.suffix.lower() in extensions: with open(file_path, 'r', encoding='utf-8') as file: text_data += file.read() + "\n" return text_data ... combined_text = self.load_text_files(data_folder, extensions) text_chunks = combined_text.split('\n') text_chunks = [chunk.strip() for chunk in text_chunks if chunk.strip()] dataset = Dataset.from_dict({'text': text_chunks}) tokenized_datasets = dataset.map(self.tokenize_function, batched=True) Set-up for fine-tuning We have a number of parameters at our disposal that are going to help with the fine-tuning. Arguably the most important is going to be the base model. The base modes use for training will need some extensive consideration, however, because we will want to try multiple base models, the unique HF model name is going to be one of the input parameter to use with transformes for model loading. from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments ... self.model = AutoModelForCausalLM.from_pretrained(model_name).to(self.device) ... Another set of optional parameters (the number of exposed parameters is bound to grow) are shown in the code snipped below: ... @app.command() def train( model_name: str, device: str = typer.Option(..., help="Device to use for training ('cpu' or 'cuda')"), batch_size: int = typer.Option(..., help="Batch size for training"), max_length: int = typer.Option(..., help="Maximum sequence length"), gradient_accumulation: int = typer.Option(..., help="Gradient accumulation steps"), epochs: int = typer.Option(..., help="Number of training epochs"), data_folder: str = typer.Option(default_data_folder, help="Folder containing the training data"), extensions: List[str] = typer.Option(default_extensions, help="File extensions to include in the training data") ): ... device: this will probably always be Nvidia cuda however we are making it optional to allow cpu trainingbatch_size: given the VRAM memory constrains, the expectation to try ever-growing models and the potentially large set of data used for training, the batch size is going to help control memory usage (at the cost of training speed) max_length: similarly, max_length can help with memory consumption and sets the maximum number of input tokens during traininggradient_accumulation: tells the model when to update the weight over several forward and backward passes, effectively making the memory consumption a product of per_device_train_batch_size * gradient_accumulation_stepsepochs: number of passes to run the training for, with more epochs higher weight adjustment but risks of overfitting. Also, epochs directly affects the time it takes to train a model. We will discuss a number of other parameters that will play an important role once our focus shifts towards fine tuning the model (the next phases of this series). While other fine tunign parameters are set as fixed numbers at this stage, for completeness, I am sharing here our current trainign setup: ... training_args = TrainingArguments( output_dir=results_folder, eval_strategy="no", learning_rate=2e-5, per_device_train_batch_size=self.batch_size, gradient_accumulation_steps=self.gradient_accumulation, num_train_epochs=self.epochs, weight_decay=0.01, logging_dir=logs_folder, report_to="all", fp16=True, gradient_checkpointing=True, remove_unused_columns=False ) ... Base Model considerations Keeping in mind we are training a model that can converse and act as a movie character, while there are many archtiectures that can fit the problems, the three most notorius tranformer based architectures are GPT (Generative Pre-Trained Tranformer model) and LLaMA (Large Language Model Meta AI), with T5 (Text-toText Transfer Tranformer) also being a valid candidated architecture, even though it excels better at filling/predicting masked/missing text. Based on these architecures, a number of candiates fit the description, spanning different model size, number of parameters and base training dataset (i.e. what have these been trained on) GPT Based Models: Having been developed by OpenAi, some of the open source OpenAi models are the best candidates that fit the descirption; gpt2, gpt-medium, gpt2-large and gpt2-xl generally differing by the size/number of parameters each model has. Other Open Source models that utilize GPT3 include EleutherAI/gpt-neo-125M, gpt-neo-1.3B and gpt-neo-2.7B with probably the best candidate for our use case (but the most challanging to train) being EleutherAI/gpt-j-6B. LLaMA based Models: Most of the LLaMA based model candidates will be published by Meta in HF and can be found under their LlamaMeta org page. Similarly, the difference is going to fall under number of parameters and size. T5 based Models: Although we are not going to train on T5 based models, for completeness sharing here reference to probably the most notorius T5 base model, probably becuase of it support for multilanguage and transaltions. That is byt5-base by google. Phase 3: Training iterations As we speak, we are runnig a number of traing sessions on multiple machines with different configurations in size and other parameters. Our setup allows us to easily try some different options. Our goal at this stage is to be able to fit the best model in terms of size a number of parameters considering the hardware available. In the image above we show a training session using gpt2-medium base model on a single RTX-A4000 with some low fine tuing of batch-size, gradient and epoch given the hardware available. In another similar experiment, we have launched the training based on a Llama-3.2-1B base model, trained on two Nvidia RTX 8000 Quadro using accelerate to distribute the workload and parallelize the training. # example starting training on two GPUs CUDA_VISIBLE_DEVICES=0,1 accelerate python model-training.py openai-community/gpt2-medium --device cuda --batch-size=10 --max-length=128 --gradient-accumulation=3 --epochs=3 # output {'train_runtime': 332.242, 'train_samples_per_second': 32.425, 'train_steps_per_second': 1.084, 'train_loss': 1.1916698031955295, 'epoch': 3.0} 100 Model saved to ./results/final_model_20250221_182716 Once our newly trained model has been saved to file, we can run it using another small python script run-model.py that allows to test the model. (clean_env) yprift01@vasari:~/Dev/movie-gpt-model$ python ./run-model.py ./results/final_model_20250221_182716/ 🔍 Loading Hugging Face model from: ./results/final_model_20250221_182716/ ... ✅ HF Model loaded successfully! Type your messages below (type 'exit' to quit). You: Act as movie character DAVID INGLEBY You: what is Elgin college? 🤖 Model: what is Elgin college? It is a school for children of all ages. It is a school for children of all ages. You: Can you tell me more? 🤖 Model: Can you tell me more? I am interested in the idea of learning. I am interested in learning how to learn. I am interested in learning how to learn. You: Phase 4: Validating the result We need a more sustainable way to validate the results, besides sporadic or machine based validation. This is likely to be in the form of end user trials. For this purpose we need to setup our model in a real ChatGPT like access model where end users can (a) register or have an account given (b) access the web/mobile interface and be able to chat with our Movie Model and (c) be able to provide feedback on whether the model they were talking to was a good/bad impersonation of the movie character. For this purpose we have setup Open-WebUI, (which I explain in detail in another (b)log series and for the purpose of this post we are going to use as reference the Open-WebUI configured on https://chat.prifti.us/, however during this research project we used an internal Open-WebUI installation). Open-WebUI mainly supports two wasy of loading models into the interface: Any API source that is compatible with the OpenAI APIs (for example https://openrouter.ai as an alternative that provides access to a vast number of available models), or alternatively loading and running your own models, either open source or custom trained, by using Ollama. At this phase of the project, we used Ollama for running the trained model, however this comes with the extra step that the newly trained models need to be converted into the more interoperable GuFF (Giant Universal Format for Frames) format that Ollama suports. Conclusions and Next Steps In conclussion we created the basic software to be able to train custom LLM models using HuggingFace public base models. We are able to train and distribute our workload on differnet GPUs and machines allowing to try differnent base models and better fine tuning. We have also configured a way to visually interact with and validate the newly trained models. Our focus can now shift into repeating the process, fine tunning the data and the training and validate that the end-result (the movie character chatbot) behavies closer to what is expected from the bot. Movie GPT Model