LLM Training Workshop — My Journey Into AI Complexity and What I Learned
The Problem I Kept Running Into
After diving deep into AI over the past few years—both in enterprise environments and through academic collaborations—I kept noticing the same pattern. Software engineers would get excited about AI, dive into tutorials, maybe train their first model, then hit a wall of complexity and give up or retreat to calling APIs.
The gap wasn’t just technical knowledge. It was the overwhelming nature of the field itself: too many frameworks, conflicting advice, academic papers that assume deep ML background, and tutorials that skip over the hard parts of actually putting models in production.
I found myself having the same conversations repeatedly, explaining not just how transformers work, but why they evolved the way they did, when you actually need to train your own models, and how to think about the whole pipeline from data to deployment.
How I Approached Building the Workshop
Starting With My Own Confusion
The more I learned about AI, the more I realized how much the educational material was either too academic (dense papers with math) or too shallow (follow-this-tutorial-and-magic-happens). I wanted something that felt like having a conversation with someone who had been through the struggle.
So I structured it around the questions I actually had when starting:
- Why do people keep talking about “attention” and what does it actually do?
- How do you know if training your own model is worth the effort?
- What’s all this LoRA stuff and why does everyone say it’s important?
- How do you actually evaluate whether your model is any good?
Theory Through Stories, Not Equations
I found that understanding why things evolved the way they did made the technical details click. Like why we moved from RNNs to transformers—it’s not just “transformers are better,” it’s “RNNs had this specific parallelization problem and transformers solved it in this clever way.”
The workshop slides walk through this evolution story, from perceptrons in 1943 to modern LLMs, focusing on the problems each innovation solved rather than just listing features.
Hands-On With Real Constraints
The practical component isn’t a toy example. It’s built around the same production infrastructure I’ve used for actual projects, but simplified enough that you can run it on a gaming GPU.
The key insight: if you can’t train something meaningful on accessible hardware, most people will never move past tutorials.
What the Workshop Actually Covers
The Story Behind the Technology
I start with history because I learned the hard way that diving straight into transformer architecture is overwhelming. But when you understand that RNNs couldn’t parallelize (so training was painfully slow) and that attention mechanisms were originally just a patch to help with longer sequences, suddenly the “Attention Is All You Need” paper makes sense.
The historical progression helps you recognize patterns. Like how every few decades, we hit a wall with existing approaches, then someone finds a clever workaround that becomes the new standard. Right now we’re probably in one of those transition moments with LLMs.

Making Sense of Memory and Hardware
One of the biggest surprises when I started training models was how quickly you run out of GPU memory. The struggle is real when your 7B parameter model is taking 14GB just to load on your RTX 3090, and you still need room for gradients and optimizer states.
You learn early on (through some fun math) the GPU-economics of memory and cost:
# What I wish someone had told me earlier
model_size_gb = (num_parameters * bytes_per_param) / 1e9
# Reality check for LLaMA-7B:
FP32: 7B * 4 = 28GB # My RTX 4090 has 24GB—won't fit
FP16: 7B * 2 = 14GB # Barely fits, no room for training
INT8: 7B * 1 = 7GB # Finally, something reasonable
Understanding why Flash Attention matters isn’t about the math—it’s about realizing that normal attention creates huge intermediate matrices that eat your GPU memory. Flash Attention is clever because it never creates those big matrices, computing everything in smaller chunks.
The Efficiency Tricks That Actually Matter
Getting models to fit and train on normal hardware is where most people get stuck. There are so many techniques (quantization, LoRA, gradient checkpointing, mixed precision) that it’s overwhelming.
I focus on the decision tree I actually use:
# My actual thought process when choosing techniques
def what_can_i_actually_run(gpu_memory, model_size):
if gpu_memory >= model_size * 2:
return "lucky_you_full_precision"
elif gpu_memory >= model_size:
return "int8_quantization" # LLM.int8() is magic
else:
return "qlora_time" # 4-bit + LoRA is your friend
LoRA was another revelation. Instead of fine-tuning all 7 billion parameters, you add tiny matrices that capture the changes you care about. It’s like patching software—you don’t recompile everything, just add the diff.
The Model Selection Mistake I Made (So You Don’t Have To)
Early on, I thought specialized models were always better. “CodeLlama for code tasks, right?” Wrong.
# What I tried first (and learned the hard way)
specialized_model = "CodeLlama-7b-Python"
task = "learn our custom DSL"
result = train(specialized_model, task) # Stubborn, won't learn new patterns
# What actually worked
base_model = "meta-llama/Llama-3-8B" # Clean slate, flexible
result = train(base_model, task) # Learns happily
Specialized models are like concrete that’s already set—hard to reshape. Base models are like clay—still moldable into what you need.
The Training Example That Proves the Point
The workshop centers around training a model on FLOWROLL/DIMMATCH DSL—functions that don’t exist anywhere in the model’s training data. Before training, ask any LLM “What is FLOWROLL?” and you get hallucinated nonsense. After training on just 43 examples, the model can explain the syntax, generate code, and debug errors.
{"messages": [
{"role": "user", "content": "What is FLOWROLL?"},
{"role": "assistant", "content": "FLOWROLL is a rolling aggregation function that calculates values across a specified number of periods in the Direction (Forward or Backward) from the current cell..."}
]}
This isn’t about cost savings (though that happens). It’s about teaching models concepts that literally don’t exist in their training data. No amount of prompting or RAG will help with truly novel patterns.
Running the Workshop in Practice
In Enterprise Settings
I’ve delivered this workshop in corporate environments where teams were spending significant amounts on API calls for tasks that seemed like they should be automatable. The workshop format was usually 2 days: first day focused on understanding when custom training makes sense, second day hands-on implementation.
What surprised me was how much time we spent on the “why” versus the “how.” Teams often jumped to training solutions when better prompting would have worked, or stuck with expensive APIs when a small custom model would be perfect.
Guest Lectures in Universities
The academic version is condensed into 2 hours, focusing more on the technical evolution story and career guidance for students entering the field. University audiences care less about cost optimization and more about understanding the fundamentals and where the field is headed.
The Q&A sessions were always the most interesting part—students asking about research vs. industry paths, what skills actually matter in practice, and how to navigate the overwhelming pace of change in AI.
What I Learned From Teaching This
Teaching forced me to clarify my own thinking. You can’t explain why LoRA works until you really understand it yourself. Building examples that work reliably across different hardware setups taught me more about the practical constraints than any paper.
The biggest insight: everyone gets overwhelmed by the complexity, but the core concepts aren’t that hard once you have the right mental models. Most of the confusion comes from trying to learn everything at once instead of building understanding incrementally.
Making It Open Source
This started as a side passion—I enjoy coaching peers and colleagues, and guest lectures in academia are something I genuinely like doing. The workshop had more interest than I expected, which was encouraging.
Making it open source felt natural. The AI field moves fast, and keeping educational content locked up doesn’t help anyone. Plus, it’s free for all to use—just mention the original author if you find it useful.
What You Can Use It For
The materials are all there if you want to:
- Run your own version of the workshop
- Use parts for internal team training
- Adapt the hands-on pipeline for your projects
- Learn from the examples and configurations
Feel free to modify, improve, or build on anything that’s useful. That’s the point of open sourcing it.
What I’m Still Learning
Every time I run the workshop, I discover new gaps in my own understanding. Teaching forces you to really know your stuff—you can’t hand-wave through the hard parts when someone asks a direct question.
The field keeps evolving too. What I taught about model selection six months ago is already outdated. That’s both exciting and exhausting.
Thoughts and Call for Feedback
This workshop is my attempt to bridge the gap between academic AI and practical implementation. It’s definitely not perfect and reflects my own biases and background, but it seems to help people move from “AI is magic” to “AI is a tool I can reason about.”
If you’re struggling with any of this complexity, you’re not alone. The field moves fast, and everyone feels overwhelmed sometimes. The key is finding the right entry point and building understanding incrementally.
I’d love feedback on the materials—what works, what doesn’t, what’s missing. The content is all on GitHub if you want to explore, contribute, or adapt it for your own use.
What other workshops or educational content would be helpful? This is just one attempt at making AI more accessible, and I’m always looking for ways to improve or ideas for what to tackle next.