A Few Ways to Train Your Own Models
A walkthrough of model training: what it actually is, the different approaches, and where to get the compute.
A Few Ways to Train Your Own Models
I mean, you can just ask Claude to fine-tune a model for you. Describe what you want, it handles the config, submits the job, done. That's a valid way to fine-tune, but may seem like a black box if you haven't done it before. (Reading the skill is itself a helpful primer too! Lotta means to reach the same end.) But let's take a look into what training might look like for a dev/researcher that doesn't have a seemingly-endless compute budget.
Say you wanted a model that actually understood yesterday's memes. Memes move fast. By the time a foundation model ships, the meta has shifted three times. The pretrained weights on HuggingFace have a knowledge cutoff, and they definitely don't know about whatever absurd thing went viral last Tuesday. (Skibidi Toilet and Tung Tung Sahur are quickly approaching vintage meme status. We're witnessing the great meme reset in real time. 😂)
You could hook up a daily-updating external knowledge store to augment a frontier model: RAG pipelines, tool use, the approaches that power Claude Code or Codex or countless other systems. That's a valid strategy, and often the right one. But sometimes you want the knowledge baked into the weights themselves. Faster inference, no retrieval latency, behavior that's learned rather than prompted. Closer to the metal (the knowledge lives in the weights, not in a database you query at runtime). For that, you train.
I've been experimenting with different approaches: quick fine-tunes in Colab, renting cloud GPUs for larger jobs, using HuggingFace's managed infrastructure. Each has trade-offs. This post walks through what training actually is and where you can do it.
There's real value in understanding this stuff mechanistically. When you know how training works, you can meaningfully engage with the field. You understand what's possible when you need something bespoke. You can read a paper and actually follow the methods section. You stop being a passenger.
This post is a sneak peek, not an exhaustive guide. I've linked resources throughout that go much deeper. They're fantastic and worth your time if you want to really learn this stuff.
What Training Actually Is
A neural network is a function with millions of adjustable values (parameters). Training is the process of finding the right values.
The loop:
- Feed data in, get a prediction
- Compare prediction to what you wanted (this difference is the "loss")
- Figure out which parameters contributed to the error
- Nudge those parameters to reduce the error
- Repeat thousands of times
In code, the core update looks like this:
for param in model.parameters():
param.data -= learning_rate * param.grad
The grad (gradient) tells you which direction to nudge. The learning_rate controls how big of a step to take. Everything else (Adam, learning rate schedules, regularization) builds on this idea.
If you want to really internalize this, Andrej Karpathy's Neural Networks: Zero to Hero series is the gold standard. He builds everything from scratch, including backpropagation. 3Blue1Brown's neural network series is also excellent for visual intuition.
The Different Approaches
Full Training (Pre-training)
This means starting from scratch: random weights, a massive dataset, and tons of compute. This is how GPT, LLaMA, and other foundation models are created.
You'd do this if you have a novel architecture or a domain so specialized that no pretrained model exists. It costs thousands of GPU-hours, often millions.
Fine-Tuning
Take an existing pretrained model and continue training on your specific data. The model already knows language (or vision, or audio). You're teaching it your domain.
This is the sweet spot for most people. Hours to days on a single GPU.
LoRA / QLoRA
Instead of updating every parameter, you freeze most of the model and only train small "adapter" layers. You get comparable results with a fraction of the memory.
This is how you fine-tune a 7B model on a consumer GPU.
Quantization
Not training, but often confused with it. Quantization compresses a trained model (32-bit → 4-bit) so it runs on smaller hardware. Happens after training.
The Training Loop
Before touching transformers, here's what a training loop actually looks like in PyTorch. Every framework wraps this:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
class SimpleNet(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x):
return self.layers(x)
# Sample data
X = torch.randn(1000, 64)
y = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
model = SimpleNet(input_dim=64, hidden_dim=128, output_dim=10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
for epoch in range(100):
total_loss = 0
for batch_x, batch_y in loader:
optimizer.zero_grad()
logits = model(batch_x)
loss = criterion(logits, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss = {total_loss / len(loader):.4f}")
The loss function (CrossEntropyLoss) measures how wrong the model is. The optimizer (AdamW) handles the parameter updates. Learning rate (1e-3) is a reasonable starting point.
For more on PyTorch fundamentals, the official PyTorch tutorials are comprehensive. The "Learn the Basics" series walks through tensors, datasets, autograd, and optimization.
Fine-Tuning with HuggingFace
HuggingFace's Trainer wraps the loop above and handles checkpointing, logging, and mixed precision:
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
import numpy as np
dataset = load_dataset("imdb")
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2
)
def tokenize(batch):
return tokenizer(
batch["text"],
padding="max_length",
truncation=True,
max_length=512
)
tokenized = dataset.map(tokenize, batched=True)
args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
fp16=True,
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = (predictions == labels).mean()
return {"accuracy": accuracy}
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)
trainer.train()
Note the lower learning rate (2e-5 vs 1e-3). Pretrained weights are already good, so you want to nudge them, not overwrite them. Fine-tuning also converges fast. 3-5 epochs is usually enough.
The HuggingFace fine-tuning guide covers this in depth. Phil Schmid's How to Fine-tune Open LLMs in 2025 is also a solid practical reference.
LoRA: Big Models, Small GPUs
Full fine-tuning updates every parameter. LoRA freezes the original model and trains small adapter matrices instead.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
0.06% of parameters trainable. With 8-bit quantization, this fits on a 16GB GPU.
QLoRA takes this further by loading the base model in 4-bit, reducing the model footprint to ~4GB:
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
Now you're training a 7B model on a 12-16GB GPU. (The model itself fits in ~4GB when quantized, but you need headroom for gradients, optimizer states, and activations.)
The PEFT documentation explains the theory behind LoRA. The HuggingFace LLM Course chapter on LoRA is a good walkthrough. For the quantization side, check the PEFT quantization guide.
Where to Get GPUs
You don't need to buy hardware. Rent it.
Google Colab
The easiest place to start. Open a notebook, pick a runtime, and go.
A subscription is essentially required for serious training though. The free tier gives you a T4 with frequent timeouts and disconnects. Pro ($10/mo) and Pro+ ($50/mo) unlock A100s and longer sessions.
The real value: tons of community notebooks. Find a training template for almost any model, tweak it, run it. Great for learning. Google's FunctionGemma fine-tuning notebook is a solid example of a real fine-tuning workflow. See the Colab FAQ for details on runtime limits and GPU availability.
HuggingFace Jobs
This is a nice middle ground: various GPU types (A10G, A100, etc.), helpful abstractions, and HuggingFace handles environment setup, checkpointing, and artifact storage.
You write the code, upload your dataset to the Hub, and their infrastructure runs it. Less friction than managing your own instance.
There's also a Claude Code skill for fine-tuning models with HF Jobs. Describe what you want and it handles the config. I used it to train Qwen2.5-Coder-7B-Agentic-CoT-LoRA.
Lambda Labs
Maximum power. Minimum hand-holding.
You SSH into a bare Linux machine with GPUs attached. A100 80GB, H100s, multi-GPU setups. Full control over your environment.
ssh ubuntu@<your-instance-ip>
pip install torch transformers datasets peft accelerate
python train.py
Do not forget to turn off your instance. You're billed by compute time. If you walk away and forget to terminate, you will wake up to a bill. Set a calendar reminder.
Their getting started guide walks through SSH setup and instance management.
Comparison
| Provider | Best For | Hourly Cost | Hand-Holding |
|---|---|---|---|
| Colab Pro+ | Learning, experiments | ~$0.10 | High |
| HF Jobs | Production training | Variable | Medium |
| Lambda | Power users | $1.29-1.79 (A100) | Low |
| RunPod | Budget training | ~$0.30 (3090) | Low |
A Complete Example
Here's a full QLoRA script using SFTTrainer that runs on whatever GPU you have:
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig
from trl import SFTTrainer
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
dataset = load_dataset("your-username/your-dataset", split="train")
args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit",
)
trainer = SFTTrainer(
model=model,
args=args,
train_dataset=dataset,
peft_config=peft_config, # SFTTrainer handles the PEFT wrapping
processing_class=tokenizer,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
trainer.save_model("./my-finetuned-model")
trainer.push_to_hub("my-finetuned-model")
The TRL PEFT integration guide has more examples for different training scenarios.
I hope you enjoyed this walkthrough. This space moves fast, and tools, techniques, and best practices evolve constantly, so I encourage checking out the resources linked throughout for deeper, more up-to-date coverage. By exploring and experimenting, we can become familiar with the offerings and approaches available to us.