January 6, 2026•

#machine-learning#pytorch#transformers#training#gpu

A Few Ways to Train Your Own Models

A walkthrough of model training: what it actually is, the different approaches, and where to get the compute.

Dakota and a neural network in a class about memes. Dakota remarks 'glad we attended this course in memes', the neural network is thinking of the 6 7 meme.

A Few Ways to Train Your Own Models

I mean, you can just ask Claude to fine-tune a model for you. Describe what you want, it handles the config, submits the job, done. That's a valid way to fine-tune, but may seem like a black box if you haven't done it before. (Reading the skill is itself a helpful primer too! Lotta means to reach the same end.) But let's take a look into what training might look like for a dev/researcher that doesn't have a seemingly-endless compute budget.

Say you wanted a model that actually understood yesterday's memes. Memes move fast. By the time a foundation model ships, the meta has shifted three times. The pretrained weights on HuggingFace have a knowledge cutoff, and they definitely don't know about whatever absurd thing went viral last Tuesday. (Skibidi Toilet and Tung Tung Sahur are quickly approaching vintage meme status. We're witnessing the great meme reset in real time. 😂)

You could hook up a daily-updating external knowledge store to augment a frontier model: RAG pipelines, tool use, the approaches that power Claude Code or Codex or countless other systems. That's a valid strategy, and often the right one. But sometimes you want the knowledge baked into the weights themselves. Faster inference, no retrieval latency, behavior that's learned rather than prompted. Closer to the metal (the knowledge lives in the weights, not in a database you query at runtime). For that, you train.

I've been experimenting with different approaches: quick fine-tunes in Colab, renting cloud GPUs for larger jobs, using HuggingFace's managed infrastructure. Each has trade-offs. This post walks through what training actually is and where you can do it.

There's real value in understanding this stuff mechanistically. When you know how training works, you can meaningfully engage with the field. You understand what's possible when you need something bespoke. You can read a paper and actually follow the methods section. You stop being a passenger.

This post is a sneak peek, not an exhaustive guide. I've linked resources throughout that go much deeper. They're fantastic and worth your time if you want to really learn this stuff.

What Training Actually Is

A neural network is a function with millions of adjustable values (parameters). Training is the process of finding the right values.

The loop:

Feed data in, get a prediction
Compare prediction to what you wanted (this difference is the "loss")
Figure out which parameters contributed to the error
Nudge those parameters to reduce the error
Repeat thousands of times

In code, the core update looks like this:

PYTHON

for param in model.parameters():
    param.data -= learning_rate * param.grad

The grad (gradient) tells you which direction to nudge. The learning_rate controls how big of a step to take. Everything else (Adam, learning rate schedules, regularization) builds on this idea.

If you want to really internalize this, Andrej Karpathy's Neural Networks: Zero to Hero series is the gold standard. He builds everything from scratch, including backpropagation. 3Blue1Brown's neural network series is also excellent for visual intuition.

The Different Approaches

Full Training (Pre-training)

This means starting from scratch: random weights, a massive dataset, and tons of compute. This is how GPT, LLaMA, and other foundation models are created.

You'd do this if you have a novel architecture or a domain so specialized that no pretrained model exists. It costs thousands of GPU-hours, often millions.

Fine-Tuning

Take an existing pretrained model and continue training on your specific data. The model already knows language (or vision, or audio). You're teaching it your domain.

This is the sweet spot for most people. Hours to days on a single GPU.

LoRA / QLoRA

Instead of updating every parameter, you freeze most of the model and only train small "adapter" layers. You get comparable results with a fraction of the memory.

This is how you fine-tune a 7B model on a consumer GPU.

Quantization

Not training, but often confused with it. Quantization compresses a trained model (32-bit → 4-bit) so it runs on smaller hardware. Happens after training.

The Training Loop

Before touching transformers, here's what a training loop actually looks like in PyTorch. Every framework wraps this:

PYTHON

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class SimpleNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.layers(x)

# Sample data
X = torch.randn(1000, 64)
y = torch.randint(0, 10, (1000,))

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

model = SimpleNet(input_dim=64, hidden_dim=128, output_dim=10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

for epoch in range(100):
    total_loss = 0
    for batch_x, batch_y in loader:
        optimizer.zero_grad()
        logits = model(batch_x)
        loss = criterion(logits, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {total_loss / len(loader):.4f}")

The loss function (CrossEntropyLoss) measures how wrong the model is. The optimizer (AdamW) handles the parameter updates. Learning rate (1e-3) is a reasonable starting point.

For more on PyTorch fundamentals, the official PyTorch tutorials are comprehensive. The "Learn the Basics" series walks through tensors, datasets, autograd, and optimization.

Fine-Tuning with HuggingFace

HuggingFace's Trainer wraps the loop above and handles checkpointing, logging, and mixed precision:

PYTHON

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import numpy as np

dataset = load_dataset("imdb")

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

def tokenize(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

tokenized = dataset.map(tokenize, batched=True)

args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=True,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

trainer.train()

Note the lower learning rate (2e-5 vs 1e-3). Pretrained weights are already good, so you want to nudge them, not overwrite them. Fine-tuning also converges fast. 3-5 epochs is usually enough.

The HuggingFace fine-tuning guide covers this in depth. Phil Schmid's How to Fine-tune Open LLMs in 2025 is also a solid practical reference.

LoRA: Big Models, Small GPUs

Full fine-tuning updates every parameter. LoRA freezes the original model and trains small adapter matrices instead.

PYTHON

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

0.06% of parameters trainable. With 8-bit quantization, this fits on a 16GB GPU.

QLoRA takes this further by loading the base model in 4-bit, reducing the model footprint to ~4GB:

PYTHON

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

Now you're training a 7B model on a 12-16GB GPU. (The model itself fits in ~4GB when quantized, but you need headroom for gradients, optimizer states, and activations.)

The PEFT documentation explains the theory behind LoRA. The HuggingFace LLM Course chapter on LoRA is a good walkthrough. For the quantization side, check the PEFT quantization guide.

Where to Get GPUs

You don't need to buy hardware. Rent it.

Google Colab

The easiest place to start. Open a notebook, pick a runtime, and go.

A subscription is essentially required for serious training though. The free tier gives you a T4 with frequent timeouts and disconnects. Pro ($10/mo) and Pro+ ($50/mo) unlock A100s and longer sessions.

The real value: tons of community notebooks. Find a training template for almost any model, tweak it, run it. Great for learning. Google's FunctionGemma fine-tuning notebook is a solid example of a real fine-tuning workflow. See the Colab FAQ for details on runtime limits and GPU availability.

HuggingFace Jobs

This is a nice middle ground: various GPU types (A10G, A100, etc.), helpful abstractions, and HuggingFace handles environment setup, checkpointing, and artifact storage.

You write the code, upload your dataset to the Hub, and their infrastructure runs it. Less friction than managing your own instance.

There's also a Claude Code skill for fine-tuning models with HF Jobs. Describe what you want and it handles the config. I used it to train Qwen2.5-Coder-7B-Agentic-CoT-LoRA.

Lambda Labs

Maximum power. Minimum hand-holding.

You SSH into a bare Linux machine with GPUs attached. A100 80GB, H100s, multi-GPU setups. Full control over your environment.

BASH

ssh ubuntu@<your-instance-ip>
pip install torch transformers datasets peft accelerate
python train.py

Do not forget to turn off your instance. You're billed by compute time. If you walk away and forget to terminate, you will wake up to a bill. Set a calendar reminder.

Their getting started guide walks through SSH setup and instance management.

Comparison

Provider	Best For	Hourly Cost	Hand-Holding
Colab Pro+	Learning, experiments	~$0.10	High
HF Jobs	Production training	Variable	Medium
Lambda	Power users	$1.29-1.79 (A100)	Low
RunPod	Budget training	~$0.30 (3090)	Low

A Complete Example

Here's a full QLoRA script using SFTTrainer that runs on whatever GPU you have:

PYTHON

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import LoraConfig
from trl import SFTTrainer

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

dataset = load_dataset("your-username/your-dataset", split="train")

args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    optim="paged_adamw_8bit",
)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,  # SFTTrainer handles the PEFT wrapping
    processing_class=tokenizer,
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()
trainer.save_model("./my-finetuned-model")
trainer.push_to_hub("my-finetuned-model")

The TRL PEFT integration guide has more examples for different training scenarios.

I hope you enjoyed this walkthrough. This space moves fast, and tools, techniques, and best practices evolve constantly, so I encourage checking out the resources linked throughout for deeper, more up-to-date coverage. By exploring and experimenting, we can become familiar with the offerings and approaches available to us.