Agent-Assisted Fine-Tuning

Using coding agents like Claude Code and OpenAI Codex to automate the entire LLM fine-tuning workflow—from GPU selection to model deployment—through natural language instructions.

Why Agent-Assisted Fine-Tuning?

Fine-tuning LLMs traditionally requires deep MLOps expertise: selecting hardware, configuring training scripts, managing datasets, monitoring jobs, and deploying models. Coding agents can now handle this entire workflow autonomously, making custom model training accessible to developers without ML infrastructure experience.

Automatic Hardware Selection

Agent picks optimal GPU based on model size, training method, and budget

Dataset Validation

Pre-flight checks on CPU before incurring GPU costs

Job Orchestration

Submit, monitor, and manage training runs via conversation

Local Deployment

Convert to GGUF and run with llama.cpp automatically

Real-World Impact

Teams report spending $20-30 total for multiple training runs including failed experiments—cheaper than one hour of ML consulting. The agent handles hardware selection, job orchestration, and monitoring, removing friction from the fine-tuning process.

Architecture

Agent-Assisted Fine-Tuning Flow
┌─────────────────────────────────────────────────────────────────┐
│                      CODING AGENT                                │
│              (Claude Code / Codex / Gemini CLI)                  │
│                                                                  │
│  User: "Fine-tune Qwen-7B on my customer support data"          │
│                                                                  │
│  Agent Actions:                                                  │
│  1. Validate dataset format                                      │
│  2. Select hardware (a10g-large for 7B + LoRA)                  │
│  3. Generate training configuration                              │
│  4. Submit job to compute platform                               │
│  5. Monitor progress and report status                           │
│  6. Convert to GGUF for local deployment                         │
└─────────────────────────────────────────────────────────────────┘
                              │
                              │ Skills / Plugins
                              │
         ┌────────────────────┼────────────────────┐
         │                    │                    │
         ▼                    ▼                    ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Hugging Face   │  │    Unsloth      │  │   Local LLM     │
│     Jobs        │  │                 │  │  (llama.cpp)    │
│                 │  │                 │  │                 │
│ - Managed GPU   │  │ - 2x faster     │  │ - Private data  │
│ - Auto scaling  │  │ - 30% less VRAM │  │ - No API costs  │
│ - Trackio logs  │  │ - GGUF export   │  │ - Offline use   │
└─────────────────┘  └─────────────────┘  └─────────────────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Fine-Tuned     │
                    │     Model       │
                    │                 │
                    │ • HF Hub        │
                    │ • GGUF local    │
                    │ • API endpoint  │
                    └─────────────────┘

Hugging Face Skills

The hf-llm-trainer skill teaches coding agents everything needed for fine-tuning: which GPU to pick, how to configure training, when to use LoRA vs full fine-tuning, and how to handle the dozens of decisions in a successful training run.

Hugging Face Skills Setup
# Install Hugging Face Skills plugin
/plugin marketplace add huggingface/skills
/plugin install hf-llm-trainer@huggingface-skills

# Authenticate with Hugging Face (requires Pro plan for Jobs)
hf auth login
# Or set environment variable
export HF_TOKEN=hf_your_write_access_token_here

# Start fine-tuning with natural language
# Claude Code handles everything automatically:
# - GPU selection based on model size
# - Training script configuration
# - Job submission and monitoring
# - Model upload to Hub
# Simple fine-tuning request
User: "Fine-tune Qwen3-0.6B on the open-r1/codeforces-cots dataset
       for instruction following."

# Agent automatically:
# 1. Validates dataset format
# 2. Selects hardware (t4-small for 0.6B model)
# 3. Configures training with Trackio monitoring
# 4. Submits job to Hugging Face Jobs
# 5. Reports cost estimate (~$0.30)

# Production run with specific parameters
User: "SFT Qwen-0.6B for production on my-org/support-conversations.
       Checkpoints every 500 steps, 3 epochs, cosine learning rate."

# Multi-stage pipeline
User: "Train a math reasoning model:
       1. SFT on openai/gsm8k
       2. DPO alignment with preference data
       3. Convert to GGUF Q4_K_M for local deployment"

Hardware & Cost Guide

Model Size Recommended GPU Training Time Estimated Cost
<1B t4-small Minutes $1-2
1-3B t4-medium / a10g-small Hours $5-15
3-7B a10g-large (LoRA) Hours $15-40
7-13B a100-large (LoRA) Hours $40-100
70B+ Multi-GPU / QLoRA Many hours $100+

Cost Optimization

Start with small test runs (100 examples) to validate your workflow before committing to full training. The agent automatically suggests appropriate hardware to balance cost and performance.

Training Methods

Coding agents support multiple training methods, automatically selecting the best approach based on your dataset and goals:

Training Methods
# Best for: High-quality input-output demonstration pairs
# Use cases: Customer support, code generation, domain Q&A

User: "Fine-tune Qwen3-0.6B on my-org/support-conversations for 3 epochs."

# Agent selects:
# - LoRA for models >3B (memory efficient)
# - Full fine-tuning for smaller models
# - Appropriate batch size and learning rate

# Dataset format (messages column):
{
  "messages": [
    {"role": "user", "content": "How do I reset my password?"},
    {"role": "assistant", "content": "To reset your password..."}
  ]
}
# Best for: Preference-annotated data (chosen vs rejected)
# Use cases: Alignment, reducing harmful outputs, style matching

User: "Run DPO on my-org/preference-data to align the SFT model.
       Dataset has 'chosen' and 'rejected' columns."

# No separate reward model needed
# Typically applied after SFT

# Dataset format:
{
  "prompt": "Explain quantum computing",
  "chosen": "Quantum computing uses quantum bits...",
  "rejected": "Quantum computing is magic..."
}
# Best for: Verifiable tasks with programmatic success criteria
# Use cases: Math reasoning, code generation, structured problems

User: "Train a math reasoning model using GRPO on openai/gsm8k
       based on Qwen3-0.6B."

# Model generates responses and receives rewards
# Learning from verifiable outcomes
# Particularly effective for reasoning tasks

# The agent configures:
# - Reward function based on answer correctness
# - Multiple response sampling
# - Relative ranking within groups

When to Use Each Method

Method Best For Dataset Requirements
SFT Teaching specific behaviors, domain adaptation messages column with conversations
DPO Alignment, preference learning, safety chosen and rejected columns
GRPO Math, code, verifiable reasoning tasks Tasks with programmatic success criteria

Multi-Stage Pipelines

For best results, combine methods: SFT to teach behaviors, then DPO for alignment, then GRPO for reasoning. The agent can orchestrate multi-stage pipelines automatically.

Running with Local LLMs

For private data or to avoid API costs, connect coding agents to local LLMs via llama.cpp's OpenAI-compatible API:

Local LLM Setup
# Build llama.cpp with GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # or DGGML_METAL=ON for Mac
cmake --build build --config Release

# Download quantized model (e.g., GLM-4.7-Flash 30B MoE)
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='unsloth/GLM-4.7-Flash-GGUF',
    filename='GLM-4.7-Flash-UD-Q4_K_XL.gguf',
    local_dir='./models'
)
"

# Start server with OpenAI-compatible API
./build/bin/llama-server \
    -m ./models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --port 8000 \
    --host 0.0.0.0 \
    -c 32768 \
    --temp 1.0 \
    --top-p 0.95 \
    --jinja  # Enable tool calling support
# Point Claude Code to local server
export ANTHROPIC_BASE_URL="http://localhost:8000"

# Run with local model
claude --model unsloth/GLM-4.7-Flash

# For unrestricted execution (use with caution)
claude --model unsloth/GLM-4.7-Flash --dangerously-skip-permissions
# ~/.codex/config.toml
[llama_cpp]
endpoint = "http://localhost:8000/v1"
wire_api = "responses"

# Run Codex with local model
# codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp

Model Requirements

Local models (20B-80B parameters) work well for orchestration tasks but may struggle with complex multi-file code generation where frontier models excel. Best for: summarization, Q&A, working with sensitive documents.

Recommended Models

For local fine-tuning orchestration, try GLM-4.7-Flash (30B MoE, optimized for coding), Qwen2.5-Coder (various sizes), or DeepSeek-Coder. Use Q4_K_M quantization for best size/quality balance.

Fine-Tuning Frameworks

Agents can drive various fine-tuning frameworks. Here are the most popular options:

Fine-Tuning Frameworks
# config.yaml - Axolotl configuration
base_model: Qwen/Qwen2.5-7B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

# Dataset
datasets:
  - path: my-org/training-data
    type: chat_template
    chat_template: chatml

# LoRA configuration
adapter: lora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

# Training parameters
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.1

# Optimization
bf16: true
flash_attention: true
gradient_checkpointing: true

# Output
output_dir: ./outputs/qwen-finetuned
hub_model_id: username/qwen-finetuned
push_to_hub: true
# LLaMA-Factory - zero-code fine-tuning

# Install
pip install llamafactory

# Launch web UI for no-code training
llamafactory-cli webui

# Or use CLI with YAML config
llamafactory-cli train \
    --model_name_or_path Qwen/Qwen2.5-7B-Instruct \
    --dataset my_dataset \
    --finetuning_type lora \
    --lora_rank 32 \
    --output_dir ./outputs \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 3 \
    --learning_rate 2e-4 \
    --bf16 true

# Export to GGUF for local deployment
llamafactory-cli export \
    --model_name_or_path ./outputs \
    --export_quantization_bit 4 \
    --export_dir ./gguf-output
from unsloth import FastLanguageModel
import torch

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

# Add LoRA adapters (2x faster than standard)
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=64,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=42,
)

# Train with HuggingFace Trainer
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        output_dir="outputs",
    ),
)

trainer.train()

# Save and convert to GGUF
model.save_pretrained_gguf(
    "outputs-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

Framework Comparison

Framework Best For Key Features
Axolotl Multi-GPU, production workloads YAML config, DeepSpeed, FSDP, extensive model support
LLaMA-Factory Beginners, quick experiments Web UI, zero-code option, 100+ models supported
Unsloth Speed and efficiency 2x faster training, 30% less VRAM, native GGUF export
HF TRL Maximum flexibility Official HF library, RLHF support, research-grade

LoRA and QLoRA

Parameter-efficient fine-tuning methods that make training large models feasible on consumer hardware:

LoRA (Low-Rank Adaptation)

Trains small adapter layers instead of full model weights

  • Typically r=32, alpha=64
  • ~1% of original parameters
  • Preserves base model quality
  • Multiple adapters per base model

QLoRA (Quantized LoRA)

Combines LoRA with 4-bit quantization

  • 70B model on single 24GB GPU
  • 4-bit NormalFloat quantization
  • Double quantization for memory
  • Paged optimizers for spikes

When to Use Which

Use LoRA for 95% of production fine-tuning needs—it's efficient and maintains quality. Use QLoRA when VRAM is limited or training very large models. Use full fine-tuning only when maximum accuracy is critical and resources are abundant.

Skill Transfer with upskill

An alternative to weight-based fine-tuning: transfer expertise from expensive models to cheaper ones via structured context. HuggingFace's upskill tool automates this "Robin Hood" approach—teaching open models skills that frontier models have mastered.

Skill Transfer
# Install upskill tool
pip install upskill
# Or use uvx for one-off runs
uvx upskill --help

# Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export HF_TOKEN=hf_...

# Generate skill from task description (uses Opus as teacher)
upskill generate "build optimized CUDA kernels for PyTorch"

# Generate from existing execution trace
upskill generate "write kernels" --from ./claude-trace.md

# Improve existing skill
upskill generate "add more error handling" --from ./skills/kernel-builder/

# Evaluate skill on different models
upskill eval ./skills/kernel-builder/ --model haiku --model sonnet --runs 5

# Evaluate with local model via llama.cpp
llama-server -hf unsloth/GLM-4.7-Flash-GGUF:Q4_K_M
upskill eval ./skills/my-skill/ \
    --model "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
    --base-url http://localhost:8080/v1
# Agent Skills follow the agentskills.io specification
# A skill is a directory with structured context

./skills/kernel-builder-cuda/
  SKILL.md              # Main instructions (~500 tokens)
  skill_meta.json       # Metadata and test cases

# SKILL.md contains domain expertise:
# - Project structure and conventions
# - Configuration file formats
# - API usage patterns
# - Common pitfalls and solutions
# - Best practices

# skill_meta.json defines evaluation:
{
  "name": "h100-cuda-kernel-builder",
  "version": "1.0.0",
  "cases": [
    {
      "input": "Create a build.toml for CUDA targeting H100",
      "expected": {"contains": "9.0"}  // compute capability
    },
    {
      "input": "Write a basic CUDA kernel template",
      "expected": {"contains": "cuda_runtime.h"}
    }
  ]
}
# The "Robin Hood" approach to skill transfer:
# 1. Expensive teacher model solves the problem
# 2. Export the solution as a skill
# 3. Cheaper student models use the skill

# Step 1: Teacher (Opus) builds solution interactively
User: "Help me write CUDA kernels for diffusers library"
Claude Opus: [solves problem, exports trace]

# Step 2: Convert execution trace to skill
upskill generate "write CUDA kernels" --from ./opus-trace.md

# Step 3: Evaluate skill lift on student models
upskill eval ./skills/cuda-kernels/ --model haiku --model kimi

# Results show improvement per model:
# Model       | Without Skill | With Skill | Lift
# ------------|---------------|------------|------
# Haiku       | 45%           | 82%        | +37%
# GLM-4.7     | 40%           | 85%        | +45%
# Kimi-K2     | 60%           | 95%        | +35%
# Opus 4.5    | 90%           | 88%        | -2%  (skip!)

# Key insight: Skills don't work universally
# Always measure both accuracy AND token usage

Skills vs Fine-Tuning

Approach Modifies Best For Trade-offs
Fine-Tuning Model weights Consistent style, domain-specific knowledge, production GPU cost, training time, potential catastrophic forgetting
Skill Transfer Context only Specialized tasks, quick iteration, model comparison Token overhead, skill maintenance, varies by model

When to Use Skills

Skills work best for specialized domains where ~500 tokens of context can encode the expertise. Examples: CUDA kernel patterns, API usage, coding conventions, project-specific guidelines.

Measure Per Model

Skills don't improve all models equally—some may even regress. Always evaluate on your target model before deployment. Use upskill eval to measure accuracy and token efficiency.

End-to-End Workflow

Here's a complete agent-assisted fine-tuning workflow from dataset validation to local deployment:

Complete Workflow
# Complete agent-assisted fine-tuning workflow

# Step 1: Dataset Validation (runs on CPU - no cost)
User: "Check if my-org/conversation-data works for SFT training."

Agent Response:
  Dataset validation for my-org/conversation-data:

  SFT: READY
    Found 'messages' column with conversation format

  DPO: INCOMPATIBLE
    Missing 'chosen' and 'rejected' columns

# Step 2: Test Run (quick validation)
User: "Do a quick test run to SFT Qwen-0.6B with 100 examples."

Agent Response:
  Configuration:
  - Hardware: t4-small (~$0.75/hour)
  - Estimated time: ~5 minutes
  - Estimated cost: ~$0.10

  Submitting test job...
  Job ID: test-abc123
  Monitor: https://huggingface.co/jobs/username/test-abc123

# Step 3: Production Training
User: "Looks good! Run full training with 3 epochs."

Agent Response:
  Configuration:
  - Hardware: t4-medium (~$1.50/hour)
  - Estimated time: ~2 hours
  - Estimated cost: ~$3.00
  - Checkpoints: Every 500 steps

  Job submitted: prod-xyz789

# Step 4: Monitor Progress
User: "How's my training job doing?"

Agent Response:
  Job: prod-xyz789
  Status: Running (step 1500/3000)
  Training loss: 1.23 -> 0.45
  Learning rate: 1.8e-4
  ETA: ~45 minutes

# Step 5: Deploy Locally
User: "Training complete! Convert to GGUF Q4_K_M and push to Hub."

Agent Response:
  Converting to GGUF...
  Quantization: Q4_K_M (best size/quality balance)
  Pushed to: username/qwen-finetuned-gguf

  Run locally:
  llama-server -hf username/qwen-finetuned-gguf:Q4_K_M

Compatible Coding Agents

Agent HF Skills Support Local LLM Support Notes
Claude Code Yes (plugin) Yes (ANTHROPIC_BASE_URL) Most capable, 90% of its own code written by itself
OpenAI Codex Yes (instructions) Yes (config.toml) Good for OpenAI ecosystem users
Gemini CLI Yes (extensions) Limited Google Cloud integration
Aider Manual Yes Git-focused, good for code changes
Anon Kode Manual Yes (native) LLM-agnostic Claude Code fork

Best Practices

1. Start Small

Run test jobs with 100 examples before full training. Validate dataset format on CPU first. This catches issues before incurring GPU costs.

2. Use Checkpoints

Save checkpoints every 500 steps for long runs. This allows recovery from failures and enables evaluation at different training stages.

3. Monitor Loss Curves

Watch training loss via Trackio or W&B. Flat loss means the model isn't learning; spiking loss indicates issues with learning rate or data.

4. Version Your Data

Push datasets to Hugging Face Hub with version tags. This ensures reproducibility and makes it easy to iterate on data quality.

5. Test Before Deploy

Run the fine-tuned model through your evaluation suite before production. Use the agent to help write and run evaluation scripts.

Resources

upskill: Skill Transfer for LLMs - Teaching open models with Agent Skills

upskill Repository - CLI tool for skill generation and evaluation

Agent Skills Specification - Standard format for portable skills

Hugging Face Skills for Training - Official guide to agent-assisted fine-tuning

Unsloth + Claude Code Guide - Local LLM fine-tuning setup

Axolotl Framework - Production-grade fine-tuning

LLaMA-Factory - Zero-code fine-tuning with web UI

Unsloth - 2x faster fine-tuning

Related Topics