Fine-tuning

Custom Training Pipeline

Automated pipeline to convert codebases into fine-tuning datasets. Supports LoRA adapters for any compatible model.

My role: Solo project — built to solve the codebase-to-fine-tuning data quality problem.

Problem

Turning internal codebases into fine-tuned models requires careful chunking, deduplication, and prompt formatting. Manual dataset creation is error-prone and impossible to scale across multiple repositories or languages. Most off-the-shelf tools don't respect semantic boundaries — they split mid-function, degrading training signal.

What this demonstrates

  • AST-aware chunking strategy that respects function and class boundaries to preserve training signal
  • Configurable instruction-response template system compatible with OpenAI fine-tuning format and Axolotl/unsloth
  • Quality filtering pipeline that deduplicates and removes low-signal examples before training

Architecture

Tech stack

  • Python: AST parsing via stdlib ast module; language-agnostic chunking for Python, TypeScript, and Go.
  • LoRA / PEFT: Low-rank adapter training keeps base model weights frozen; adapters are <100MB per fine-tune.
  • Supabase: Dataset and job metadata storage; training jobs tracked with status and output paths.
  • Next.js API routes: Job submission and status polling endpoints consumed by the dashboard.

Code

chunker.py
# Semantic chunking with AST-aware boundaries
class SemanticChunker:
    def __init__(self, chunk_size=512, overlap=64):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, source: ParsedFile) -> list[Chunk]:
        boundaries = self._find_boundaries(source.ast)
        chunks = []
        for start, end in boundaries:
            tokens = source.tokens[start:end]
            if len(tokens) > self.chunk_size:
                chunks.extend(self._split_large(tokens))
            else:
                chunks.append(Chunk(tokens=tokens, metadata=source.meta))
        return chunks

Results

  • Reference implementation — training jobs run against external fine-tuning APIs, not self-hosted
  • Demonstrates end-to-end dataset generation from a real TypeScript codebase to OpenAI-compatible JSONL
  • Chunker produces ~3× more usable training pairs than naive line-split on the same source

Screenshots

Training pipeline dashboard — desktop viewTraining pipeline dashboard — mobile view

GitHub

Dataset schema, training job API routes, and the chunker are in this repo. Full Python pipeline available on request.

View on GitHub