Custom Training Pipeline
Automated pipeline to convert codebases into fine-tuning datasets. Supports LoRA adapters for any compatible model.
My role: Solo project — built to solve the codebase-to-fine-tuning data quality problem.
Problem
Turning internal codebases into fine-tuned models requires careful chunking, deduplication, and prompt formatting. Manual dataset creation is error-prone and impossible to scale across multiple repositories or languages. Most off-the-shelf tools don't respect semantic boundaries — they split mid-function, degrading training signal.
What this demonstrates
- AST-aware chunking strategy that respects function and class boundaries to preserve training signal
- Configurable instruction-response template system compatible with OpenAI fine-tuning format and Axolotl/unsloth
- Quality filtering pipeline that deduplicates and removes low-signal examples before training
Architecture
Tech stack
- Python: AST parsing via stdlib ast module; language-agnostic chunking for Python, TypeScript, and Go.
- LoRA / PEFT: Low-rank adapter training keeps base model weights frozen; adapters are <100MB per fine-tune.
- Supabase: Dataset and job metadata storage; training jobs tracked with status and output paths.
- Next.js API routes: Job submission and status polling endpoints consumed by the dashboard.
Code
chunker.py
# Semantic chunking with AST-aware boundaries
class SemanticChunker:
def __init__(self, chunk_size=512, overlap=64):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, source: ParsedFile) -> list[Chunk]:
boundaries = self._find_boundaries(source.ast)
chunks = []
for start, end in boundaries:
tokens = source.tokens[start:end]
if len(tokens) > self.chunk_size:
chunks.extend(self._split_large(tokens))
else:
chunks.append(Chunk(tokens=tokens, metadata=source.meta))
return chunksResults
- Reference implementation — training jobs run against external fine-tuning APIs, not self-hosted
- Demonstrates end-to-end dataset generation from a real TypeScript codebase to OpenAI-compatible JSONL
- Chunker produces ~3× more usable training pairs than naive line-split on the same source
Screenshots


GitHub
Dataset schema, training job API routes, and the chunker are in this repo. Full Python pipeline available on request.
View on GitHub