June 6, 2026
Python PyTorch Low-Resource Languages LLM Training Autocomplete
Data: ~70M tokens
Perplexity: PPL 15
Size: 42M parameters

Why Qal

Tigrinya is spoken by over 7 million people across Eritrea, Ethiopia, and the diaspora, but digital writing support is still limited.

There are keyboards with some word-level autocomplete tools, especially on mobile. But for longer writing, the experience is still rough:

  • Limited autocomplete support
  • Minimal spell-check
  • Limited voice typing

Even for shorter messages, many of us still switch to English letters because it feels faster in the moment.

Qal starts with that daily friction: making Tigrinya easier to type, one useful suggestion at a time.

The goal is not to replace existing work or claim this solves everything. It is to build one practical piece of the puzzle: a real-time autocomplete system for Tigrinya, while documenting what it takes to build a small language model under low-resource constraints.

What I Built

Qal is an end-to-end transformer-based autocomplete system for Tigrinya, built from scratch.

  • Data pipeline for collecting, cleaning, deduplicating, and preparing Tigrinya text
  • Custom BPE tokenizer designed around Tigrinya script and text patterns
  • GPT-2 style transformer trained on roughly 70M tokens of cleaned text
  • Training pipeline for single-GPU and multi-GPU experiments
  • Inference API for serving real-time next-word suggestions

The focus is the full path from raw text to a practical writing tool, not just training a model in isolation.

Key Technical Decisions

Building Qal forced a few important trade-offs.

The data was limited, so model size, vocabulary size, context length, and training setup all had to be chosen carefully. A larger model or larger vocabulary is not automatically better when the language has limited public text available.

One surprising result came from tokenization. Larger vocabularies compressed text better, but smaller vocabularies performed better in training. The model had less to memorize, fewer rare tokens to learn, and a better chance of using the available data well.

That became one of the main lessons of the project: decisions that work well for high-resource languages do not always transfer cleanly to Tigrinya.

The Journey

I’m documenting the full build in a 7-part technical series:

  1. Why I’m building Qal
  2. Data collection and cleaning
  3. Tokenization design
  4. Model architecture
  5. Training experiments and results
  6. Inference and deployment
  7. Lessons learned

The goal is to share the full process: the decisions, the mistakes, the trade-offs, and the parts that were harder than expected.

I also plan to share the code, model, and data where possible so others can inspect it, improve it, or build something better.

Stack

  • Language: Python
  • Framework: PyTorch
  • Model: GPT-2 style transformer
  • Tokenization: HuggingFace Tokenizers
  • Training: Single-GPU and multi-GPU setup
  • Inference: FastAPI
  • Deployment: Local (Mac Studio), Cloud (Modal)