Qal | Fanus Arefaine

Why Qal

Tigrinya is spoken by over 7 million people across Eritrea, Ethiopia, and the diaspora, but its digital tools are still catching up.

For years, researchers, linguists, and dedicated developers have pushed Tigrinya digital tools forward with keyboards, translation, datasets, and research.

But for long-form writing, the gaps remain:

Limited autocomplete support
Minimal spell-check
Limited voice typing

Even for shorter messages, many of us still switch to English letters because it feels faster in the moment.

Qal builds on that foundation and starts with one daily friction: making Tigrinya easier to type, one useful suggestion at a time.

The goal is not to replace existing work or claim this solves everything. It is to build one practical piece of the puzzle: a real-time autocomplete system for Tigrinya, while documenting what it takes to build a small language model from scratch under low-resource constraints.

What I Built

Qal is an end-to-end transformer-based autocomplete system for Tigrinya, built from scratch.

Data pipeline for collecting, cleaning, deduplicating, and preparing Tigrinya text
Custom BPE tokenizer designed around Tigrinya script and text patterns
GPT-2 style transformer (8 layers, 12 heads, 768-dim, 8192 vocab) trained on roughly 70M tokens of cleaned text
Training pipeline for single-GPU and multi-GPU experiments
Inference API for serving real-time next-word suggestions

The focus is the full path from raw text to a practical writing tool, not just training a model in isolation.

Key Technical Decisions

Building Qal forced a few important trade-offs.

The data was limited, so model size, vocabulary size, context length, and training setup all had to be chosen carefully. A larger model or larger vocabulary is not automatically better when the language has limited public text available.

One surprising result came from tokenization. Larger vocabularies compressed text better, but smaller vocabularies performed better in training. The model had less to memorize, fewer rare tokens to learn, and a better chance of using the available data well.

That became one of the main lessons of the project: decisions that work well for high-resource languages do not always transfer cleanly to Tigrinya.

The Journey

I’m sharing the journey of building Qal in a 7-part technical series:

Why Qal
Data collection and cleaning
Tokenization design
Model architecture
Training experiments and results
Inference and deployment
Lessons learned

The goal is to share the full process: the decisions, the mistakes, the trade-offs, and the parts that were harder than expected.

I’m also sharing the code, model, and data where possible so others can inspect it, improve it, or build something better.

Stack

Language: Python
Framework: PyTorch
Model: GPT-2 style transformer
Tokenization: HuggingFace Tokenizers
Training: Single-GPU and multi-GPU setup
Inference: FastAPI
Deployment: Local (Mac Studio), Cloud (Modal)