Tuvalu · Pacific Islands

TuvaLLM

Speech Recognition for Tuvaluan

The first automatic speech recognition system for Tuvaluan — a Polynesian language with no prior ASR support. Built by fine-tuning Facebook's MMS model using Samoan as a linguistic bridge.

Try It Yourself

Record speech in Tuvaluan using your microphone, or upload an audio file. The model transcribes in real time.

Checking server...
Your transcript will appear here
or

Why Tuvaluan ASR?

The Language

Te Gana Tuvalu

Tuvaluan is an Austronesian language spoken by approximately 11,000 people in Tuvalu, a small island nation in the central Pacific. It had no existing speech recognition technology before this project.

The Approach

Samoan Bridge

Samoan and Tuvaluan share significant phonological and lexical overlap. We fine-tune Facebook's MMS model (pre-trained on 1,100+ languages including Samoan) using CTC-aligned Tuvaluan parliament recordings.

The Data

Parliament Proceedings

Training data comes from the 2024 Tuvaluan Parliament sessions (June & December). Audio is aligned to official Hansard transcripts using a CTC forced alignment pipeline.

Open Source

Reproducible Pipeline

The full alignment, training, and evaluation pipeline is open source. Model weights and training code are available on GitHub.

GitHub Repository →

Model Performance

21%
WER (Beam)
~14h
Training Data
5k
Segments
MMS
Base Model

Built on facebook/mms-1b-all with the Samoan adapter as initialization. CTC fine-tuning with SpecAugment and speed perturbation, trained on RunPod A40 GPUs. Beam search decoding uses a word-level 5-gram KenLM language model built from the full parliament transcripts (~290k words).

SourceParliament audio
AlignCTC forced alignment
FilterQuality scoring
TrainMMS fine-tuning
DecodeBeam + 5-gram LM