Indigenous Languages of the Americas Need Machine Learning That Serves Preservation, Not Hype

If these languages are going to remain teachable in fifty or one hundred years, the priority is not AI spectacle. It is community authority, digitization, searchable archives, good dictionaries, careful dataset building, and small machine-learning systems that make speakers, teachers, and learners stronger instead of turning them into raw material.

Three years ago, Kindalame argued that we should document Native traditions and teachings before more knowledge disappears. That case still stands, and it matters even more now. But the next version of that argument has to be more concrete. If we want future generations to learn Indigenous languages of the Americas, then we need more than admiration. We need durable language infrastructure: recordings that are stored responsibly, dictionaries that can be searched, orthographies that work on phones and laptops, teaching materials that can be updated, and machine-learning workflows that are small enough to audit and humble enough to stay under community control.

That is why this piece is less about “AI” as a brand and more about machine learning as a tool. The question is not whether a giant model can impress people on social media. The question is whether people can still study, teach, translate, pronounce, search, and recover Indigenous languages decades from now. The real work is preservation first and automation second.

Why does this matter now?

United Nations and UNESCO materials on Indigenous languages have long warned that thousands of Indigenous languages are spoken worldwide and that many are endangered. Once a language falls out of use, what disappears is not just vocabulary. You lose teaching pathways, pronunciation patterns, story structures, ceremonial context, place-based knowledge, ecological memory, and a worldview that does not come back automatically just because someone digitized a word list.

That is why the preservation problem is bigger than “translation quality.” A future learner needs more than a chatbot. They need trusted recordings, consistent spelling systems, sentence examples, grammar explanations, curricular materials, and community review. They need materials that can be searched and extended without being stripped from the people who created them.

This is also why our earlier post on documenting Native traditions and teachings should be read as a moral starting point rather than the finished technical answer. The practical sequel is this: preservation in the 2020s means building language resources that remain useful to humans first and useful to software second.

What infrastructure already exists?

The good news is that this work is already underway. The bad news is that most of it still needs more funding, more technical support, and more respect.

The Endangered Languages Project exists to support endangered-language documentation and revitalization with a public platform for language profiles and resources. FirstVoices through the First Peoples’ Cultural Council is one of the strongest examples of what preservation infrastructure looks like when it is built for communities rather than extractive research. Even when FPCC pages are read simply as a directory of current work, they point to a real stack: archives, language tools, and community-facing learning pathways. FPCC also keeps a dedicated FirstVoices overview and a separate governance-oriented resource, Check Before You Tech, that signals exactly how serious this work is about consent and appropriate use.

The Ojibwe People’s Dictionary shows what a living public dictionary can look like when language work includes learner access, voices, and structured entries instead of a static PDF left to decay. The Cherokee Nation Language Department is even more explicit about the preservation-to-technology pipeline: it says the department is committed to preserving and perpetuating Cherokee through daily spoken use, and it ties together translation, classes, immersion, and language technology in the same program. The page also states that there are about 2,000 first-language Cherokee speakers and that several thousand more learners have come through Cherokee Nation language programs. That is the shape of serious work: teaching, translation, immersion, and technical support all in one place.

The Alaska Native Language Archive provides another durable model. Even its public navigation tells you what matters: collections, dictionaries, copyright, deposit guidance, partnerships, and digital heritage preservation. That is exactly what a future-proof preservation system needs. It is not enough to “have data.” You need stewardship rules, deposit pathways, and a searchable structure that makes materials usable without making them ownerless.

Then there is the governance layer, which too many machine-learning people still treat like optional ethics theater. The CARE Principles for Indigenous Data Governance exist because FAIR-style openness by itself does not protect Indigenous rights and interests. GIDA is explicit that CARE is about collective benefit, control over use, and data practices oriented around people and purpose rather than extraction. Local Contexts’ TK Labels add another critical piece: provenance, permissions, and community context are not metadata decoration. They are part of whether you should be using a resource at all.

That existing landscape matters because it answers a common bad-faith question: “Do we even have enough to do anything?” Yes, we do. We have archives, dictionaries, recordings, language departments, governance frameworks, and active revitalization programs. What we do not have is enough patient investment in turning those resources into better teaching and preservation systems without violating the communities behind them.

What can machine learning actually help with?

Machine learning is useful here, but only when it is narrow, auditable, and tied to actual language work.

The biggest near-term use case is digitization. A 2024 ACL paper, A Concise Survey of OCR for Low-Resource Languages, points out that many low-resource language materials still exist in image-based documents such as scanned dictionaries, field notes, children’s stories, and other non-machine-readable text. That is a perfect example of where ML is helpful without being mystical. OCR does not “save” a language by itself, but it can turn dead scans into text that teachers, communities, and future tools can actually search.

The second good use case is aligned data creation. If a community has approved bilingual phrase lists, classroom examples, or translation pairs, then lightweight models can help with sentence matching, glossary expansion, terminology suggestion, or draft translations in narrow domains. This is not glamorous work. It is also exactly the work that makes future dictionaries, tutors, and translation aids possible.

The third use case is educational support. The AmericasNLP 2024 shared task on educational materials is worth paying attention to because it keeps the goal pointed at learning rather than leaderboard vanity. If models can assist in creating exercises, controlled reading passages, or curriculum support under teacher review, that is preservation infrastructure. It helps real learners.

The fourth use case is machine translation assistance, but with very narrow claims. The AmericasNLP 2024 shared task on machine translation into Indigenous languages shows that researchers are actively working on translation for Indigenous languages of the Americas. ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization is even more concrete. That paper introduced a Cherokee-English parallel dataset, described it as extremely low-resource at about 14,000 sentence pairs plus about 5,000 Cherokee monolingual examples, and framed the work explicitly around revitalization. That is the right scale of honesty. It is not “we solved Cherokee.” It is “here is a small, transparent resource and a research baseline that the community can build on.”

Speech technology can help too, but only when there is approved audio and careful evaluation. Evaluating Self-Supervised Speech Representations for Indigenous American Languages is a reminder that the speech side of the problem is active, real, and still difficult. In the right setting, speech models can assist with transcription, search, or pronunciation support. In the wrong setting, they simply turn low-resource recordings into another round of bad guesses.

Just as important is what ML does not do well.

It does not replace fluent speakers. It does not infer missing grammar from tiny datasets with magical reliability. It does not ethically justify scraping public-facing language pages into a training set. It does not make translation trustworthy when no teacher or speaker reviews the output. And it definitely does not remove the need for community governance just because a model card says “open.”

What data is good enough to use?

For this topic, the rule should be simple: public and community-approved is the floor, not the ceiling.

That means there are three buckets.

Safe to recommend

The safest public resources are the ones explicitly built to teach, document, or guide preservation work in the open: Endangered Languages Project, FirstVoices, the Ojibwe People’s Dictionary, the Cherokee Nation Language Department, the Alaska Native Language Archive, Local Contexts TK Labels, and the CARE Principles. These are safe to recommend as resources to learn from, support, and study.

Useful, but permission-dependent

A public archive is not the same thing as blanket permission to train on everything inside it. Audio, classroom materials, archival scans, and bilingual examples may be publicly visible while still carrying cultural, legal, or ethical restrictions on reuse. That is exactly why Check Before You Tech and CARE matter. They tell you not to treat discoverability as consent. If a community has not clearly approved reuse for model training, treat the material as permission-dependent even when you can read it in a browser.

Not appropriate for casual scraping

Do not casually scrape community portals, lesson sites, audio collections, or restricted cultural materials into a generic dataset just because a crawler can reach them. Do not use “open web” logic where Indigenous language resources are concerned. If provenance is unclear, if the licensing is silent, if the material is ceremonial or culturally sensitive, or if the community context has been stripped away, then the correct move is not to train first and apologize later. The correct move is to stop.

This is where Local Contexts TK Labels should become standard practice. In this space, provenance and permissions are part of the data itself.

What does a realistic Google Colab path look like?

For most readers, the right first build is not a giant multilingual model. It is a tiny, narrow, supervised task that you can explain to a teacher or language worker in one sentence.

Start with something like:

glossary normalization
dictionary-entry classification
phrase translation in one narrow domain
sentence matching between existing bilingual materials
OCR cleanup ranking for scanned text

Google Colab’s FAQ is honest about the tradeoff. Colab provides free access to GPUs and TPUs, but the resources are not guaranteed, are not unlimited, and usage limits fluctuate. That means Colab is perfect for prototypes, notebooks, evaluation experiments, and small adapter runs. It is not where you build a preservation strategy that depends on uninterrupted compute.

The practical stack I would recommend is:

PEFT LoRA for parameter-efficient fine-tuning
bitsandbytes quantization for 8-bit or 4-bit loading
either Axolotl or Unsloth as the training wrapper if you want faster setup

The official LoRA documentation says the point is to accelerate fine-tuning of large models while consuming less memory by training low-rank update matrices instead of retraining the entire base model. That is why LoRA is the default here. In preservation settings, portable adapters are a feature, not a compromise. They let you keep the base model frozen, document what changed, and swap in small community-specific adapters for narrow tasks.

The official bitsandbytes docs go further. They explain that QLoRA is a 4-bit quantization technique that keeps models trainable by inserting a small set of trainable LoRA weights, and they even give a concrete example: nested quantization can make it possible to finetune a Llama-13B model on a 16GB NVIDIA T4 with a sequence length of 1024, batch size 1, and gradient accumulation 4. That is not a promise that every preservation project should do this. It is a reminder that small, narrow experiments have become much more accessible than they were even a few years ago.

So the Colab path is simple:

Use only reviewed, approved data.
Pick one task with a clear success criterion.
Load a small or mid-size open model in 4-bit.
Train a LoRA adapter, not the full model.
Evaluate with real speakers or teachers on held-out examples.
Keep the adapter, training data notes, prompts, and evaluation sheet together.

If you cannot describe the model’s job in one paragraph, the project is probably too broad for Colab.

What does a realistic home-GPU path look like?

Home hardware can be enough, but the key word is enough, not effortless.

Unsloth’s documentation is refreshingly blunt. Its Axolotl integration is described as hyper-optimized QLoRA fine-tuning for single GPUs, and the docs also list clear limitations: single GPU only, LoRA and QLoRA only, limited architecture support, and no full fine-tunes. That is not a weakness. That is exactly the kind of bounded system many preservation projects should prefer.

If you have a home GPU, the realistic starting point is:

one open-weight base model
one narrow task
one reviewed dataset
one adapter

Do not start by trying to create a universal tutor for all Indigenous languages of the Americas. Start by making one thing more usable: OCR correction for a dictionary scan, term suggestions for a lesson set, or bilingual phrase completion for a class workbook.

Axolotl’s overview is useful because it openly supports LoRA, QLoRA, and full fine-tuning while also documenting a Google Colab quickstart and model guides including Llama and GPT-OSS support in the current project overview. That makes it a practical bridge tool: you can prototype in Colab, then move to a home workstation when the dataset, evaluation, and governance are mature enough.

My realistic rule of thumb is this: if you are doing a tiny QLoRA experiment, 16GB-class hardware can sometimes be enough, especially when the official docs themselves show constrained 13B examples on a T4. But if you want a sane local workflow with room for mistakes, evaluation runs, and longer contexts, 24GB or more is a much more comfortable target. That last sentence is an engineering judgment, not an official vendor promise, and it is exactly the kind of judgment readers deserve. Preservation projects should optimize for repeatability and review, not for heroic demo runs that barely fit.

When does this stop being a hobbyist project?

There is a point where the right answer is not “buy a better GPU.” The right answer is “this is an institutional project now.”

That threshold usually appears when you need one or more of these:

multilingual base models with formal governance review
larger aligned corpora across dialects or communities
transcription pipelines built on sensitive recordings
data agreements that require archival or university support
evaluation processes that need multiple teachers, speakers, or translators

This is where No Language Left Behind: Scaling Human-Centered Machine Translation is useful as a research signal rather than a prescription. The paper is explicit that most low-resource languages have been left behind, that the work involved exploratory interviews with native speakers, and that the team evaluated over 40,000 translation directions with human-translated benchmarks and toxicity review. That is the correct lesson to take from frontier multilingual research: serious work on low-resource languages is not just about model size. It is about interviews, data curation, human evaluation, and safety.

So if your project starts needing speech collections, new corpus creation, or cross-community evaluation, you should think like a lab or a tribal institution, not like a weekend fine-tuning hobbyist. That means governance documents, explicit consent, storage plans, evaluation rubrics, and community decision-making before model iteration.

What should a minimum viable preservation dataset look like?

If someone asked me what the smallest responsible dataset bundle looks like, I would not start with model format. I would start with provenance.

At minimum, a preservation-oriented dataset package should include:

the text itself
where it came from
who approved its use
what task it is meant to support
which orthography or spelling conventions it follows
who reviewed it
what should never be done with it

That sounds bureaucratic until you compare it with the alternative, which is the usual machine-learning mess: mystery text files, unlabeled audio, uncertain permission, and a model that cannot tell you where any of its examples came from. For Indigenous language work, that kind of sloppiness is not a small issue. It is often the whole ethical issue.

This is another reason CARE, Local Contexts TK Labels, and FPCC’s Check Before You Tech should be treated as operating rules. A dataset for glossary expansion is not the same thing as a dataset for open-ended generation. A classroom phrasebook is not the same thing as an archive of community recordings. A public web page is not the same thing as a training license.

If readers want a practical template, I would use columns or fields like these:

source collection
language and dialect
speaker, teacher, or reviewer role
task type: OCR cleanup, dictionary lookup, translation, educational materials
access status: public, public-but-review-required, or restricted
community guidance note
approval or provenance reference
orthography version
confidence or review status

That kind of packaging is not overkill. It is what lets a future volunteer, teacher, or archivist understand whether your work is usable and trustworthy after you are gone.

How should these models be evaluated?

This is where a lot of otherwise well-meaning work goes off the rails. People get a model to output a few plausible examples, and suddenly they act as if they have built a tutor.

The better standard is much simpler.

For narrow tasks, evaluate on held-out examples that matter to the actual use case. If the task is OCR cleanup, compare the cleaned text against reviewed text and track the kinds of errors that remain. If the task is phrase translation, use a held-out set of approved phrases and have fluent speakers, teachers, or advanced learners check adequacy, wording, spelling, and whether the output is pedagogically useful. If the task is educational-material generation, judge whether the result is teachable, correct, level-appropriate, and worth a human keeping.

No Language Left Behind is helpful here too because it does not stop at automated metrics. The paper explicitly combines broad evaluation with human review and safety checks. That is the lesson worth copying. BLEU, chrF, or any other benchmark can be one signal, but they are not the real gate for language-preservation tools. The real gate is whether people responsible for the language say the tool helps without distorting, cheapening, or misleading.

My preferred evaluation bundle for this kind of work would include:

a held-out test set
a speaker or teacher review sheet
an error taxonomy
examples of good outputs
examples of unacceptable outputs
a short note on what the model is not allowed to do

That last item matters. A good preservation workflow includes explicit boundaries. A model that is only good enough for draft glossary suggestions should never be presented as a translator. A model that is only good enough for OCR cleanup should never be marketed as a teacher. Saying no to overclaiming is part of the technical work.

What does the current research already prove?

It proves enough to justify action, but not enough to justify arrogance.

AmericasNLP 2024 exists because Indigenous languages of the Americas are active, serious NLP research subjects right now, not a hypothetical future category. The workshop’s shared tasks on machine translation and educational materials show that the community is already working on both translation quality and learner support. The OCR survey for low-resource languages makes clear that scanned dictionaries, field notes, and stories are not edge cases. They are the substrate of a lot of real preservation work.

ChrEn proves something even more important: small, endangered-language MT is possible, but the bottleneck is still data, domain transfer, and evaluation. The LREC-COLING paper on self-supervised speech representations shows that speech support for Indigenous American languages is an active area too. So the question is not whether there is any ML worth doing here. There is. The question is whether we are willing to do the slow work around it.

That is the real divide between hype and preservation. Hype says, “Let’s build a model for this language.” Preservation says, “What materials exist, who governs them, what narrow tool would help, how will speakers review it, and what survives after the novelty wears off?”

Resource appendix

Here is the shortest serious list I would hand to anyone entering this space.

Preservation and public learning resources

Governance and Indigenous data use

Public resources that still need reuse review

The Cherokee Nation Language Department is public and valuable, but translation materials, classes, and technology resources should not be assumed to be blanket training data.
The Alaska Native Language Archive is a major archive, but archival visibility is not the same thing as unrestricted model-training permission.
The Ojibwe People’s Dictionary is a model public resource for learners, but public educational access is not identical to unrestricted downstream reuse.

Research worth reading

Practical fine-tuning and LoRA docs

What should people do next?

If you care about these languages, do not start by asking which model is hottest this month.

Start by supporting classes, archives, and community language departments. Help digitize approved materials. Fund transcription, dictionary cleanup, and structured metadata. Build tiny tools that can be reviewed, corrected, and thrown away if they are wrong. Treat evaluation by speakers and teachers as the real benchmark. And never separate technical ambition from community leadership.

The future we should want is not a world where a huge model impersonates fluency in endangered languages. It is a world where more people can actually learn them, teach them, search them, hear them, and keep them alive on terms set by the communities that carry them.

Indigenous Languages of the Americas Need Machine Learning That Serves Preservation, Not Hype

Why does this matter now?

What infrastructure already exists?

What can machine learning actually help with?