System Online · Blog Reading · 6 min · trilingual Apr 22, 2026

بناء Fellah AI: نصائح الفلاحة بالدارجة

كيفاش بنينا مساعد فلاحي ثلاثي اللغة كيفهم المناخ المغربي، المحاصيل، وتقاويم الحصاد — بلا ما نرجعو لنصائح عامة من قواعد بيانات Silicon Valley.

When we started on Fellah AI, the obvious move was to plug in a general-purpose LLM and let it answer farming questions. That worked — about as well as a tourist in Aïn Leuh explaining olive harvesting to a farmer who's been doing it for thirty years. Technically correct in the abstract, completely wrong in the particulars.

The problem isn't the model. The problem is what the model has been shown. Generic LLMs have read endless corpora of temperate-zone gardening blogs, USDA planting guides, and permaculture manuals written for people with acid soil and four clean seasons. Ask it when to plant tomatoes and it'll confidently tell you "after the last frost" — a phrase that has meaningful content in Minnesota and no meaningful content in Settat.

What Moroccan farming actually needs

Three things shape every agricultural question a Moroccan farmer asks, and all three are invisible to a foundation model trained on Anglosphere data:

  • Climate zones that don't map cleanly. Morocco has at least five distinct agro-climatic regions — Atlantic coast, Rif, Middle Atlas, High Atlas, pre-Saharan. Casablanca and Marrakech are 240km apart and grow different things at different times.
  • Crops the model under-weights. Argan, carob, barley for couscous, berry-type strawberries, cactus fruit. These aren't exotic — they're central to Moroccan agriculture. They're also underrepresented in training data.
  • Language that's code-switched by default. A farmer in Taroudant doesn't ask questions in pure MSA or pure French. They ask in Darija with French loan-words and Amazigh terms sprinkled in.

Our approach: retrieval over a small, hand-curated corpus

We didn't fine-tune. Fine-tuning a 7B model on a small domain corpus is expensive, brittle, and gets stale every time you want to add new knowledge. Instead we went RAG — retrieval-augmented generation — with a corpus we built ourselves from three sources:

  1. Public ONSSA and INRA Morocco publications (technical bulletins on specific crops)
  2. Moroccan agricultural extension service Q&A — the kind of thing farmers actually ask, in the language they actually use
  3. Our own interviews with farmers in five regions, transcribed and translated into a trilingual knowledge base

Total corpus size: around 4,200 chunks. We embed them with a multilingual model, store in pgvector on Supabase, and retrieve the top-8 most relevant chunks per query before passing them to the LLM as context.

The Darija problem — and the workaround

Darija is the single hardest piece. There's no standard orthography, almost no training data, and most "Arabic" models treat it as noisy MSA. Our hack: we accept Darija input and route it through a pre-processing step that expands common Darija phrases into an MSA + French gloss that the model can actually reason over. It's ugly. It works. Every time the model gets something wrong, we add the failing phrase to the gloss table.

The best part isn't the tech. It's watching a farmer in Chefchaouen ask "واش نقدر نزرع الطماطم دابا؟" and get back a specific, correct, Darija-phrased answer within 400ms.

What's next

Voice is the frontier. A lot of our users can't or don't type — they'd rather speak. We're experimenting with a two-stage pipeline: Whisper for rough transcription, then a Darija-aware post-processor that normalizes the output before it hits the retrieval step. It's not great yet. It will be.

If you want to try Fellah AI, it's at fellah.kolchiai.com. Feedback welcome — in any of the three languages.