16 March 2026 EN #android#kotlin#architecture#fintech#yembi

Parsing a million mobile money SMS at 99% accuracy

When SMS is the only API: the architecture behind Yembi’s parser: versioned regex pipelines, config-driven formats, and idempotent sync over 2G.

Mobile money operators in Burkina Faso don’t offer a transaction API. What they offer is a text message:

Cher client, vous avez transfere 50,500.00 FCFA au numero
65013303,ABDOURAZACK. Votre nouveau solde est de 12,300.00 FCFA.
ID Trans: PP260312.1453.A12345

That message is the entire integration surface. If you want to build financial software here, SMS is the API: undocumented, unversioned, and changed whenever the operator edits a template. I lead Yembi at DOKAL-Africa and was its only engineer through launch. This is how I built its parser to survive that, and what it taught me about building on hostile interfaces.

The pipeline

Every incoming SMS goes through four stages on-device:

Operator detection. Sender ID and message fingerprints route the message to the right parser family (Orange Money, Moov Money), and SIM-slot detection maps it to the right account on dual-SIM phones.
Classification. Each operator has seven or more transaction types (withdrawal, P2P transfer, receipt, bill payment, merchant payment, airtime, agent deposit), each with its own template variants. A classifier picks the candidate type before any extraction happens.
Extraction. Type-specific patterns pull out amount, counterparty, transaction ID, timestamp, and post-transaction balance. Amount parsing alone handles 50,500.00 FCFA, 50500 FCFA, and a few formats the operators have used and abandoned over the years.
Validation. Extracted fields are cross-checked: does the new balance make arithmetic sense given the previous one? Inconsistent parses are quarantined for review instead of silently corrupting the ledger.

The decision that saved the project: config-driven formats

Early on, format rules lived in code. Then Orange changed a template, and fixing it meant shipping an APK and waiting for users to update. Unacceptable for a financial ledger.

Now the parser is driven by versioned configuration. Pattern sets live as data, shipped independently of app releases, so a template change is a config update rather than a release. The training pipeline runs against a corpus of real anonymized messages, and every config version gets a measured accuracy score before it ships. The 99%+ number comes from that corpus. It’s a regression suite, not a guess.

Idempotency, or: 2G will retry you into corruption

Parsing is half the system. The other half is syncing parsed transactions to the backend over networks that drop mid-request. The failure mode that matters: the server commits, the ACK never arrives, the client retries. Your user now has a duplicate 150,000 FCFA withdrawal in their history.

The fix is boring, which is what you want in a ledger: every transaction carries a content-derived SHA-256 idempotency key, and the server treats inserts as upserts on that key. A persistent client-side queue with exponential backoff can retry as aggressively as it wants. Duplicates are structurally impossible rather than carefully avoided.

Performance: indexes are a product feature

A year of active use generates a serious local database, and Yembi’s whole pitch is “ask questions about your money.” Category breakdowns and date-range queries run against composite indexes designed for exactly those questions, which keeps them under 50ms on million-row tables, on mid-range Android hardware. The dashboard feels instant because the schema was designed backwards from the questions.

What I’d tell anyone building on an unofficial interface

Treat the format as adversarial. Version your parsers, measure accuracy against a real corpus, and make format changes deployable as data.
Quarantine, don’t guess. A financial app that’s wrong is worse than one that asks.
Make duplicates impossible, not unlikely.
Design indexes from the user’s questions, not the entity model.

For most of this project I was the only engineer on Yembi: app, backend, admin portal, parser tooling, Play Store listing. We’ve since hired a second developer, and the config-driven parser is the main reason onboarding them was painless.

Yembi launched in March. Next post: the launch itself, what worked, what flopped, and the numbers.