Parsing a million mobile money SMS at 99% accuracy
When SMS is the only API: the architecture behind Yembi’s parser: versioned regex pipelines, config-driven formats, and idempotent sync over 2G.
Mobile money operators in Burkina Faso don’t offer a transaction API. What they offer is a text message:
Cher client, vous avez transfere 50,500.00 FCFA au numero
65013303,ABDOURAZACK. Votre nouveau solde est de 12,300.00 FCFA.
ID Trans: PP260312.1453.A12345
That message is the entire integration surface. If you want to build financial software here, SMS is the API: undocumented, unversioned, and changed whenever the operator edits a template. I lead Yembi at DOKAL-Africa and was its only engineer through launch. This is how I built its parser to survive that, and what it taught me about building on hostile interfaces.
The pipeline
Every incoming SMS goes through four stages on-device:
- Operator detection. Sender ID and message fingerprints route the message to the right parser family (Orange Money, Moov Money), and SIM-slot detection maps it to the right account on dual-SIM phones.
- Classification. Each operator has seven or more transaction types (withdrawal, P2P transfer, receipt, bill payment, merchant payment, airtime, agent deposit), each with its own template variants. A classifier picks the candidate type before any extraction happens.
- Extraction. Type-specific patterns pull out amount, counterparty,
transaction ID, timestamp, and post-transaction balance. Amount parsing alone
handles
50,500.00 FCFA,50500 FCFA, and a few formats the operators have used and abandoned over the years. - Validation. Extracted fields are cross-checked: does the new balance make arithmetic sense given the previous one? Inconsistent parses are quarantined for review instead of silently corrupting the ledger.
The decision that saved the project: config-driven formats
Early on, format rules lived in code. Then Orange changed a template, and fixing it meant shipping an APK and waiting for users to update. Unacceptable for a financial ledger.
Now the parser is driven by versioned configuration. Pattern sets live as data, shipped independently of app releases, so a template change is a config update rather than a release. The training pipeline runs against a corpus of real anonymized messages, and every config version gets a measured accuracy score before it ships. The 99%+ number comes from that corpus. It’s a regression suite, not a guess.
Idempotency, or: 2G will retry you into corruption
Parsing is half the system. The other half is syncing parsed transactions to the backend over networks that drop mid-request. The failure mode that matters: the server commits, the ACK never arrives, the client retries. Your user now has a duplicate 150,000 FCFA withdrawal in their history.
The fix is boring, which is what you want in a ledger: every transaction carries a content-derived SHA-256 idempotency key, and the server treats inserts as upserts on that key. A persistent client-side queue with exponential backoff can retry as aggressively as it wants. Duplicates are structurally impossible rather than carefully avoided.
Performance: indexes are a product feature
A year of active use generates a serious local database, and Yembi’s whole pitch is “ask questions about your money.” Category breakdowns and date-range queries run against composite indexes designed for exactly those questions, which keeps them under 50ms on million-row tables, on mid-range Android hardware. The dashboard feels instant because the schema was designed backwards from the questions.
What I’d tell anyone building on an unofficial interface
- Treat the format as adversarial. Version your parsers, measure accuracy against a real corpus, and make format changes deployable as data.
- Quarantine, don’t guess. A financial app that’s wrong is worse than one that asks.
- Make duplicates impossible, not unlikely.
- Design indexes from the user’s questions, not the entity model.
For most of this project I was the only engineer on Yembi: app, backend, admin portal, parser tooling, Play Store listing. We’ve since hired a second developer, and the config-driven parser is the main reason onboarding them was painless.
Yembi launched in March. Next post: the launch itself, what worked, what flopped, and the numbers.