Overview
Named entities like personal names and place names often appear as “unknown words” to spell checkers. The NER module helps identify these entities, preventing the spell checker from flagging them as errors. Entity Types Supported:PER- Personal names (e.g., ကိုအောင်)LOC- Locations (e.g., ရန်ကုန်မြို့)ORG- Organizations (e.g., မြန်မာ့လေကြောင်း)DATE- Date expressionsNUM- Numbers and numeric expressionsTIME- Time expressionsMISC- Miscellaneous named entitiesOTHER- Not an entity (used internally for BIO tag “O”)
NER Implementations
mySpellChecker provides three NER implementations with different accuracy/speed trade-offs:| Implementation | Accuracy | Speed | Dependencies |
|---|---|---|---|
HeuristicNER | ~70% | Fast | None |
TransformerNER | ~93% | Slow | transformers, torch |
HybridNER | ~93% | Adaptive | transformers (optional) |
HeuristicNER
Fast, rule-based NER using patterns and whitelists. Ideal for real-time applications. Features:- Honorific-based name detection (ဦး, ဒေါ်, ကို, မ)
- Location suffix detection (မြို့, ရွာ, ပြည်နယ်)
- Organization pattern matching (ကုမ္ပဏီ, ဘဏ်, တက္ကသိုလ်)
- Whitelist support for known entities
- No external dependencies
TransformerNER
High-accuracy NER using HuggingFace transformer models. Features:- State-of-the-art accuracy (~93%)
- BIO tagging for multi-word entities
- Confidence scores for each prediction
- Batch processing support
- LRU result caching for performance
HybridNER
Combines transformer and heuristic approaches. Uses the transformer as primary, with automatic fallback to heuristics. Features:- Best of both approaches
- Graceful degradation if transformer unavailable
- Automatic fallback on transformer errors
- Configurable fallback behavior
NER Gazetteer
In addition to the heuristic and transformer implementations, mySpellChecker includes a curated NER gazetteer — a YAML-based dictionary of known named entities loaded fromrules/named_entities.yaml. The gazetteer provides fast O(1) lookup without any ML dependencies.
Entity Categories
The gazetteer covers five categories with 373+ entities:| Category | Examples | Count |
|---|---|---|
| Personal name components | အောင်, မြင့်, ခင် | ~100 |
| Place names | ရန်ကုန်, မန္တလေး, နေပြည်တော် | ~120 |
| Organization names | လွှတ်တော်, တပ်မတော် | ~50 |
| Religious/cultural terms | ဗုဒ္ဓ, ဓမ္မ | ~50 |
| Government bodies | ဝန်ကြီးဌာန, ကော်မရှင် | ~50 |
Gazetteer API
SQLite NER Schema
When building dictionaries, the enrichment pipeline (Step 5e) seeds NER entities into the database via thener_entities table. This enables runtime entity lookup without loading the YAML file.
False Positive Suppression
The gazetteer integrates witherror_suppression.py to automatically suppress spell check errors on recognized named entities. This prevents proper nouns, place names, and organization names from being flagged as misspellings.
Integration with SpellChecker
NER is fully integrated into the SpellChecker pipeline. When enabled, the NER model:- Provides name masks to the ContextValidator (for strategies to skip named entities)
- Filters errors post-validation, removing any error that overlaps a detected entity
Basic Usage (Heuristic NER)
With Transformer NER
For highest accuracy, configureNERConfig with a transformer model:
CLI Usage
Disabling NER
NERConfig Options
| Option | Type | Default | Description |
|---|---|---|---|
enabled | bool | True | Enable/disable NER |
model_type | str | ”heuristic" | "heuristic” or “transformer” |
model_name | str | ”chuuhtetnaing/myanmar-ner-model” | HuggingFace model name |
device | int | -1 | Device index (-1=CPU, 0+=GPU) |
confidence_threshold | float | 0.5 | Minimum confidence to accept |
heuristic_confidence | float | 0.7 | Confidence for heuristic results |
batch_size | int | 32 | Batch size for transformer |
cache_size | int | 1000 | LRU cache size |
fallback_to_heuristic | bool | True | Use heuristics if transformer fails |
ner_entity_types | list[str] | ["PER"] | Entity types to suppress false positives for. Add "LOC" to also suppress place-name FPs. Valid types: PER, LOC, ORG, DATE, NUM, TIME, MISC |
loc_confidence_threshold | float | 0.85 | Higher confidence threshold for LOC entities due to common-noun/place-name ambiguity in Myanmar |
Entity Data Structure
TheEntity dataclass represents detected entities:
Advanced Usage
Batch Processing
Process multiple texts efficiently:Custom Whitelist
Add known names to reduce false negatives:Performance Tips
- Real-time typing: Use
HeuristicNERfor fastest response - Document checking: Use
HybridNERfor balance - Batch processing: Use
TransformerNERwith batching - High throughput: Enable result caching
See Also
- Syllable Validation - Core validation layer
- Word Validation - Dictionary-based validation
- Grammar Checking - Syntactic validation