Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Without NER, a spell checker flags every unfamiliar proper noun as a misspelling. The NER module provides heuristic, transformer, and hybrid implementations to detect entities and suppress false positives.
Overview
Named entities like personal names and place names often appear as “unknown words” to spell checkers. The NER module helps identify these entities, preventing the spell checker from flagging them as errors.
Entity Types Supported:
PER - Personal names (e.g., ကိုအောင်)
LOC - Locations (e.g., ရန်ကုန်မြို့)
ORG - Organizations (e.g., မြန်မာ့လေကြောင်း)
DATE - Date expressions
NUM - Numbers and numeric expressions
TIME - Time expressions
MISC - Miscellaneous named entities
OTHER - Not an entity (used internally for BIO tag “O”)
NER Implementations
mySpellChecker provides three NER implementations with different accuracy/speed trade-offs:
| Implementation | Accuracy | Speed | Dependencies |
|---|
HeuristicNER | ~70% | Fast | None |
TransformerNER | ~93% | Slow | transformers, torch |
HybridNER | ~93% | Adaptive | transformers (optional) |
HeuristicNER
Fast, rule-based NER using patterns and whitelists. Ideal for real-time applications.
Features:
- Honorific-based name detection (ဦး, ဒေါ်, ကို, မ)
- Location suffix detection (မြို့, ရွာ, ပြည်နယ်)
- Organization pattern matching (ကုမ္ပဏီ, ဘဏ်, တက္ကသိုလ်)
- Whitelist support for known entities
- No external dependencies
from myspellchecker.text.ner_model import HeuristicNER, NERConfig
# Basic usage
ner = HeuristicNER()
entities = ner.extract_entities("ဦးအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")
for entity in entities:
print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")
# Output:
# အောင်: PER (0.70)
# ရန်ကုန်မြို့: LOC (0.70)
High-accuracy NER using HuggingFace transformer models.
Features:
- State-of-the-art accuracy (~93%)
- BIO tagging for multi-word entities
- Confidence scores for each prediction
- Batch processing support
- LRU result caching for performance
from myspellchecker.text.ner_model import TransformerNER, NERConfig
# Using factory method
ner = TransformerNER.from_pretrained(
"chuuhtetnaing/myanmar-ner-model",
device=0, # GPU (use -1 for CPU)
confidence_threshold=0.7
)
entities = ner.extract_entities("ကိုအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")
for entity in entities:
print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")
Requirements:
pip install myspellchecker[transformers]
# or
pip install transformers torch
HybridNER
Combines transformer and heuristic approaches. Uses the transformer as primary, with automatic fallback to heuristics.
Features:
- Best of both approaches
- Graceful degradation if transformer unavailable
- Automatic fallback on transformer errors
- Configurable fallback behavior
from myspellchecker.text.ner_model import NERFactory, NERConfig
# HybridNER via factory
config = NERConfig(
model_type="transformer",
model_name="chuuhtetnaing/myanmar-ner-model",
fallback_to_heuristic=True # Use heuristics if transformer fails
)
ner = NERFactory.create(config)
entities = ner.extract_entities("ဦးအောင်မြင့်သည် မန္တလေးမြို့တွင် နေသည်။")
NER Gazetteer
In addition to the heuristic and transformer implementations, mySpellChecker includes a curated NER gazetteer — a YAML-based dictionary of known named entities loaded from rules/named_entities.yaml. The gazetteer provides fast O(1) lookup without any ML dependencies.
Entity Categories
The gazetteer covers five categories with 373+ entities:
| Category | Examples | Count |
|---|
| Personal name components | အောင်, မြင့်, ခင် | ~100 |
| Place names | ရန်ကုန်, မန္တလေး, နေပြည်တော် | ~120 |
| Organization names | လွှတ်တော်, တပ်မတော် | ~50 |
| Religious/cultural terms | ဗုဒ္ဓ, ဓမ္မ | ~50 |
| Government bodies | ဝန်ကြီးဌာန, ကော်မရှင် | ~50 |
Gazetteer API
from myspellchecker.text.ner import is_known_entity, load_gazetteer
# Check if a word is a known entity (fast, cached lookup)
is_known_entity("ရန်ကုန်") # True
is_known_entity("ကြောင်") # False
# Load the full gazetteer
entities = load_gazetteer() # Returns frozenset[str]
len(entities) # 373
SQLite NER Schema
When building dictionaries, the enrichment pipeline (Step 5e) seeds NER entities into the database via the ner_entities table. This enables runtime entity lookup without loading the YAML file.
from myspellchecker.providers import SQLiteProvider
provider = SQLiteProvider(database_path="dictionary.db")
# Check if a word is a corpus-mined entity
provider.is_corpus_entity("ရန်ကုန်") # True
# Get entity type categories
provider.get_entity_types("ရန်ကုန်") # ["place_name"]
False Positive Suppression
The gazetteer integrates with error_suppression.py to automatically suppress spell check errors on recognized named entities. This prevents proper nouns, place names, and organization names from being flagged as misspellings.
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider
# With use_ner=True (default), gazetteer suppression is active
config = SpellCheckerConfig(use_ner=True)
provider = SQLiteProvider(database_path="dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ရန်ကုန်မြို့သို့ သွားသည်။")
# "ရန်ကုန်" will not be flagged as an error
Integration with SpellChecker
NER is fully integrated into the SpellChecker pipeline. When enabled, the NER model:
- Provides name masks to the ContextValidator (for strategies to skip named entities)
- Filters errors post-validation, removing any error that overlaps a detected entity
Basic Usage (Heuristic NER)
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider
# Heuristic NER is enabled by default via use_ner=True
config = SpellCheckerConfig(use_ner=True)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ဦးအောင်သည် စာအုပ်ဖတ်သည်။")
# "အောင်" will not be flagged as an error
For highest accuracy, configure NERConfig with a transformer model:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, NERConfig
from myspellchecker.providers import SQLiteProvider
config = SpellCheckerConfig(
ner=NERConfig(
model_type="transformer",
model_name="chuuhtetnaing/myanmar-ner-model",
device=0, # GPU index, -1 for CPU
fallback_to_heuristic=True, # Graceful degradation
),
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
CLI Usage
# Check with default heuristic NER (enabled by default)
myspellchecker check input.txt
# Check with transformer NER model
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model
# Check with transformer NER on GPU
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model --ner-device 0
# Disable NER entirely
myspellchecker check input.txt --no-ner
Disabling NER
# Disable NER for speed
config = SpellCheckerConfig(use_ner=False)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
NERConfig Options
| Option | Type | Default | Description |
|---|
enabled | bool | True | Enable/disable NER |
model_type | str | ”heuristic" | "heuristic” or “transformer” |
model_name | str | ”chuuhtetnaing/myanmar-ner-model” | HuggingFace model name |
device | int | -1 | Device index (-1=CPU, 0+=GPU) |
confidence_threshold | float | 0.5 | Minimum confidence to accept |
heuristic_confidence | float | 0.7 | Confidence for heuristic results |
batch_size | int | 32 | Batch size for transformer |
cache_size | int | 1000 | LRU cache size |
fallback_to_heuristic | bool | True | Use heuristics if transformer fails |
ner_entity_types | list[str] | ["PER"] | Entity types to suppress false positives for. Add "LOC" to also suppress place-name FPs. Valid types: PER, LOC, ORG, DATE, NUM, TIME, MISC |
loc_confidence_threshold | float | 0.85 | Higher confidence threshold for LOC entities due to common-noun/place-name ambiguity in Myanmar |
Entity Data Structure
The Entity dataclass represents detected entities:
@dataclass
class Entity:
text: str # Entity text
label: EntityType # PER, LOC, ORG, DATE, NUM, TIME, MISC, OTHER
start: int # Start character position
end: int # End character position
confidence: float # 0.0 to 1.0
metadata: dict # Additional info (source, pattern, etc.)
Advanced Usage
Batch Processing
Process multiple texts efficiently:
texts = [
"ဦးအောင်သည် ရန်ကုန်တွင် နေသည်။",
"ဒေါ်မြင့်မြင့်သည် မန္တလေးသို့ သွားသည်။",
"ကိုဇော်ဇော်သည် ပုဂံမြို့နယ်တွင် အလုပ်လုပ်သည်။"
]
all_entities = ner.extract_entities_batch(texts)
for i, entities in enumerate(all_entities):
print(f"Text {i+1}: {[e.text for e in entities]}")
Custom Whitelist
Add known names to reduce false negatives:
from myspellchecker.text.ner import NameHeuristic
# Create heuristic with custom whitelist
whitelist = {"ရွှေစာ", "ချစ်စုလှိုင်", "မောင်မောင်"}
heuristic = NameHeuristic(whitelist=whitelist)
# These will always be recognized as names
is_name = heuristic.is_potential_name("ရွှေစာ") # True
- Real-time typing: Use
HeuristicNER for fastest response
- Document checking: Use
HybridNER for balance
- Batch processing: Use
TransformerNER with batching
- High throughput: Enable result caching
# High-performance configuration
config = NERConfig(
model_type="transformer",
model_name="chuuhtetnaing/myanmar-ner-model",
batch_size=64, # Larger batches for throughput
cache_size=5000, # Larger cache for repeated texts
device=0 # Use GPU if available
)
See Also