Skip to main content
Without NER, a spell checker flags every unfamiliar proper noun as a misspelling. The NER module provides heuristic, transformer, and hybrid implementations to detect entities and suppress false positives.

Overview

Named entities like personal names and place names often appear as “unknown words” to spell checkers. The NER module helps identify these entities, preventing the spell checker from flagging them as errors. Entity Types Supported:
  • PER - Personal names (e.g., ကိုအောင်)
  • LOC - Locations (e.g., ရန်ကုန်မြို့)
  • ORG - Organizations (e.g., မြန်မာ့လေကြောင်း)
  • DATE - Date expressions
  • NUM - Numbers and numeric expressions
  • TIME - Time expressions
  • MISC - Miscellaneous named entities
  • OTHER - Not an entity (used internally for BIO tag “O”)

NER Implementations

mySpellChecker provides three NER implementations with different accuracy/speed trade-offs:
ImplementationAccuracySpeedDependencies
HeuristicNER~70%FastNone
TransformerNER~93%Slowtransformers, torch
HybridNER~93%Adaptivetransformers (optional)

HeuristicNER

Fast, rule-based NER using patterns and whitelists. Ideal for real-time applications. Features:
  • Honorific-based name detection (ဦး, ဒေါ်, ကို, မ)
  • Location suffix detection (မြို့, ရွာ, ပြည်နယ်)
  • Organization pattern matching (ကုမ္ပဏီ, ဘဏ်, တက္ကသိုလ်)
  • Whitelist support for known entities
  • No external dependencies
from myspellchecker.text.ner_model import HeuristicNER, NERConfig

# Basic usage
ner = HeuristicNER()
entities = ner.extract_entities("ဦးအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")

for entity in entities:
    print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")
# Output:
# အောင်: PER (0.70)
# ရန်ကုန်မြို့: LOC (0.70)

TransformerNER

High-accuracy NER using HuggingFace transformer models. Features:
  • State-of-the-art accuracy (~93%)
  • BIO tagging for multi-word entities
  • Confidence scores for each prediction
  • Batch processing support
  • LRU result caching for performance
from myspellchecker.text.ner_model import TransformerNER, NERConfig

# Using factory method
ner = TransformerNER.from_pretrained(
    "chuuhtetnaing/myanmar-ner-model",
    device=0,  # GPU (use -1 for CPU)
    confidence_threshold=0.7
)

entities = ner.extract_entities("ကိုအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")
for entity in entities:
    print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")
Requirements:
pip install myspellchecker[transformers]
# or
pip install transformers torch

HybridNER

Combines transformer and heuristic approaches. Uses the transformer as primary, with automatic fallback to heuristics. Features:
  • Best of both approaches
  • Graceful degradation if transformer unavailable
  • Automatic fallback on transformer errors
  • Configurable fallback behavior
from myspellchecker.text.ner_model import NERFactory, NERConfig

# HybridNER via factory
config = NERConfig(
    model_type="transformer",
    model_name="chuuhtetnaing/myanmar-ner-model",
    fallback_to_heuristic=True  # Use heuristics if transformer fails
)
ner = NERFactory.create(config)

entities = ner.extract_entities("ဦးအောင်မြင့်သည် မန္တလေးမြို့တွင် နေသည်။")

NER Gazetteer

In addition to the heuristic and transformer implementations, mySpellChecker includes a curated NER gazetteer — a YAML-based dictionary of known named entities loaded from rules/named_entities.yaml. The gazetteer provides fast O(1) lookup without any ML dependencies.

Entity Categories

The gazetteer covers five categories with 373+ entities:
CategoryExamplesCount
Personal name componentsအောင်, မြင့်, ခင်~100
Place namesရန်ကုန်, မန္တလေး, နေပြည်တော်~120
Organization namesလွှတ်တော်, တပ်မတော်~50
Religious/cultural termsဗုဒ္ဓ, ဓမ္မ~50
Government bodiesဝန်ကြီးဌာန, ကော်မရှင်~50

Gazetteer API

from myspellchecker.text.ner import is_known_entity, load_gazetteer

# Check if a word is a known entity (fast, cached lookup)
is_known_entity("ရန်ကုန်")  # True
is_known_entity("ကြောင်")   # False

# Load the full gazetteer
entities = load_gazetteer()  # Returns frozenset[str]
len(entities)  # 373

SQLite NER Schema

When building dictionaries, the enrichment pipeline (Step 5e) seeds NER entities into the database via the ner_entities table. This enables runtime entity lookup without loading the YAML file.
from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider(database_path="dictionary.db")

# Check if a word is a corpus-mined entity
provider.is_corpus_entity("ရန်ကုန်")  # True

# Get entity type categories
provider.get_entity_types("ရန်ကုန်")  # ["place_name"]

False Positive Suppression

The gazetteer integrates with error_suppression.py to automatically suppress spell check errors on recognized named entities. This prevents proper nouns, place names, and organization names from being flagged as misspellings.
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

# With use_ner=True (default), gazetteer suppression is active
config = SpellCheckerConfig(use_ner=True)
provider = SQLiteProvider(database_path="dictionary.db")
checker = SpellChecker(config=config, provider=provider)

result = checker.check("ရန်ကုန်မြို့သို့ သွားသည်။")
# "ရန်ကုန်" will not be flagged as an error

Integration with SpellChecker

NER is fully integrated into the SpellChecker pipeline. When enabled, the NER model:
  1. Provides name masks to the ContextValidator (for strategies to skip named entities)
  2. Filters errors post-validation, removing any error that overlaps a detected entity

Basic Usage (Heuristic NER)

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

# Heuristic NER is enabled by default via use_ner=True
config = SpellCheckerConfig(use_ner=True)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

result = checker.check("ဦးအောင်သည် စာအုပ်ဖတ်သည်။")
# "အောင်" will not be flagged as an error

With Transformer NER

For highest accuracy, configure NERConfig with a transformer model:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, NERConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    ner=NERConfig(
        model_type="transformer",
        model_name="chuuhtetnaing/myanmar-ner-model",
        device=0,  # GPU index, -1 for CPU
        fallback_to_heuristic=True,  # Graceful degradation
    ),
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

CLI Usage

# Check with default heuristic NER (enabled by default)
myspellchecker check input.txt

# Check with transformer NER model
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model

# Check with transformer NER on GPU
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model --ner-device 0

# Disable NER entirely
myspellchecker check input.txt --no-ner

Disabling NER

# Disable NER for speed
config = SpellCheckerConfig(use_ner=False)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

NERConfig Options

OptionTypeDefaultDescription
enabledboolTrueEnable/disable NER
model_typestr”heuristic""heuristic” or “transformer”
model_namestr”chuuhtetnaing/myanmar-ner-model”HuggingFace model name
deviceint-1Device index (-1=CPU, 0+=GPU)
confidence_thresholdfloat0.5Minimum confidence to accept
heuristic_confidencefloat0.7Confidence for heuristic results
batch_sizeint32Batch size for transformer
cache_sizeint1000LRU cache size
fallback_to_heuristicboolTrueUse heuristics if transformer fails
ner_entity_typeslist[str]["PER"]Entity types to suppress false positives for. Add "LOC" to also suppress place-name FPs. Valid types: PER, LOC, ORG, DATE, NUM, TIME, MISC
loc_confidence_thresholdfloat0.85Higher confidence threshold for LOC entities due to common-noun/place-name ambiguity in Myanmar

Entity Data Structure

The Entity dataclass represents detected entities:
@dataclass
class Entity:
    text: str          # Entity text
    label: EntityType  # PER, LOC, ORG, DATE, NUM, TIME, MISC, OTHER
    start: int         # Start character position
    end: int           # End character position
    confidence: float  # 0.0 to 1.0
    metadata: dict     # Additional info (source, pattern, etc.)

Advanced Usage

Batch Processing

Process multiple texts efficiently:
texts = [
    "ဦးအောင်သည် ရန်ကုန်တွင် နေသည်။",
    "ဒေါ်မြင့်မြင့်သည် မန္တလေးသို့ သွားသည်။",
    "ကိုဇော်ဇော်သည် ပုဂံမြို့နယ်တွင် အလုပ်လုပ်သည်။"
]

all_entities = ner.extract_entities_batch(texts)
for i, entities in enumerate(all_entities):
    print(f"Text {i+1}: {[e.text for e in entities]}")

Custom Whitelist

Add known names to reduce false negatives:
from myspellchecker.text.ner import NameHeuristic

# Create heuristic with custom whitelist
whitelist = {"ရွှေစာ", "ချစ်စုလှိုင်", "မောင်မောင်"}
heuristic = NameHeuristic(whitelist=whitelist)

# These will always be recognized as names
is_name = heuristic.is_potential_name("ရွှေစာ")  # True

Performance Tips

  1. Real-time typing: Use HeuristicNER for fastest response
  2. Document checking: Use HybridNER for balance
  3. Batch processing: Use TransformerNER with batching
  4. High throughput: Enable result caching
# High-performance configuration
config = NERConfig(
    model_type="transformer",
    model_name="chuuhtetnaing/myanmar-ner-model",
    batch_size=64,      # Larger batches for throughput
    cache_size=5000,    # Larger cache for repeated texts
    device=0            # Use GPU if available
)

See Also