Named Entity Recognition (NER)

Without NER, a spell checker flags every unfamiliar proper noun as a misspelling. The NER module provides heuristic, transformer, and hybrid implementations to detect entities and suppress false positives.

Overview

Named entities like personal names and place names often appear as “unknown words” to spell checkers. The NER module helps identify these entities, preventing the spell checker from flagging them as errors. Entity Types Supported:

PER - Personal names (e.g., ကိုအောင်)
LOC - Locations (e.g., ရန်ကုန်မြို့)
ORG - Organizations (e.g., မြန်မာ့လေကြောင်း)
DATE - Date expressions
NUM - Numbers and numeric expressions
TIME - Time expressions
MISC - Miscellaneous named entities
OTHER - Not an entity (used internally for BIO tag “O”)

NER Implementations

mySpellChecker provides three NER implementations with different accuracy/speed trade-offs:

Implementation	Accuracy	Speed	Dependencies
`HeuristicNER`	~70%	Fast	None
`TransformerNER`	~93%	Slow	transformers, torch
`HybridNER`	~93%	Adaptive	transformers (optional)

HeuristicNER

Fast, rule-based NER using patterns and whitelists. Ideal for real-time applications. Features:

Honorific-based name detection (ဦး, ဒေါ်, ကို, မ)
Location suffix detection (မြို့, ရွာ, ပြည်နယ်)
Organization pattern matching (ကုမ္ပဏီ, ဘဏ်, တက္ကသိုလ်)
Whitelist support for known entities
No external dependencies

from myspellchecker.text.ner_model import HeuristicNER, NERConfig

# Basic usage
ner = HeuristicNER()
entities = ner.extract_entities("ဦးအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")

for entity in entities:
    print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")
# Output:
# အောင်: PER (0.70)
# ရန်ကုန်မြို့: LOC (0.70)

TransformerNER

High-accuracy NER using HuggingFace transformer models. Features:

State-of-the-art accuracy (~93%)
BIO tagging for multi-word entities
Confidence scores for each prediction
Batch processing support
LRU result caching for performance

from myspellchecker.text.ner_model import TransformerNER, NERConfig

# Using factory method
ner = TransformerNER.from_pretrained(
    "chuuhtetnaing/myanmar-ner-model",
    device=0,  # GPU (use -1 for CPU)
    confidence_threshold=0.7
)

entities = ner.extract_entities("ကိုအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")
for entity in entities:
    print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")

Requirements:

pip install myspellchecker[transformers]
# or
pip install transformers torch

HybridNER

Combines transformer and heuristic approaches. Uses the transformer as primary, with automatic fallback to heuristics. Features:

Best of both approaches
Graceful degradation if transformer unavailable
Automatic fallback on transformer errors
Configurable fallback behavior

from myspellchecker.text.ner_model import NERFactory, NERConfig

# HybridNER via factory
config = NERConfig(
    model_type="transformer",
    model_name="chuuhtetnaing/myanmar-ner-model",
    fallback_to_heuristic=True  # Use heuristics if transformer fails
)
ner = NERFactory.create(config)

entities = ner.extract_entities("ဦးအောင်မြင့်သည် မန္တလေးမြို့တွင် နေသည်။")

NER Gazetteer

In addition to the heuristic and transformer implementations, mySpellChecker includes a curated NER gazetteer — a YAML-based dictionary of known named entities loaded from rules/named_entities.yaml. The gazetteer provides fast O(1) lookup without any ML dependencies.

Entity Categories

The gazetteer covers five categories with 373+ entities:

Category	Examples	Count
Personal name components	အောင်, မြင့်, ခင်	~100
Place names	ရန်ကုန်, မန္တလေး, နေပြည်တော်	~120
Organization names	လွှတ်တော်, တပ်မတော်	~50
Religious/cultural terms	ဗုဒ္ဓ, ဓမ္မ	~50
Government bodies	ဝန်ကြီးဌာန, ကော်မရှင်	~50

Gazetteer API

from myspellchecker.text.ner import is_known_entity, load_gazetteer

# Check if a word is a known entity (fast, cached lookup)
is_known_entity("ရန်ကုန်")  # True
is_known_entity("ကြောင်")   # False

# Load the full gazetteer
entities = load_gazetteer()  # Returns frozenset[str]
len(entities)  # 373

SQLite NER Schema

When building dictionaries, the enrichment pipeline (Step 5e) seeds NER entities into the database via the ner_entities table. This enables runtime entity lookup without loading the YAML file.

from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider(database_path="dictionary.db")

# Check if a word is a corpus-mined entity
provider.is_corpus_entity("ရန်ကုန်")  # True

# Get entity type categories
provider.get_entity_types("ရန်ကုန်")  # ["place_name"]

False Positive Suppression

The gazetteer integrates with error_suppression.py to automatically suppress spell check errors on recognized named entities. This prevents proper nouns, place names, and organization names from being flagged as misspellings.

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

# With use_ner=True (default), gazetteer suppression is active
config = SpellCheckerConfig(use_ner=True)
provider = SQLiteProvider(database_path="dictionary.db")
checker = SpellChecker(config=config, provider=provider)

result = checker.check("ရန်ကုန်မြို့သို့ သွားသည်။")
# "ရန်ကုန်" will not be flagged as an error

Integration with SpellChecker

NER is fully integrated into the SpellChecker pipeline. When enabled, the NER model:

Provides name masks to the ContextValidator (for strategies to skip named entities)
Filters errors post-validation, removing any error that overlaps a detected entity

Basic Usage (Heuristic NER)

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

# Heuristic NER is enabled by default via use_ner=True
config = SpellCheckerConfig(use_ner=True)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

result = checker.check("ဦးအောင်သည် စာအုပ်ဖတ်သည်။")
# "အောင်" will not be flagged as an error

With Transformer NER

For highest accuracy, configure NERConfig with a transformer model:

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, NERConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    ner=NERConfig(
        model_type="transformer",
        model_name="chuuhtetnaing/myanmar-ner-model",
        device=0,  # GPU index, -1 for CPU
        fallback_to_heuristic=True,  # Graceful degradation
    ),
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

CLI Usage

# Check with default heuristic NER (enabled by default)
myspellchecker check input.txt

# Check with transformer NER model
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model

# Check with transformer NER on GPU
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model --ner-device 0

# Disable NER entirely
myspellchecker check input.txt --no-ner

Disabling NER

# Disable NER for speed
config = SpellCheckerConfig(use_ner=False)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

NERConfig Options

Option	Type	Default	Description
`enabled`	bool	True	Enable/disable NER
`model_type`	str	”heuristic"	"heuristic” or “transformer”
`model_name`	str	”chuuhtetnaing/myanmar-ner-model”	HuggingFace model name
`device`	int	-1	Device index (-1=CPU, 0+=GPU)
`confidence_threshold`	float	0.5	Minimum confidence to accept
`heuristic_confidence`	float	0.7	Confidence for heuristic results
`batch_size`	int	32	Batch size for transformer
`cache_size`	int	1000	LRU cache size
`fallback_to_heuristic`	bool	True	Use heuristics if transformer fails
`ner_entity_types`	list[str]	`["PER"]`	Entity types to suppress false positives for. Add `"LOC"` to also suppress place-name FPs. Valid types: PER, LOC, ORG, DATE, NUM, TIME, MISC
`loc_confidence_threshold`	float	0.85	Higher confidence threshold for LOC entities due to common-noun/place-name ambiguity in Myanmar

Entity Data Structure

The Entity dataclass represents detected entities:

@dataclass
class Entity:
    text: str          # Entity text
    label: EntityType  # PER, LOC, ORG, DATE, NUM, TIME, MISC, OTHER
    start: int         # Start character position
    end: int           # End character position
    confidence: float  # 0.0 to 1.0
    metadata: dict     # Additional info (source, pattern, etc.)

Advanced Usage

Batch Processing

Process multiple texts efficiently:

texts = [
    "ဦးအောင်သည် ရန်ကုန်တွင် နေသည်။",
    "ဒေါ်မြင့်မြင့်သည် မန္တလေးသို့ သွားသည်။",
    "ကိုဇော်ဇော်သည် ပုဂံမြို့နယ်တွင် အလုပ်လုပ်သည်။"
]

all_entities = ner.extract_entities_batch(texts)
for i, entities in enumerate(all_entities):
    print(f"Text {i+1}: {[e.text for e in entities]}")

Custom Whitelist

Add known names to reduce false negatives:

from myspellchecker.text.ner import NameHeuristic

# Create heuristic with custom whitelist
whitelist = {"ရွှေစာ", "ချစ်စုလှိုင်", "မောင်မောင်"}
heuristic = NameHeuristic(whitelist=whitelist)

# These will always be recognized as names
is_name = heuristic.is_potential_name("ရွှေစာ")  # True

Performance Tips

Real-time typing: Use HeuristicNER for fastest response
Document checking: Use HybridNER for balance
Batch processing: Use TransformerNER with batching
High throughput: Enable result caching

# High-performance configuration
config = NERConfig(
    model_type="transformer",
    model_name="chuuhtetnaing/myanmar-ner-model",
    batch_size=64,      # Larger batches for throughput
    cache_size=5000,    # Larger cache for repeated texts
    device=0            # Use GPU if available
)

​Overview

​NER Implementations

​HeuristicNER

​TransformerNER

​HybridNER

​NER Gazetteer

​Entity Categories

​Gazetteer API

​SQLite NER Schema

​False Positive Suppression

​Integration with SpellChecker

​Basic Usage (Heuristic NER)

​With Transformer NER

​CLI Usage

​Disabling NER

​NERConfig Options

​Entity Data Structure

​Advanced Usage

​Batch Processing

​Custom Whitelist

​Performance Tips

​See Also