Code Analysis: nsUTF8Prober
Confidence Calculation
1. Core Logic
The detector’s core principle is: Verify UTF-8 encoding rules to determine if text is UTF-8. It uses a state machine (mCodingSM
) to track byte sequence compliance with UTF-8 specifications.
-
Reset()
: Initializes detector state, resets state machine, multi-byte character counter (mNumOfMBChar
), and detection state (mState
). -
HandleData()
: Primary function for processing input byte streams:- Processes bytes sequentially through the state machine (
mCodingSM->NextState(aBuf[i])
) eItsMe
state return indicates definite UTF-8 rule violation → detector state becomeseFoundIt
(effectively “confirmed not UTF-8”)eStart
state return indicates successful recognition of a complete UTF-8 character:- For multi-byte characters (
mCodingSM->GetCurrentCharLen() >= 2
), incrementsmNumOfMBChar
- Includes logic to build Unicode code points (
currentCodePoint
) stored incodePointBuffer
- For multi-byte characters (
- Key optimization: At
HandleData
‘s end:if (mState == eDetecting) if (mNumOfMBChar > ENOUGH_CHAR_THRESHOLD && GetConfidence(0) > SHORTCUT_THRESHOLD) mState = eFoundIt;
This allows early termination when sufficient valid multi-byte characters are found (mNumOfMBChar > 256
) with high confidence.
- Processes bytes sequentially through the state machine (
2. Confidence Calculation (GetConfidence
)
Core calculation logic:
#define ONE_CHAR_PROB (float)0.50
float nsUTF8Prober::GetConfidence(int candidate)
{
if (mNumOfMBChar < 6) // Fewer than 6 multi-byte characters
{
float unlike = 0.5f; // Initial 50% probability of not being UTF-8
// Each valid multi-byte character has 50% probability of being coincidental
// Combined probability for N characters: (0.5)^N
for (PRUint32 i = 0; i < mNumOfMBChar; i++)
unlike *= ONE_CHAR_PROB; // Multiply by 0.5 per character
// Confidence = 1 - probability of coincidence
return (float)1.0 - unlike;
}
else // 6+ multi-byte characters
{
return (float)0.99; // High-confidence threshold
}
}
3. Confidence Calculation Methodology
The algorithm uses a statistical significance heuristic:
- Low-Confidence Mode (<6 MB characters):
- Models probability that N valid UTF-8 sequences appear coincidentally in non-UTF8 text as
(0.5)^N
ONE_CHAR_PROB=0.5
is an empirical estimate of random byte sequences accidentally matching UTF-8 rules- Confidence =
1 - (0.5)^N
- Examples:
- 0 MB chars: 50% confidence
- 1 MB char: 75% confidence
- 3 MB chars: 93.75% confidence
- 5 MB chars: 98.4375% confidence
- Models probability that N valid UTF-8 sequences appear coincidentally in non-UTF8 text as
- High-Confidence Mode (≥6 MB characters):
- Returns fixed 99% confidence
- Optimization based on empirical observation that 6 valid sequences provide near-certain detection
- Minimizes false positives while maintaining efficiency
4. Key Characteristics
Aspect | Description |
---|---|
Detection Basis | Multi-byte character count (mNumOfMBChar ) |
Calculation Approach | Statistical model of coincidental matches |
Probability Constant | Empirical value (0.5) |
Threshold | 6 multi-byte characters |
Strengths | Simple computation, fast rejection of invalid sequences |
Detection Philosophy | Focuses on disproving non-UTF8 through rule validation |
5. Practical Implications
- Short text sensitivity: Confidence builds slowly with character count
- Language dependence: More effective for languages requiring frequent multi-byte characters
- Error resilience: Single invalid sequence resets confidence building
- Performance tradeoff: Threshold value balances accuracy vs processing time
This confidence model exemplifies Uchardet’s practical approach – using statistically-informed heuristics to achieve efficient encoding detection without complex probabilistic modeling. The 0.5 probability constant and 6-character threshold represent carefully balanced empirical values refined through real-world testing.