-
Notifications
You must be signed in to change notification settings - Fork 66
Description
NaN Propagation Causes SIGSEGV in Autodetect Process
Summary
The autodetect process crashes with SIGSEGV (signal 11) due to NaN values propagating through probability calculations, corrupting model weights, and leading to unsafe access of an empty accumulator. The root cause is extreme metric values (particularly 2^64 overflow values) causing mathematical overflow in variance/statistical calculations.
Error Chain (Chronological Order)
- 14:10:56 - NaN Probability Bounds Error (
COneOfNPrior.cc:960):Bad probability bounds = [-nan, -nan] - 14:11:03 - Update Failed (
COneOfNPrior.cc:464):Update failedwith NaN log weights:Update failed ( -2.596754686508612e+00 -nan -2.079860261157839e+01 ) - 14:11:46 - Bad Category (
CXMeansOnline1d.cc:468): NaN scale parameter in normal distribution:Bad category = (0.499001, 1.84467e+19, 0): Error in function boost::math::normal_distribution<double>::normal_distribution: Scale parameter is -nan, but must be > 0 ! - 14:11:46 - Invalid Split (
CXMeansOnline1d.cc:530):Expected 2-split: [](empty candidate) - 14:17:26 - Process Crash (Signal 11 SIGSEGV): Child process terminated
Root Cause Analysis
Issue 1: NaN Propagation in Probability Calculation
Location: lib/maths/common/COneOfNPrior.cc:893-977
Problem:
- Function
probabilityOfLessLikelySamples()accumulates probability bounds from multiple models - If
model.probabilityOfLessLikelySamples()returnstruebut with NaN values inmodelLowerBoundormodelUpperBound, these NaN values propagate through accumulation (lines 953-954) - Error is logged at line 960 but NaN values are only corrected AFTER the error log (lines 964-969)
- The NaN values propagate to other parts of the system
Code Flow:
Line 942: model.probabilityOfLessLikelySamples() returns true with NaN values
Line 953-954: NaN values accumulated into lowerBound/upperBound
Line 960: ERROR logged: "Bad probability bounds = [-nan, -nan]"
Line 964-969: NaN corrected to default values (too late)Issue 2: NaN Propagation to Model Weights
Location: lib/maths/common/COneOfNPrior.cc:463-468
Problem:
- After probability calculation returns with NaN values, the model weights become corrupted
badWeights()checks if log weights are finite (line 1147)- When NaN values are present,
badWeights()returns true - Error logged: "Update failed" with debug output showing NaN weights
- Model is reset to non-informative state (line 467), but damage already done
Issue 3: NaN Propagation to Clustering Algorithm
Location: lib/maths/common/CXMeansOnline1d.cc:430-468
Problem:
- NaN values propagate to category variance calculations
CBasicStatistics::maximumLikelihoodVariance(category)returns NaNsigma = std::sqrt(NaN)→ NaNboost::math::normal_distribution<double>(m, NaN)throws exception
Code Flow:
Line 430: sigma = std::sqrt(CBasicStatistics::maximumLikelihoodVariance(category));
Line 441: boost::math::normal normal(m, sigma); // sigma is NaN → exception
Line 468: ERROR logged: "Bad category = (0.499001, 1.84467e+19, 0): Error..."Issue 4: Corrupted State Causes Invalid Split
Location: lib/maths/common/CXMeansOnline1d.cc:525-530
Problem:
- After exception handling, algorithm state is corrupted
- Split candidate calculation fails, returns empty vector
- Error: "Expected 2-split: []"
Issue 5: Unsafe Access to Empty Accumulator → SIGSEGV
Location: lib/maths/common/COneOfNPrior.cc:972 (now 1001 after fixes)
Problem:
- Code accesses
tail_[0].secondwithout checking iftail_accumulator is empty TDoubleTailPrMaxAccumulator(which isCOrderStatisticsStack<TDoubleTailPr, 1>) can be empty if:- Loop never executes (e.g.,
logWeightsis empty after corruption) - Loop exits early via break condition before any
tail_.add()calls - NaN values cause early loop termination
- Loop never executes (e.g.,
- When
count() == 0, accessingtail_[0]usesm_Statistics[m_UnusedCount + 0]wherem_UnusedCount == N, causing out-of-bounds access → SIGSEGV
Vulnerable Code:
Line 927: TDoubleTailPrMaxAccumulator tail_;
Line 928-956: Loop that may not add anything to tail_ (due to NaN/corruption)
Line 972: tail = tail_[0].second; // UNSAFE - no check for empty!Root Cause: Extreme Values in Production Data
Analysis of production data shows:
- Max values:
18446744073709551616=2^64=UINT64_MAX(integer overflow indicators) - Huge standard deviations: e.g.,
3217383803720871424.000000vs mean578927774067923328.000000 - Negative values: Some partitions have negative min values (
-13034,-433, etc.)
These extreme values cause:
- Mathematical overflow in variance calculations → NaN
- NaN propagation through probability calculations
- Corrupted model weights
- Empty
normalizedLogWeights()when all weights are NaN - Empty accumulator → SIGSEGV
Proposed Fixes
- Early NaN Detection: Check for NaN values immediately after
model.probabilityOfLessLikelySamples()returns, before accumulation (lines 942-948) - Empty Accumulator Guard: Add check
if (tail_.count() > 0)before accessingtail_[0](line 972) - Better Error Handling: Return false if NaN detected or accumulator is empty, rather than continuing with corrected values
- NaN Checks in Clustering: Add NaN validation before creating distribution objects in
CXMeansOnline1d.cc:430-441 - Validate Split Candidates: Add check for empty candidate vector before using it (line 529)
Files Affected
lib/maths/common/COneOfNPrior.cc(lines 893-977, especially 942-972)lib/maths/common/CXMeansOnline1d.cc(lines 430-441, 529-530)
Production Workaround
Add datafeed query filters to exclude invalid values:
{
"datafeed_config": {
"query": {
"bool": {
"must": [
{
"range": {
"field_value": {
"gte": -1e15,
"lte": 1e15
}
}
},
{
"script": {
"script": {
"source": "doc['field_value'].value != null && Double.isFinite(doc['field_value'].value)",
"lang": "painless"
}
}
}
]
}
}
}
}This filters out:
2^64overflow values (1.84e19>1e15)- NaN/Infinity values
- Extreme outliers causing mathematical overflow
Impact
- Severity: Critical (SIGSEGV crash)
- Frequency: Occurs when processing data with extreme outliers (2^64 values)
- Affected: Jobs processing metric data with integer overflow values
- Workaround Available: Yes (query filters)
Additional Context
Error log excerpt:
ERROR [autodetect/230997] [[COneOfNPrior.cc]@960] Bad probability bounds = [-nan, -nan], [(-nan, 2), (-nan, 1), (-nan, 0)] -
ERROR [autodetect/230997] [[COneOfNPrior.cc]@464] Update failed ( -2.596754686508612e+00 -nan -2.079860261157839e+01 ) -
ERROR [autodetect/230997] [[CXMeansOnline1d.cc]@468] Bad category = (0.499001, 1.84467e+19, 0): Error in function boost::math::normal_distribution<double>::normal_distribution: Scale parameter is -nan, but must be > 0 ! -
ERROR [controller/136] [[CDetachedProcessSpawner.cc]@201] Child process with PID 230997 was terminated by signal 11 Please check system logs for more details. -