Skip to content

NaN Propagation Causes SIGSEGV in Autodetect Process #2874

@valeriy42

Description

@valeriy42

NaN Propagation Causes SIGSEGV in Autodetect Process

Summary

The autodetect process crashes with SIGSEGV (signal 11) due to NaN values propagating through probability calculations, corrupting model weights, and leading to unsafe access of an empty accumulator. The root cause is extreme metric values (particularly 2^64 overflow values) causing mathematical overflow in variance/statistical calculations.

Error Chain (Chronological Order)

  1. 14:10:56 - NaN Probability Bounds Error (COneOfNPrior.cc:960): Bad probability bounds = [-nan, -nan]
  2. 14:11:03 - Update Failed (COneOfNPrior.cc:464): Update failed with NaN log weights: Update failed ( -2.596754686508612e+00 -nan -2.079860261157839e+01 )
  3. 14:11:46 - Bad Category (CXMeansOnline1d.cc:468): NaN scale parameter in normal distribution: Bad category = (0.499001, 1.84467e+19, 0): Error in function boost::math::normal_distribution<double>::normal_distribution: Scale parameter is -nan, but must be > 0 !
  4. 14:11:46 - Invalid Split (CXMeansOnline1d.cc:530): Expected 2-split: [] (empty candidate)
  5. 14:17:26 - Process Crash (Signal 11 SIGSEGV): Child process terminated

Root Cause Analysis

Issue 1: NaN Propagation in Probability Calculation

Location: lib/maths/common/COneOfNPrior.cc:893-977

Problem:

  • Function probabilityOfLessLikelySamples() accumulates probability bounds from multiple models
  • If model.probabilityOfLessLikelySamples() returns true but with NaN values in modelLowerBound or modelUpperBound, these NaN values propagate through accumulation (lines 953-954)
  • Error is logged at line 960 but NaN values are only corrected AFTER the error log (lines 964-969)
  • The NaN values propagate to other parts of the system

Code Flow:

Line 942: model.probabilityOfLessLikelySamples() returns true with NaN values
Line 953-954: NaN values accumulated into lowerBound/upperBound  
Line 960: ERROR logged: "Bad probability bounds = [-nan, -nan]"
Line 964-969: NaN corrected to default values (too late)

Issue 2: NaN Propagation to Model Weights

Location: lib/maths/common/COneOfNPrior.cc:463-468

Problem:

  • After probability calculation returns with NaN values, the model weights become corrupted
  • badWeights() checks if log weights are finite (line 1147)
  • When NaN values are present, badWeights() returns true
  • Error logged: "Update failed" with debug output showing NaN weights
  • Model is reset to non-informative state (line 467), but damage already done

Issue 3: NaN Propagation to Clustering Algorithm

Location: lib/maths/common/CXMeansOnline1d.cc:430-468

Problem:

  • NaN values propagate to category variance calculations
  • CBasicStatistics::maximumLikelihoodVariance(category) returns NaN
  • sigma = std::sqrt(NaN) → NaN
  • boost::math::normal_distribution<double>(m, NaN) throws exception

Code Flow:

Line 430: sigma = std::sqrt(CBasicStatistics::maximumLikelihoodVariance(category));
Line 441: boost::math::normal normal(m, sigma);  // sigma is NaN → exception
Line 468: ERROR logged: "Bad category = (0.499001, 1.84467e+19, 0): Error..."

Issue 4: Corrupted State Causes Invalid Split

Location: lib/maths/common/CXMeansOnline1d.cc:525-530

Problem:

  • After exception handling, algorithm state is corrupted
  • Split candidate calculation fails, returns empty vector
  • Error: "Expected 2-split: []"

Issue 5: Unsafe Access to Empty Accumulator → SIGSEGV

Location: lib/maths/common/COneOfNPrior.cc:972 (now 1001 after fixes)

Problem:

  • Code accesses tail_[0].second without checking if tail_ accumulator is empty
  • TDoubleTailPrMaxAccumulator (which is COrderStatisticsStack<TDoubleTailPr, 1>) can be empty if:
    • Loop never executes (e.g., logWeights is empty after corruption)
    • Loop exits early via break condition before any tail_.add() calls
    • NaN values cause early loop termination
  • When count() == 0, accessing tail_[0] uses m_Statistics[m_UnusedCount + 0] where m_UnusedCount == N, causing out-of-bounds access → SIGSEGV

Vulnerable Code:

Line 927: TDoubleTailPrMaxAccumulator tail_;
Line 928-956: Loop that may not add anything to tail_ (due to NaN/corruption)
Line 972: tail = tail_[0].second;  // UNSAFE - no check for empty!

Root Cause: Extreme Values in Production Data

Analysis of production data shows:

  • Max values: 18446744073709551616 = 2^64 = UINT64_MAX (integer overflow indicators)
  • Huge standard deviations: e.g., 3217383803720871424.000000 vs mean 578927774067923328.000000
  • Negative values: Some partitions have negative min values (-13034, -433, etc.)

These extreme values cause:

  1. Mathematical overflow in variance calculations → NaN
  2. NaN propagation through probability calculations
  3. Corrupted model weights
  4. Empty normalizedLogWeights() when all weights are NaN
  5. Empty accumulator → SIGSEGV

Proposed Fixes

  1. Early NaN Detection: Check for NaN values immediately after model.probabilityOfLessLikelySamples() returns, before accumulation (lines 942-948)
  2. Empty Accumulator Guard: Add check if (tail_.count() > 0) before accessing tail_[0] (line 972)
  3. Better Error Handling: Return false if NaN detected or accumulator is empty, rather than continuing with corrected values
  4. NaN Checks in Clustering: Add NaN validation before creating distribution objects in CXMeansOnline1d.cc:430-441
  5. Validate Split Candidates: Add check for empty candidate vector before using it (line 529)

Files Affected

  • lib/maths/common/COneOfNPrior.cc (lines 893-977, especially 942-972)
  • lib/maths/common/CXMeansOnline1d.cc (lines 430-441, 529-530)

Production Workaround

Add datafeed query filters to exclude invalid values:

{
  "datafeed_config": {
    "query": {
      "bool": {
        "must": [
          {
            "range": {
              "field_value": {
                "gte": -1e15,
                "lte": 1e15
              }
            }
          },
          {
            "script": {
              "script": {
                "source": "doc['field_value'].value != null && Double.isFinite(doc['field_value'].value)",
                "lang": "painless"
              }
            }
          }
        ]
      }
    }
  }
}

This filters out:

  • 2^64 overflow values (1.84e19 > 1e15)
  • NaN/Infinity values
  • Extreme outliers causing mathematical overflow

Impact

  • Severity: Critical (SIGSEGV crash)
  • Frequency: Occurs when processing data with extreme outliers (2^64 values)
  • Affected: Jobs processing metric data with integer overflow values
  • Workaround Available: Yes (query filters)

Additional Context

Error log excerpt:

ERROR	[autodetect/230997] [[COneOfNPrior.cc]@960] Bad probability bounds = [-nan, -nan], [(-nan, 2), (-nan, 1), (-nan, 0)]	-
ERROR	[autodetect/230997] [[COneOfNPrior.cc]@464] Update failed ( -2.596754686508612e+00 -nan -2.079860261157839e+01 )	-
ERROR	[autodetect/230997] [[CXMeansOnline1d.cc]@468] Bad category = (0.499001, 1.84467e+19, 0): Error in function boost::math::normal_distribution<double>::normal_distribution: Scale parameter is -nan, but must be > 0 !	-
ERROR	[controller/136] [[CDetachedProcessSpawner.cc]@201] Child process with PID 230997 was terminated by signal 11 Please check system logs for more details.	-

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions