Skip to content

gpexpand first stage fails: 'SyncPackages' object has no attribute 'ret' (extension sync never runs; regression) #1825

@talmacschen-arch

Description

@talmacschen-arch

Bug Report

Summary

During the first stage of gpexpand (initialization), gpexpand fails while "Syncing Apache Cloudberry extensions" with:

AttributeError: 'SyncPackages' object has no attribute 'ret'

The exception is squashed into a WARNING, so the expansion appears to continue, but two things go wrong:

  1. The error traceback is printed and an [ERROR] line is logged.
  2. More importantly, the package/extension synchronization to the new segment hosts silently never runs at all — even on a single-host expansion.

Version / Environment

  • Reproduced on a build based on current apache/cloudberry main.
  • The buggy code is present on main today (see Root cause). Observed on a downstream 4.5.0-rc.3 build, but the defect is in shared gpMgmt code, not downstream-only.

Steps to reproduce

gpexpand -i /tmp/expand_input.txt -v 2>&1 | tee /tmp/gpexpand_init.log

Actual output (abridged, -v)

gpexpand:...-[INFO]:-Syncing Apache Cloudberry extensions
gpexpand:...-[DEBUG]:-Starting ParallelOperation
gpexpand:...-[DEBUG]:-WorkerPool() initialized with 1 workers
gpexpand:...-[DEBUG]:-WorkerPool haltWork()
gpexpand:...-[DEBUG]:-[worker0] haltWork
gpexpand:...-[DEBUG]:-Ending ParallelOperation
gpexpand:...-[DEBUG]:-[worker0] got a halt cmd
gpexpand:...-[ERROR]:-Syncing of Apache Cloudberry extensions has failed.
Traceback (most recent call last):
  File ".../bin/gpexpand", line 1935, in sync_packages
    operation.get_ret()
  File ".../lib/python/gppylib/operations/__init__.py", line 64, in get_ret
    if isinstance(self.ret, Exception):
AttributeError: 'SyncPackages' object has no attribute 'ret'
gpexpand:...-[WARNING]:-Please run gppkg --clean after successful expansion.

Note the telltale sequence: the worker pool is created, then haltWork() is called and the worker only ever receives a halt command — there is no Starting SyncPackages / [worker0] got cmd line. The submitted SyncPackages operation is never executed.

Root cause

WorkerPool.__init__ signature (gpMgmt/bin/gppylib/commands/base.py):

def __init__(self, numWorkers=16, should_stop=False, items=None, daemonize=False, logger=...):
    ...
    if items is not None:
        for item in items:
            self.addCommand(item)

The second positional parameter is should_stop; the work items are the third parameter items.

OperationWorkerPool.__init__ (same file) passes the operations positionally:

class OperationWorkerPool(WorkerPool):
    def __init__(self, numWorkers=16, operations=None):
        if operations is not None:
            for operation in operations:
                self._spoof_operation(operation)
        super(OperationWorkerPool, self).__init__(numWorkers, operations)   # <-- BUG

So operations is bound to should_stop (a truthy list), and items stays None. Consequences:

  1. items is Noneno operation is ever added to the work queue.
  2. should_stop is set to a truthy value.

ParallelOperation.execute() then join()s an empty queue (returns immediately) and haltWork()s the pool, so the worker only ever pops the halt command. SyncPackages.run() / execute() is never invoked, so Operation.run() never assigns self.ret. (SyncPackages.__init__ does not call super().__init__(), so ret is not pre-initialized either.) Back in gpexpand.sync_packages():

for operation in operations:
    operation.get_ret()   # SyncPackages has no attribute 'ret' -> AttributeError

AttributeError: 'SyncPackages' object has no attribute 'ret'.

This is a regression

This exact bug was already fixed once and then re-introduced by a merge:

  • cd3c88f6e1e "Fix gppkg error: 'SyncPackages' object has no attribute 'ret'." changed the call to super(OperationWorkerPool, self).__init__(numWorkers, items=operations).
  • 0f4cf8d5068 "Merge tag 'REL_16_9' into Cloudberry" reverted that line back to super(OperationWorkerPool, self).__init__(numWorkers, operations).

On current main, grep -c "items=operations" gpMgmt/bin/gppylib/commands/base.py returns 0 — the fix is gone again.

Suggested fix

Pass the operations as the keyword argument items (re-applying the lost fix):

super(OperationWorkerPool, self).__init__(numWorkers, items=operations)

This restores enqueueing of the operations and keeps should_stop=False.

Impact

  • gpexpand first stage prints an error/traceback even on a healthy cluster.
  • Extensions/gppkg packages are not synchronized to new segment hosts during expansion (the operation never runs), which can leave new segments missing extensions until a manual gppkg sync. The squashing of the exception into a WARNING hides this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions