Bug Report
Summary
During the first stage of gpexpand (initialization), gpexpand fails while "Syncing Apache Cloudberry extensions" with:
AttributeError: 'SyncPackages' object has no attribute 'ret'
The exception is squashed into a WARNING, so the expansion appears to continue, but two things go wrong:
- The error traceback is printed and an
[ERROR] line is logged.
- More importantly, the package/extension synchronization to the new segment hosts silently never runs at all — even on a single-host expansion.
Version / Environment
- Reproduced on a build based on current
apache/cloudberry main.
- The buggy code is present on
main today (see Root cause). Observed on a downstream 4.5.0-rc.3 build, but the defect is in shared gpMgmt code, not downstream-only.
Steps to reproduce
gpexpand -i /tmp/expand_input.txt -v 2>&1 | tee /tmp/gpexpand_init.log
Actual output (abridged, -v)
gpexpand:...-[INFO]:-Syncing Apache Cloudberry extensions
gpexpand:...-[DEBUG]:-Starting ParallelOperation
gpexpand:...-[DEBUG]:-WorkerPool() initialized with 1 workers
gpexpand:...-[DEBUG]:-WorkerPool haltWork()
gpexpand:...-[DEBUG]:-[worker0] haltWork
gpexpand:...-[DEBUG]:-Ending ParallelOperation
gpexpand:...-[DEBUG]:-[worker0] got a halt cmd
gpexpand:...-[ERROR]:-Syncing of Apache Cloudberry extensions has failed.
Traceback (most recent call last):
File ".../bin/gpexpand", line 1935, in sync_packages
operation.get_ret()
File ".../lib/python/gppylib/operations/__init__.py", line 64, in get_ret
if isinstance(self.ret, Exception):
AttributeError: 'SyncPackages' object has no attribute 'ret'
gpexpand:...-[WARNING]:-Please run gppkg --clean after successful expansion.
Note the telltale sequence: the worker pool is created, then haltWork() is called and the worker only ever receives a halt command — there is no Starting SyncPackages / [worker0] got cmd line. The submitted SyncPackages operation is never executed.
Root cause
WorkerPool.__init__ signature (gpMgmt/bin/gppylib/commands/base.py):
def __init__(self, numWorkers=16, should_stop=False, items=None, daemonize=False, logger=...):
...
if items is not None:
for item in items:
self.addCommand(item)
The second positional parameter is should_stop; the work items are the third parameter items.
OperationWorkerPool.__init__ (same file) passes the operations positionally:
class OperationWorkerPool(WorkerPool):
def __init__(self, numWorkers=16, operations=None):
if operations is not None:
for operation in operations:
self._spoof_operation(operation)
super(OperationWorkerPool, self).__init__(numWorkers, operations) # <-- BUG
So operations is bound to should_stop (a truthy list), and items stays None. Consequences:
items is None → no operation is ever added to the work queue.
should_stop is set to a truthy value.
ParallelOperation.execute() then join()s an empty queue (returns immediately) and haltWork()s the pool, so the worker only ever pops the halt command. SyncPackages.run() / execute() is never invoked, so Operation.run() never assigns self.ret. (SyncPackages.__init__ does not call super().__init__(), so ret is not pre-initialized either.) Back in gpexpand.sync_packages():
for operation in operations:
operation.get_ret() # SyncPackages has no attribute 'ret' -> AttributeError
→ AttributeError: 'SyncPackages' object has no attribute 'ret'.
This is a regression
This exact bug was already fixed once and then re-introduced by a merge:
cd3c88f6e1e "Fix gppkg error: 'SyncPackages' object has no attribute 'ret'." changed the call to super(OperationWorkerPool, self).__init__(numWorkers, items=operations).
0f4cf8d5068 "Merge tag 'REL_16_9' into Cloudberry" reverted that line back to super(OperationWorkerPool, self).__init__(numWorkers, operations).
On current main, grep -c "items=operations" gpMgmt/bin/gppylib/commands/base.py returns 0 — the fix is gone again.
Suggested fix
Pass the operations as the keyword argument items (re-applying the lost fix):
super(OperationWorkerPool, self).__init__(numWorkers, items=operations)
This restores enqueueing of the operations and keeps should_stop=False.
Impact
gpexpand first stage prints an error/traceback even on a healthy cluster.
- Extensions/gppkg packages are not synchronized to new segment hosts during expansion (the operation never runs), which can leave new segments missing extensions until a manual
gppkg sync. The squashing of the exception into a WARNING hides this.
Bug Report
Summary
During the first stage of
gpexpand(initialization),gpexpandfails while "Syncing Apache Cloudberry extensions" with:The exception is squashed into a
WARNING, so the expansion appears to continue, but two things go wrong:[ERROR]line is logged.Version / Environment
apache/cloudberrymain.maintoday (see Root cause). Observed on a downstream4.5.0-rc.3build, but the defect is in sharedgpMgmtcode, not downstream-only.Steps to reproduce
Actual output (abridged,
-v)Note the telltale sequence: the worker pool is created, then
haltWork()is called and the worker only ever receives a halt command — there is noStarting SyncPackages/[worker0] got cmdline. The submittedSyncPackagesoperation is never executed.Root cause
WorkerPool.__init__signature (gpMgmt/bin/gppylib/commands/base.py):The second positional parameter is
should_stop; the work items are the third parameteritems.OperationWorkerPool.__init__(same file) passes the operations positionally:So
operationsis bound toshould_stop(a truthy list), anditemsstaysNone. Consequences:items is None→ no operation is ever added to the work queue.should_stopis set to a truthy value.ParallelOperation.execute()thenjoin()s an empty queue (returns immediately) andhaltWork()s the pool, so the worker only ever pops the halt command.SyncPackages.run()/execute()is never invoked, soOperation.run()never assignsself.ret. (SyncPackages.__init__does not callsuper().__init__(), soretis not pre-initialized either.) Back ingpexpand.sync_packages():→
AttributeError: 'SyncPackages' object has no attribute 'ret'.This is a regression
This exact bug was already fixed once and then re-introduced by a merge:
cd3c88f6e1e"Fix gppkg error: 'SyncPackages' object has no attribute 'ret'." changed the call tosuper(OperationWorkerPool, self).__init__(numWorkers, items=operations).0f4cf8d5068"Merge tag 'REL_16_9' into Cloudberry" reverted that line back tosuper(OperationWorkerPool, self).__init__(numWorkers, operations).On current
main,grep -c "items=operations" gpMgmt/bin/gppylib/commands/base.pyreturns0— the fix is gone again.Suggested fix
Pass the operations as the keyword argument
items(re-applying the lost fix):This restores enqueueing of the operations and keeps
should_stop=False.Impact
gpexpandfirst stage prints an error/traceback even on a healthy cluster.gppkgsync. The squashing of the exception into a WARNING hides this.