Skip to content

Submit draft of RFC for ForEach-Object -Parallel proposal #194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 9, 2019
99 changes: 99 additions & 0 deletions 1-Draft/RFCnnnn-ForEach-Parallel-Cmdlet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
RFC: RFCnnnn
Author: Paul Higinbotham
Status: Draft
SupercededBy: N/A
Version: 1.0
Area: Engine
Comments Due: July 18, 2019
Plan to implement: Yes
---

# PowerShell ForEach-Object -Parallel Cmdlet

This RFC proposes a new parameter set for the existing ForEach-Object cmdlet to parallelize script block executions, instead of running them sequentially as it does now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it complicates the ForEach-Object cmdlet too much and limits future development opportunities.

From previous discussion #174 (comment)

From iSazonov

Yesterday I accidentally saw GNU utility like "parallel --pipe". Also we could find "Where-Object -Parallel" useful too. And "Select-String". And others. So this suggests that maybe we need something more general for pipe parallelization (Parallel-Pipe)
$hugeArray | Parallel-Pipe { Where-Object Name -eq "test" }
$hugeArray | Parallel-Pipe {Select-String -Pattern "test" }
$hugeArray | Parallel-Pipe -AsJob { ... }

From rjmholt

A rose by any other name would smell as sweet:
$hugeArray | ForEach-Object -Parallel { $_ | Where-Object Name -eq "test" }
$hugeArray | ForEach-Object -Parallel { $_ | Select-String -Pattern "test" }
$hugeArray | ForEach-Object -Parallel -AsJob { ... }
I think in the name of single-purpose, composible cmdlets having a single parallel cmdlet (rather than trying to parallelise each individually) is the right way to go. But I think ForEach-Object has that brief of Select-like enumeration. And just like LINQ has .AsParallel(), I think -Parallel makes sense to do this. But that's just a brief and not-strongly-held opinion :)

From iSazonov

There is a problem in depth with "ForEach-Object -Parallel" - there is a lot of parameters and how will "-Begin/Process/End" and etc work - in "ForEach-Object -Parallel" context or scriptblock context? I think we will have to duplicate parameters. In the case Parallel-Pipe is more simple and more safely.
Parallel-Pipe -InputOblect $data
-Begin { … }
-Process { Foreach-Object -Begin { … } -Process { … } -End { … } }
-End { … }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-Begin, -Process, -End switches will not be part of the new parameter set, since they don't make sense in this parallel case.

Copy link
Contributor

@iSazonov iSazonov Jun 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a meaning. Especially in binary cmdlets It should work just like in ForEach-Object (why do we have the blocks there?). Rather, there is a difficulty in implementation.
Current design looks like a workaround - it is simple implement but save all current limitations (see @alx9r comments below). My suggestion is to implement natively parallel pipelines that give significant flexibility, performance and extensibility.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does make sense to keep the

before any {do this }
{do this for each one}
after all of them  {do this} 

that exists in the sequential for each.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're removing -Begin, -Process, and -End then why are you proposing to overload ForEach-Object instead of simply adding a new command? There's literally nothing else in this command, so you're really creating an entirely new command!

ForEach-Object is currently one of the worst performing cmdlets in PowerShell (compare $power = 1..1e5 | foreach { $_ * 2 } to $power = 1..1e5 | & { process { $_ * 2 } }) and although I doubt this change will make it slower, it will make it more complicated -- reducing the possibility of ever fixing it 🤨


## Motivation

As a PowerShell User,
I can do simple fan-out concurrency with the PowerShell ForEach-Object cmdlet, without having to obtain and load a separate module, or deal with PowerShell jobs unless I want to.
Copy link

@alx9r alx9r Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what is meant by "simple" here. It seems to me that "simple" from the user's perspective means something like "works as I'd expect in common use cases". Vaguely speaking, I would suggest that such behavior would include the following:

  1. Speedup is achieved to the degree that the workload in the scriptblock allows.
  2. Scriptblocks can safely be passed as variables to the command.

I'm highlighting these behaviors in particular because they seem like obvious goals yet require overcoming substantial underlying implementation challenges to achieve. In particular,

  1. Runspace.Open() sometimes completes without the specified ImportPSModules becoming available PowerShell#7377, #7153, #7131, and #7035 all relate to various roadblocks to opening, resetting, and reusing, runspaces in parallel. The summary is that each runspace will need to load the modules it needs and there currently does not seem to be a way to open modules in parallel in a manner that results in speedup.
  2. Concurrently invoking a ScriptBlock in two runspaces causes InvalidOperationException and pwsh.exe termination PowerShell#7626 demonstrates that PowerShell currently can crash (with, I think, possibly undefined behavior) when scriptblocks are passed amongst runspaces.

I think overcoming these challenges would be great for PowerShell, but I don't think they should be underestimated.

Copy link
Contributor

@iSazonov iSazonov Jun 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, scriptblocks have runspace affinity and we need resolve this before we can go any further. Perhaps it will be Vnext runspaces.


## Specification

There will be two new parameter sets added to the existing ForeEach-Object cmdlet to support both synchronous and asynchronous operations for parallel script block execution.
For the synchronous case, the `ForEach-Object` cmdlet will not return until all parallel executions complete.
For the asynchronous case, the `ForEach-Object` cmdlet will immediately return a PowerShell job object that contains child jobs of each parallel execution.

### Implementation details
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will ForEach-Object implement pipeline backpressure? If not, how will memory usage for pipelines like Get-Content giantfile.txt | ForEach-Object -Parallel {'CPU-bound operation'} be limited? Is so, how will backpressure work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default ThrottleLimit would be the way to limit how many parallel jobs will run, no?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think the default ThrottleLimit would be the way to limit how many parallel jobs will run, no?

@daxian-dbw That's what I understand. The concern I'm raising here isn't about the number of parallel jobs. Rather, it is about how much of the output from Get-Content giantfile.txt is read into the input buffer of ForEach-Object -Parallel at any one time. The way I read the RFC currently, the behavior that respect is not specified. If no backpressure is implemented, then the amount of memory that could be consumed by ForEach-Object -Parallel is unbounded. If backpressure is implemented, then the particularities of that behavior will affect how performant ForEach-Object -Parallel is under different conditions.


Implementation will be similar to the ThreadJob module.
Script block execution will be run for each piped input on a separate thread and runspace.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will module loading for the new runspaces be managed? Will the runspaces be reused? If they will be reused, will state be reset between uses? If so, to what degree? Just by way of ResetRunspaceState()? Or something more elaborate?

Will ForEach-Object attempt to open modules in parallel? If so, is it within the scope of this RFC to resolve the contention reported in PowerShell/PowerShell#7035? If not, how will module loading occurring in the runspaces be serialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module loading will work as normal, either explicitly loaded using 'Import-Module', or implicitly loaded via command discovery. Runspaces and threads will not be reused in the initial implementation, due to concerns of stale state causing problems, but it is something that can be looked at later.

Module loading contention is a concern but is not something that will be addressed in this work.

The number of threads that run at a time will be limited by a `-ThrottleLimit` parameter with a default value.
Piped input that exceeds the allowed number of threads will be queued until a thread is available.
For synchronous operation, a `-Timeout` parameter will be available that terminates the wait for completion after a specified time.
Without a `-Timeout` parameter, the cmdlet will wait indefinitely for completion.

### Synchronous parameter set
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to specify the following behavior for this parameter set:

  1. How is an exception from one but not all scriptblock invocations handled? Are the other non-throwing scriptblocks allowed to complete? Is the exception ever thrown? Is it converted to an error instead?
  2. How are exceptions from multiple scriptblock invokations handled? Are they combined into an AggregateException?
  3. How are Verbose/Warning/Error streams from scriptblock invocations handled? Are they multiplexed and passed through so they appear to the user? If so, how can the user correlate those streams by invocation? Some sort of ID prefix?
  4. Does ForEach-Object write its progress to the Progress stream?
  5. How are Progress streams from scriptblock invocations handled? Do they become child activities to a ForEach-Object parent activity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in current design it will work "as is" - based on runspace affinity. I expect we will get sometimes "interesting" side effects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runspace affinity can be completely avoid in the implementation, by passing the AST instead of the ScriptBlock to another Runspace, then get the script block by Ast.GetScriptBlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good questions that should be answered in this RFC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll add answers to the RFC for these questions:.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runspace affinity can be completely avoid in the implementation, by passing the AST instead of the ScriptBlock to another Runspace, then get the script block by Ast.GetScriptBlock()

I don't see how this is true, in general. For example, in

$g = Get-SomeScriptBlockGenerator

1..10 | ForEach-Object -Parallel { . $using:g.GetNextScriptBlock() }

would the the scriptblock returned by .GetNextScriptBlock() be silently converted to its .Ast counterpart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser will not allow the example shown above, the $using directive only supports variable expressions. And generating a new script block from the Ast will only be done for ScriptBlock using variable types.

But it is still possible to pass in a script block variable aligned to a different runspace. e.g.,

ForEach-Parallel -Parallel { $g = $using:g; $g.GetNextScriptBlock() }

We won't be trying to prevent all possible cases of passing a variable reference to a script block and using it unsafely. But we can document safe and unsafe ways to use passed in variable references.


Synchronous ForEach-Object -Parallel returns after all script blocks complete running or timeout

```powershell
ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSecs 1800 -ScriptBlock {}
```

- `-Parallel` : parameter switch specifies fan-out parallel script block execution

- `-ThrottleLimit` : parameter takes an integer value that determines the maximum number threads

- `-TimeoutSecs` : parameter takes an integer that specifies the maximum time to wait for completion in seconds
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the executing scriptblock when the timeout is reached? Is a stop signal sent? If so, what happens if stopping takes a long time? Does ForEach-Object wait for execution of all scriptblocks to stop before it returns from EndProcessing()? Is there another timeout for stopping? What happens if that timeout is reached? Is the thread killed? Or is the runspace and associated powershell merely abandoned?


### Asynchronous parameter set

Asynchronous ForEach-Object -Parallel immediately returns a job object for monitoring parallel script block execution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PaulHigin: Were you planning for this to return a single job object, with one child thread job for each thread where the script block is invoked, or were you planning for this to return one job object for each thread where the script block is invoked? I presume the latter, but since it wasn't specified I wanted to ask to make sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intend to follow PowerShell job convention and return a single job object with multiple child jobs for each pipeline iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, please review the other comments on this RFC related to why having -AsJob as a parameter is a bad idea, and how dropping -AsJob allows you to support -Begin and -End without complication.

```powershell
ForEach-Object -Parallel -ThrottleLimit 5 -AsJob -ScriptBlock {}
```

- `-Parallel` : parameter switch specifies fan-out parallel script block execution

- `-ThrottleLimit` : parameter takes an integer value that determines the maximum number threads

- `-AsJob` : parameter switch returns a job object

### Variable passing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are variables serialized and deserialized like happens when using Start-Job? Or are references to the original object instances available in the scriptblocks?

Will scriptblocks be allowed to be passed as variables? If so, is it within the scope of this RFC to resolve PowerShell/PowerShell#7626? If not, will the prohibition on passing scriptblocks be enforced or merely documented?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are references to the original object instances available in the scriptblocks?

I believe it's this one. We are talking about Runspace's in the same process, so no need to serialize/deserialize.
Passing a script block as variable is not a problem. $sb.Ast can be passed to the target Runspace, where we call ast.GetScriptBlock() to get back a script block that is associated with the target Runspace.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...$sb.Ast can be passed to the target Runspace...

This suggests to me that ForEach-Object -Parallel would rely on users correctly converting every scriptblock to its .Ast prior to passing. Is that the intent? That doesn’t seem like a great strategy to me because it’s error prone. I also don’t think that is possible to do in general. For example, there can be modules that emit objects with scriptblock properties which are bound to a SessionState in the parent runspace. For such modules the user would not have the opportunity to intervene.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current plan is to not do any serialization and to pass object copies (for value types) and references to each running scriptblock via the $using keyword and the $_ variable for piped input. The parser already disallows assignment to a $using variable, however one can assign a reference value to another variable and assign to it, with undefined behavior. I tried to call this out in the RFC as unsupported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the cmdlet can pull the AST off the scriptblock internally, and then pass that along.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for scriptblock variables. I was talking about variable types in general.


ForEach-Object -Parallel will support the PowerShell `$_` current piped item variable within each script block.
It will also support the `$using:` directive for passing variables from script scope into the parallel executed script block scope.

### Examples

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logs = $computerNames | ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSecs 1800 -ScriptBlock {
Get-Logs -ComputerName $_
}
```

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$job = ForEach-Object -Parallel -ThrottleLimit 10 -InputObject $computerNames -AsJob -ScriptBlock {
Get-Logs -ComputerName $_
}
$logs = $job | Wait-Job | Receive-Job
```

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logNames = 'System','SQL'
$logs = ForEach-Object -Parallel -InputObject $computerNames -ScriptBlock {
Get-Logs -ComputerName $_ -LogNames $using:logNames
}
```

## Alternate Proposals and Considerations
Copy link
Contributor

@KirkMunro KirkMunro Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of looking at ForEach-Object for parallelism we scaled this out to allow parallelism to be used with any command?

For example, if we...

  • ...had a new CmdletBinding attribute called SupportsParallelism.
  • ...added -Parallel, -ThrottleLimit, -TimeoutSecs, and -AsJob common parameters to any advanced function or cmdlet that has SupportsParallelism set to $true in the CmdletBinding attribute (supporting the different common parameter sets that are proposed above, such that -AsJob could not be used with -TimeoutSecs).
  • ...updated the $using: variable prefix so that it can be used in any process block inside of a command that has SupportsParallelism set to $true (Note: This would also need to allow $using: to work when the command is invoked without parallelism, so that the functionality would only need to be written once; the $using: prefix would simply identify a variable that is pulled into the parallel runspace/job when the command is invoked in parallel. The $using: prefix should probably not be required for parameters, though -- that should be implicit, and maybe $using could be implicit for variables defined in the function/command scope -- again, makes things much easier, and users would only need to worry about using in commands like ForEach-Object that accept script blocks).

Then we could...

  • ...add the SupportsParallelism attribute to ForEach-Object invocations to get the functionality we're talking about here.
  • ...add the SupportsParallelism attribute to any other command that is written to support parallelism.
  • ...simplify command invocation when you want parallelism a great deal, which would keep ForEach-Object use to where it is really needed.
  • ...maintain support for begin and end blocks (which are still useful -- even with parallelism, some resources such as connections, credentials, or collections to accumulate results may be needed and shared among the different threads/jobs).
  • ...discover commands that support parallelism (and therefore those that do not).
  • ...make multitasking much more of a "thing" in PowerShell.

Taking the examples provided above, they could be simplified as follows:

$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logNames = 'System','SQL'

# Get logs from computers in parallel, throttled with a timeout
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -TimeoutSecs 1800

# Get logs from computers in parallel using background jobs
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -AsJob | Receive-Job -Wait

# Get System and SQL logs from computers in parallel with no throttle or timeout
$logs = Get-Logs -ComputerName $computerNames -LogNames $logNames -Parallel

For folks who are going to look at this and complain about the lack of downlevel support in this functionality, I'd like to draw your attention to this RFC. That won't help projects that require a downlevel version, but PowerShell needs to start offering a lot more motivation for users to upgrade to newer releases, and this would be some pretty great motivation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting idea, and would be a lot of work to cover all built-in cmdlets. A variation is to build on @BrucePay work with the & operator to force execution as background job.

Get-Process -id $pid &
Id     Name            PSJobTypeName   State         HasMoreData     Location             Command
--     ----            -------------   -----         -----------     --------             -------
5      Job5            BackgroundJob   Running       True            localhost            Microsoft.PowerShell.Man…

Maybe this can be extended to optionally fan-out using ThreadJobs.

This is outside the scope of this RFC, but is something we could look into in the future.

Copy link
Contributor

@KirkMunro KirkMunro Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't have to cover all built-in cmdlets from the start. For the first cut you could just lay the foundation and update ForEach-Object to use SupportsParallelism (or just call that SupportsParallel for short and to match the -Parallel common parameter name). The other built-in cmdlets could be updated over time. That keeps focus on what this RFC is trying to do (add parallelism to ForEach-Object) while doing so in such a way that it can scale to any cmdlet/advanced function.

The initial work shouldn't be that much greater than what would be required to implement what was originally proposed on this RFC (a specific modification of ForEach-Object, and that should be in scope in this RFC, at least listed in the alternate proposals and considerations section. The [CmdletBinding(SupportsParallelism)] approach feels like a much better strategy to achieve the desired end goal for all of PowerShell than simply updating ForEach-Object).

Copy link
Contributor

@KirkMunro KirkMunro Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also regarding the & background operator and thread jobs comment, see PowerShell Issue #9873.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of this in the same simple sense as SupportsShouldProcess where all you're doing is opting in to a few common parameters and magic variables which you may --or may not, actually-- use (properly)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already started a separate RFC for this to see where that leads. Thanks for sharing these thoughts @Jaykul, I had some of them already, but not quite all. Automatic implementation is the goal, for consistency and performance, but I'm still thinking it would be bound to specific named parameter sets, or all parameter sets if you don't name them. Part of what I hope happens here is to "corral the troops" so to speak, and bring a consistent implementation to the table that commands can adopt.

Copy link

@oising oising Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a new, mutually exclusive [to process] processasync/parallelprocess block in cmdlets/functions?

function foo {
    begin { }
    processasync { }
    end { }
}

It's so easy to write that out. Additional control might be handled by extra properties in [CmdletBinding()], such as ThrottleLimit etc, which could also be overridden on a per-invocation basis, by adding ThrottleLimit etc as common parameters.

Or maybe, even:

function foo {
    [CmdletBinding(AsyncProcess=true)]
    begin {}
    process {}
    end {}
}

Perhaps this is a little easier to do and renders any function parallelizable (thar be dragons always but I digress.) IF AsyncProcess is true, then it lights up the parallel/job common parameters (ThrottleLimit etc)

I'm not tied to "async" versus "parallel" btw; perhaps the latter is a better choice to imply "jobs" and not "task".

Copy link
Contributor

@KirkMunro KirkMunro Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oising The last thing you suggested (CmdletBinding configuration) is exactly what started this part of the discussion, albeit with different terminology ([CmdletBinding(SupportsParallel=...)]). I'm also considering processasync and the corresponding ProcessRecordAsync while I work through the complementary RFC.

See: #194 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'm leaning towards the idea of just leaving the process keyword alone (e.g. no new keywords like processasync), but in the .NET API, surface ProcessRecordAsync, and have them materialize as threadjobs in the console.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the complementary RFC I wrote up today: #206.


Another option (and a previous RFC proposal) is to resurrect the PowerShell Windows workflow script `foreach -parallel` keyword to be used in normal PowerShell script to perform parallel execution of foreach loop iterations.
However, the majority of the community felt it would be more useful to update the existing ForeEach-Object cmdlet with a -parallel parameter set.
We may want to eventually implement both solutions.
But the ForEach-Object -Parallel proposal in this RFC should be implemented first since it is currently the most popular.