Skip to content

Add RFC for foreach-parallel feature #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions 1-Draft/RFCnnnn-ForEach-Parallel-Implementation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
RFC: RFCnnnn
Author: Paul Higinbotham
Status: Draft
SupercededBy: N/A
Version: 1.0
Area: Engine
Comments Due: July 1, 2019
Plan to implement: Yes
---

# Implement PowerShell language foreach -parallel

Windows PowerShell currently supports the foreach language keyword with the -parallel switch flag, but only for workflow scripts.

```powershell

workflow wf1 {
$list = 1..5
foreach -parallel -throttlelimit 5 ($item in $list) {
Start-Sleep -Seconds 1
Write-Output "Output $item"
}
}

```

This will run the script block with each value in the `$list` array, in parallel using workflow jobs.
However, workflow is not supported in PowerShell Core 6, partly because it is a Windows only solution but also because it is cumbersome to use.
In addition the workflow implementation is very heavy weight, using lots of system resources.

This is a proposal to re-implement `foreach -parallel` in PowerShell Core, using PowerShell's support for concurrency via Runspaces.
It is similar to the [ThreadJob module](https://www.powershellgallery.com/packages/ThreadJob/1.1.2) except that it becomes part of the PowerShell language via `foreach -parallel`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of how implemented (as a cmdlet or a foreach extension), I seriously hope you will make this compatible with psexec. When I filed an issue against threadjob noting that it broke psexec, psexec was blamed. I doubt highly that sysinternals/Mark Russinovich would agree with that.

## Motivation

As a PowerShell User,
I can do simple fan-out concurrency from within the language, without having to obtain and load a separate module or deal with PowerShell jobs.

## Specification

The PowerShell `foreach -parallel` language keyword will be re-implemented to perform invoke script blocks in parallel, similar to how it works for workflow functions except that script blocks will be invoked on threads within the same process rather than in workflow jobs running in separate processes.
The default behavior is to fan-out script block execution to multiple threads, and then wait for all threads to finish.
However, a `-asjob` switch will also be supported that returns a PowerShell job object for asynchronous use.
If the number of foreach iterations exceed the throttle limit value, then only the throttle limit number of threads are created at a time and the rest are queued until a running thread becomes available.

### Supported foreach parameters

- `-parallel`
- `-throttlelimit`
- `-timeout`
- `-asjob`

### P0 Features

- `foreach -parallel` fans out script block execution to threads, along with a bound single foreach iteration value

- `-throttlelimit` parameter value specifies the maximum number of threads that can run at one time

- `-timeout` parameter value specifies a maximum time to wait for all iterations to complete, after which 'stop' will be called on all running script blocks to terminate execution

- `-asjob` switch causes foreach to return a PowerShell job object that is used to asynchronously monitor execution

- When a job object is returned, it will be compatible with all relevant job cmdlets

- All script blocks running in parallel will run isolated from each other.
Only foreach iteration objects will be passed to the parallel script block.

### Data stream handling

`foreach -parallel` will use normal PowerShell pipes to return various data streams.
Data will be returned in order received.
Except when `-asjob` switch is used, in which case a single job object is returned.
The returned job object will contain an array of child jobs that represent each iteration of the foreach.

### Examples

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logs = foreach -parallel -throttle 10 -timeout 300 ($computer in $computerNames)
{
Get-Logs -ComputerName $computer
}
```

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$job = foreach -parallel -asjob ($computer in $computerNames)
{
Get-Logs -ComputerName $computer
}
$logs = $job | Wait-Job | Receive-Job
```

```powershell
$params += @{
$argTitle = "Title1"
$argValue = 102
}
foreach -parallel ($param in $params)
{
c:\scripts\ToRun.ps1 @param
}
```

## Alternate Proposals and Considerations

One alternative is to create a `ForEach-Parallel` cmdlet instead of re-implementing the `foreach -parallel` keyword.
This would work well but would not be as useful as making it part of the PowerShell language.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is maybe worth discussing more, just because the drawbacks aren't as apparent to me. What's the key limitation here?

The drawbacks I see with foreach () -Parallel are:

  • New syntax means scripts are syntactically backwards incompatible -- they cannot even be successfully parsed by an older PowerShell version. Compare with:
    $foreachParams = @{}
    if ($PSVersionTable.PSVersion.Major -ge 7) { $foreachParams += @{ Parallel = $true } }
    $workItems | ForEach-Object @foreachParams { Invoke-Processing $_ }
  • Assigning from a foreach-loop seems like a relatively unintuitive construction and a bit syntactically off. I know it's already a functionality we support and I think it makes sense in the language, but minting it as the primary syntax for parallelism seems to run a bit against the natural style of PowerShell to me

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that ForEach-Object was in scope of this RFC. If we get this implemented, I can see that becoming part of the conversation.

I also don't believe that we would be able to splat to the foreach operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not, but I agree with @rjmholt here; it makes more syntactic sense to put this into the cmdlet instead. ForEach-Object -Parallel -AsJob {} makes a LOT more sense than foreach -Parallel -AsJob ($a in $b) {} visually, and with only a handful of exceptions the majority of language keywords don't have parameters like that.

Additionally, having this available for pipeline cmdlets I think would be significantly more valuable than just a foreach loop, which can't be used in a pipeline context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm agree with @vexx32 and I think it's a bad practice to modify foreach in this way.
I'm against a "Foreach VS Foreach-Object VS Magic Foreach VS ForEach-Parallel"
I vote for "ForEach-Object -Parallel -AsJob {}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am perfectly fine implementing this as a cmdlet rather than a foreach language keyword extension. In fact it is much easier to implement as a cmdlet. If the community prefers a cmdlet (as it seems from these comments) then I am happy to update this RFC accordingly. But I'll let the PowerShell committee weigh in as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put like that :-) ... there is a case for both, I think the case is stronger for the cmdlet;
When you said

To my understanding, the foreach -parallel keyword is only intended to run in parallel for the duration of the loop. It's not supposed to keep running while the rest of the script executes, it simply runs each iteration of the loop in parallel, and waits until all the iterations complete before continuing the script.

I was saying, "Yes and that's why the keyword is less good"
If your script is goes
Get items ; do something to each item in its own thread; format output.
Then The keyword approach can't start any threads until it has all the items and won't output anything until all threads have completed. But if they are a pipeline the threads will be started as the items are fetched, and the output can happen as the threads end. That overlapping of commands in a pipeline is makes a big perf difference if the commands either side of the parallelized one are slow

Copy link
Contributor

@rjmholt rjmholt Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I am missing something here. We actually want a parse error, correct, so if someone tries to run this on down level Powershell that the whole script doesn't run.

Some scripts and modules need to run on versions from PS 7 down to PS 3 (or even PS 2 in the case of Pester), and not just on Windows.

An example is the PowerShellEditorServices (backend to the PowerShell extension for VSCode) startup script; it must run on everything from PS 3 in Windows Server 2012 (in some cases 2008) to PS 7 on Ubuntu 18.04. We took it out, but for example that script used to call chmod on *nix and Set-Acl on Windows. Imagine if you couldn't wrap that in an if, but it was a parse error.

Keywords that don't exist in any of those versions can't be used anywhere in that script. We'd have to write a whole new script (slowing down startup, increasing the download size, duplicating the code). Whereas a command parameter can be added to a splat conditionally. PS 7 users would get the parallel speedup, but it still works in PS 3.

Another example is the Install-VSCode.ps1 script. It wants to be fast, so in Windows PowerShell it uses Start-BitsTransfer since that's available. If that resulted in a parse error, you wouldn't be able to do that. We'd have to either publish two scripts, or settle for Invoke-WebRequest (which was sped up considerably in PS 6 btw :)).

You can already prevent downlevel running at parse time with #requires -Version 7.0. But as someone who maintains several complicated scripts that must work all the way downlevel, I'd like the ability to leverage PowerShell's dynamism to get the best everywhere.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That startup script is one fugly piece of scripting. :-)

But I think you miss @jhoneill's point - a down-level script engine (e.g., PS v3) should generate an error if not gated by a PSVersion check. "foreach -parallel" does not pass execution time checking on PS v3, even though a proper AST is generated - up until the "-parallel" parameter.

But if the code is protected by a PSVersion check, all is OK. I wouldn't think that should change - and I don't think that @jhoneill is suggesting otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was missing that PS v3 was handling this differently. When I checked it on PS v5.1, I got this message when calling foreach -parallel:

the '-parallel' parameter can be used only within a workflow. 

And that is exactly what I expected to happen on 5.1. Because it was a parse exception, it never even tried to execute. This is also what I expected to happen.

I expect that I would need to use it in a workflow on a pre PS v7 script. I am perfectly OK with it as a PS v7 feature that is not compatible with anything lower. If you are targeting a lower version, then you need to not use the newer features.

With that said, I ok with script authors abusing the syntax to write multi-versioned scripts but I don't think that should dictate the primary design. If foreach -parallel should be a thing, the fact that you can't use a PS v7 feature in a PS v3 script should not prevent us from implementing it in the ideal way (if we ever decide what that is).

Yes, I know some people need to write scripts that target PS v3 and it would be really nice to have a script that just ran faster on PS 7.0 and still worked on PS v3, but it is also perfectly OK for it to be a parse exception in PS v3. We already have things in PS v7 that are a parse exception in PS v5.1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know some people need to write scripts that target PS v3 and it would be really nice to have a script that just ran faster on PS 7.0 and still worked on PS v3, but it is also perfectly OK for it to be a parse exception in PS v3. We already have things in PS v7 that are a parse exception in PS v5.1

Up to a point ... Having dealt with clients where it is just too painful to get servers updated from PS4 to PS5 and found my scripts used a couple of bits of 5 specific syntax, I'm probably more keen than average that things should work on old versions.

This errors without running anything on 5.1

$lastByte = 1..10 
if ($PSVersionTable.PSVersion.Major -lt 7) {
    foreach ($b in $lastbyte) {Test-Connection "192.168.0.$b" -Count 1 } 
}
else {
    foreach ($b in $lastbyte) -parallel  {Test-Connection "192.168.0.$b" -Count 1 }  
}

This runs

$lastByte = 1..10 
if ($PSVersionTable.PSVersion.Major -lt 7) {
    $lastbyte | foreach -Process  {Test-Connection "192.168.0.$b" -Count 1 } 
}
else {
    $lastbyte | foreach -parallel  {Test-Connection "192.168.0.$_" -Count 1 }  
}

Now, if everything else were equal (and it's not the cmdlet can go in a pipeline) I think most people would say the implementation which supports one script for two versions is preferable - it's not mandatory

But here's why breaking can be good. Imagine a college creating a ton of new users at the start of a year.
$newUsers = import-csv new.csv | Add-CustomUser
$newUsers | export-csv created.csv
$newUsers | foreach-object {add-CustomHomeDir $_}

So we do this and it all looks good but someone says "It makes the new csv real quick but creating the home directories feels like a month" so they add -parallel. Then someone runs the script on another box and the users get created, the file is exported and BANG error with no homedirectories set up. We can't run the script again because the users exist so we have to clean up and it's all horrible.
Would someone who did the quick conversion think to put requires at the top the script in case someone runs it on an old version ? A complete fail would save them from themselves; but I would prefer that command line as
import-csv new.csv | Add-CustomUser -outvariable $newusers | foreach-object {add-CustomHomeDir $_}
Something which didn't let me create home directories until I'd created the last user would be bad

But if re-implementing the foreach keyword becomes problematic, it would be a good fallback solution.