-
Notifications
You must be signed in to change notification settings - Fork 129
Add RFC for foreach-parallel feature #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Alternate Proposals and Considerations | ||
|
||
One alternative is to create a `ForEach-Parallel` cmdlet instead of re-implementing the `foreach -parallel` keyword. | ||
This would work well but would not be as useful as making it part of the PowerShell language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is maybe worth discussing more, just because the drawbacks aren't as apparent to me. What's the key limitation here?
The drawbacks I see with foreach () -Parallel
are:
- New syntax means scripts are syntactically backwards incompatible -- they cannot even be successfully parsed by an older PowerShell version. Compare with:
$foreachParams = @{} if ($PSVersionTable.PSVersion.Major -ge 7) { $foreachParams += @{ Parallel = $true } } $workItems | ForEach-Object @foreachParams { Invoke-Processing $_ }
- Assigning from a
foreach
-loop seems like a relatively unintuitive construction and a bit syntactically off. I know it's already a functionality we support and I think it makes sense in the language, but minting it as the primary syntax for parallelism seems to run a bit against the natural style of PowerShell to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe that ForEach-Object
was in scope of this RFC. If we get this implemented, I can see that becoming part of the conversation.
I also don't believe that we would be able to splat to the foreach
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not, but I agree with @rjmholt here; it makes more syntactic sense to put this into the cmdlet instead. ForEach-Object -Parallel -AsJob {}
makes a LOT more sense than foreach -Parallel -AsJob ($a in $b) {}
visually, and with only a handful of exceptions the majority of language keywords don't have parameters like that.
Additionally, having this available for pipeline cmdlets I think would be significantly more valuable than just a foreach loop, which can't be used in a pipeline context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm agree with @vexx32 and I think it's a bad practice to modify foreach in this way.
I'm against a "Foreach VS Foreach-Object VS Magic Foreach VS ForEach-Parallel"
I vote for "ForEach-Object -Parallel -AsJob {}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am perfectly fine implementing this as a cmdlet rather than a foreach language keyword extension. In fact it is much easier to implement as a cmdlet. If the community prefers a cmdlet (as it seems from these comments) then I am happy to update this RFC accordingly. But I'll let the PowerShell committee weigh in as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put like that :-) ... there is a case for both, I think the case is stronger for the cmdlet;
When you said
To my understanding, the
foreach -parallel
keyword is only intended to run in parallel for the duration of the loop. It's not supposed to keep running while the rest of the script executes, it simply runs each iteration of the loop in parallel, and waits until all the iterations complete before continuing the script.
I was saying, "Yes and that's why the keyword is less good"
If your script is goes
Get items ; do something to each item in its own thread; format output.
Then The keyword approach can't start any threads until it has all the items and won't output anything until all threads have completed. But if they are a pipeline the threads will be started as the items are fetched, and the output can happen as the threads end. That overlapping of commands in a pipeline is makes a big perf difference if the commands either side of the parallelized one are slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I am missing something here. We actually want a parse error, correct, so if someone tries to run this on down level Powershell that the whole script doesn't run.
Some scripts and modules need to run on versions from PS 7 down to PS 3 (or even PS 2 in the case of Pester), and not just on Windows.
An example is the PowerShellEditorServices (backend to the PowerShell extension for VSCode) startup script; it must run on everything from PS 3 in Windows Server 2012 (in some cases 2008) to PS 7 on Ubuntu 18.04. We took it out, but for example that script used to call chmod
on *nix and Set-Acl
on Windows. Imagine if you couldn't wrap that in an if
, but it was a parse error.
Keywords that don't exist in any of those versions can't be used anywhere in that script. We'd have to write a whole new script (slowing down startup, increasing the download size, duplicating the code). Whereas a command parameter can be added to a splat conditionally. PS 7 users would get the parallel speedup, but it still works in PS 3.
Another example is the Install-VSCode.ps1 script. It wants to be fast, so in Windows PowerShell it uses Start-BitsTransfer
since that's available. If that resulted in a parse error, you wouldn't be able to do that. We'd have to either publish two scripts, or settle for Invoke-WebRequest
(which was sped up considerably in PS 6 btw :)).
You can already prevent downlevel running at parse time with #requires -Version 7.0
. But as someone who maintains several complicated scripts that must work all the way downlevel, I'd like the ability to leverage PowerShell's dynamism to get the best everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That startup script is one fugly piece of scripting. :-)
But I think you miss @jhoneill's point - a down-level script engine (e.g., PS v3) should generate an error if not gated by a PSVersion check. "foreach -parallel" does not pass execution time checking on PS v3, even though a proper AST is generated - up until the "-parallel" parameter.
But if the code is protected by a PSVersion check, all is OK. I wouldn't think that should change - and I don't think that @jhoneill is suggesting otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was missing that PS v3 was handling this differently. When I checked it on PS v5.1, I got this message when calling foreach -parallel
:
the '-parallel' parameter can be used only within a workflow.
And that is exactly what I expected to happen on 5.1. Because it was a parse exception, it never even tried to execute. This is also what I expected to happen.
I expect that I would need to use it in a workflow on a pre PS v7 script. I am perfectly OK with it as a PS v7 feature that is not compatible with anything lower. If you are targeting a lower version, then you need to not use the newer features.
With that said, I ok with script authors abusing the syntax to write multi-versioned scripts but I don't think that should dictate the primary design. If foreach -parallel
should be a thing, the fact that you can't use a PS v7 feature in a PS v3 script should not prevent us from implementing it in the ideal way (if we ever decide what that is).
Yes, I know some people need to write scripts that target PS v3 and it would be really nice to have a script that just ran faster on PS 7.0 and still worked on PS v3, but it is also perfectly OK for it to be a parse exception in PS v3. We already have things in PS v7 that are a parse exception in PS v5.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know some people need to write scripts that target PS v3 and it would be really nice to have a script that just ran faster on PS 7.0 and still worked on PS v3, but it is also perfectly OK for it to be a parse exception in PS v3. We already have things in PS v7 that are a parse exception in PS v5.1
Up to a point ... Having dealt with clients where it is just too painful to get servers updated from PS4 to PS5 and found my scripts used a couple of bits of 5 specific syntax, I'm probably more keen than average that things should work on old versions.
This errors without running anything on 5.1
$lastByte = 1..10
if ($PSVersionTable.PSVersion.Major -lt 7) {
foreach ($b in $lastbyte) {Test-Connection "192.168.0.$b" -Count 1 }
}
else {
foreach ($b in $lastbyte) -parallel {Test-Connection "192.168.0.$b" -Count 1 }
}
This runs
$lastByte = 1..10
if ($PSVersionTable.PSVersion.Major -lt 7) {
$lastbyte | foreach -Process {Test-Connection "192.168.0.$b" -Count 1 }
}
else {
$lastbyte | foreach -parallel {Test-Connection "192.168.0.$_" -Count 1 }
}
Now, if everything else were equal (and it's not the cmdlet can go in a pipeline) I think most people would say the implementation which supports one script for two versions is preferable - it's not mandatory
But here's why breaking can be good. Imagine a college creating a ton of new users at the start of a year.
$newUsers = import-csv new.csv | Add-CustomUser
$newUsers | export-csv created.csv
$newUsers | foreach-object {add-CustomHomeDir $_}
So we do this and it all looks good but someone says "It makes the new csv real quick but creating the home directories feels like a month" so they add -parallel. Then someone runs the script on another box and the users get created, the file is exported and BANG error with no homedirectories set up. We can't run the script again because the users exist so we have to clean up and it's all horrible.
Would someone who did the quick conversion think to put requires at the top the script in case someone runs it on an old version ? A complete fail would save them from themselves; but I would prefer that command line as
import-csv new.csv | Add-CustomUser -outvariable $newusers | foreach-object {add-CustomHomeDir $_}
Something which didn't let me create home directories until I'd created the last user would be bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, I don't feel a strong case was made for making this part of the language instead of just shipping a module.
In addition to the syntax reasons mentioned by others, a cmdlet in a module has the added benefit of being shippable (or already shipped) to the gallery, and the option of being backward compatible ...
|
||
This is a proposal to re-implement `foreach -parallel` in PowerShell Core, using PowerShell's support for concurrency via Runspaces. | ||
It is similar to the [ThreadJob module](https://www.powershellgallery.com/packages/ThreadJob/1.1.2) except that it becomes part of the PowerShell language via `foreach -parallel`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of how implemented (as a cmdlet or a foreach extension), I seriously hope you will make this compatible with psexec. When I filed an issue against threadjob noting that it broke psexec, psexec was blamed. I doubt highly that sysinternals/Mark Russinovich would agree with that.
@Jaykul Putting it into the language makes sense to me. It brings a great workflow feature into the full language. I think it is ok for Pwsh 7 to make changes like this, in the same way Msft did with Windows Power Shell 3.0. yes, you could ship cmdlets to do this, but foreach -parallel is more natural. I could argue you have not made the case for not updating the language. |
If we're to add this to the language, I think it ought to be a new keyword. The amount of switches we're talking about adding on top of -Parallel makes for some very quirky and clunky syntax. Additionally, all of the additional switches? Come from the job cmdlets. In m opinion, they shouldn't be mirrored on language keywords. If we want those, it makes more sense to stick to cmdlets. Keywords are meant to do ONE thing very well, with minimal or zero configurability. Cmdlets are more appropriate for the PowerShell ecosystem in my opinion. Doubly so with the job-like additions you want applied, and the ability to return job objects. |
@vexx32 I have to agree about the |
what happen to discoverability in case of inclusion to the language ? |
@doctordns In practice it is similar to thread jobs, but automatically getting the result of each completed one. So it could be
But if the parallelism comes from a statement it isn't as good:
My use case was: get all servers in AD. Check if they were still in DNS, and if they were, ping them, and if they answered then query settings on them. as @JamesWTruher asks above , do you extend foreach to also have a -threadCount parameter (do we want thousands of threads if L is a list of thousands ?) And a something running in its own run space (mine or thread job) doesn't have access to the variables from what ever launced it so taking this code
and adding -parallel would break the script block because the hashtable isn't shared. |
Are we talking about this? parallel {
# A script block
}
$job = parallel {
# A script block
}
|
@iSazonov yes. I don't think anyone is aguing that there should not be backed in support for parallelism. The questions are The first has to completely evaluate the statement before it pipes the result into So the discussion is keyword or cmdlet ? and Switches for existing (timeout, number of threads) or New command ? I'd say cmdlet, and distinct command. |
Purely on the point on switches. An alternative to having switches might be to have preference variables, eg $PSForEachP_ThrottleLimit and $PSForEachP_TimeOut ? |
No, I don't think that's a good idea. That would very quickly get cumbersome in any script which needs to use the keyword frequently. |
@doctordns ... so a global variable rather than passing as a parameter ? |
@PowerShell/powershell-committee discussed this and our recommendation is to implement as |
Yesterday I accidentally saw GNU utility like "parallel --pipe". Also we could find "Where-Object -Parallel" useful too. And "Select-String". And others. So this suggests that maybe we need something more general for pipe parallelization (Parallel-Pipe) $hugeArray | Parallel-Pipe { Where-Object Name -eq "test" }
$hugeArray | Parallel-Pipe {Select-String -Pattern "test" }
$hugeArray | Parallel-Pipe -AsJob { ... } |
@iSazonov A rose by any other name would smell as sweet: $hugeArray | ForEach-Object -Parallel { $_ | Where-Object Name -eq "test" }
$hugeArray | ForEach-Object -Parallel { $_ | Select-String -Pattern "test" }
$hugeArray | ForEach-Object -Parallel -AsJob { ... } I think in the name of single-purpose, composible cmdlets having a single parallel cmdlet (rather than trying to parallelise each individually) is the right way to go. But I think |
@rjmholt There is a problem in depth with "ForEach-Object -Parallel" - there is a lot of parameters and how will "-Begin/Process/End" and etc work - in "ForEach-Object -Parallel" context or scriptblock context? I think we will have to duplicate parameters. In the case Parallel-Pipe is more simple and more safely. Parallel-Pipe -InputOblect $data
-Begin { … }
-Process { Foreach-Object -Begin { … } -Process { … } -End { … } }
-End { … } |
FWIW, One pair, invoked once for each parallel pipeline, is defined as parameters
Another pair is defined in Unlike in
construct than the more popular
Why? Because the script may be defined elsewhere and it can be used for both
|
@SteveL-MSFT It still needs a @iSazonov |
I think that's proposed here as |
Ok, I'll re-work this RFC to be a ForEach-Object cmdlet parameter set addition rather than a language keyword change. Thanks for all the responses. |
@PaulHigin Maybe open new PR because we have already 53 conversations that slow down GitHub interface? |
I'll create a new RFC PR for cmdlet version of foreach parallel |
To be honest I really liked the idea of |
@PrzemyslawKlys We can eventually do both. But it sounds like the community prefers the ForEach-Object cmdlet solution so that can be implemented first. |
Continue in #194 |
@PaulHigin, I understand. I use pipeline very rarely, mostly because there is a performance difference. If you can make it gone, it doesn't matter then. For example, I am removing Where-Object where speed matters because it's 10x slower than standard foreach loop. But maybe it's just Where-Object that has to be fixed? |
@PrzemyslawKlys If you see performance problems in important scenarios please report in new issue(s). |
@iSazonov which repo? |
@PrzemyslawKlys Tbh designing to prevent use of the pipeline is not a clever thing to do.
because New-thing can process the first item to come from the get-Dataserver before the last one has been returned. What you are advocating is that something which allows you to start pieces of work without waiting for the previous one to finish , should be implemented so it can only begin when the whole of the previous task has finished and should not allow the next one to start until all its parallel parts have finished. That's why the cmdlet is better. |
@jhoneill I don't want to change how things are done. I just want them to be faster in places I use them. In most of my tests, Where-Object was slow, so I started using Foreach and dropping pipeline. From most of my tests using Foreach-Object was slower than Foreach as well, however, I'm only talking about cases I've used them in. Now, this is why I actually asked for implementing both ForEach-Object and ForEach with a parallel in place so when it makes sense to use the pipeline, use pipeline, but where it's not don't use it. In the case of AD, for example, using pipeline is very risky. Sure you can process objects as they come but often this would cause Context Errors. But the very same problems would occur if you would try and introduce runspaces (for example getting Users, Computers, and Groups at the same time with 3 different queries). I don't want to stop using the pipeline. I'm a pro pipeline. I like it, I like the idea, but it's just not suited for all scenarios and if I can get -parallel for both sign me up. As for Where-Object have a look here: PowerShell/PowerShell#9941 |
I am not sure that that is correct. Certainly, when I teach PowerShell, I do not get that sense. I think the folks posting here are more aware of the implementation issues and are avoding the harder one (which is not a bad thing). But standing back, I think implementing the Foreach as a language feature is the better option, although a close second for me is to do both. I have two reasons why I think that. First, I believe we should have symmetry with the Workflow syntax. Yes, Workflow may have been awful implementation, but I do like the language feature allowing parallelism. Secondly, it would mean I do not have to rely on $_/$PSITEM in the script blocks. Forcing that is not good practice, in my view. I also think that NOW is the best time to implement both approaches. If the team punts on adding the language feature now, when would be appropriate? PWSH 7 offers a unique opportunity to get some things right (or at least a whole lot better). I fully understand the complexity argument. But if you are going to implement parallelism like this into PowerShell, let's do it completely, let's do it right, and let's do it a PWSH 7. |
I think we should understand that the order in which they learned something. I remember in my early days with PowerShell first I new about % , then forEach as an alias, and finally the split between the foreach keyword and the foreach-object cmdlet. Depending on how a course is structured people might learn for (which is only for recovering C programmers) while, and for each before they learn about pipelines. Or they might learn about composing things with a pipeline before they learn the language constructs. People with no exposure to workflow (and I looked at it and couldn't find a use) will have a different view to those with a lot of exposure. |
Where-object is a bad case for the pipeline because if you do The time to do the work is smaller than pipeline overhead is big. But when you test and say "it is 10 times faster" you are not talking about saving seconds, but much less than 1ms per operation. (See https://jamesone111.wordpress.com/2019/04/06/powershell-functions-and-when-not-to-use-them/ for the similar overhead per function call it's ~ 100 microSeconds ). It is not visible unless you are working with many, many thousands of items. Try this
No output for the first 15 seconds. They take the same time end to end, but I can read the first output while the process is running. The use case for which I wrote start-Parallel was (roughly):
|
I understand the use case and output may vary depending on how you test it. And that's a problem. I shouldn't need to test it. I shouldn't have to worry about stuff like this. In my case - If you have a collection of 10 items Where-Object will be good. In case of my 4412 items where for each object I did another loop thru 4412 objects using Where-Object, it was 7 minutes vs 59 seconds. If this can be optimized in PS 7 so I can stop worrying about testing things like this and focus on solving problems. I don't care about micro optimization like 15 seconds to 5 seconds. But 7 minutes to 1minute it's a big deal. I understand that 19465744 loops (4412 * 4412) is actually a lot, and it may take time, the difference is just too big. I use a lot of runspaces in my scripts, but till some point I was taking Where-Object and few other stuff as granted, just to find out they are very slow in comparison to foreach. |
If you want to see what is wrong with
What is more of an issue (for me) is that for a lot where-object and foreach-object don't expand an array if it is passed as inputobject parameter so when you have multiple objects you are forced to pipe the input in. Other things do expand |
This is an RFC to implement foreach -parallel language keyword to use PowerShell runspaces/threads for running foreach iterations in parallel.