Skip to content

Add RFC for foreach-parallel feature #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

PaulHigin
Copy link
Contributor

This is an RFC to implement foreach -parallel language keyword to use PowerShell runspaces/threads for running foreach iterations in parallel.

## Alternate Proposals and Considerations

One alternative is to create a `ForEach-Parallel` cmdlet instead of re-implementing the `foreach -parallel` keyword.
This would work well but would not be as useful as making it part of the PowerShell language.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is maybe worth discussing more, just because the drawbacks aren't as apparent to me. What's the key limitation here?

The drawbacks I see with foreach () -Parallel are:

  • New syntax means scripts are syntactically backwards incompatible -- they cannot even be successfully parsed by an older PowerShell version. Compare with:
    $foreachParams = @{}
    if ($PSVersionTable.PSVersion.Major -ge 7) { $foreachParams += @{ Parallel = $true } }
    $workItems | ForEach-Object @foreachParams { Invoke-Processing $_ }
  • Assigning from a foreach-loop seems like a relatively unintuitive construction and a bit syntactically off. I know it's already a functionality we support and I think it makes sense in the language, but minting it as the primary syntax for parallelism seems to run a bit against the natural style of PowerShell to me

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that ForEach-Object was in scope of this RFC. If we get this implemented, I can see that becoming part of the conversation.

I also don't believe that we would be able to splat to the foreach operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not, but I agree with @rjmholt here; it makes more syntactic sense to put this into the cmdlet instead. ForEach-Object -Parallel -AsJob {} makes a LOT more sense than foreach -Parallel -AsJob ($a in $b) {} visually, and with only a handful of exceptions the majority of language keywords don't have parameters like that.

Additionally, having this available for pipeline cmdlets I think would be significantly more valuable than just a foreach loop, which can't be used in a pipeline context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm agree with @vexx32 and I think it's a bad practice to modify foreach in this way.
I'm against a "Foreach VS Foreach-Object VS Magic Foreach VS ForEach-Parallel"
I vote for "ForEach-Object -Parallel -AsJob {}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am perfectly fine implementing this as a cmdlet rather than a foreach language keyword extension. In fact it is much easier to implement as a cmdlet. If the community prefers a cmdlet (as it seems from these comments) then I am happy to update this RFC accordingly. But I'll let the PowerShell committee weigh in as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put like that :-) ... there is a case for both, I think the case is stronger for the cmdlet;
When you said

To my understanding, the foreach -parallel keyword is only intended to run in parallel for the duration of the loop. It's not supposed to keep running while the rest of the script executes, it simply runs each iteration of the loop in parallel, and waits until all the iterations complete before continuing the script.

I was saying, "Yes and that's why the keyword is less good"
If your script is goes
Get items ; do something to each item in its own thread; format output.
Then The keyword approach can't start any threads until it has all the items and won't output anything until all threads have completed. But if they are a pipeline the threads will be started as the items are fetched, and the output can happen as the threads end. That overlapping of commands in a pipeline is makes a big perf difference if the commands either side of the parallelized one are slow

Copy link
Contributor

@rjmholt rjmholt Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I am missing something here. We actually want a parse error, correct, so if someone tries to run this on down level Powershell that the whole script doesn't run.

Some scripts and modules need to run on versions from PS 7 down to PS 3 (or even PS 2 in the case of Pester), and not just on Windows.

An example is the PowerShellEditorServices (backend to the PowerShell extension for VSCode) startup script; it must run on everything from PS 3 in Windows Server 2012 (in some cases 2008) to PS 7 on Ubuntu 18.04. We took it out, but for example that script used to call chmod on *nix and Set-Acl on Windows. Imagine if you couldn't wrap that in an if, but it was a parse error.

Keywords that don't exist in any of those versions can't be used anywhere in that script. We'd have to write a whole new script (slowing down startup, increasing the download size, duplicating the code). Whereas a command parameter can be added to a splat conditionally. PS 7 users would get the parallel speedup, but it still works in PS 3.

Another example is the Install-VSCode.ps1 script. It wants to be fast, so in Windows PowerShell it uses Start-BitsTransfer since that's available. If that resulted in a parse error, you wouldn't be able to do that. We'd have to either publish two scripts, or settle for Invoke-WebRequest (which was sped up considerably in PS 6 btw :)).

You can already prevent downlevel running at parse time with #requires -Version 7.0. But as someone who maintains several complicated scripts that must work all the way downlevel, I'd like the ability to leverage PowerShell's dynamism to get the best everywhere.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That startup script is one fugly piece of scripting. :-)

But I think you miss @jhoneill's point - a down-level script engine (e.g., PS v3) should generate an error if not gated by a PSVersion check. "foreach -parallel" does not pass execution time checking on PS v3, even though a proper AST is generated - up until the "-parallel" parameter.

But if the code is protected by a PSVersion check, all is OK. I wouldn't think that should change - and I don't think that @jhoneill is suggesting otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was missing that PS v3 was handling this differently. When I checked it on PS v5.1, I got this message when calling foreach -parallel:

the '-parallel' parameter can be used only within a workflow. 

And that is exactly what I expected to happen on 5.1. Because it was a parse exception, it never even tried to execute. This is also what I expected to happen.

I expect that I would need to use it in a workflow on a pre PS v7 script. I am perfectly OK with it as a PS v7 feature that is not compatible with anything lower. If you are targeting a lower version, then you need to not use the newer features.

With that said, I ok with script authors abusing the syntax to write multi-versioned scripts but I don't think that should dictate the primary design. If foreach -parallel should be a thing, the fact that you can't use a PS v7 feature in a PS v3 script should not prevent us from implementing it in the ideal way (if we ever decide what that is).

Yes, I know some people need to write scripts that target PS v3 and it would be really nice to have a script that just ran faster on PS 7.0 and still worked on PS v3, but it is also perfectly OK for it to be a parse exception in PS v3. We already have things in PS v7 that are a parse exception in PS v5.1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know some people need to write scripts that target PS v3 and it would be really nice to have a script that just ran faster on PS 7.0 and still worked on PS v3, but it is also perfectly OK for it to be a parse exception in PS v3. We already have things in PS v7 that are a parse exception in PS v5.1

Up to a point ... Having dealt with clients where it is just too painful to get servers updated from PS4 to PS5 and found my scripts used a couple of bits of 5 specific syntax, I'm probably more keen than average that things should work on old versions.

This errors without running anything on 5.1

$lastByte = 1..10 
if ($PSVersionTable.PSVersion.Major -lt 7) {
    foreach ($b in $lastbyte) {Test-Connection "192.168.0.$b" -Count 1 } 
}
else {
    foreach ($b in $lastbyte) -parallel  {Test-Connection "192.168.0.$b" -Count 1 }  
}

This runs

$lastByte = 1..10 
if ($PSVersionTable.PSVersion.Major -lt 7) {
    $lastbyte | foreach -Process  {Test-Connection "192.168.0.$b" -Count 1 } 
}
else {
    $lastbyte | foreach -parallel  {Test-Connection "192.168.0.$_" -Count 1 }  
}

Now, if everything else were equal (and it's not the cmdlet can go in a pipeline) I think most people would say the implementation which supports one script for two versions is preferable - it's not mandatory

But here's why breaking can be good. Imagine a college creating a ton of new users at the start of a year.
$newUsers = import-csv new.csv | Add-CustomUser
$newUsers | export-csv created.csv
$newUsers | foreach-object {add-CustomHomeDir $_}

So we do this and it all looks good but someone says "It makes the new csv real quick but creating the home directories feels like a month" so they add -parallel. Then someone runs the script on another box and the users get created, the file is exported and BANG error with no homedirectories set up. We can't run the script again because the users exist so we have to clean up and it's all horrible.
Would someone who did the quick conversion think to put requires at the top the script in case someone runs it on an old version ? A complete fail would save them from themselves; but I would prefer that command line as
import-csv new.csv | Add-CustomUser -outvariable $newusers | foreach-object {add-CustomHomeDir $_}
Something which didn't let me create home directories until I'd created the last user would be bad

Copy link
Contributor

@Jaykul Jaykul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, I don't feel a strong case was made for making this part of the language instead of just shipping a module.

In addition to the syntax reasons mentioned by others, a cmdlet in a module has the added benefit of being shippable (or already shipped) to the gallery, and the option of being backward compatible ...


This is a proposal to re-implement `foreach -parallel` in PowerShell Core, using PowerShell's support for concurrency via Runspaces.
It is similar to the [ThreadJob module](https://www.powershellgallery.com/packages/ThreadJob/1.1.2) except that it becomes part of the PowerShell language via `foreach -parallel`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of how implemented (as a cmdlet or a foreach extension), I seriously hope you will make this compatible with psexec. When I filed an issue against threadjob noting that it broke psexec, psexec was blamed. I doubt highly that sysinternals/Mark Russinovich would agree with that.

@doctordns
Copy link

@Jaykul Putting it into the language makes sense to me. It brings a great workflow feature into the full language. I think it is ok for Pwsh 7 to make changes like this, in the same way Msft did with Windows Power Shell 3.0. yes, you could ship cmdlets to do this, but foreach -parallel is more natural.

I could argue you have not made the case for not updating the language.

@vexx32
Copy link
Contributor

vexx32 commented May 14, 2019

If we're to add this to the language, I think it ought to be a new keyword. The amount of switches we're talking about adding on top of -Parallel makes for some very quirky and clunky syntax.

Additionally, all of the additional switches? Come from the job cmdlets. In m opinion, they shouldn't be mirrored on language keywords. If we want those, it makes more sense to stick to cmdlets.

Keywords are meant to do ONE thing very well, with minimal or zero configurability.

Cmdlets are more appropriate for the PowerShell ecosystem in my opinion. Doubly so with the job-like additions you want applied, and the ability to return job objects.

@PaulHigin
Copy link
Contributor Author

@vexx32 I have to agree about the -parallel, -asjob etc. switches in the language. They feel very cumbersome. I like the idea of updating the language to support this, but switches feel more natural with cmdlets.

@fMichaleczek
Copy link

what happen to discoverability in case of inclusion to the language ?
Get-Help, Autocompletion will be supported ?

@jhoneill
Copy link

@doctordns
You can pipe into a cmdlet but not into a statement. I wrote my own parallelizer so I could do
Get-ListOfSomething | start-parallel {work on something }| do-something-with-results ...

In practice it is similar to thread jobs, but automatically getting the result of each completed one. So it could be

Get-ListOfSomething | foreach-object {start-threadjob -argumentlist $_ {work on something }
get-job | recieve-job | do-something-with-results

But if the parallelism comes from a statement it isn't as good:

$L = Get-listOfSomething
$r = foreach -parallel ($item in L) {work on something} 
$r | do-something-with-results

My use case was: get all servers in AD. Check if they were still in DNS, and if they were, ping them, and if they answered then query settings on them. as @JamesWTruher asks above , do you extend foreach to also have a -threadCount parameter (do we want thousands of threads if L is a list of thousands ?) And a something running in its own run space (mine or thread job) doesn't have access to the variables from what ever launced it so taking this code

$hashtable = @{} 
foreach ($item in $L) {$hashtable[$item.id] = $item.name}

and adding -parallel would break the script block because the hashtable isn't shared.
I would argue that for this reason alone it should not be a adaptation of foreach (cmdlet or statement). It could be a new statement, but my contention is it is better to be able to include parallelism in a pipleline so I would say new cmdlet.

@iSazonov
Copy link
Contributor

iSazonov commented May 15, 2019

I like the idea of updating the language to support this

Are we talking about this?

    parallel {
        # A script block
    }

    $job = parallel {
        # A script block
    }

@jhoneill
Copy link

@iSazonov yes.
it would be
parallel [parameters] {scriptblock}

I don't think anyone is aguing that there should not be backed in support for parallelism. The questions are
(a) A solution which uses runspaces , as present methods do, would mean the script block doesn't have access to external variables, so everything must be passed as a parameter and returned as a result. Therefore the scriptblock behaves like the one in start-job or a remote invoke-command (and would need to support "Using" which only works with those, but not with foreach).
(b) It follows that a script block which works with a foreach / ForEach-Object now would still work if these had a -Parallel switch, or that a script written for V7 would work on an earlier version by removing -Parallel
(c) Statements can't directly form part of a pipeline; compare
$(foreach ($x in 1..3) {$x ; sleep 2}) | foreach {$_}
and
1..3 | foreach {$_ ; sleep 2} | foreach {$_ }

The first has to completely evaluate the statement before it pipes the result into
So if you had to do $(Parallel <<whatever>>) it can not begin until it has ALL its input, and the next step can not begin until it has generated ALL the output.

So the discussion is keyword or cmdlet ? and Switches for existing (timeout, number of threads) or New command ?

I'd say cmdlet, and distinct command.

@doctordns
Copy link

Purely on the point on switches. An alternative to having switches might be to have preference variables, eg $PSForEachP_ThrottleLimit and $PSForEachP_TimeOut ?

@vexx32
Copy link
Contributor

vexx32 commented May 15, 2019

No, I don't think that's a good idea. That would very quickly get cumbersome in any script which needs to use the keyword frequently.

@jhoneill
Copy link

@doctordns ... so a global variable rather than passing as a parameter ?
And seriously, I set my preference for what's good on my machine, never set a limit in my scripts. I share a script with you and with your preference it sends a massive number of concurrent requests to your sql/ad/whatever. The number should be explicit.

@SteveL-MSFT
Copy link
Member

@PowerShell/powershell-committee discussed this and our recommendation is to implement as ForEach-Object -Parallel instead of reusing the foreach -parallel statement.

@iSazonov
Copy link
Contributor

Yesterday I accidentally saw GNU utility like "parallel --pipe". Also we could find "Where-Object -Parallel" useful too. And "Select-String". And others. So this suggests that maybe we need something more general for pipe parallelization (Parallel-Pipe)

$hugeArray | Parallel-Pipe { Where-Object Name -eq "test" }
$hugeArray | Parallel-Pipe {Select-String -Pattern "test" }
$hugeArray | Parallel-Pipe -AsJob { ... }

@rjmholt
Copy link
Contributor

rjmholt commented May 16, 2019

@iSazonov A rose by any other name would smell as sweet:

$hugeArray | ForEach-Object -Parallel { $_ | Where-Object Name -eq "test" }
$hugeArray | ForEach-Object -Parallel { $_ | Select-String -Pattern "test" }
$hugeArray | ForEach-Object -Parallel -AsJob { ... }

I think in the name of single-purpose, composible cmdlets having a single parallel cmdlet (rather than trying to parallelise each individually) is the right way to go. But I think ForEach-Object has that brief of Select-like enumeration. And just like LINQ has .AsParallel(), I think -Parallel makes sense to do this. But that's just a brief and not-strongly-held opinion :)

@iSazonov
Copy link
Contributor

@rjmholt There is a problem in depth with "ForEach-Object -Parallel" - there is a lot of parameters and how will "-Begin/Process/End" and etc work - in "ForEach-Object -Parallel" context or scriptblock context? I think we will have to duplicate parameters. In the case Parallel-Pipe is more simple and more safely.

Parallel-Pipe     -InputOblect $data
                  -Begin { … } 
                  -Process { Foreach-Object -Begin { … } -Process { … } -End { … } }
                  -End { … } 

@nightroman
Copy link

nightroman commented May 16, 2019

FWIW, Split-Pipeline from SplitPipeline uses two kinds of Begin and End.

One pair, invoked once for each parallel pipeline, is defined as parameters Begin and End.
Their goal is to setup and cleanup pipeline workspaces.

Split-Pipeline
    [-Script] <ScriptBlock>
    [[-InputObject] <PSObject>]
    [-Begin <ScriptBlock>]
    [-End <ScriptBlock>]
    ...

Another pair is defined in Script as the usual Begin and End blocks.
This pair works for each batch of objects sent to a pipeline. Note that each
parallel pipeline may have several input batches.

Unlike in ForEach-Object, the script normally has its Process block.
Split-Pipeline is designed to be the parallel version of the less popular

# sequential
... | .{process{ $_ }}

# parallel
... | Split-Pipeline {process{ $_ }}

construct than the more popular ForEach-Object

... | ForEach-Object { $_ }

Why? Because the script may be defined elsewhere and it can be used for both
sequential and parallel ways without changes (except some cases)

# common script
$script = {process{ $_ }}

# sequential
... | . $script

# parallel
... | Split-Pipeline $script

@jhoneill
Copy link

jhoneill commented May 16, 2019

@PowerShell/powershell-committee discussed this and our recommendation is to implement as ForEach-Object -Parallel instead of reusing the foreach -parallel statement.

@SteveL-MSFT
if -parallel specifies a script block (like the -process which is often implied) it solves my concern where a script block which works sequentially would not work in parallel.

It still needs a -threads option - some services will allow more than 1 connection from the same client but won't allow 10000000 :-) And if -Parallel is creating run spaces it will need a $using:
You could have foreach-object -begin {xxx} -parallel {yyyy} -end {zzzz} and
get-stuff | foreach-object -parallel {aaaa} | foreach-object -parallel {bbbb}
Which is how I would want it to work.

@iSazonov - begin is "before all" -end is "after all" and where is just foreach-object {if (<<condition>>) {$_} }

@rjmholt
Copy link
Contributor

rjmholt commented May 16, 2019

@jhoneill

It still needs a -threads option - some services will allow more than 1 connection from the same client but won't allow 10000000 :-)

I think that's proposed here as -ThrottleLimit (which sets a maximum parallelisation level)

@PaulHigin
Copy link
Contributor Author

Ok, I'll re-work this RFC to be a ForEach-Object cmdlet parameter set addition rather than a language keyword change. Thanks for all the responses.

@iSazonov
Copy link
Contributor

@PaulHigin Maybe open new PR because we have already 53 conversations that slow down GitHub interface?

@PaulHigin
Copy link
Contributor Author

I'll create a new RFC PR for cmdlet version of foreach parallel

@PaulHigin PaulHigin closed this May 17, 2019
@PaulHigin PaulHigin deleted the foreach-parallel branch May 17, 2019 15:29
@PrzemyslawKlys
Copy link

To be honest I really liked the idea of foreach -parallel rather then ForEach-Object version of it. I don't know much about implementation cost thou. It would be great having both options so one can use it with pipeline and without.

@PaulHigin
Copy link
Contributor Author

@PrzemyslawKlys We can eventually do both. But it sounds like the community prefers the ForEach-Object cmdlet solution so that can be implemented first.

@iSazonov
Copy link
Contributor

Continue in #194

@PrzemyslawKlys
Copy link

@PaulHigin, I understand. I use pipeline very rarely, mostly because there is a performance difference. If you can make it gone, it doesn't matter then. For example, I am removing Where-Object where speed matters because it's 10x slower than standard foreach loop. But maybe it's just Where-Object that has to be fixed?

@iSazonov
Copy link
Contributor

because there is a performance difference

@PrzemyslawKlys If you see performance problems in important scenarios please report in new issue(s).

@PrzemyslawKlys
Copy link

@iSazonov which repo?

@iSazonov
Copy link
Contributor

@jhoneill
Copy link

@PrzemyslawKlys Tbh designing to prevent use of the pipeline is not a clever thing to do.
There is an overhead in passing things down the pipeline. If the work done in a step is small then removing the pipeline is faster. BUT
get-DataFromServer | where {complex condition} | New-Thing
Will finish faster than

$d =get-DataFromServer
$f= foreach ($item in $d) {if (complex condition) {$item}}
foreach ($filteredItem in $f) {new-thing $filtereditem}

because New-thing can process the first item to come from the get-Dataserver before the last one has been returned.
It turns out that avoiding the pipeline is only good SOMETIMES.

What you are advocating is that something which allows you to start pieces of work without waiting for the previous one to finish , should be implemented so it can only begin when the whole of the previous task has finished and should not allow the next one to start until all its parallel parts have finished. That's why the cmdlet is better.

@PrzemyslawKlys
Copy link

@jhoneill I don't want to change how things are done. I just want them to be faster in places I use them. In most of my tests, Where-Object was slow, so I started using Foreach and dropping pipeline.

From most of my tests using Foreach-Object was slower than Foreach as well, however, I'm only talking about cases I've used them in.

Now, this is why I actually asked for implementing both ForEach-Object and ForEach with a parallel in place so when it makes sense to use the pipeline, use pipeline, but where it's not don't use it.

In the case of AD, for example, using pipeline is very risky. Sure you can process objects as they come but often this would cause Context Errors. But the very same problems would occur if you would try and introduce runspaces (for example getting Users, Computers, and Groups at the same time with 3 different queries).

I don't want to stop using the pipeline. I'm a pro pipeline. I like it, I like the idea, but it's just not suited for all scenarios and if I can get -parallel for both sign me up.

As for Where-Object have a look here: PowerShell/PowerShell#9941

@doctordns
Copy link

@PrzemyslawKlys We can eventually do both. But it sounds like the community prefers the ForEach-Object cmdlet solution so that can be implemented first.

I am not sure that that is correct. Certainly, when I teach PowerShell, I do not get that sense. I think the folks posting here are more aware of the implementation issues and are avoding the harder one (which is not a bad thing). But standing back, I think implementing the Foreach as a language feature is the better option, although a close second for me is to do both.

I have two reasons why I think that. First, I believe we should have symmetry with the Workflow syntax. Yes, Workflow may have been awful implementation, but I do like the language feature allowing parallelism. Secondly, it would mean I do not have to rely on $_/$PSITEM in the script blocks. Forcing that is not good practice, in my view.

I also think that NOW is the best time to implement both approaches. If the team punts on adding the language feature now, when would be appropriate? PWSH 7 offers a unique opportunity to get some things right (or at least a whole lot better).

I fully understand the complexity argument. But if you are going to implement parallelism like this into PowerShell, let's do it completely, let's do it right, and let's do it a PWSH 7.

@jhoneill
Copy link

it sounds like the community prefers the ForEach-Object cmdlet

I am not sure that that is correct. Certainly, when I teach PowerShell, I do not get that sense.

First, I believe we should have symmetry with the Workflow syntax. Yes, Workflow may have been awful implementation,

I think we should understand that the order in which they learned something. I remember in my early days with PowerShell first I new about % , then forEach as an alias, and finally the split between the foreach keyword and the foreach-object cmdlet.

Depending on how a course is structured people might learn for (which is only for recovering C programmers) while, and for each before they learn about pipelines. Or they might learn about composing things with a pipeline before they learn the language constructs. People with no exposure to workflow (and I looked at it and couldn't find a use) will have a different view to those with a lot of exposure.

@jhoneill
Copy link

@jhoneill I don't want to change how things are done. I just want them to be faster in places I use them. In most of my tests, Where-Object was slow, so I started using Foreach

Where-object is a bad case for the pipeline because if you do
| where -property -ne 0

The time to do the work is smaller than pipeline overhead is big. But when you test and say "it is 10 times faster" you are not talking about saving seconds, but much less than 1ms per operation. (See https://jamesone111.wordpress.com/2019/04/06/powershell-functions-and-when-not-to-use-them/ for the similar overhead per function call it's ~ 100 microSeconds ). It is not visible unless you are working with many, many thousands of items.

Try this
1..5 | foreach {sleep 2; ($_ + 1) } | foreach {sleep 1 ; ($_ + 1) }
It starts generating output at 3 seconds.
Compared with

$data =  1..5 
$intermediate = foreach ($item in $data) {sleep 2; ($item + 1) } 
$output = foreach ($item in $intermediate) {sleep 1 ; ($item + 1) }
$output 

No output for the first 15 seconds. They take the same time end to end, but I can read the first output while the process is running.

The use case for which I wrote start-Parallel was (roughly):

  1. It takes ~ 100 seconds to return a list of 100 servers to test (it turns out DoS protection prevents one of the checks for "do we test this" can't be parallelized) .
  2. 10% of the servers are off line and it takes 20 seconds for a server to time out , and 3 seconds to run the test. Run in sequence this takes 100 + 10 * 20 seconds + 90 * 3 seconds = 570 seconds
  3. If no servers were off line, using start-parallel I only need 4 threads because server1's test results come in and its thread can be reused at about the time the information arrives saying test server4. I might have 2 more threads for servers which are not responding. But with 6-7 threads (and only 1 or 2 causing network traffic at any time) I get my first result after 3-4 seconds, and the last one after ~120 seconds
  4. But if I do not stagger the start of the threads, I get nothing for 100 seconds. Then I start 100 threads, and still get the last result after 120 seconds but I've concentrated all the load in two 20 seconds, and maybe it would better to cap to, say, 10 threads. (In practice to prevent accidental DoS attacks we'll need a low default number)
    So I can query first 10 servers, one blocks and 3 seconds later I can query 9 more, one blocks 3 seconds later I query 8 more, at equilibrium about 5 threads are blocked so I'm doing about 1.5 servers per second. So it takes ~ 60 seconds before I have the final test in progress and it is 20 seconds after that when the last one comes in. So 180 seconds in all.
    Either way , it is preferable allow the threads to start before we have all the input data.

@PrzemyslawKlys
Copy link

PrzemyslawKlys commented Jun 19, 2019

I understand the use case and output may vary depending on how you test it. And that's a problem. I shouldn't need to test it. I shouldn't have to worry about stuff like this.

In my case - If you have a collection of 10 items Where-Object will be good. In case of my 4412 items where for each object I did another loop thru 4412 objects using Where-Object, it was 7 minutes vs 59 seconds. If this can be optimized in PS 7 so I can stop worrying about testing things like this and focus on solving problems. I don't care about micro optimization like 15 seconds to 5 seconds. But 7 minutes to 1minute it's a big deal.

I understand that 19465744 loops (4412 * 4412) is actually a lot, and it may take time, the difference is just too big.

I use a lot of runspaces in my scripts, but till some point I was taking Where-Object and few other stuff as granted, just to find out they are very slow in comparison to foreach.

@jhoneill
Copy link

If you want to see what is wrong with where try
(1..1mb ).where({$_ % 131071 -eq 0}) ## No pipeline takes 6.5 seconds on my machine

foreach ($i in 1..1mb) {if ($i % 131071 -eq 0) {$i}} ## 1.6 seconds.

What is more of an issue (for me) is that for a lot where-object and foreach-object don't expand an array if it is passed as inputobject parameter so when you have multiple objects you are forced to pipe the input in. Other things do expand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.