Skip to content

[libclang] Add API to generate a reproducer for the explicitly-built modules. #10577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: next
Choose a base branch
from

Conversation

vsapsai
Copy link

@vsapsai vsapsai commented Apr 29, 2025

Capturing a single -cc1 command to reproduce a bug with the explicitly-built modules isn't sufficient, we need to know what .pcm files were built and how.

rdar://59743925

@vsapsai vsapsai requested a review from jansvoboda11 April 29, 2025 04:06
@vsapsai
Copy link
Author

vsapsai commented Apr 29, 2025

Right now I'd like to get a high-level opinion if such an approach for reproducers is reasonable. This feature will have more commits, to get an idea of how it is supposed to look like you can check users/vsapsai/reproducer-explicitly-built-modules branch.

@vsapsai
Copy link
Author

vsapsai commented May 7, 2025

An example of API usage is in swiftlang/swift-build#455

@vsapsai
Copy link
Author

vsapsai commented May 7, 2025

By the way, this doesn't work as a reproducer for dependency scanning, I'm just capturing the scanning results. My hope that with including the original command [used for scanning] and with the captured files you can still do something like

c-index-test core --scan-deps -- original_command_from_reproducer

After landing the proposed level of reproducing we can think how to support scan-deps better.

I was considering an alternative of invoking clang-scan-deps from the shell script and building modules based on the scanning results. But I've decided that embedding a little build system in a shell script is too much complexity and not justified at this stage. Maybe we'll see a need for something like that, then we can bite the bullet and add the required complexity.

@vsapsai
Copy link
Author

vsapsai commented May 14, 2025

I had an idea of adding DependencyScanningWorker to the reproducer generation API so that we can capture some internal worker state to help with the debugging. But after additional consideration I think it is too vague and unclear as I don't have specific use cases when it should help. And such functionality we can always add in a new API version.

@vsapsai vsapsai force-pushed the users/vsapsai/reproducer-explicitly-built-modules_01 branch from 5185b66 to dd4cc19 Compare May 14, 2025 21:58
@vsapsai vsapsai marked this pull request as ready for review May 14, 2025 22:00
@vsapsai
Copy link
Author

vsapsai commented May 22, 2025

Ping.

@jansvoboda11
Copy link

I'd like the API changes from users/vsapsai/reproducer-explicitly-built-modules (and the addition of const char *ModuleName to align with the scan API that we talked about offline) to be included in this PR. C APIs cannot change between Clang versions, so getting this right in the first commit makes the most sense to me.

vsapsai added a commit that referenced this pull request May 22, 2025
…modules. (#10577)

Capturing a single `-cc1` command to reproduce a bug with the explicitly-built
modules isn't sufficient, we need to know what .pcm files were built and how.

rdar://59743925
@vsapsai vsapsai force-pushed the users/vsapsai/reproducer-explicitly-built-modules_01 branch from dd4cc19 to 9977985 Compare May 22, 2025 22:07
vsapsai added a commit that referenced this pull request May 22, 2025
Tools using the reproducers can specify a custom location to support
scenarios similar to those covered by `-fcrash-diagnostics-dir` and
`CLANG_CRASH_DIAGNOSTICS_DIR`.
…modules. (#10577)

Capturing a single `-cc1` command to reproduce a bug with the explicitly-built
modules isn't sufficient, we need to know what .pcm files were built and how.

rdar://59743925
vsapsai added a commit that referenced this pull request May 22, 2025
Tools using the reproducers can specify a custom location to support
scenarios similar to those covered by `-fcrash-diagnostics-dir` and
`CLANG_CRASH_DIAGNOSTICS_DIR`.
@vsapsai vsapsai force-pushed the users/vsapsai/reproducer-explicitly-built-modules_01 branch from 9977985 to d440bf1 Compare May 22, 2025 23:28
@vsapsai
Copy link
Author

vsapsai commented May 22, 2025

Added ModuleName [that's not used right now]. And added another commit with ReproducerLocation to this branch.

Copy link

@jansvoboda11 jansvoboda11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let's see if @Bigcheese or @benlangmuir have any objections.

Copy link

@benlangmuir benlangmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have an entry point in clang-scan-deps? It's unusual to have something in CDependencies that doesn't have an equivalent clang-scan-deps command-line interface, though it's not strictly necessary

* for a success case but is allowed to be empty when encountered an error.
*/
CINDEX_LINKAGE enum CXErrorCode
clang_experimental_DependencyScanner_generateReproducer(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should use CXDependencyScannerWorkerScanSettings instead of int argc, const char *const *argv, const char *ModuleName, const char *WorkingDirectory for the basic compilation setup.

How sure are we that const char *ReproducerLocation, bool UseUniqueReproducerName, are the only two reproducer-specific options we will ever want here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should use CXDependencyScannerWorkerScanSettings instead of int argc, const char *const *argv, const char *ModuleName, const char *WorkingDirectory for the basic compilation setup.

I was going back and forth about it and can be persuaded either way. My [weak] justification is that we don't pass the settings to a dependency service/worker/tool. So it's not a self-sufficient entity but more a convenience container that happens to be the same between different functions. So I avoid treating it as an independent entity.

How sure are we that const char *ReproducerLocation, bool UseUniqueReproducerName, are the only two reproducer-specific options we will ever want here?

For the potential enhancements I expect only something to help capture the scanner state. Unfortunately, I don't know what that would be. Maybe .pcm cache, maybe dumping stat cache, maybe something else. So far it is too vague and not the most important direction for the next steps. That's why I'd like to land something reasonably future-proof and iterate on it to see what else we might need.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My [weak] justification is that we don't pass the settings to a dependency service/worker/tool.

Isn't that exactly what we do? The difference is that you're creating a new one instead of using an existing one.

hat's why I'd like to land something reasonably future-proof and iterate on it to see what else we might need.

If you want to make this future proof the C API should probably be much more opaque about the configuration, similar to what we do with the worker scan settings where the settings become their own opaque type and we can add new C functions over time to add new settings to that configuration object without breaking the existing API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My [weak] justification is that we don't pass the settings to a dependency service/worker/tool.

Isn't that exactly what we do? The difference is that you're creating a new one instead of using an existing one.

I mean that in clang_experimental_DependencyScannerWorker_getDepGraph, for instance, Settings.MLO and Settings.MLOContext goes into DependencyScanningTool::createActionController. Settings.argc, Settings.argv, Settings.WorkingDirectory goes into computeDependencies. And Settings.ModuleName affects the function's logic. We don't use Settings as a single entity anywhere and taking it apart is the first thing we do.

hat's why I'd like to land something reasonably future-proof and iterate on it to see what else we might need.

If you want to make this future proof the C API should probably be much more opaque about the configuration, similar to what we do with the worker scan settings where the settings become their own opaque type and we can add new C functions over time to add new settings to that configuration object without breaking the existing API.

Do you suggest using CXDependencyScannerWorkerScanSettings or introducing something like CXDependencyScannerReproducerSettings? I believe a single function with simple arguments is easier to work with and introducing clang_experimental_DependencyScanner_generateReproducer_v2 when necessary is not a big deal. But it's not a belief worth fighting, so I'm fine using whatever settings wrapper you think would be better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settings.argc, Settings.argv, Settings.WorkingDirectory goes into computeDependencies. And Settings.ModuleName affects the function's logic.

The use of getTranslationUnitDependencies is basically the same here as the use of computeDependencies elsewhere. The fact you're not using the lookup callback could be a possible reason not to use CXDependencyScannerWorkerScanSettings.

Incidentally, why do you accept ModuleName and not use it?

Do you suggest using CXDependencyScannerWorkerScanSettings or introducing something like CXDependencyScannerReproducerSettings? I believe a single function with simple arguments is easier to work with and introducing clang_experimental_DependencyScanner_generateReproducer_v2 when necessary is not a big deal. But it's not a belief worth fighting, so I'm fine using whatever settings wrapper you think would be better.

I'm trying to understand what kind of changes are likely to come to this API over time, how it fits with the other scanner APIs, and so what makes sense to make it stable. Introducing a v2 is not necessarily a big deal, but one of these APIs made it to v5 very quickly and was annoying to work with in client code that needs to support multiple versions of the dylib, like swift-build does. Having explicit opaque settings objects is harder to work with when you have one version of an API, but it's easier to factor if we are supporting N versions of the API.

If you are confident that this API will not need more parameters, then I'm fine with keeping it like it is and we can make it opaque in a hypothetical v2 if there is an unexpected need to add one. If we already expect there is a good chance of needing changes to the API in future, then I'm inclined to make it opaque from the start.

DependencyScanningService DepsService(
ScanningMode::DependencyDirectivesScan, ScanningOutputFormat::Full,
CASOpts, /*CAS=*/nullptr, /*ActionCache=*/nullptr, FS);
DependencyScanningTool DepsTool(DepsService);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this API create its own service/worker/tool instead of taking a CXDependencyScannerWorker?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this API create its own service/worker/tool instead of taking a CXDependencyScannerWorker?

I wanted the reproducer scan to be independent from previous scans, that's why don't use an existing worker. It is a trade-off, not a uniquely correct approach. Reproducers should be useful in general, not only for debugging issues with EBM, that's why I'm leaning towards from-scratch scanning. And I hope it is easier to reason about than to take into account an invisible worker's state.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

independent from previous scans

Can you clarify what specifically you want to be independent from? Is it the file system cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less about reusing specific data but more about creating API that is easier to reason about. For example, I don't need to know implementation details that CXDependencyScannerWorker doesn't maintain state that affects the reproducer results. That generating a reproducer doesn't affect CXDependencyScannerWorker state in unpredictable way. That I don't need to be concerned about CXDependencyScannerWorker being accessed concurrently. In practice, passing CXDependencyScannerWorker works despite all of these concerns but there is no guarantee it won't change in the future.

llvm::SmallString<128> ReproScriptPath;
int ScriptFD;
if (ReproducerLocation) {
if (!llvm::sys::fs::exists(ReproducerLocation)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_directories has bool IgnoreExisting = true

@vsapsai
Copy link
Author

vsapsai commented May 23, 2025

Should this have an entry point in clang-scan-deps? It's unusual to have something in CDependencies that doesn't have an equivalent clang-scan-deps command-line interface, though it's not strictly necessary

I agree it is a good idea. But when I've started working on it, I've realized that for the proper API I need a build system ready to use it. And I don't have such a build system. I'm not strongly against it but I'd rather avoid developing functionality nobody uses.

Tools using the reproducers can specify a custom location to support
scenarios similar to those covered by `-fcrash-diagnostics-dir` and
`CLANG_CRASH_DIAGNOSTICS_DIR`.
@vsapsai vsapsai force-pushed the users/vsapsai/reproducer-explicitly-built-modules_01 branch from d440bf1 to fc7ae9f Compare May 23, 2025 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants