Skip to content

[libclang] Add API to generate a reproducer for the explicitly-built modules. #10577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: next
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions clang/include/clang-c/Dependencies.h
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,38 @@ CINDEX_LINKAGE void
clang_experimental_DependencyScannerWorkerScanSettings_dispose(
CXDependencyScannerWorkerScanSettings);

/**
* Generates a reproducer to compile a requested file with required modules.
*
* Here the reproducer means the required input data with the commands to
* compile intermediate modules and a requested file. Required intermediate
* modules and the order of their compilation are determined by the function
* and don't need to be provided.
*
* \param argc the number of compiler invocation arguments (including argv[0]).
* \param argv the compiler driver invocation arguments (including argv[0]).
* \param ModuleName If non-NULL, reproduce building the named module and all
* the intermediate modules. Otherwise, reproduce building
* the whole translation unit.
* \param WorkingDirectory the directory in which the invocation runs.
* \param ReproducerLocation the directory where to store the reproducer files.
* If NULL, use a temporary location.
* \param UseUniqueReproducerName if reproducer files should have unique names
* to avoid collisions with existing files.
* \param [out] MessageOut A pointer to store the human-readable message
* describing the result of the operation. If non-NULL,
* owned and should be disposed by the caller.
*
* \returns \c CXError_Success on success; otherwise a non-zero \c CXErrorCode
* indicating the kind of error. \p MessageOut is guaranteed to be populated
* for a success case but is allowed to be empty when encountered an error.
*/
CINDEX_LINKAGE enum CXErrorCode
clang_experimental_DependencyScanner_generateReproducer(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should use CXDependencyScannerWorkerScanSettings instead of int argc, const char *const *argv, const char *ModuleName, const char *WorkingDirectory for the basic compilation setup.

How sure are we that const char *ReproducerLocation, bool UseUniqueReproducerName, are the only two reproducer-specific options we will ever want here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should use CXDependencyScannerWorkerScanSettings instead of int argc, const char *const *argv, const char *ModuleName, const char *WorkingDirectory for the basic compilation setup.

I was going back and forth about it and can be persuaded either way. My [weak] justification is that we don't pass the settings to a dependency service/worker/tool. So it's not a self-sufficient entity but more a convenience container that happens to be the same between different functions. So I avoid treating it as an independent entity.

How sure are we that const char *ReproducerLocation, bool UseUniqueReproducerName, are the only two reproducer-specific options we will ever want here?

For the potential enhancements I expect only something to help capture the scanner state. Unfortunately, I don't know what that would be. Maybe .pcm cache, maybe dumping stat cache, maybe something else. So far it is too vague and not the most important direction for the next steps. That's why I'd like to land something reasonably future-proof and iterate on it to see what else we might need.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My [weak] justification is that we don't pass the settings to a dependency service/worker/tool.

Isn't that exactly what we do? The difference is that you're creating a new one instead of using an existing one.

hat's why I'd like to land something reasonably future-proof and iterate on it to see what else we might need.

If you want to make this future proof the C API should probably be much more opaque about the configuration, similar to what we do with the worker scan settings where the settings become their own opaque type and we can add new C functions over time to add new settings to that configuration object without breaking the existing API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My [weak] justification is that we don't pass the settings to a dependency service/worker/tool.

Isn't that exactly what we do? The difference is that you're creating a new one instead of using an existing one.

I mean that in clang_experimental_DependencyScannerWorker_getDepGraph, for instance, Settings.MLO and Settings.MLOContext goes into DependencyScanningTool::createActionController. Settings.argc, Settings.argv, Settings.WorkingDirectory goes into computeDependencies. And Settings.ModuleName affects the function's logic. We don't use Settings as a single entity anywhere and taking it apart is the first thing we do.

hat's why I'd like to land something reasonably future-proof and iterate on it to see what else we might need.

If you want to make this future proof the C API should probably be much more opaque about the configuration, similar to what we do with the worker scan settings where the settings become their own opaque type and we can add new C functions over time to add new settings to that configuration object without breaking the existing API.

Do you suggest using CXDependencyScannerWorkerScanSettings or introducing something like CXDependencyScannerReproducerSettings? I believe a single function with simple arguments is easier to work with and introducing clang_experimental_DependencyScanner_generateReproducer_v2 when necessary is not a big deal. But it's not a belief worth fighting, so I'm fine using whatever settings wrapper you think would be better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settings.argc, Settings.argv, Settings.WorkingDirectory goes into computeDependencies. And Settings.ModuleName affects the function's logic.

The use of getTranslationUnitDependencies is basically the same here as the use of computeDependencies elsewhere. The fact you're not using the lookup callback could be a possible reason not to use CXDependencyScannerWorkerScanSettings.

Incidentally, why do you accept ModuleName and not use it?

Do you suggest using CXDependencyScannerWorkerScanSettings or introducing something like CXDependencyScannerReproducerSettings? I believe a single function with simple arguments is easier to work with and introducing clang_experimental_DependencyScanner_generateReproducer_v2 when necessary is not a big deal. But it's not a belief worth fighting, so I'm fine using whatever settings wrapper you think would be better.

I'm trying to understand what kind of changes are likely to come to this API over time, how it fits with the other scanner APIs, and so what makes sense to make it stable. Introducing a v2 is not necessarily a big deal, but one of these APIs made it to v5 very quickly and was annoying to work with in client code that needs to support multiple versions of the dylib, like swift-build does. Having explicit opaque settings objects is harder to work with when you have one version of an API, but it's easier to factor if we are supporting N versions of the API.

If you are confident that this API will not need more parameters, then I'm fine with keeping it like it is and we can make it opaque in a hypothetical v2 if there is an unexpected need to add one. If we already expect there is a good chance of needing changes to the API in future, then I'm inclined to make it opaque from the start.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, why do you accept ModuleName and not use it?

To be honest, I've added it mostly because Jan asked about it. I can see there might be a noticeable enough divergence between for-TU scanning and for-module scanning in the future, so it can be useful. But I'm not using it now because there is no data indicating it is useful. So this is "let's try and see how it works" approach. Having ModuleName in API seems like a plausible enough future-proofing step.

I'm trying to understand what kind of changes are likely to come to this API over time, how it fits with the other scanner APIs, and so what makes sense to make it stable. Introducing a v2 is not necessarily a big deal, but one of these APIs made it to v5 very quickly and was annoying to work with in client code that needs to support multiple versions of the dylib, like swift-build does. Having explicit opaque settings objects is harder to work with when you have one version of an API, but it's easier to factor if we are supporting N versions of the API.

If you are confident that this API will not need more parameters, then I'm fine with keeping it like it is and we can make it opaque in a hypothetical v2 if there is an unexpected need to add one. If we already expect there is a good chance of needing changes to the API in future, then I'm inclined to make it opaque from the start.

Fair enough, that v5 doesn't give off good vibes.

I think the main question to you would be if you see a need for some CAS special handling?

As for the known problems (that we need to be prepared to fix) we have

  • Failing reproducer commands, e.g., unescaped file paths with spaces. No API change required.
  • No integration with Swift. No API change required based on how scanning is integrated with Swift.
  • Non-relocatable precompiled headers. No API change required.
  • Unknown worker state that causes crashes/errors. Might require API change.
  • Stale built modules. Might require API change.

The last two are a little bit of a stretch because reproducers don't promise to help with reproducing intermittent issues. If you cannot reproduce an issue in a clean build, there is no magic in reproducers that makes intermittent issues predictable.

It is possible there are unknown issues that would require API changes but as far as I know, API is fairly stable in its current shape.

int argc, const char *const *argv, const char *ModuleName,
const char *WorkingDirectory, const char *ReproducerLocation,
bool UseUniqueReproducerName, CXString *MessageOut);

/**
* Produces the dependency graph for a particular compiler invocation.
*
Expand Down
42 changes: 42 additions & 0 deletions clang/test/Modules/reproducer-with-module-dependencies.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// Test generating a reproducer for a modular build where required modules are
// built explicitly as separate steps.

// RUN: rm -rf %t
// RUN: split-file %s %t
//
// RUN: c-index-test core -gen-deps-reproducer -working-dir %t \
// RUN: -- clang-executable -c %t/reproducer.c -o %t/reproducer.o \
// RUN: -fmodules -fmodules-cache-path=%t | FileCheck %t/reproducer.c

// Test a failed attempt at generating a reproducer.
// RUN: not c-index-test core -gen-deps-reproducer -working-dir %t \
// RUN: -- clang-executable -c %t/failed-reproducer.c -o %t/reproducer.o \
// RUN: -fmodules -fmodules-cache-path=%t 2>&1 | FileCheck %t/failed-reproducer.c

// Test the content of a reproducer script.
// RUN: c-index-test core -gen-deps-reproducer -working-dir %t -o %t/repro-content \
// RUN: -- clang-executable -c %t/reproducer.c -o %t/reproducer.o \
// RUN: -fmodules -fmodules-cache-path=%t
// RUN: FileCheck %t/script-expectations.txt --input-file %t/repro-content/reproducer.sh

//--- modular-header.h
void fn_in_modular_header(void);

//--- module.modulemap
module Test { header "modular-header.h" export * }

//--- reproducer.c
// CHECK: Sources and associated run script(s) are located at:
#include "modular-header.h"

void test(void) {
fn_in_modular_header();
}

//--- failed-reproducer.c
// CHECK: fatal error: 'non-existing-header.h' file not found
#include "non-existing-header.h"

//--- script-expectations.txt
CHECK: clang-executable
CHECK: -fmodule-file=Test=reproducer.cache/explicitly-built-modules/Test-{{.*}}.pcm
33 changes: 33 additions & 0 deletions clang/tools/c-index-test/core_main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ enum class ActionType {
AggregateAsJSON,
ScanDeps,
ScanDepsByModuleName,
GenerateDepsReproducer,
UploadCachedJob,
MaterializeCachedJob,
ReplayCachedJob,
Expand Down Expand Up @@ -87,6 +88,8 @@ Action(cl::desc("Action:"), cl::init(ActionType::None),
"Get file dependencies"),
clEnumValN(ActionType::ScanDepsByModuleName, "scan-deps-by-mod-name",
"Get file dependencies by module name alone"),
clEnumValN(ActionType::GenerateDepsReproducer, "gen-deps-reproducer",
"Generate a reproducer for the file"),
clEnumValN(ActionType::UploadCachedJob, "upload-cached-job",
"Upload cached compilation data to upstream CAS"),
clEnumValN(ActionType::MaterializeCachedJob, "materialize-cached-job",
Expand Down Expand Up @@ -921,6 +924,27 @@ static int scanDeps(ArrayRef<const char *> Args, std::string WorkingDirectory,
return 1;
}

static int generateDepsReproducer(ArrayRef<const char *> Args,
std::string WorkingDirectory,
std::string ReproLocation) {
CXString MessageString;
auto DisposeMessageString = llvm::make_scope_exit([&]() {
clang_disposeString(MessageString);
});
CXErrorCode ExitCode =
clang_experimental_DependencyScanner_generateReproducer(
Args.size(), Args.data(), /*ModuleName=*/nullptr,
WorkingDirectory.c_str(),
ReproLocation.empty() ? nullptr : ReproLocation.c_str(),
/*UseUniqueReproducerName=*/ReproLocation.empty(), &MessageString);
if (ExitCode == CXError_Success) {
llvm::outs() << clang_getCString(MessageString) << "\n";
} else {
llvm::errs() << "error: " << clang_getCString(MessageString) << "\n";
}
return (ExitCode == CXError_Success) ? 0 : 1;
}

static int uploadCachedJob(std::string CacheKey, CXCASDatabases DBs) {
CXError Err = nullptr;
CXCASCachedCompilation CComp = clang_experimental_cas_getCachedCompilation(
Expand Down Expand Up @@ -1546,6 +1570,15 @@ int indextest_core_main(int argc, const char **argv) {
options::OutputDir, DBs, options::ModuleName);
}

if (options::Action == ActionType::GenerateDepsReproducer) {
if (options::WorkingDir.empty()) {
errs() << "error: missing -working-dir\n";
return 1;
}
return generateDepsReproducer(CompArgs, options::WorkingDir,
options::OutputFile);
}

if (options::Action == ActionType::UploadCachedJob) {
if (options::InputFiles.empty()) {
errs() << "error: missing cache key\n";
Expand Down
144 changes: 144 additions & 0 deletions clang/tools/libclang/CDependencies.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,150 @@ void clang_experimental_DependencyScannerWorkerScanSettings_dispose(
delete unwrap(Settings);
}

namespace {
// Helper class to capture a returnable error code and to return a formatted
// message in a provided CXString pointer.
class MessageEmitter {
const CXErrorCode ErrorCode;
CXString *OutputString;
std::string Buffer;
llvm::raw_string_ostream Stream;

public:
MessageEmitter(CXErrorCode Code, CXString *Output)
: ErrorCode(Code), OutputString(Output), Stream(Buffer) {}
~MessageEmitter() {
if (OutputString)
*OutputString = clang::cxstring::createDup(Buffer.c_str());
}

operator CXErrorCode() const { return ErrorCode; }

template <typename T> MessageEmitter &operator<<(const T &t) {
Stream << t;
return *this;
}
};
} // end anonymous namespace

enum CXErrorCode clang_experimental_DependencyScanner_generateReproducer(
int argc, const char *const *argv, const char *ModuleName,
const char *WorkingDirectory, const char *ReproducerLocation,
bool UseUniqueReproducerName, CXString *MessageOut) {
auto Report = [MessageOut](CXErrorCode ErrorCode) -> MessageEmitter {
return MessageEmitter(ErrorCode, MessageOut);
};
auto ReportFailure = [&Report]() -> MessageEmitter {
return Report(CXError_Failure);
};

if (argc < 2 || !argv)
return Report(CXError_InvalidArguments) << "missing compilation command";
if (!WorkingDirectory)
return Report(CXError_InvalidArguments) << "missing working directory";
if (!UseUniqueReproducerName && !ReproducerLocation)
return Report(CXError_InvalidArguments)
<< "non-unique reproducer is allowed only in a custom location";

CASOptions CASOpts;
IntrusiveRefCntPtr<llvm::cas::CachingOnDiskFileSystem> FS;
DependencyScanningService DepsService(
ScanningMode::DependencyDirectivesScan, ScanningOutputFormat::Full,
CASOpts, /*CAS=*/nullptr, /*ActionCache=*/nullptr, FS);
DependencyScanningTool DepsTool(DepsService);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this API create its own service/worker/tool instead of taking a CXDependencyScannerWorker?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this API create its own service/worker/tool instead of taking a CXDependencyScannerWorker?

I wanted the reproducer scan to be independent from previous scans, that's why don't use an existing worker. It is a trade-off, not a uniquely correct approach. Reproducers should be useful in general, not only for debugging issues with EBM, that's why I'm leaning towards from-scratch scanning. And I hope it is easier to reason about than to take into account an invisible worker's state.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

independent from previous scans

Can you clarify what specifically you want to be independent from? Is it the file system cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less about reusing specific data but more about creating API that is easier to reason about. For example, I don't need to know implementation details that CXDependencyScannerWorker doesn't maintain state that affects the reproducer results. That generating a reproducer doesn't affect CXDependencyScannerWorker state in unpredictable way. That I don't need to be concerned about CXDependencyScannerWorker being accessed concurrently. In practice, passing CXDependencyScannerWorker works despite all of these concerns but there is no guarantee it won't change in the future.


llvm::SmallString<128> ReproScriptPath;
int ScriptFD;
if (ReproducerLocation) {
if (auto EC = llvm::sys::fs::create_directories(ReproducerLocation))
return ReportFailure() << "failed to create a reproducer location '"
<< ReproducerLocation << "'\n"
<< EC.message();
SmallString<128> Path(ReproducerLocation);
llvm::sys::path::append(Path, "reproducer");
const char *UniqueSuffix = UseUniqueReproducerName ? "-%%%%%%" : "";
if (auto EC = llvm::sys::fs::createUniqueFile(Path + UniqueSuffix + ".sh",
ScriptFD, ReproScriptPath))
return ReportFailure() << "failed to create a reproducer script file\n"
<< EC.message();
} else {
if (auto EC = llvm::sys::fs::createTemporaryFile(
"reproducer", "sh", ScriptFD, ReproScriptPath)) {
return ReportFailure() << "failed to create a reproducer script file\n"
<< EC.message();
}
}
SmallString<128> FileCachePath = ReproScriptPath;
llvm::sys::path::replace_extension(FileCachePath, ".cache");

std::string FileCacheName = llvm::sys::path::filename(FileCachePath).str();
auto LookupOutput = [&FileCacheName](const ModuleDeps &MD,
ModuleOutputKind MOK) -> std::string {
if (MOK != ModuleOutputKind::ModuleFile)
return "";
return FileCacheName + "/explicitly-built-modules/" +
MD.ID.ModuleName + "-" + MD.ID.ContextHash + ".pcm";
};

std::vector<std::string> Compilation{argv, argv + argc};
llvm::DenseSet<ModuleID> AlreadySeen;
auto TUDepsOrErr = DepsTool.getTranslationUnitDependencies(
Compilation, WorkingDirectory, AlreadySeen, std::move(LookupOutput));
if (!TUDepsOrErr)
return ReportFailure() << "failed to generate a reproducer\n"
<< toString(TUDepsOrErr.takeError());

TranslationUnitDeps TU = *TUDepsOrErr;
llvm::raw_fd_ostream ScriptOS(ScriptFD, /*shouldClose=*/true);
ScriptOS << "# Original command:\n#";
for (StringRef Arg : Compilation)
ScriptOS << ' ' << Arg;
ScriptOS << "\n\n";

ScriptOS << "# Dependencies:\n";
std::string ReproExecutable = std::string(argv[0]);
auto PrintArguments = [&ReproExecutable,
&FileCacheName](llvm::raw_fd_ostream &OS,
ArrayRef<std::string> Arguments) {
OS << ReproExecutable;
for (int I = 0, E = Arguments.size(); I < E; ++I)
OS << ' ' << Arguments[I];
OS << " -ivfsoverlay \"" << FileCacheName << "/vfs/vfs.yaml\"";
OS << '\n';
};
for (ModuleDeps &Dep : TU.ModuleGraph)
PrintArguments(ScriptOS, Dep.getBuildArguments());
ScriptOS << "\n# Translation unit:\n";
for (const Command &BuildCommand : TU.Commands)
PrintArguments(ScriptOS, BuildCommand.Arguments);

SmallString<128> VFSCachePath = FileCachePath;
llvm::sys::path::append(VFSCachePath, "vfs");
std::string VFSCachePathStr = VFSCachePath.str().str();
llvm::FileCollector FileCollector(VFSCachePathStr,
/*OverlayRoot=*/VFSCachePathStr);
for (const auto &FileDep : TU.FileDeps) {
FileCollector.addFile(FileDep);
}
for (ModuleDeps &ModuleDep : TU.ModuleGraph) {
ModuleDep.forEachFileDep([&FileCollector](StringRef FileDep) {
FileCollector.addFile(FileDep);
});
}
if (FileCollector.copyFiles(/*StopOnError=*/true))
return ReportFailure()
<< "failed to copy the files used for the compilation";
SmallString<128> VFSOverlayPath = VFSCachePath;
llvm::sys::path::append(VFSOverlayPath, "vfs.yaml");
if (FileCollector.writeMapping(VFSOverlayPath))
return ReportFailure() << "failed to write a VFS overlay mapping";

return Report(CXError_Success)
<< "Created a reproducer. Sources and associated run script(s) are "
"located at:\n "
<< FileCachePath << "\n " << ReproScriptPath;
}

enum CXErrorCode clang_experimental_DependencyScannerWorker_getDepGraph(
CXDependencyScannerWorker W,
CXDependencyScannerWorkerScanSettings CXSettings, CXDepGraph *Out) {
Expand Down
1 change: 1 addition & 0 deletions clang/tools/libclang/libclang.map
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,7 @@ LLVM_16 {
clang_experimental_cas_replayCompilation;
clang_experimental_cas_ReplayResult_dispose;
clang_experimental_cas_ReplayResult_getStderr;
clang_experimental_DependencyScanner_generateReproducer;
clang_experimental_DependencyScannerService_create_v1;
clang_experimental_DependencyScannerServiceOptions_create;
clang_experimental_DependencyScannerServiceOptions_dispose;
Expand Down