Use cmdstan config and decouple stanfit objects from RunSet#851
Use cmdstan config and decouple stanfit objects from RunSet#851amas0 wants to merge 9 commits intostan-dev:developfrom
Conversation
Bumps [actions/cache](https://github.com/actions/cache) from 4 to 5. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](actions/cache@v4...v5) --- updated-dependencies: - dependency-name: actions/cache dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
WardBrian
left a comment
There was a problem hiding this comment.
A few comments/questions, but overall I like the look of the code!
Responding to your comments in the PR:
- Runset being left out of the result entirely makes sense. We will have to see how this feels when we tackle something with more moving parts like the MCMC class, but for now I agree.
- Repr changes make sense.
- Stdout file is good. We may want to also do the same thing with diagnostic files and profiling files, which gets us closer to just including the whole Runset again... maybe we have a small dataclass that is like a cut-down runset and stores the relevant files, can be re-used across all of them?
- I agree, it makes sense to replace
csvwithoutputacross the board, both in the save functions and thefrom_csvfunction. This would make sense going forward anyway, c.f. stan-dev/design-docs#58 - Yep, totally agree that InferenceMetadata might want to take a hike when all is said and done. The header could be stored in each class, or in the cut-down-Runset idea ("
FileInfo"?)
| model_name: str | ||
| csv_file: str | ||
| config: PathfinderConfig | ||
| config_file: str | None = None |
There was a problem hiding this comment.
Is this really optional? Doesn't look like it in from_files
There was a problem hiding this comment.
It's optional in the sense that if you pass a config in directly, you don't need to reference the file itself. Outside of simplifying some testing, I'm not sure if it will come up much? The intended entrypoint in practice is always going to be the from_files.
| metadata = InferenceMetadata.from_csv(csv_file) | ||
| return cls( | ||
| metadata=metadata, | ||
| csv_file=str(csv_file), |
There was a problem hiding this comment.
The super-correct way to convert a str|os.PathLike to a str is os.fspath()
| metadata=metadata, | ||
| csv_file=str(csv_file), | ||
| model_name=stan_config.model_name, | ||
| config=stan_config.method_config, |
There was a problem hiding this comment.
It's obviously nice that we don't have any downstream isinstance checks, but it is a bit of a bummer that we end up throwing away the extra information that is present in the StanConfig (e.g. model name, etc). Is there some trick to keep it all around, while maintaining that this is a Pathfinder-specific config?
There was a problem hiding this comment.
Good point. We could just store references to both stan_config and config? The top level StanConfig would still contain the inference config within it, but it would just be the same object. Storing both would keep the top level info and still let us have config be validated that the method is correct. Keeps the type checker happy and doesn't drop any info. Thoughts?
There was a problem hiding this comment.
That seems fine. I was wondering if pydantic had a more elegant way, but it's not like doing that actually costs any extra memory or anything
There was a problem hiding this comment.
We could do something fancy like make StanConfig a generic, which pydantic has good support for:
from typing import Generic, TypeVar
AnyMethodConfig = Annotated[
SampleConfig | OptimizeConfig | PathfinderConfig
| LaplaceConfig | VariationalConfig | GeneratedQuantitiesConfig,
Discriminator("method"),
]
MethodT = TypeVar("MethodT", bound=BaseModel, default=AnyMethodConfig)
class StanConfig(BaseModel, Generic[MethodT]):
model_config = ConfigDict(extra="allow")
model_name: str
stan_major_version: str
stan_minor_version: str
stan_patch_version: str
method_config: MethodTThen for example we could have:
@dataclass
class CmdStanPathfinder:
# other fields ...
config: StanConfig[PathfinderConfig]And parsing like StanConfig[PathfinderConfig].model_validate(jsons) would validate it against the Pathfinder instance of the config. This would result in only one reference to the config but the type checker would know that it has the pathfinder configuration fields within method_config.
There was a problem hiding this comment.
That seems like a good representation of what's actually happening under the hood
| f'found {len(csvfiles)}' | ||
| ) | ||
| csv_file = csvfiles[0] | ||
| config_file = os.path.splitext(csv_file)[0] + '_config.json' |
There was a problem hiding this comment.
I assume this is temporary but it's worth putting a comment saying so, since this is kind of a nasty assumption
There was a problem hiding this comment.
Yeah definitely temporary. I was planning on doing away with this function entirely once the full changes are in.
Draft PR for initial progress on config parsing for #785 and implementing ideas discussed in #848.
@WardBrian I'd like you to take a look at this initial draft which implements pydantic-based parsing of the config json files and uses that to refactor
CmdStanPathfinderto be built entirely from output files (and decouples it from the runset).The changes I'd like you to look at are in
stanfit/pathfinder.pyandstanfit/metadata.py-- the others files are mostly just temporary patches to have the library in a functional state.The
CmdStanPathfinderclass has been rewritten as a dataclass that defines afrom_files(...)method as a constructor to build it from the output files. I also prefer the clarity that this gives to what attributes the class has.Some areas where I'd appreciate thoughts:
reprany reference to the method_args, building directly from files made it seem that was no longer desirable.stdout_fileas an optional attribute. With no reference to the runset, it still seems like something that might be nice to have? But the stdout info is more like process diagnostic than an essential component of the stanfit?save_csvfilesprobably needs to be called something likesave_output_files, but I haven't changed that yet.InferenceMetadataobject more or less only scan the header line for stan variables, I think. So it may be worth a think as to what that should look like?If the shape of looks to be in the right direction, I'll proceed with doing this for the other stanfit objects. Following that, we should be able to really clean up what we have going on in
stanfit/__init__.pyand should be able to drop all the code that we have for parsing out the headers of the stan csvs.