Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pymultifit submission #233

Open
14 of 32 tasks
syedalimohsinbukhari opened this issue Jan 21, 2025 · 14 comments
Open
14 of 32 tasks

pymultifit submission #233

syedalimohsinbukhari opened this issue Jan 21, 2025 · 14 comments
Assignees

Comments

@syedalimohsinbukhari
Copy link

syedalimohsinbukhari commented Jan 21, 2025

Submitting Author: Syed Ali Mohsin Bukhari (@syedalimohsinbukhari)
All current maintainers: (@syedalimohsinbukhari)
Package Name: pymultifit
One-Line Description of Package: A python library for fitting data with multiple models.
Repository Link: https://github.com/syedalimohsinbukhari/pyMultiFit
Version submitted: v1.0.3 v1.0.6
EiC: @coatless
Editor: @Batalex
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does:

pymultifit is built primarily to solve one problem, to fit multiple models (and mixture models) to a given data. Be it multiple Gaussian, Laplacian, or a mixture of such models, this package aims to deal with multi-model data fitting. The package also provides easy-to-use BaseDistribution and BaseFitter classes for respective user-defined functions.

Scope

  • Please indicate which category or categories.
    Check out our package scope page to learn more about our
    scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization1
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific

  • Geospatial
  • Education

Community Partnerships

If your package is associated with an
existing community please check below:

  • For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):

    • Who is the target audience and what are scientific applications of this package?

Researchers, data scientists, and statisticians who work with datasets requiring multi-model fitting for robust analysis and modeling.

  • Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Apart from scipy, lmfit, and scikit-learn the general purpose scientific packages, there exists PyAutoFit, a Python-based probabilistic programming language built on Bayesian inference. Another notable library is Mixture-Models, which specializes in advanced optimization techniques for fitting various families of mixture models, including Gaussian mixture models and their variants. Both libraries are powerful tools for specific use cases, and I recently came to know about them during my search of existing options.

While these libraries offer robust solutions for hierarchical modeling (PyAutoFit) or a diverse array of pre-defined mixture models (Mixture-Models), pyMultiFit distinguishes itself through its simplicity of use and its focus on simplicity of use. Specifically, it is designed to provide a lightweight and user-friendly framework for fitting multi-model data, including custom mixture models (for example, gaussian + laplace + line). pymultifit also provides easy-to-use base classes that can be modified for any distribution/fitter purposes.

One of the more prominent features of pyMultiFit is the BaseFitter template class that provides custom fitting to any definable function with minimal boilerplate code. All the plotting and boundary functionalities are handled inside the template class so that the user can focus solely on running through multiple models quickly without thinking about how to manage multiple models of the same type or even of different types.

Additionally, the generators template function provides the user with an N-model data generator function with added noise capability to mimic real-life scenarios of whatever distribution the user might want.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • uses an OSI approved license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a tutorial with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

  • I have read the author guide.
  • I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Footnotes

  1. Please fill out a pre-submission inquiry before submitting a data visualization package.

@coatless
Copy link

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci
review. Below are the basic checks that your package needs to pass
to begin our review. If some of these are missing, we will ask you
to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements
below.

  • Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
    • The package imports properly into a standard Python environment import package.
  • Fit The package meets criteria for fit and overlap.
  • Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
    • User-facing documentation that overviews how to install and start using the package.
    • Short tutorials that help a user understand how to use the package and what it can do for them.
    • API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
  • Core GitHub repository Files
    • README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
    • Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
    • Code of Conduct The package has a CODE_OF_CONDUCT.md file.
    • License The package has an OSI approved license.
      NOTE: We prefer that you have development instructions in your documentation too.
  • Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
  • Automated tests Package has a testing suite and is tested via a Continuous Integration service.
  • Repository The repository link resolves correctly.
  • Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

  • Initial onboarding survey was filled out
    We appreciate each maintainer of the package filling out this survey individually. 🙌
    Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌


Editor comments

I think there's enough novelty behind the multifitter and distribution approaches discussed to move forward with a full review. (For clarity, scipy provides scipy.stats.fit() for a single DV or CV whereas multiple and different supports are given by pymultifit via MixedDataFitter) Moreover, there are ample tutorials and a solid case study of applying the package to solve real-world problems.

My only concern is regarding the accuracy benchmarks throwing RuntimeWarning notices. I think this is leaning into a discussion that appeared in the presubmission intake analysis regarding a re-implementation with numpy.

For example, with arcsine() and beta(a=5, b=80, loc=-3, scale=5) we have:

Beta

https://pymultifit.readthedocs.io/latest/benchmarks/_bm_accuracy.html#beta(a=5,-b=80,-loc=-3,-scale=5)

/home/sarl-ws-5/PycharmProjects/pyMultiFit/src/pymultifit/distributions/utilities_d.py:168: RuntimeWarning: overflow encountered in power
  numerator = y**(alpha - 1) * (1 - y)**(beta - 1)
/home/sarl-ws-5/PycharmProjects/pyMultiFit/src/pymultifit/distributions/utilities_d.py:168: RuntimeWarning: overflow encountered in multiply
  numerator = y**(alpha - 1) * (1 - y)**(beta - 1)

Arcsine

https://pymultifit.readthedocs.io/latest/benchmarks/_bm_accuracy.html#arcsine()

/home/sarl-ws-5/PycharmProjects/pyMultiFit/src/pymultifit/distributions/utilities_d.py:71: RuntimeWarning: invalid value encountered in sqrt
  pdf_ = 1 / (np.pi * np.sqrt(y * (1 - y)))
/home/sarl-ws-5/PycharmProjects/pyMultiFit/src/pymultifit/distributions/utilities_d.py:71: RuntimeWarning: divide by zero encountered in divide
  pdf_ = 1 / (np.pi * np.sqrt(y * (1 - y)))
/home/sarl-ws-5/PycharmProjects/pyMultiFit/benchmarks/functions.py:30: RuntimeWarning: invalid value encountered in subtract
  pdf_abs_diff = np.abs(pdf_custom - pdf_scipy) + EPSILON

@lwasser lwasser moved this from pre-review-checks to seeking-editor in peer-review-status Jan 29, 2025
@syedalimohsinbukhari
Copy link
Author

syedalimohsinbukhari commented Jan 29, 2025

Hi @coatless

Thanks for getting back. I've been meaning to address those issues for a while but was waiting for a response before I proceeded. I've updated the required functions, and they shouldn't give the same issues now. The only issue now is that in arcsine for x=1 (in the edge case of testing), the pdf gives np.inf in both scenarios and thus the invalid value error.

Image

Cheers.

@coatless
Copy link

@syedalimohsinbukhari any reason for not directly using np.arcsin()? I'm not having that issue with the built-in version:

Screenshot of a Jupyter notebook with a custom implementation throwing a similar warning vs. the implementation in NumPy of `np.arcsin()`

Could you restore the docstring for beta_pdf_():

syedalimohsinbukhari/pyMultiFit@517e6fe#diff-901f230daf84fc7cccf4819b150e039e69d3b060a1bcc2279a55f84ce2570da6L118

@syedalimohsinbukhari
Copy link
Author

syedalimohsinbukhari commented Jan 29, 2025

Hi @coatless,

I think this image should clear things up. The ArcSineDistribution is not the same as np.arcsin of trigonometry; it is a distribution with pdf $f(y) = \dfrac{1}{\pi\sqrt{y(1-y)}}$ that's why.

For reference, wiki article and scipy docs.

Image

Thanks for the headsup for beta_pdf_; strangely, it is showing up in my local builds but not in recent RTD builds. I'll look into it more.

Image

@syedalimohsinbukhari
Copy link
Author

Hi @coatless

Just letting you know that beta_pdf_() docstring is now showing up on RTD servers as well.

https://pymultifit.readthedocs.io/latest/distributions/utilities_d.html#pymultifit.distributions.utilities_d.beta_pdf_

@coatless
Copy link

@syedalimohsinbukhari thanks! I'm going to work on getting an editor assigned to move the review forward.

@syedalimohsinbukhari
Copy link
Author

@coatless Awesome!! Looking forward to it.

@syedalimohsinbukhari
Copy link
Author

syedalimohsinbukhari commented Feb 18, 2025

Hi @coatless

I'm working on some updates and wanted to ask if it'll be okay to push the next version or should I just keep the version as is since I've already mentioned it for submission.

@coatless
Copy link

@syedalimohsinbukhari go for it! We're still working on getting editors.

@coatless
Copy link

coatless commented Mar 6, 2025

@syedalimohsinbukhari Thanks for your patience. I've secured an editor to further move the review along.

I am happy to announce that @Batalex will be the editor for your submission.

@lwasser lwasser moved this from seeking-editor to under-review in peer-review-status Mar 6, 2025
@syedalimohsinbukhari
Copy link
Author

Hi @Batalex.

Welcome, and I'm looking forward to working with you.

@Batalex
Copy link
Contributor

Batalex commented Mar 8, 2025

Hello @syedalimohsinbukhari, nice to meet you.
I am excited to be part of pymultifit review, and I'll get started with my search for reviewers.

@syedalimohsinbukhari
Copy link
Author

Hi @coatless

I'm working on some updates and wanted to ask if it'll be okay to push the next version or should I just keep the version as is since I've already mentioned it for submission.

@syedalimohsinbukhari go for it! We're still working on getting editors.

Hi, @coatless, @Batalex

As mentioned previously, I've updated the submission version for pymultifit with some streamlining of functionalities without changing much infrastructure. The docs, examples, and tutorails are up to date with this version v1.0.6.

@Batalex
Copy link
Contributor

Batalex commented Mar 10, 2025

Hey, thanks for letting us know. My personal policy when it comes to the reviews I lead as the editor is as follows:

  • you can work as you please on the code base, there is no need to freeze anything during the review process. This is important because finding reviewers can take a while, and there is no point in asking you to delay the work multiple times for several months
  • we'll ask the reviewers to review the version submitted. Since there is no reviewer at this time, feel free to bump to the latest release.
  • however, as soon as reviewers are assigned to the review, the version submitted in the issue header will stay the same. Making sure that reviewers' concern made on an older codebase are properly addressed in the newer one will be your job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: under-review
Development

No branches or pull requests

4 participants