Skip to content

feat: Add random state feature. #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: john-development
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 1 addition & 20 deletions src/diffpy/snmf/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,8 @@
Y0 = np.loadtxt("input/W0.txt", dtype=float)
N, M = MM.shape

# Convert to DataFrames for display
# df_X = pd.DataFrame(X, columns=[f"Comp_{i+1}" for i in range(X.shape[1])])
# df_Y = pd.DataFrame(Y, columns=[f"Sample_{i+1}" for i in range(Y.shape[1])])
# df_MM = pd.DataFrame(MM, columns=[f"Sample_{i+1}" for i in range(MM.shape[1])])
# df_Y0 = pd.DataFrame(Y0, columns=[f"Sample_{i+1}" for i in range(Y0.shape[1])])

# Print the matrices
"""
print("Feature Matrix (X):\n", df_X, "\n")
print("Coefficient Matrix (Y):\n", df_Y, "\n")
print("Data Matrix (MM):\n", df_MM, "\n")
print("Initial Guess (Y0):\n", df_Y0, "\n")
"""


my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, components=2)
my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, n_components=2)
print("Done")
# print(f"My final guess for X: {my_model.X}")
# print(f"My final guess for Y: {my_model.Y}")
# print(f"Compare to true X: {X_norm}")
# print(f"Compare to true Y: {Y_norm}")
np.savetxt("my_norm_X.txt", my_model.X, fmt="%.6g", delimiter=" ")
np.savetxt("my_norm_Y.txt", my_model.Y, fmt="%.6g", delimiter=" ")
np.savetxt("my_norm_A.txt", my_model.A, fmt="%.6g", delimiter=" ")
90 changes: 80 additions & 10 deletions src/diffpy/snmf/snmf_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,79 @@


class SNMFOptimizer:
def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500, tol=5e-7, components=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a docstring here and in the init. Please see scikit-package FAQ about how to write these. Also, look at Yucong's code or diffpy.utils?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one here. The package init dates back to the old codebase, but as soon as that is updated it will get a docstring as well.

print("Initializing SNMF Optimizer")
def __init__(
self,
MM,
Y0=None,
X0=None,
A=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many different standards for what to name these matrices. Zero agreement between sources that use NMF. I'm inclined to eventually use what sklearn.decomposition.non_negative_factorization uses, which would mean MM->X, X->W, Y->H. But I'd like to leave this as is for the moment until there's a consensus about what would be the most clear or standard. If people will be finding this tool from the sNMF paper, there's also an argument for using the X, Y, and A names because that was used there.

rho=1e12,
eta=610,
max_iter=500,
tol=5e-7,
n_components=None,
random_state=None,
):
"""Run sNMF based on an ndarray, parameters, and either a number
of components or a set of initial guess matrices.

Currently instantiating the SNMFOptimizer class runs all the analysis
immediately. The results can then be accessed as instance attributes
of the class (X, Y, and A). Eventually, this will be changed such
that __init__ only prepares for the optimization, which will can then
be done using fit_transform.

Parameters
----------
MM: ndarray
A numpy array containing the data to be decomposed. Rows correspond
to different samples/angles, while columns correspond to different
conditions with different stretching. Currently, there is no option
to treat the first column (commonly containing 2theta angles, sample
index, etc) differently, so if present it must be stripped in advance.
Y0: ndarray
A numpy array containing initial guesses for the component weights
at each stretching condition, with number of rows equal to the assumed
number of components and number of columns equal to the number of
conditions (same number of columns as MM). Must be provided if
n_components is not provided. Will override n_components if both are
provided.
X0: ndarray
A numpy array containing initial guesses for the intensities of each
component per row/sample/angle. Has rows equal to the rows of MM and
columns equal to n_components or the number of rows of Y0.
A: ndarray
A numpy array containing initial guesses for the stretching factor for
each component, at each condition. Has number of rows equal to n_components
or the number of rows of Y0, and columns equal to the number of conditions
(columns of MM).
rho: float
A stretching factor that influences the decomposition. Zero corresponds to
no stretching present. Relatively insensitive and typically adjusted in
powers of 10.
eta: float
A sparsity factor than influences the decomposition. Should be set to zero
for non sparse data such as PDF. Can be used to improve results for sparse
data such as XRD, but due to instability, should be used only after first
selecting the best value for rho.
max_iter: int
The maximum number of times to update each of A, X, and Y before stopping
the optimization.
tol: float
The minimum fractional improvement in the objective function to allow
without terminating the optimization. Note that a minimum of 20 updates
are run before this parameter is checked.
n_components: int
The number of components to attempt to extract from MM. Note that this will
be overridden by Y0 if that is provided, but must be provided if no Y0 is
provided.
random_state: int
Used to set a reproducible seed for the initial matrices used in the
optimization. Due to the non-convex nature of the problem, results may vary
even with the same initial guesses, so this does not make the program
deterministic.
"""

self.MM = MM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to n_components, which is what sklearn.decomposition.NMF uses.

self.X0 = X0
self.Y0 = Y0
Expand All @@ -15,23 +86,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500
# Capture matrix dimensions
self.N, self.M = MM.shape
self.num_updates = 0
self.rng = np.random.default_rng(random_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more descriptive variable name? Is this a range? What is the range?


if Y0 is None:
if components is None:
raise ValueError("Must provide either Y0 or a number of components.")
if n_components is None:
raise ValueError("Must provide either Y0 or n_components.")
else:
self.K = components
self.Y0 = np.random.beta(a=2.5, b=1.5, size=(self.K, self.M)) # This is untested
self.K = n_components
self.Y0 = self.rng.beta(a=2.5, b=1.5, size=(self.K, self.M))
else:
self.K = Y0.shape[0]

# Initialize A, X0 if not provided
if self.A is None:
self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3 # Small perturbation
self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K and M are probably good names if the matrix decomposition equation is in hte docstring, so they get defined there.

if self.X0 is None:
self.X0 = np.random.rand(self.N, self.K) # Ensures values in [0,1]
self.X0 = self.rng.random((self.N, self.K))

# Initialize solution matrices to be iterated on
self.X = np.maximum(0, self.X0)
self.Y = np.maximum(0, self.Y0)

Expand Down