Skip to content

Commit b55dcae

Browse files
authored
Merge pull request #222 from basf/master
Release V1.2.0
2 parents 90f3547 + 25b88a3 commit b55dcae

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+2576
-2320
lines changed

.github/workflows/pr-tests.yml

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
name: PR Unit Tests
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- develop
7+
- master # Add any other branches where you want to enforce tests
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- name: Checkout Repository
15+
uses: actions/checkout@v4
16+
17+
- name: Set up Python
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: "3.10" # Change this to match your setup
21+
22+
- name: Install Poetry
23+
run: |
24+
curl -sSL https://install.python-poetry.org | python3 -
25+
echo "$HOME/.local/bin" >> $GITHUB_PATH
26+
export PATH="$HOME/.local/bin:$PATH"
27+
28+
- name: Install Dependencies
29+
run: |
30+
python -m pip install --upgrade pip
31+
poetry install
32+
pip install pytest
33+
34+
- name: Install Package Locally
35+
run: |
36+
poetry build
37+
pip install dist/*.whl # Install the built package to fix "No module named 'mambular'"
38+
39+
- name: Run Unit Tests
40+
env:
41+
PYTHONPATH: ${{ github.workspace }} # Ensure the package is discoverable
42+
run: pytest tests/
43+
44+
- name: Verify Tests Passed
45+
if: ${{ success() }}
46+
run: echo "All tests passed! Pull request is allowed."
47+
48+
- name: Fail PR on Test Failure
49+
if: ${{ failure() }}
50+
run: exit 1 # This ensures the PR cannot be merged if tests fail

README.md

+62-62
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,17 @@
2121

2222
Mambular is a Python library for tabular deep learning. It includes models that leverage the Mamba (State Space Model) architecture, as well as other popular models like TabTransformer, FTTransformer, TabM and tabular ResNets. Check out our paper `Mambular: A Sequential Model for Tabular Deep Learning`, available [here](https://arxiv.org/abs/2408.06291). Also check out our paper introducing [TabulaRNN](https://arxiv.org/pdf/2411.17207) and analyzing the efficiency of NLP inspired tabular models.
2323

24+
<h3>⚡ What's New ⚡</h3>
25+
<ul>
26+
<li>Individual preprocessing: preprocess each feature differently, use pre-trained models for categorical encoding</li>
27+
<li>Extract latent representations of tables</li>
28+
<li>Use embeddings as inputs</li>
29+
<li>Define custom training metrics</li>
30+
</ul>
31+
32+
33+
34+
2435
<h3> Table of Contents </h3>
2536

2637
- [🏃 Quickstart](#-quickstart)
@@ -30,7 +41,6 @@ Mambular is a Python library for tabular deep learning. It includes models that
3041
- [🛠️ Installation](#️-installation)
3142
- [🚀 Usage](#-usage)
3243
- [💻 Implement Your Own Model](#-implement-your-own-model)
33-
- [Custom Training](#custom-training)
3444
- [🏷️ Citation](#️-citation)
3545
- [License](#license)
3646

@@ -103,6 +113,7 @@ pip install mamba-ssm
103113
<h2> Preprocessing </h2>
104114

105115
Mambular simplifies data preprocessing with a range of tools designed for easy transformation of tabular data.
116+
Specify a default method, or a dictionary defining individual preprocessing methods for each feature.
106117

107118
<h3> Data Type Detection and Transformation </h3>
108119

@@ -116,6 +127,7 @@ Mambular simplifies data preprocessing with a range of tools designed for easy t
116127
- **Polynomial Features**: Automatically generates polynomial and interaction terms for numerical features, enhancing the ability to capture higher-order relationships.
117128
- **Box-Cox & Yeo-Johnson Transformations**: Performs power transformations to stabilize variance and normalize distributions.
118129
- **Custom Binning**: Enables user-defined bin edges for precise discretization of numerical data.
130+
- **Pre-trained Encoding**: Use sentence transformers to encode categorical features.
119131

120132

121133

@@ -147,6 +159,28 @@ preds = model.predict(X)
147159
preds = model.predict_proba(X)
148160
```
149161

162+
Get latent representations for each feature:
163+
```python
164+
# simple encoding
165+
model.encode(X)
166+
```
167+
168+
Use unstructured data:
169+
```python
170+
# load pretrained models
171+
image_model = ...
172+
nlp_model = ...
173+
174+
# create embeddings
175+
img_embs = image_model.encode(images)
176+
txt_embs = nlp_model.encode(texts)
177+
178+
# fit model on tabular data and unstructured data
179+
model.fit(X_train, y_train, embeddings=[img_embs, txt_embs])
180+
```
181+
182+
183+
150184
<h3> Hyperparameter Optimization</h3>
151185
Since all of the models are sklearn base estimators, you can use the built-in hyperparameter optimizatino from sklearn.
152186

@@ -222,9 +256,11 @@ MambularLSS allows you to model the full distribution of a response variable, no
222256
- **studentt**: For data with heavier tails, useful with small samples.
223257
- **negativebinom**: For over-dispersed count data.
224258
- **inversegamma**: Often used as a prior in Bayesian inference.
259+
- **johnsonsu**: Four parameter distribution defining location, scale, kurtosis and skewness.
225260
- **categorical**: For data with more than two categories.
226261
- **Quantile**: For quantile regression using the pinball loss.
227262

263+
228264
These distribution classes make MambularLSS versatile in modeling various data types and distributions.
229265

230266

@@ -269,13 +305,16 @@ Here's how you can implement a custom model with Mambular:
269305

270306
```python
271307
from dataclasses import dataclass
308+
from mambular.configs import BaseConfig
272309

273310
@dataclass
274-
class MyConfig:
311+
class MyConfig(BaseConfig):
275312
lr: float = 1e-04
276313
lr_patience: int = 10
277314
weight_decay: float = 1e-06
278-
lr_factor: float = 0.1
315+
n_layers: int = 4
316+
pooling_method:str = "avg
317+
279318
```
280319

281320
2. **Second, define your model:**
@@ -290,22 +329,32 @@ Here's how you can implement a custom model with Mambular:
290329
class MyCustomModel(BaseModel):
291330
def __init__(
292331
self,
293-
cat_feature_info,
294-
num_feature_info,
332+
feature_information: tuple,
295333
num_classes: int = 1,
296334
config=None,
297335
**kwargs,
298336
):
299-
super().__init__(**kwargs)
300-
self.save_hyperparameters(ignore=["cat_feature_info", "num_feature_info"])
337+
super().__init__(**kwargs)
338+
self.save_hyperparameters(ignore=["feature_information"])
339+
self.returns_ensemble = False
340+
341+
# embedding layer
342+
self.embedding_layer = EmbeddingLayer(
343+
*feature_information,
344+
config=config,
345+
)
301346

302-
input_dim = get_feature_dimensions(num_feature_info, cat_feature_info)
347+
input_dim = np.sum(
348+
[len(info) * self.hparams.d_model for info in feature_information]
349+
)
303350

304351
self.linear = nn.Linear(input_dim, num_classes)
305352

306-
def forward(self, num_features, cat_features):
307-
x = num_features + cat_features
308-
x = torch.cat(x, dim=1)
353+
def forward(self, *data) -> torch.Tensor:
354+
x = self.embedding_layer(*data)
355+
B, S, D = x.shape
356+
x = x.reshape(B, S * D)
357+
309358

310359
# Pass through linear layer
311360
output = self.linear(x)
@@ -329,60 +378,11 @@ Here's how you can implement a custom model with Mambular:
329378
```python
330379
regressor = MyRegressor(numerical_preprocessing="ple")
331380
regressor.fit(X_train, y_train, max_epochs=50)
381+
382+
regressor.evaluate(X_test, y_test)
332383
```
333384

334-
# Custom Training
335-
If you prefer to setup custom training, preprocessing and evaluation, you can simply use the `mambular.base_models`.
336-
Just be careful that all basemodels expect lists of features as inputs. More precisely as list for numerical features and a list for categorical features. A custom training loop, with random data could look like this.
337385

338-
```python
339-
import torch
340-
import torch.nn as nn
341-
import torch.optim as optim
342-
from mambular.base_models import Mambular
343-
from mambular.configs import DefaultMambularConfig
344-
345-
# Dummy data and configuration
346-
cat_feature_info = {
347-
"cat1": {
348-
"preprocessing": "imputer -> continuous_ordinal",
349-
"dimension": 1,
350-
"categories": 4,
351-
}
352-
} # Example categorical feature information
353-
num_feature_info = {
354-
"num1": {"preprocessing": "imputer -> scaler", "dimension": 1, "categories": None}
355-
} # Example numerical feature information
356-
num_classes = 1
357-
config = DefaultMambularConfig() # Use the desired configuration
358-
359-
# Initialize model, loss function, and optimizer
360-
model = Mambular(cat_feature_info, num_feature_info, num_classes, config)
361-
criterion = nn.MSELoss() # Use MSE for regression; change as appropriate for your task
362-
optimizer = optim.Adam(model.parameters(), lr=0.001)
363-
364-
# Example training loop
365-
for epoch in range(10): # Number of epochs
366-
model.train()
367-
optimizer.zero_grad()
368-
369-
# Dummy Data
370-
num_features = [torch.randn(32, 1) for _ in num_feature_info]
371-
cat_features = [torch.randint(0, 5, (32,)) for _ in cat_feature_info]
372-
labels = torch.randn(32, num_classes)
373-
374-
# Forward pass
375-
outputs = model(num_features, cat_features)
376-
loss = criterion(outputs, labels)
377-
378-
# Backward pass and optimization
379-
loss.backward()
380-
optimizer.step()
381-
382-
# Print loss for monitoring
383-
print(f"Epoch [{epoch+1}/10], Loss: {loss.item():.4f}")
384-
385-
```
386386

387387
# 🏷️ Citation
388388

mambular/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,4 @@
1616
#
1717

1818
# The following line *must* be the last in the module, exactly as formatted:
19-
__version__ = "1.1.0"
19+
__version__ = "1.2.0"

mambular/arch_utils/layer_utils/attention_utils.py

+2-9
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
import torch.nn as nn
66
import torch.nn.functional as F
77
from einops import rearrange
8-
from rotary_embedding_torch import RotaryEmbedding
98

109

1110
class GEGLU(nn.Module):
@@ -25,7 +24,7 @@ def FeedForward(dim, mult=4, dropout=0.0):
2524

2625

2726
class Attention(nn.Module):
28-
def __init__(self, dim, heads=8, dim_head=64, dropout=0.0, rotary=False):
27+
def __init__(self, dim, heads=8, dim_head=64, dropout=0.0):
2928
super().__init__()
3029
inner_dim = dim_head * heads
3130
self.heads = heads
@@ -34,18 +33,13 @@ def __init__(self, dim, heads=8, dim_head=64, dropout=0.0, rotary=False):
3433
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)
3534
self.to_out = nn.Linear(inner_dim, dim, bias=False)
3635
self.dropout = nn.Dropout(dropout)
37-
self.rotary = rotary
3836
dim = np.int64(dim / 2)
39-
self.rotary_embedding = RotaryEmbedding(dim=dim)
4037

4138
def forward(self, x):
4239
h = self.heads
4340
x = self.norm(x)
4441
q, k, v = self.to_qkv(x).chunk(3, dim=-1)
4542
q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v)) # type: ignore
46-
if self.rotary:
47-
q = self.rotary_embedding.rotate_queries_or_keys(q)
48-
k = self.rotary_embedding.rotate_queries_or_keys(k)
4943
q = q * self.scale
5044

5145
sim = torch.einsum("b h i d, b h j d -> b h i j", q, k)
@@ -61,7 +55,7 @@ def forward(self, x):
6155

6256

6357
class Transformer(nn.Module):
64-
def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout, rotary=False):
58+
def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout):
6559
super().__init__()
6660
self.layers = nn.ModuleList([])
6761

@@ -74,7 +68,6 @@ def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout, rotary
7468
heads=heads,
7569
dim_head=dim_head,
7670
dropout=attn_dropout,
77-
rotary=rotary,
7871
),
7972
FeedForward(dim, dropout=ff_dropout),
8073
]

0 commit comments

Comments
 (0)