Skip to content

Added LayoutLMv3 #2178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

carrycooldude
Copy link

Description

This PR fixes the LayoutLMv3 checkpoint conversion script to properly handle different spatial embedding dimensions between the base and large models. The base model uses 128 dimensions for all spatial embeddings, while the large model uses 171 dimensions for x/y coordinates and 170 dimensions for height/width.

Changes Made

  • Added dynamic detection of spatial embedding dimensions from the Hugging Face model
  • Implemented padding for smaller embeddings to match the maximum dimension
  • Updated projection matrices to use consistent dimensions
  • Added detailed debug output for spatial embedding shapes

Technical Details

The conversion script now:

  1. Detects individual dimensions for x, y, h, w embeddings
  2. Uses the maximum dimension (171 for large model) for all embeddings
  3. Pads smaller embeddings (170) with zeros to match the larger dimension
  4. Creates projection matrices with consistent dimensions

Testing

  • Successfully converted both base and large models
  • Verified output shapes match expected dimensions
  • Confirmed no dimension mismatch errors during conversion

Output Example

Screenshot from 2025-03-30 12-50-29

@divyashreepathihalli
Copy link
Collaborator

@carrycooldude That you for the PR - the code structure does not match KerasHub style.
please go through the guide here - https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING_MODELS.md
Take a look at other model folders.
What would the task model look like?
the preset file contents should be just metadata and kaggle hub path
Can you provide a model code usage example?

@@ -0,0 +1,152 @@
"""Tests for LayoutLMv3 backbone."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these docstring at the start of the file.

@sachinprasadhs
Copy link
Collaborator

Adding General code structuring comments.

  • Add all the files under the model directory only, we don't recommend using sub directories.
  • We don't encourage using Tensorflow specific operation, like tf. , we make the mode design to support backend agnostic.
  • The code does not follow the general code format we follow in Keras Hub, I suggest you to refer other model implementations in detail.
  • Arguments needs to be descriptive, with type of data it accepts and what is the default arguments etc.

Refer any existing model implementations here https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models

The test cases also should follow the template we are following in the models.

Copy link
Collaborator

@sachinprasadhs sachinprasadhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added few comments, most of it are general practice which we follow. Incorporate those general suggested changes across all the files.
And remove the files and directory which are not required like env directory.

@@ -0,0 +1 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this directory and file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be removed

@@ -0,0 +1,4 @@
"""LayoutLMv3 document classifier."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs to be empty, all the import is handled in keras_hub/api directory and will be automatically generated whenever you run git commit -m "<message>"
Make sure you run pre-commit install for the first time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending

@@ -0,0 +1,15 @@
from keras_hub.src.models.layoutlmv3.layoutlmv3_backbone import LayoutLMv3Backbone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is mainly to register presets, follow other models to understand the format we follow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending


def __init__(
self,
vocab_size: int = 30522,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove type annotation from everywhere, we don't follow type annotation in Keras Hub

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still type annotation needs to be removed

References:
- [LayoutLMv3 Paper](https://arxiv.org/abs/2204.08387)
- [LayoutLMv3 GitHub](https://github.com/microsoft/unilm/tree/master/layoutlmv3)
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire doctring needs to be inside the Backbone class

"""

import os
from typing import Dict, List, Optional, Tuple, Union
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this once type annotation is removed


from .layoutlmv3_tokenizer import LayoutLMv3Tokenizer
from .layoutlmv3_presets import backbone_presets
from .layoutlmv3_transformer import LayoutLMv3TransformerLayer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change from relative imports to absolute imports everywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it from relative imports to absolute imports, we don't follow from . import abc

maintaining spatial relationships in documents.

Args:
vocab_size: int, defaults to 30522. Size of the vocabulary.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format for Args we follow is:
vocab_size: int. Size of the vocabulary. Defaults to 30522

This format should be followed for all and make sure it conveys the proper and complete required information.

```
"""

presets = backbone_presets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need of this here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep the example, but we don't need presets = backbone_presets

self.use_rel_pos = use_rel_pos
self.rel_pos_bins = rel_pos_bins
self.max_rel_pos = max_rel_pos
self.spatial_embedding_dim = spatial_embedding_dim
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should come at last.
You can follow below order:

# === Layers ===

# === Functional Model ===

# === Config ===

@carrycooldude
Copy link
Author

@sachinprasadhs any updates on this one?

@sachinprasadhs
Copy link
Collaborator

Still the review comments are not addressed, could you please fix those before I can suggest any more changes

@carrycooldude
Copy link
Author

Still the review comments are not addressed, could you please fix those before I can suggest any more changes

I guess I fixed it , can you tell me which are those?

Copy link
Collaborator

@sachinprasadhs sachinprasadhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointed the comments where previous reviews were not addressed.

Also, remove layoutmv3_env directory

```
"""

presets = backbone_presets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep the example, but we don't need presets = backbone_presets


def __init__(
self,
vocab_size: int = 30522,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still type annotation needs to be removed

@@ -0,0 +1,4 @@
"""LayoutLMv3 document classifier."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending

@@ -0,0 +1,15 @@
from keras_hub.src.models.layoutlmv3.layoutlmv3_backbone import LayoutLMv3Backbone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending

Comment on lines 1 to 15
# Copyright 2024 The Keras Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

Comment on lines 1 to 15
"""LayoutLMv3 tokenizer implementation.

This tokenizer inherits from WordPieceTokenizer and adds LayoutLMv3-specific
functionality for document understanding tasks.

Example:
```python
# Initialize the tokenizer
tokenizer = LayoutLMv3Tokenizer.from_preset("layoutlmv3_base")

# Tokenize text
tokens = tokenizer("Hello world!")
```
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this, move the example inside LayoutLMv3Tokenizer if necessary.

Comment on lines 1 to 2
"""Tests for LayoutLMv3 tokenizer."""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

Comment on lines 9 to 10
from ..layoutlmv3.layoutlmv3_tokenizer import LayoutLMv3Tokenizer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No relative imports

Comment on lines 1 to 5
"""LayoutLMv3 transformer layer implementation.

This module implements the transformer layer used in the LayoutLMv3 model.
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

Comment on lines 6 to 7
from typing import Dict, Optional

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need of this

Copy link

This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale label May 24, 2025
@sachinprasadhs
Copy link
Collaborator

Hi, let us know once this PR is ready for review again. Thanks

@carrycooldude
Copy link
Author

@sachinprasadhs can you check this

Copy link
Author

@carrycooldude carrycooldude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sachinprasadhs Just made some changes

Copy link
Collaborator

@sachinprasadhs sachinprasadhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see lot of previous comments which were not addressed, couldn't go thorough all the files since all the comments needs to be addressed to save everyone's time.

I would request you to go through each of the comments made in this PR so far, address each of them and mark it as resolved once you have made the necessary changes.
If you have trouble understanding any of the comment, let me know, I would be happy to clarify it for you.
Also, add the necessary test cases, I see many empty files.
And maintain the consistency across all the files.

Comment on lines 5 to 8
from keras_hub.src.models.layoutlmv3.layoutlmv3_tokenizer import (
LayoutLMv3Tokenizer,
)
from keras_hub.src.utils.preset_utils import register_presets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LayoutLMv3Tokenizer import is not required here

Comment on lines 10 to 15
__all__ = [
"LayoutLMv3Backbone",
"LayoutLMv3Tokenizer",
"LayoutLMv3TransformerLayer",
]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove these lines


from .layoutlmv3_tokenizer import LayoutLMv3Tokenizer
from .layoutlmv3_presets import backbone_presets
from .layoutlmv3_transformer import LayoutLMv3TransformerLayer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it from relative imports to absolute imports, we don't follow from . import abc

pad_token_id: int = 0,
position_embedding_type: str = "absolute",
use_cache: bool = True,
classifier_dropout: Optional[float] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove optional type annotation, just keep the default or None value.

Comment on lines 1 to 31
"""
LayoutLMv3 backbone model implementation.

This module implements the LayoutLMv3 model architecture as described in
"LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking"
(https://arxiv.org/abs/2204.08387).

The LayoutLMv3 model is a multimodal transformer that combines text, layout,
and visual information for document understanding tasks. It uses a unified
architecture to process both text and image inputs, with special attention to
spatial relationships in documents.

Example:
```python
# Initialize backbone from preset
backbone = LayoutLMv3Backbone.from_preset("layoutlmv3_base")

# Process document image and text
outputs = backbone({
"input_ids": input_ids, # Shape: (batch_size, seq_length)
"bbox": bbox, # Shape: (batch_size, seq_length, 4)
"attention_mask": attention_mask, # Shape: (batch_size, seq_length)
"image": image # Shape: (batch_size, height, width, channels)
})
```

References:
- [LayoutLMv3 Paper](https://arxiv.org/abs/2204.08387)
- [LayoutLMv3 GitHub](https://github.com/microsoft/unilm/tree/master/layoutlmv3)
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this docstring inside backbone class along with args, remove rest of the docstring, it is redundant.

Comment on lines 111 to 112
presets = backbone_presets

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line

@divyashreepathihalli
Copy link
Collaborator

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces the LayoutLMv3 model, including its backbone, tokenizer, and a checkpoint conversion script. The implementation requires further work to ensure correctness and functionality. Key areas needing attention include the model's call method, the transformer layer implementation, and the checkpoint conversion script.


# Process through transformer layers
hidden_states = [embeddings]
for layer in self.transformer_layers:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code iterates over self.transformer_layers, which is not defined. Use self.encoder_layers instead.

Suggested change
for layer in self.transformer_layers:
for layer in self.encoder_layers:

Comment on lines +5 to +40
@register_keras_serializable()
class LayoutLMv3TransformerLayer(layers.Layer):
def __init__(
self,
hidden_size,
num_attention_heads,
intermediate_size,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
layer_norm_eps=1e-12,
qkv_bias=True,
use_rel_pos=True,
rel_pos_bins=32,
max_rel_pos=128,
name=None,
**kwargs,
):
super().__init__(name=name, **kwargs)
self.hidden_size = hidden_size
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.qkv_bias = qkv_bias
self.use_rel_pos = use_rel_pos
self.rel_pos_bins = rel_pos_bins
self.max_rel_pos = max_rel_pos

def call(self, hidden_states, attention_mask=None, **kwargs):
# Minimal stub: just return hidden_states unchanged
return hidden_states

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The LayoutLMv3TransformerLayer is a stub and needs a full implementation, including the attention mechanism, intermediate dense layers, and other necessary components.

Comment on lines +141 to +159
# Create projection matrices based on actual weight shapes
x_proj = np.random.normal(
0, 0.02, (spatial_embedding_dim, hf_config.hidden_size)
)
y_proj = np.random.normal(
0, 0.02, (spatial_embedding_dim, hf_config.hidden_size)
)
h_proj = np.random.normal(
0, 0.02, (spatial_embedding_dim, hf_config.hidden_size)
)
w_proj = np.random.normal(
0, 0.02, (spatial_embedding_dim, hf_config.hidden_size)
)

# Set weights for projection layers
keras_model.x_proj.set_weights([x_proj, np.zeros(hf_config.hidden_size)])
keras_model.y_proj.set_weights([y_proj, np.zeros(hf_config.hidden_size)])
keras_model.h_proj.set_weights([h_proj, np.zeros(hf_config.hidden_size)])
keras_model.w_proj.set_weights([w_proj, np.zeros(hf_config.hidden_size)])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The projection layers for spatial embeddings (x_proj, y_proj, h_proj, w_proj) are being initialized with random weights. Load these weights from the original Hugging Face model instead.

Comment on lines +96 to +161
def call(self, text, bbox=None, **kwargs):
"""Tokenize text and process bounding boxes.

Args:
text: A string or list of strings to tokenize.
bbox: Optional list of bounding box coordinates for each token. If
provided, should be a list of lists of [x0, y0, x1, y1]
coordinates.
**kwargs: Additional keyword arguments passed to the parent class.

Returns:
A dictionary containing:
- token_ids: Tensor of shape (batch_size, sequence_length)
containing token IDs
- padding_mask: Tensor of shape (batch_size, sequence_length)
containing padding mask
- attention_mask: Tensor of shape (batch_size, sequence_length)
containing attention mask
- bbox: Tensor of shape (batch_size, sequence_length, 4)
containing bounding box coordinates (if provided)
"""
# Tokenize input text
token_ids, padding_mask = super().call(text)

# Add [CLS] token at the beginning
batch_size = backend.shape(token_ids)[0]
cls_token_ids = (
backend.ones((batch_size, 1), dtype="int32") * self.cls_token_id
)
cls_token_mask = (
backend.ones((batch_size, 1), dtype="int32") * self.cls_token_mask
)

token_ids = backend.concatenate([cls_token_ids, token_ids], axis=1)
padding_mask = backend.concatenate(
[cls_token_mask, padding_mask], axis=1
)

# Add [SEP] token at the end
sep_token_ids = (
backend.ones((batch_size, 1), dtype="int32") * self.sep_token_id
)
sep_token_mask = (
backend.ones((batch_size, 1), dtype="int32") * self.sep_token_mask
)

token_ids = backend.concatenate([token_ids, sep_token_ids], axis=1)
padding_mask = backend.concatenate(
[padding_mask, sep_token_mask], axis=1
)

# Create attention mask
attention_mask = backend.cast(padding_mask, dtype="int32")

# Process bounding boxes
if bbox is not None:
bbox_tensor = backend.stack(bbox, axis=1)
else:
bbox_tensor = None

return {
"token_ids": token_ids,
"padding_mask": padding_mask,
"attention_mask": attention_mask,
"bbox": bbox_tensor,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The bbox list needs to be expanded to match the new token sequence after subword tokenization. Also, add dummy bounding boxes for special tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants