Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optee 3.21.0: core: mobj_reg_shm_dec_map #6120

Closed
omasse-linaro opened this issue Jun 15, 2023 · 15 comments
Closed

optee 3.21.0: core: mobj_reg_shm_dec_map #6120

omasse-linaro opened this issue Jun 15, 2023 · 15 comments

Comments

@omasse-linaro
Copy link
Contributor

Hi

I'm getting an assert in reg_shm_unmap_helper when performing a simple stress test with two threads which invoke in a infinite loop a ping command to a TA with an input shared memory as parameter.

E/TC:1 1 assertion 'r->mm' failed at core/arch/arm/mm/mobj_dyn_shm.c:117 <reg_s>
E/TC:1 1 Panic at core/kernel/assert.c:28 <_assert_break>
E/TC:1 1 TEE load address @ 0xe1b32000

it seems that reg_shm_unmap_helper is called by mobj_reg_shm_dec_map when the mapcount reach 0 with an already unmapped object.

looking at the refcount.h (core/include/kernel/refcount.h) it is specify that when refcount_dec(r) is true, the object o is uninitialize if refcount_val(r) is still 0 AND o is not NULL.

however mobj_reg_shm_dec_map in core/arch/arm/mm/mobj_dyn_shm.c do not check if r->mm is not NULL before calling reg_shm_unmap_helper.

doing so, my stress test run fine without assert.
could we have a race condition when two threads call to mobj_reg_shm_dec_map occur at the same time ?

after doing refcount_dec on mapcount, thread 1 take the lock before thread 2 and call reg_shm_unmap_helper and release the lock.
thread 2 then take the lock and call reg_shm_unmap_helper as well, which fail with assert(r->mm).

@jenswi-linaro
Copy link
Contributor

Each call to mobj_dec_map() must be matched by a previous call to mobj_inc_map(). When that rule is followed then the race you're describing can't occur.

I believe the real problem is that the mobj_inc_map() and mobj_dec_map() calls are unbalanced for some reason.

@omasse-linaro
Copy link
Contributor Author

indeed, I see that you have fix a race issue in mobj_reg_shm_inc_map, but why not just get the reg_shm_map_lock spin lock before incrementing or decrementing the map refcount ?

@jenswi-linaro
Copy link
Contributor

Does that fix your problem?

The point with refcount_inc() and refcount_dec() is that they can be called without holding a lock. If we're going to hold a lock then we can access the counter directly instead.

@omasse-linaro
Copy link
Contributor Author

omasse-linaro commented Jun 21, 2023

Yes, I just reproduce the issue with qemu as well.
Screenshot from 2023-06-21 11-27-52

Fixed with the following patch

diff --git a/core/arch/arm/mm/mobj_dyn_shm.c b/core/arch/arm/mm/mobj_dyn_shm.c
index 3e2ac5022..158e25211 100644
--- a/core/arch/arm/mm/mobj_dyn_shm.c
+++ b/core/arch/arm/mm/mobj_dyn_shm.c
@@ -185,11 +185,11 @@ static TEE_Result mobj_reg_shm_inc_map(struct mobj *mobj)
        size_t sz = 0;
 
        while (true) {
-               if (refcount_inc(&r->mapcount))
-                       return TEE_SUCCESS;
-
                exceptions = cpu_spin_lock_xsave(&reg_shm_map_lock);
 
+               if (refcount_inc(&r->mapcount))
+                       goto out;
+
                if (!refcount_val(&r->mapcount))
                        break; /* continue to reinitialize */
                /*
@@ -233,14 +233,15 @@ static TEE_Result mobj_reg_shm_dec_map(struct mobj *mobj)
        struct mobj_reg_shm *r = to_mobj_reg_shm(mobj);
        uint32_t exceptions = 0;
 
-       if (!refcount_dec(&r->mapcount))
-               return TEE_SUCCESS;
-
        exceptions = cpu_spin_lock_xsave(&reg_shm_map_lock);
 
+       if (!refcount_dec(&r->mapcount))
+               goto out;
+
        if (!refcount_val(&r->mapcount))
                reg_shm_unmap_helper(r);
 
+out:
        cpu_spin_unlock_xrestore(&reg_shm_map_lock, exceptions);
 
        return TEE_SUCCESS;

@omasse-linaro
Copy link
Contributor Author

draft submitted here #6133

@jenswi-linaro
Copy link
Contributor

What's regression case 1040?

@omasse-linaro
Copy link
Contributor Author

It's a dedicated added test that ping a TA with an input shared memory.
PFA a patch for optee_test.
xtest_1040.zip

@jenswi-linaro
Copy link
Contributor

It's a dedicated added test that ping a TA with an input shared memory. PFA a patch for optee_test. xtest_1040.zip

Thanks

@jenswi-linaro
Copy link
Contributor

jenswi-linaro commented Jun 26, 2023

What are the steps needed to reproduce the problem on QEMU?
I understand that I need OP-TEE/optee_test#663, but what else is needed? Which flags do you compile OP-TEE with? Which patches do you have in OP-TEE OS?

Edit: sorry I see now that it has nothing to do with OP-TEE/optee_test#663 (which has a 1040 case by the way). I'll apply your patch and test.

@omasse-linaro
Copy link
Contributor Author

Test was done on optee os master branch at revision 504f159
Test fail without any added patches.

OP-TEE is build with CFG_TEE_CORE_LOG_LEVEL set to 1.
make run CFG_TEE_CORE_LOG_LEVEL=1

@omasse-linaro
Copy link
Contributor Author

omasse-linaro commented Jun 26, 2023

The race condition could be reproduced by running two xtest clients as follow:
xtest -t regression 1040 &
xtest -t regression 1040 &

NB: issue could but reproduce quicker with more running xtest clients.

@jenswi-linaro
Copy link
Contributor

Can you reproduce the problem with #6137?

@omasse-linaro
Copy link
Contributor Author

No, issue is fixed with that patch.

@jenswi-linaro
Copy link
Contributor

Thanks, let's use #6137 instead then.

@omasse-linaro
Copy link
Contributor Author

Thanks, let's close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants