Skip to content

Future not resolving with Tonic and SimpleSpanProcessor #785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
duarten opened this issue Apr 16, 2022 · 5 comments
Closed

Future not resolving with Tonic and SimpleSpanProcessor #785

duarten opened this issue Apr 16, 2022 · 5 comments
Labels
A-trace Area: issues related to tracing

Comments

@duarten
Copy link

duarten commented Apr 16, 2022

Consider the following otlp configuration:

opentelemetry_otlp::new_exporter()
        .tonic()
        .with_endpoint(endpoint.as_str())
        .with_metadata(metadata)
        .with_tls_config(
            tonic::transport::ClientTlsConfig::new().domain_name(endpoint.host_str().unwrap()),
        )
        .with_protocol(opentelemetry_otlp::Protocol::Grpc)
        .with_timeout(std::time::Duration::from_secs(3))

When using SimpleSpanProcessor (through with_simple_exporter), eventually the program reaches a point where the future actually exporting the span never resolves. It's this line: the thread dequeues the span, and block_on never returns. It seems this is triggering a deadlock.

I'm using tracing-rs (with the workaround in #473) on top and opentelemetry-otlp configured with ["tonic", "tls", "tls-roots"] .

Curiously, if I use the following very simple processor, everything works correctly:

#[derive(Debug)]
struct SimplerProcessor<E: SpanExporter + Send + Sync> {
    inner: std::sync::Arc<std::sync::Mutex<E>>,
}

impl<E: SpanExporter + Sync + Send> SpanProcessor for SimplerProcessor<E> {
    fn on_start(&self, _span: &mut Span, _cx: &opentelemetry::Context) {
    }

    fn on_end(&self, span: SpanData) {
        tokio::task::block_in_place(|| {
            tokio::runtime::Handle::current()
                .block_on(self.inner.lock().unwrap().export(vec![span]))
        })
        .expect("span not sent");
    }

    fn force_flush(&self) -> opentelemetry::trace::TraceResult<()> {
        Ok(())
    }

   fn shutdown(&mut self) -> opentelemetry::trace::TraceResult<()> {
        Ok(())
    }
}

Does any of this ring any bells?

@TommyCpp TommyCpp added the A-trace Area: issues related to tracing label Apr 18, 2022
@TommyCpp
Copy link
Contributor

IIRC SimpleSpanProcessor is not compatilbe with async runtimes. The tricky part here is we want to support different runtime so we generally need a Runtime to know how to spawn a task.

@duarten
Copy link
Author

duarten commented Apr 18, 2022

Oh! I was getting the same issue with the batch processor, but I didn't debug it thoroughly. I think it also calls futures_executor::block_on at some point and that may cause the issue, but I'll look more deeply.

Did you consider providing an AsyncRuntime? rdkafka does something like that (it is meant to be implemented by empty structs).

@TommyCpp
Copy link
Contributor

Yeah we do have Runtime and it's one of the parameters of the BatchSpanProcessor.

@duarten
Copy link
Author

duarten commented Apr 18, 2022

Oh, nice. Indeed I'm having better luck with the BatchSpanProcessor. I have to call force_flush_tracer_provider since I'm running in a Lambda, and that still calls futures_executor::block_on. I imagine it would deadlock if I tried to run this on a single threaded runtime. It might be a good idea to lift the block_on to the Runtime trait.

@TommyCpp
Copy link
Contributor

TommyCpp commented Apr 19, 2022

Yep we have encounter it before if user are using tokio single threaded runtime. TokioCurrentThread runtime will spawn the resource on a different thread so that kind of solving the issue

This is not a perfect design but force_flush need at least two thread(or a runtime to allow at least two task running in parallel). The reason is we need to block on the main thread so that it will not return while the background thread finish the work.

I think if we can make force_flush an async function/returns a furture things might be easier. But we do want to make sure the API can work without an async runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-trace Area: issues related to tracing
Projects
None yet
Development

No branches or pull requests

2 participants