Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeding a new subsystem #394

Open
gabrielcolson opened this issue Dec 24, 2024 · 5 comments
Open

Seeding a new subsystem #394

gabrielcolson opened this issue Dec 24, 2024 · 5 comments

Comments

@gabrielcolson
Copy link

How would you seed a new subsystem with historical data ?

Example usecase:

  • Subsys X has an egress-esg that emits external events and store them to its event lake
  • we create Subsys Y and a routing rule to send external events sent by other subsytems to its event bus.
  • but Subsys Y also needs historical data

Possible solution:
a script similar to the aws-lambda-stream-cli that reads from Subsys X's event lake and sends the events to Subsys Y's event bus. It could work but I fear that it would send duplicates to Subsys Y's event lake.

Has anyone faced this problem before?

@vsnig
Copy link
Contributor

vsnig commented Dec 24, 2024

I think the most straight forward solution is to replay events of interest to X's egress gateway which would make it publish external events that will seed the Y subsystem

@gabrielcolson
Copy link
Author

That means you will get a lot of duplicated external events in both X and Y event lakes. Or should we not store external events to the event lakes ?

@vsnig
Copy link
Contributor

vsnig commented Dec 24, 2024

No. I have eventPattern: { source: ['custom'] } rule for delivery stream that pushes events into event lake. That means that only internal events get pushed into the lake. But honestly even if they would get saved into the lake, I don't see a problem here (of course if it's not terrabytes of events you're trying to replay)

@vsnig
Copy link
Contributor

vsnig commented Dec 24, 2024

Or should we not store external events to the event lakes ?

Haven't noticed the second question at first. I don't have a definite answer. Right now in John Gilbert's s3 template anything-but-fault is saved. So it means that external events are saved too. In my lakes I have source: ['custom'], and I derived it from some past version of the template so it was like that at some point in the past. But logically, external events can't be replayed to anything within a subsystem so it seems there's little value of storing them in S3 lake, maybe just for statistcs sake.
Maybe John himself would want to say something on this topic.

@gabrielcolson
Copy link
Author

Thanks for you insights.

In case you have multiple downstream subsystems, do you still replay events in the egress? We usually have an idempotency table in ingresses that expires records after 30 days to avoid sending duplicates. Should we keep those records forever to be able to replay as much as we want ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants