When building event driven systems, events might be used for two distinct aspects of the system:
- for communication between components, e.g. using a message queue
- for representing the state of a service
These concerns are related, sometimes intertwined, but often technically distinct.
Infinite retention of an event stream would typically mean that you store the events in a database, allowing you to replay them and to rebuild the current state at any later point in time. This property of an event-based system has nothing to do with how your services communicate – it works just as well regardless of whether you use message queues or REST as a communication pattern.
Certain message queue systems do have the ability to store messages/events indefinitely. Notably, Kafka has this capability (and cloud providers will happily sell you this extended retention). But it’s not necessarily wise to use such capabilities.
Communication in a distributed system must be assumed to be unreliable. If your system only works correctly if messages are delivered exactly-once, you’re going to have problems.
E.g. assume that you have a horizontally-scalable service where every instance has its own cursor for consuming from the message queue. If one instance is temporarily turned off for half a year and then restarts, it will likely spend multiple days just catching up with old messages. The ability to rebuild the entire state can be very important, but might be a poor practice when used as a standard operating procedure – even in an event-driven system, you likely want checkpoints of the current state that can be shared between instances.
Dealing with limited-retention message queues can be an “enabling constraint”: forcing you to think about what happens when messages are dropped, and about how state can be rebuilt efficiently.
In a simple message-oriented service architecture, we might have two services A and B that are connected via some queue Q. When A sends a message to B, what does this mean, what can A assume about B? A should assume absolutely nothing. A should not assume that B will receive and process the message within any particular deadline. If the message is important, A might try re-sending the message with exponential backoff until B sends another message in response (or use a message queue that guarantees delivery attempts until B acknowledges it). If B wants to be able to re-build its own state in an event-driven manner, it might store events in its own database – but would probably not rely on the message queue Q for persistent event storage.
In practice, it is possible to make message queues (and any distributed system, really) highly reliable so that guaranteed delivery can be ensured almost always. But whether that’s the case depends on the Ops team that maintains this architecture, and on the specific guarantees provided by your specific technology choices.