Resulting context

The primary benefit of this solution is that we are not relying on a self-managed streaming cluster to retain a history of all events and double as a data lake. Managing this level of disk storage takes a great deal of effort and can expose a system to the risk of significant data loss if not managed properly. Instead, this solution enables teams to leverage value-added cloud-streaming services so that they can focus on the functional requirements of their components. The data lake is responsible for the long-term durable storage of all the events, while the streams run lean and just retain the most recent events. This ultimately helps ensure that we have proper bulkheads for the streams, instead of the tendency to have one large monolithic streaming cluster. Leveraging blob storage also has the added benefit of life cycle management to age events into cold storage and replicate to another region and account for disaster recovery.

One of the primary purposes of a data lake is to act as a safety net. Because the data lake collects, stores, and indexes all events, we can leverage it to repair components by replaying events. If a component drops events for some reason, we can replay those events; if a component has a bug that improperly performs calculations against the event stream, we can fix the bug and replay the events; and if we have enhanced a component, we can replay events as well. We can also replay events to seed a new component with data. And as we will discuss in the Stream Circuit Breaker pattern, components will emit fault events when they cannot process a given event or set of events. Once the cause of the fault is resolved, we can resubmit the events that caused the fault.

When replaying or resubmitting events, there are various considerations to keep in mind. First, replaying or resubmitting an event is not the same as publishing an event. When we publish an event, it is broadcast to all consumers. When we replay or resubmit an event, we are sending the event to a very specific consumer. When we replay an event, we need to consider the side effects of replaying the event. For example, will the specific component emit events as a result of receiving the replayed events and if so, is that desirable or not? Is the component idempotent and are the downstream components idempotent, or will replay cause double counting, or duplicate email notifications or similar improper logic? We will discuss idempotence in the Stream Circuit Breaker pattern. On the other hand, when we resubmit an event or group of events that caused a fault, then we typically do want the side effects. However, idempotency can still be important for resubmit as well when the events may have partially processed before the fault. Backwards compatibility is another concern when replaying older events. The component will need to properly handle the older formats. Otherwise, you should only replay events with the supported formats. The bottom line in all cases is to strive for idempotence and understand the impacts and side effects of replay and resubmission.

One drawback to this solution is the fact that the consumers cannot replay events by simply resetting their checkpoint on the stream and reprocessing those events from that point forward directly from the stream. A program must be implemented that reads events from the data lake and sends them to the consumer. It is preferable that the consumers do not have to support multiple entry points to support the normal stream flow plus replay and resubmission. This is straightforward when the consumer is implemented using function-as-a-service because functions can be invoked directly. The replay and resubmission programs just need to impersonate the stream by formatting the messages so that they look like they came from the stream itself. This is easily accomplished by storing the full stream message wrapper in the data lake as well.

Another drawback is the potential, no matter how small, that the consumers populating the data lake could fail to process the events from the stream before they expire. Therefore it is critical that the consumers are implemented with simple logic that is focused on storing the events in the data lake without any additional processing logic. It is also imperative that proper monitoring is implemented to alert the team when processing is in error or falling behind so that corrective measures can be taken. We will discuss monitoring in Chapter 8, Monitoring. Note, that in the event that some data is dropped, that data could potentially be recovered from the search engine.

The security and privacy of the data in the data lake are of the utmost importance. Security By Design is the first principle; we must design security up front. We will discuss security extensively in Chapter 9, Security. With regard to the data lake, we are concerned about access permissions and encryption. When we are designing the stream topology for the system, we should also be considering the security implications of the topology. The data lake should be organized by stream name, which means we can grant access to the data lake by stream name. With this knowledge in mind, we can design a topology whereby the events flow through the proper secure channels. With regard to encryption, it is never sufficient to rely on storage encryption. It is also not realistic to have the data lake encrypt specific data elements as the data flows into the data lake. Instead, the event producers are the most appropriate source for encrypting the appropriate elements. The design of each producer should account for the security requirements of the data it emits. For example, a producer of PCI data would tokenize the card data and a producer of HIPAA data would encrypt and/or redact the data along multiple dimensions.

The data lake is an excellent source of knowledge and it is equally important to use it as a source for analytics. Data warehouses will be typical consumers of the events flowing through the system and data scientists will leverage the data lake for performing big data experiments. As an example, a team may have a hypothesis that a specific change to an algorithm would yield better results. An experiment could be devised that loads several years of historical events from the data lake and processes the events through the new algorithm to validate whether or not the algorithm is better. The future uses of the data in the data lake are unpredictable. This is why it is important to collect, store, and index all the events of the system with no loss of fidelity.