This course will demonstrate some of the more advanced options that are available in Google Cloud Pub/Sub. These options include filtering and ordering messages, creating and enforcing schemas, as well as replaying previously delivered messages.
Learning Objectives
- Filtering and ordering Pub/Sub messages
- Creating and enforcing message schemas
- Handling duplicate or undeliverable messages
- Replaying and purging messages
- Monitoring your topics for problems
Intended Audience
- GCP Developers
- GCP Data Engineers
- Anyone preparing for a Google Cloud certification (such as the Professional Data Engineer exam)
Prerequisites
- Some experience with Cloud Pub/Sub
- Access to a Google Cloud Platform account is recommended
Typically, Cloud Pub/Sub will deliver each message once. However, sometimes a message will be delivered multiple times. This can create problems. So in this section I am going to discuss how to handle duplicate messages.
First, you can help minimize duplicates by extending the message deadline. Duplicates are often caused by a subscriber taking too long to acknowledge the message. Setting a longer deadline should help reduce the duplication rate. It is also possible that your subscribers are simply getting overwhelmed. You might consider adding more subscribers to keep up with the load. Or you can upgrade your subscribers so that they can each handle more. Finally, enforcing message ordering can cause an increase in duplicated messages. So if ordering is not strictly required, you could try disabling that as well. Now while these techniques can help reduce the number of duplicates, they will most likely not eliminate them 100%.
Usually filtering out duplicates on the subscriber side isn’t too difficult. You can often add a little extra logic to check if you have already processed a message. If you were, say, encoding videos, you would first check to see if the current video has already been encoded. Also, you should be aware that all messages in a given topic are assigned a unique message ID. You can use this field to keep track of messages that have already been processed. For example, you can store the processed IDs in a database or cache. You can also use the IDs to compare two messages that you suspect might be duplicates. If your subscribers are built to be as robust and fault tolerant as possible, you shouldn’t have to worry about the occasional duplicate.
Sometimes, you won’t be able to tolerate any duplicates at all. If you need to guarantee exactly-once-processing of your messages, then consider using the Apache Beam programming model. Apache Beam lets you interact with Cloud Dataflow, and you can use the Beam PubSubIO connector to read from Cloud Pub/Sub. Essentially, you will define a pipeline and then let Apache Beam filter out all the duplicates for you.
Remember, Cloud Pub/Sub by itself only offers at-least-once delivery. If you need exactly-once-delivery, you are going to need some additional logic.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.