I very much agree with Todd with the advice:
I strongly recommend that you don’t overengineer your architecture.
Less is more.
YAGNI: Start simple and tunable, then update your architecture as
To answer your specific questions but reordered slightly:
Missing posts in a user app like this in case of failure is a big no-no, so I wonder what is your view on this, and what is the best resolution of the problem.
First, we need to be clear about the problem. Implicit in your question is the assumption that write throughput is a big problem to solve. Buffering will happen in Redis when the write-throughput has exceeded the single master write-path of a MongoDB cluster. You are then considering what might happen if Redis crashes. This is all good thinking. Yet is optimising write-throughput the major architectural challenge in your service?
You describe your service as news aggregation. We might expect that reads might dominate. Writes are be many orders of magnitude smaller for news. If it is indeed news you are publishing you can consider buffering writes at the client under high write (e.g., browser local storage). If you load-shed writes you can add simple logic at the client can try again using an exponential back-off give your backend time to catch up (else heal from an outage). A tiny write message queue at every client is easy to implement. You can expect it to be reliable and easy to test. Only if your service is not accepting writes, and the user deletes their client app (else clears browser local storage), will they lose work.
If your service is a bit more like an Uber or Lyft ride-sharing service then write throughput will be critical. With ride-sharing there are a huge number of jobs published constantly and searched for in a geospatial manner. Write throughput and graceful outages are critical to those services. In which case you can research how those companies solve the write-path problem.
Below I will propose two solutions that I would investigate. One that optimises for write throughput and another that might be more suitable for a low budget “news aggregation” service that can scale up later.
Does a high-availability cluster with write-behind cache solve all of my problems?
Operational complexity and recovering from failures is likely to give you the biggest real-world business problems. Studies of big systems indicate that it is “grey failures” are the most tricky sort of problems that lead to extended outages and data loss. The moment you integrate two products on your write path you can expect complex interactions under failures and load. In theory, yes, a highly-available cluster with a write-behind solves all your problems. In practice, I would hesitate to go with a HA Redis Cluster in front of a MongoDB Cluster.
Does high-availability write to disk (Redis), guaranteeing that no data will be lost?
Well, Redis is an excellent product. In the past, it has had some hiccups with data loss. I have used Redis myself as cache in a few architectures. Would I use Redis as a silver bullet to optimise the write-throughput a geospatial based web-scale system? No.
IMHO opensource products grow up to fulfil the core needs of their community. They then bolt on additional features that serve as wide a set of needs as possible. People then get into trouble when they push those additional features to the extreme. You then encounter bugs that haven’t been found and fixed by the wider community as you are the outlier. Redis excels at being a low-latency cache optimising the latency of the read path. Using it to optimise the write-path of a very high write throughput geospatial system would be something I would be very nervous about in practice. I think Redis is great and I will continue to deploy it to optimise the read-path. Yet I would go with a specialist solution to optimise the write-path.
If write-throughput is the major challenge than I would look at Apache Kafka or an alternative. It has an architecture that is very different from traditional massage brokers so that it can scale to “IoT” levels of writes. You can then have consumers that update your main data store and possibly a separate geospatial index. If a consumer was buggy/lossy you can easily replay the data stream to “fix-up” your geospatial index service or main store after you have fixed the bug in your consumers. I would then have a single product on the durable write path. If the secondary writes to the main database or index service fail the ability to easily replay a stream of events will be invaluable. Engineering to make it easy to recover from the unexpected is money better spent than trying to eliminate the unexpected. A large number of real-world outages are caused by human error so designing for failure recovery is critical.
Even with write-behind cache, is it the safest course of action to add a message queue (eg RabbitMq) or is this generally not necessary even under high load?
Well, RabbitMQ is an excellent product. I have used it myself. If you can buffer the writes at the client under high-load, and do load shedding in-front of RabbitMQ, then it might be a great fit. It would be low-latency and allow you to update the main database and possibly a separate geospatial index. Using a message queue to decouple and independently scale microservices is a great strategy. Yet it will add operational complexity at go-live when it might not be needed until the service starts to take-off. So I would be tempted to start without it and then use it as a way to break apart a simple initial architecture into a more complex architecture at a later date.
My Two Suggested Approaches
(1) For a “mega-scale” Geospatial System with both high writes and high read:
If write-throughput is a big challenge I would evaluate both Apache Kafka and Apache Pulsar as the initial durable write-path. Those are very highly scalable messaging engines. Kafka is often used as the transport in asynchronous microservices architectures so we might expect it to have satisfactory latencies at high throughput. I would recommend having an edge “news publish” microservice that exposes RSocket over WebSockets to the clients. Clients would push the new news message over WebSockets as a request-response interaction. The RSocket server would simply apply security (validate the user), validate the payload, write it into Kafka/Pulsar, and respond. I would add logic at the client to hold the message in local storage and periodically retry if it got timeouts or errors. I would exponentially back-off in the retry logic to allow the service to recover.
A news aggregation service will need scalable reads. Your proposal is to use a MongoDB cluster. This is because it does geospatial queries. That would work. If you were looking to scale reads to a much higher level you could consider using a more scalable main database such as ScyllaDB and deploy separate geospatial query service such as Elastic Search. The consumers that read new news from Kafka can first write to the main ScyllaDB database. They can then update a geospatial index in Elastic Search. Elastic Search is often used as a secondary index for both free text search and geospatial indexing. As you are a news aggregation service deploying a dedicated free text search index may also be useful.
(2) For a “start-up budget” system with initial low writes and modest reads:
While we all want to build a service that can scale to huge loads from launch often the reality is that the main threat to a business at launch is over-engineering. It makes business sense to initially focus all engineering effort on the user experience while using a simple and cheap backend. Facebook was a college toy at go-live. Amazon started off selling books from a single workstation computer that beeped whenever an order came in. Demand scales up over time and only after you have a great product deployed. If it is a great business idea and a compelling product then a few hiccups as you grow and scale the architecture will be fine.
PostgreSQL is a traditional relational database. It now does binary JSON storage where you can index into the JSON fields. It also supports geospatial indexes. It also supports free-text searching. It is very easy to run locally to develop against. Better yet major cloud providers support it with lots of automation, high-availability, automated backup/restore, point-in-time restores, and monitoring dashboards. Amazon RDS lets you run PostgreSQL and they can seriously scale it up for you with a few clicks. You can start very cheaply and then add HA easily later then scale up to bigger and bigger database servers as you grow. That gives you time to then fix the real performance bottlenecks rather than guessed problems.
You can start with an edge microservice that will load-shed rather than do too many writes into PostgreSQL. The client can buffer the write and try again later. Then start off with simple code that does all of the writes against PostgreSQL for the main document and geospatial indexing. Spend the engineering effort on the frontend and business features. Later you can put RabbitMQ between the edge microservice and PostgreSQL. Then you can either break up or swap out PostgreSQL.
At a later date, you might create a separate geospatial index in Elastic Search. Later yet you might choose to move actual documents out of Postgres and into ScyllaDB. Do we know which things we should do in which order today? No, we cannot. Instead plan to evolve the architecture. Maybe just splitting from one PostgreSQL server into three servers, one dedicated to each of geospatial indexing, binary json, and free text search might work? That sounds like a great intermediate step before swapping in Elastic Search or ScyllaDB. We don’t know but we can be flexible.