I am reading up on system design concepts, specifically DataBase replication and partitioning. I understand that partitioning is used to improve throughput, and/or when the amount of data is high i.e. basically to improve scalability. But at what point do I need partitioning?
Eg. If I have a single table in my OLTP DB which sees a bunch of read/writes, with multiple columns, and each row is
1 KByte and there are
1 million daily users writing and reading from that DB, so each day there is
1GBytes of data written, and say the data is persisted for 1 year, so at any point the DB has
1 GBytes * 365 = 365 GBytes worth of Data.
Is this amount small enough to just use an index to improve throughput and get away with it? Or would I need to partition this? If the no. of users goes from 1 million to say 1 billion, then that is
365 TBytes worth of Data, what happens in this case?
Basically, how do I decide if I should partition or not, is there any empirical rule that I can fit here?
I understand that this question is too broad, and there is no “one size fits all” given how different each application, each DB can be, but there must be some way to “understand” or know this, hence asking the question here.
I did read up some articles online, but what I see is partitioning is explained wrt generic concepts like improving scalability, etc in a distributed system. Is there a real example that has a “before partitioning” and “after partitioning” scenario with numbers? Or is partitioning just used by default in modern applications?