Cloud architectures

Hi everyone!

So I’m currently the only dev at a startup, and we just got a GCP bill that was a lot tougher to swallow than we’d like.

So what we’re looking for is a cloud architecture that isn’t ~1000s/month for a handful of users. I know I made some mistakes with GCP that I’ll detail, but I’m also looking for better ways to run it.

What we’re looking for is to run a spring boot app connected to a database. We have to regionalize our cloud for data residency requirements. Currently we’re just in the US and Canada, but at some point we will be going to the EU. We want to store unidentified data in a data lake of some kind so we can develop some ML models.

Our current setup is we have the app running on GKE connected to a cloud SQL postgress DB. We have 6 tables all together. When data is sent to the backend, we deidentify it, and send it to pub/sub to store it in BigQuery.

the pub/sub and associated dataflow was 2/3rds of our bill. I had 5 running with 4vCPU since that was the default. I’m looking to scale that back to 1, but I’m also wondering if there’s better architectures out there. Either on Google or other platform.

Thanks for any help or pointers!

It doesn’t sound like it should be too expensive, still cheaper than hiring a backend dev I’m guessing. You have a postgres instance already, I don’t understand why you need to move it to BigQuery for ML stuff.

How much data are you moving through pubsub (in MB/s or events)? And how much do you go through using BigQuery? (Ballpark?) Would having additional postgres replicas be any cheaper - would it work for your data volume? And for how frequently you need to reprocess it?

All of these are something you could theoretically build out of VMs, and I’m just wondering if it’s worth going the DIY route to save on a couple of thousand bucks a month.

Thanks for your reply!

So the reason we’re using BigQuery is because we’re dealing with sensitive data, we don’t want to deal with raw data from our databases. We want to work with deidentified version, which my api service currently does before sending it to BigQuery.

As for the size right now, very little. We have 300 users, so to sync everything we’re probably talking under a GB total. We’re talking 10sMB/month.

To start, I’m not a cloud expert. I’m pretty good with springboot, but not the underlying infrastructure. I know enough to get in trouble it seems.

We have 5 tables, which means we have 5 endpoints so the database on the phone and the database in the cloud stay in sync. I had created 5 subscriptions, which it turns out created 5 dataflow things. They were running with 4vCPU by default each. This gave us a bill of like 1300/month on dataflow alone. Which is just way to much to handle 10sMB/month.

I’m planning on moving to the API writing directly to bigquery about once a week in bulk. It will be in charge of creating the queries itself. I need to see if the spring schedule will still work with google cloud run, but thats my adventure this week.

We can’t use replicas of postgress because of the deidentification. So it needs to be similar but different.

Thanks for you reply though!

Phew, for a second I thought you were going to say "not much data, 1T/day and we backfill BigQuery nightly from our 1PiB dataset.


All your data, all copies of it fit into ram of a single 8GiB raspberry pi. And you’re running how many xeon cores and how much ram total?

Instead of BigQuery, use a second postgres instance, BigQuery is tragic overkill for small data.

Postgres network protocol is open, and so there’s a bajilion “replication providers”. It’s software that connects to postgresql and says, “hi please tee any changes to me”, it’s what postgresql uses for master slave replication. But these “providers” can do strange things, for example there’s one that injects rows into Kafka, which is like cheap/opensource pubsub. Architecturally it’s actually very similar to pubsub even internally except Google pubsub is built to scale a lot more… Kafka will be cheaper.

I don’t know enough about the details of your workload to give accurate advice here. but I can give some general process to optimize your costs:

  • Look at the bill first and foremost. where are your dollars going? when you optimize start on the biggest dollar amounts
  • Do things in bulk, you will be surprised how much you can save when you consolidate your queries/ HTTP requests
  • if things have a low usage window (E.G the internal api only gets called 50 times a day between 9am/5pm) turn it into a lambda function instead of having a VM run 24/7
  • Turn down the specs of your compute resources, and stress test them to see if the system remains stable (do this in a staging env PLS)
  • Don’t move data around unless absolutely nescesary

I noticed that you moved data into BQ just to anonimize it? may i ask why?
You can achieve the same by restructuring the db to have a seperate table with the identifiable information locked from the account (E.G table user_data that only account: serviceA is able to read from).