To SQL, or not to SQL. Choosing something for storing sensor data with frequent updates

First of, ‘huh, completely missed there was such a category’.
Second of, ‘huh, only 4 topics. Need to fix that’.

I am building (and learning how it goes down in a long run) a small app, which is aimed to process and host data from smart sensors (so far I didn’t find anything that I would find interesting except humidity and temp). I wonder if I can find wind speed and light level sensors, usable for an apartment.

This app is a source for other apps to ge current data by sensor(microservice architecture).

But I am also storing this data for future uses (before a HnT data, displayed on a small e-ink screen, sitting right next to my monitor, I didn’t even know for which levels of humidity to take an umbrella when going outside).

So I’m collecting this data, and plan to use it once I get an inspiration(and a right idea). Meaning that I will be reading data in scope of dates, and specific time periods.

I started off with Mongo, collecting data by sensor per day, keeping a separate db entry in memory and flushing it to db once an hour. It did work, although I somehow screwed up somewhere, and after day 3 was getting an NPE. But I didn’t touch that project for a while, and when I finally did, I looked again at the whole question, and moved to Postgres (I feel more at home with sql and ‘everything on its shelf’ structure approach).

Everything looks neat and tidy, but I do feel that ‘long’ isn’t eternal (althooough 9,223,372,036,854,775,807 is quite a big number), and should do fine for a single table, which currently has 2 new records once a minute (and 2 more, but those seem to be once an hour or something). But in time, the number of temp sensors will increase, and so will the record count. I do realize that I can group the data and write an average of 5 minutes. Or even make maintenance jobs, which will archive outdated data, clearing the table (I find this as an interesting future experience).

But I am basically thinking of two things:

  1. Was the mongo a better approach? I know that reading a full month of data would be bringing up all that data from db into app.
  2. SQL (I like sql) although sounds better (easier to write, and seems to be easier to read with the right sql query). But I haven’t seen how a sql table behaves with this much data.

Also was thinking of going a mongo like approach, and simply write data to regular files (in csv format or something).

Your thoughts?

P.S. I know HomeAssistant exists. I have it on my list of stuff to research. But I do love bringing up things from the ground, and experiencing good and bad decisions.

If only thing you do is appending data and never deleting/updating, just save data in CSV format.

Otherwise, use SQL.

1 Like

Fair point. But that will be a hassle when reading (writing logic for it) versus an engine, where I can index fields (there is a datetime field, which will be relevant).

Depends on your hosting environment. An extra SQL server costs money usually.
If your department can pay for it, then go for it. It is good in a long run.

Nah. This is a homelab thing. My server is a 4 core\thread x86 Intel chip (Odroid H3+). Nothing related to ‘production’ - just an ecosystem of my home projects.

1 Like

I mean, that’s a fine reason to go ahead and use PostgreSQL if you’re just going to be more comfortable with it. You seem aware that if you scale then you’ll eventually need to work out how to split the table data off into partitions or archives or whatever suits your needs. PostgreSQL itself won’t flinch at millions of rows in a table - whether it becomes a problem or not really depends on your storage, querying needs and whether you actually remember to keep an eye on the log size and do something about it down the line.

1 Like

I do wonder how much time it will take for data such as

to grow to the size, where my 256GB m.2 drive won’t be enough (I do have a separate 1tb drive hooked up to that machine, which I use as a storage buffer between my devices).

And will it even make sense to increase the storage, instead of aggregating precise data into “accumulated”, dropping off a big chunk of data.

It is. But I don’t use noSQL frequently (meaning, that as a developer, I don’t have much deep experience with it). It isn’t bad, but there was an interview, where a fellow didn’t like the overall explanation of noSQL being a paged storage solution (although that was a fun interview, with no real noSQL requirements in the position).

For now, scaling is of no considerations. I am yet to determine even what I want to do with this stuff. My initial point for even storing it was to make some prognosis, based on previous years, of things like “when the summer really kicked in” and stuff (going that road of “observations”).

One cool idea I am brewing in my mind is to get a temp/humid delta in slices of 1, 3, 6, 12 hours to get an understanding of “this night the temperature went down from 25 to 10C, meaning that even if the current temp is showing 25C, I shouldn’t really expect going out in a t-shirt and expecting not to get cold”. But all of this is just brewing around new ideas for future projects.

There are specialised time-series databases optimised for this kind of case such as Prometheus (and many others). If you want file based storage I’d suggest the parquet format; it’s very fast and efficient and stores data in a columnar compressed format with some nice features like partitioned files etc. DuckDB is an interesting option to run sql queries on top of parquet or csv files. But if performance with Postgres is OK there’s no reason to not stick with that if you’re comfortable with that solution.

2 Likes

What scale are you talking?

  • How many sensors?
  • Polling rate?
  • Concurrent users?

If this is just something for your house, home office, small 100 user company, etc… just use whatever SQL variant you are comfortable with that has great libraries for your language of choice.

SQL will give you flexibility, broad compatibility, etc. It’s not the trendy ultra high performance option but unless you’re scaling to (hundreds of) thousands of users - on any half reasonable modern hardware SQL is just fine.

Unless of course this is a learning exercise to learn funky new toy database of choice - then go nuts. But personally I’d just use postgresql or mysql. SQL gives you so much functionality and hardware these days is overkill for running dinky little home apps on it. Its not 1995 any more.

In terms of data growth vs. your SSD…

I’ve got over a decade of every fax (or virtual fax that was sent via email as a purchase order, payslip or EFT remittance advice) in monochrome uncompressed TIFF format that our small $2bn enterprise sent during that time that would fit within that sized SSD several times. IIRC it was under 20 GB. Mongo or CSV or whatever would also be fine until you need to rewrite because you eventually find you want to actually do some analysis on your data.

The amount of text/time series you can fit on modern storage is nuts unless you’re pulling in a huge quantity of sensors from a huge number of sources at a very high polling rate.

I think your SSD will be plenty of storage until it fails from age.

2 Likes

You can reduce the size of your rows. Datetime can be used as the primary key saving the index column and floats can multiplied by a fixed number and be stored as SMALLINT for 2 bytes instead of 8 bytes.

Performance shouldn’t be an issue with only one or two users so you could also use file system based compression provided you’re using a file system that supports compression.

1 Like

Well. This is basically Day 2 with me moving to Postgres. And the current usage is 2 http threads, which call to get the latest data. And I do keep the ‘latest entry per sensor’ in HEAP, both to return by request, and to compare the data to a new value from the sensor, to filter out reports, which have the same data.

Hm. Isn’t Prometheus a metrics thing? I remember it implementing it for some work project in the past.

The file approach is tempting. I was considering a 20250826/sensor1.csv approach, and simply adding new entry at the end of the file.

But that will be easy to write.

But(2) with stuff like “computer, give me an average temperature drop between 18:00 and 18:59 across the year”, having the data packed into files will be more of a hassle if to compare to SQL.

One apartment. 4 sensors for the moment. 2 have a once-per-minute, and 2 - once-per-hour(I am still figuring out how it works).

For the users part. Well… “me”. For now I am only (started from yesterday) collecting this data. The processing doesn’t read from the DB (I keep a ‘latest’ report per sensor).

But I think over time I will come up with ways of using it. But doubt it will go outside of 10 threads/users.

Nah. Home use. 1-10 apps (time will tell what I come up with).

SQL looks good. But this is a “you do - you learn from long term”. Just wanted to hear opinions on what sounds better :wink:

I forgot there was TIFF. But I do remember faxes. And virtual ones as well :slight_smile: 20GB you say? Nice. Then I’m fine. Although I will need some monitoring just to be on the safe side…

1 Like

Without bothering to actually look up pgSQL datatype sizes and alignments, assuming 8 bytes per column as a worst case and some overhead , lets imagine your rows are 64 bytes each. To fill up 1GB at 2 rows a minute is going to take over 14 years.

1 Like

This does sound interesting. Don’t make a separate index to preserve write time and storage, and remove float calculations. Sweet.

But I have now looked at the database. I believe I put it online somewhere close to 24 hours ago. Have 1.014 records in the ‘sensor report’ table. And the table size is 160kb. Meaning, at this rate, it will be somewhere near 3mb per month(although I need to create that index for the date field)… and even if it grows in terms of sensors, it still won’t be a problem for today’s average storage capacities.

Just looking at those example rows… you’re talking what… 5 columns.

Lets be pessimistic and assume 64 bits per column (based on the data types I see, most aren’t long long int but there is a timestamp in there).

That’s 320 bits = 40 bytes per row.

200 GB = 5 billion rows.

A billion is a lot of samples.

Lets say you’ve got 100% storage overhead (thumb suck fudge factor to account for db overhead), you’re still at 2.5 billion rows.

That’s about 79 years of data at 60hz polling rate, 24/7/365

edit:
unless I screwed up my calculations. but either way. its a lot. 60hz is for a single sensor… more sensors, less hz or less years…

2 Likes

Yeah. Guess I have enough storage for this endeavor. And even there I will simply can introduce some maintenance, and aggregate N year old data.

Or just upgrade to a 1TB SSD in 3-5 years :smiley:

1 Like

Generally it is one of time series database. It is a whole ecosystem for gathering, storing and processing that kind of data. Google searches for keywords “prometheus” + “grafana” , and one of “smart Home”, “IoT”, “smart sensor” already show some interesting things.

The key benefit apart from potentially best fit for database type is that those databases are already usually accompanied by good possibilities to gather the data by installing and configuring components instead of need to write custom application that will be doing that (broadly speaking).

Prometheus has PromQL as a dedicated language to query the data.

You could also add AlertManager that could trigger alerts base on sensor information and sent email or message to chat.

My suggestion would be to consider how much time you want to spend on solving problem of how to store the data (e.g. with SQL database sounds that you will be doing that, creating DB schema, populating tables) vs how much time to spend on focusing what to do with this data (grafana, custom application querying the Prometheus) . Assuming that actually putting the data to Prometheus will not introduce other complex issues.

The thing with ‘ready’ solutions is that almost nobody looks behind the curtains of ‘it works’. Just today I had a conversation about a task to implement a simple rest controller to write data into db (a simple effort). But when we went into detail, it turned out that the person was using postman to feed a file of N rows (thousands), which would one-by-one call the endpoint with one line of data… (a tale of a thousand transactions). And the fun part is, this is how the person wants it.

And this would not be the first code related horror story in my experience.

So, unless I actually know how, why and what is going on under the hood.

‘its opensource!’ paired with ‘you can always check it out’ works well… if a good portion of your work isn’t related to reading a buns load of this stuff… and you literally have no desire in doing so after work as well, for my personal projects (at least for them)

So I try to stay away as much as possible from the baked solutions(I’ve head quite a few ‘so you decided to invest a buss load of moneum into increasing infra instead of optimizing the code’ conversations over the years of me working in IT).

Dedicated time series db postgresql+timescaledb, influxdb or prometheus are the ideal for your usecase. They are optimized around the timeseries inherent nature and offer many bells and whistles natively.

Prometheus (datastore) + grafana (user frontend) is literal GOAT of this design.

Unless you plan to run your entire setup on single raspberry pi, performance of prometheus is not really an issue, regardless of number of sensors.

As for mongodb, it very weird choice for timeseries only (yes there is native supporz since 2021) and even worse for beginner. Its rather niche nosql db that has no great usecase beyond unstructured data. Thats exactly the opposite you need.

Obligatory skit for DBAs everywhere:

And this, cruder but very spot on some drama in 2010s about NoSQL superiority:

3 Likes