HDD sectors - equal in radians or in length?

Peanut253 · May 4, 2017, 9:10pm

Not necessarily. The original question was:

This is in essence tackling two issues.

How exactly does data get written on HDDs?
Is the answer to #1 done for better performance?

Number 2 is the source of the confusion on the following point: Even if you assume that HD manufacturers configure their controllers to write data efficiently, and also that OS-level engineers write their software having a vague understanding of what they are trying to do with the storage media, it is my contention that you cannot reasonable assume such software was written intentionally to take advantage of storage media characteristics nor can the software be reasonably expected to ever be written in such a way. Software cannot direct a controller at that level to write a specific datum in a certain location or in a certain way.

The fact that SSDs are addressed in similar terms to HDDs should more than demonstrate that writing software that optimizes usage for HDDs is of secondary importance to writing software that simply addresses them both the same way. Which is more important to understanding computers: understanding how HDDs are written to or understanding how both HDDs and SSDs are written to? One is a subset of the other, and hence my emphasis on stacks.

They cannot be guaranteed to be write their data from the outside in. If I create 2 partitions on a disk, it may start writting from the half-way point on the platter stack, or it may not. How can we know? We can't except by extrapolating from. You cannot reasonably expect one behavior or the other because different hard disks and different storage medium will behave differently.

dmj · May 4, 2017, 9:46pm

I do, because I don't see a reason for engineers to be unreasonable. Rule of thumb for me: these engineers are much smarter than me, which means that they would implement everything in a most sensible way.

Sorry, but you're wrong here. And actually, I must thank you. Only because you started arguing, I decided to search for actual data and came upon this site I've already given link to (beforehand, it didn't even occur to me that it's up for a discussion, because how disks are written simply makes sense). So, thank you. And here you go:

Storing data in a very specific, usually inaccessible area of a disk. How cool is that?!

Both are important. While you don't care if your partition is on HDD or SSD, and your OS doesn't care because at its lowest level it uses LBA addressing, you do use discard mount option for partitions on SSD, don't you? (Edit: and there are filesystems specifically designed for USB flash drives, like exFAT. You abstract from the underlying media, but you can't ignore it completely, just as @jak_ub said)

Why would it not? Again, my rule of thumb: the most sensible approach wins.
1. You number logical blocks from outside to inside. You only have to remap bad sectors, everything else is just a straightforward list, you don't have to keep the map of every CHS block to every LBA number. You have higher speeds at the beginning, then it slowly drops.
2. You number logical blocks from inside to outside. Same as (1) but you have the slowest speed when you start using the drive. Why would you do that?
3. You keep an additional mapping table LBA->CHS and try to always write to the fastest area possible. That's just insane. I mean, similar solution works on a much larger scale (google IBM XIV) but for a single drive?! No way.

And if there was a way to make throughput constant all the way through the platter without compromising on data density or iops or data retention, I am sure such drives would be available for enterprise solutions right now.

Peanut253 · May 4, 2017, 10:36pm

But they do mean they cannot be relied upon to exist. You cannot rely that a file written contiguously on a file system will be written to a storage media contiguously. You cannot rely that when imaging a disk, that the sectors you ask for (0/0/0) are the sectors that actually on the disk due to the replacement table, media characteristics (like skipping down to the next platter instead of continuing to write using the existing head for performance reasons), or a million other factors. That is what the forensics article was all about. Reasons may include partitions do not have to start at the beginning of a disk, the physical media skipping over large chunks of media where the replacement table itself is held (hint: the replacement table itself is probably not contiguous), and where the USB firmware itself is held in flash drives.

You cannot rely that a file written contiguously on a file system will be written to a storage media contiguously.

Well you were the one who used it as an example first...so I thought you should think it through yourself. If even something as basic as a swap file, which should be easy to optimize using os-level software can't be reasonably expected to keep in say, unfragmented, we should not then expect LVMs to be able to optimize their usage of storage media based upon storage media specific characteristics. Let's take this further just for fun to prove my point even more.

Let's assume a Debian install at default settings creates a partition at the end of the drive for the swap file to have its own partition, written to a file system for which it is the only entry to on an HDD. The install completes, the system boots up. These are all logical structures so far. So here is the question: after booting, where is the page file on the physical disk?

Answer: You can't know.

The disk may, internally, have done the obvious partition layout, as specified by the file system, and started the partition at the very end of the physical medium in which case the page file is split across a series of platters towards the very center of the rotating medium, assuming the disk writes from the outside in. This optimization of creating a dedicated partition for the swap file, would dramatically decrease the performance of the swap file. What if the disk was smarter? What if the disk, recognizing that data was being written to it while it was 98% empty, arranged for the single large file to be written instead to the outer edge of the disk, dramatically increasing performance. Great!

And then some log file gets written to the file system and the disk writes it to the next available fastest performing series of sectors. A program needs to use more memory than initially estimated and the page file crows, contiguously on the file system and fragmented on the physical disk. And as the page file continuously shrinks and expands, it becomes more and more fragmented while appearing to the OS as a contiguous file towards the end of disk.

So, which scenario actually happened? As a human, you can know by reading documentation and performing experiments. As a piece of defrag software, looking down at all of the layers, where is the swap file? At the start or at the end of the disk? Is it defragmented? How do you know it is actually contiguous? How could you tell? How do you know any file is actually contiguous anyway? From the perspective of a piece of defrag software, the only thing you can rely on to know if given file is contiguous is if the file system's file table, mapping clusters to files, says it is contiguous. You cannot know nor be reasonably expected to know more than that. It is extraordinarily unhelpful in understanding computers to try to figure out exactly which, as opposed to focusing on the overall process.

Software that tries to access the physical media directly, that can talk to the PCB controller directly using the is still stuck in the same boat. The hard drive was asked to write some bits, it did, it was asked to return them, it did. Are they contiguous? Maybe, maybe not.

Highlights the division between physical structures. If you try to write to a medium using a physical addressing scheme incompatible with the fundamental characteristics of the media, it should not be very surprising that a translation layer exists, namely the PCB controller that translates between the imagined physical strucutres (heads/cylinders) to actual ones (nand flash). This applies more towards SATA SSDs than USBs although USB drives also share the same emulated characteristics. It is thus, also more helpful to think of writing to a "storage media," because of said translation layer," as opposed to "writing to a physical disk." Again, how to understand computers, in terms of inter-layer communications or intra-layer ones?

This is an elaboration of why software, including OS-level software should not in principle care about storage media characteristics: because it doesn't have to. I directly extended this analogy to OS-level software in my previous post as well. The LVM should not, in principle, care about the actual media characteristics beyond those that is is presented with and needs to actually address the media.

One of the requirements of thinking of computers in terms of stacks is to always have to go through each layer's interface with every other layer. How exactly does layer X communicate with layer Y? What can it know? What can it not be reasonably expected to know? How can this layer find out X?

In particular, I tend to think in terms of applications since I know them so well.

My application is running in this environment, how do I know exactly what the underlying processor is so I know to invoke the correct dynamic link library? How can I tell? What part of this environment provided by the OS tells me that? How often can I count on that, always or just sometimes? If the OS gives me error X, what should I assume happened? What should I assume did not happen? How can I double-check? When will this operation fail? When should it never fail?

Even if you downshift everything by one layer and ask yourself the same questions about the LVM, the LVM does not have to care about the actual characteristics of the media it works with, just enough to actually utilize it.

Strawman fallacy.

Compatibility is king. No one cares about hardware/software projects they can't actually use and that usually means lowest common denominator. You have to code for what you know will work, after that, you can implement extra code to take into consideration that HDDs tend to perform better if written to sequentially and that SSDs should not be defraged and that pagefile X needs to be handed a special way during creation and that this file system with this intended usage type should have this cluster size and that...

Step 1) Make sure it works.
Step 2) Make sure it works well.

Step 1 always gets done with software you have to use in production, but even then #2 does not happen unless a significant bottleneck is addressed. Understanding how things work, step #1 is very important, but understanding the various optimizations for every remote use case for any given structure on any given layer (HDDs at the physical layer in this case) is just not, unless said bottleneck occurs in your particular use case.

jak_ub · May 4, 2017, 10:38pm

How exactly does data get written on HDDs?

Always sequentially, except for necessary mechanisms (like bad sectors, special areas). The controller role is to maintain sequential order as much as possible because the physical distribution of data in physical drives was and still is important.

This rule is less important for SSD. It simply do not hurt SSD to be written in sequential order.

Is the answer to #1 done for better performance?

Yes, exactly for that reason.

Which is more important to understanding computers: understanding how HDDs are written to or understanding how both HDDs and SSDs are written to? One is a subset of the other, and hence my emphasis on stacks.

This thread specifically started with HDD being the only drive in question - it is hard to miss round picture and mentioned RPM parameter.

cannot reasonable assume such software was written intentionally to take advantage of storage media characteristics nor can the software be reasonably expected to ever be written in such a way.

Study at least some performance charts of HDD drives created in last decade. Go even further and look at SSD measurements and difference between random and sequential reads of one single large file.
Then please tell me that you still do not see a pattern there.

If I create 2 partitions on a disk, it may start writting from the half-way point on the platter stack, or it may not.

And what does that change in regards to single file? The fact is, that first partition will somewhat benefit from being in the beginning of the drive and second will not. Normal file system/OS still will try to keep files continuous because the head jump is a potential cost even within the same partition. Second partition still benefits from files being created in the begging of it (the cylinder radius still is longer in the beginning of partition than in the end).

There is some rather common characteristics of sequential operations, regardless of a layer and across the layers. Once you started doing something in some specific place, there is a huge probability that you will continue your job from the next sequential item/sector/memory block. So even memory controller if has nothing else to do is prefetching next memory cell, HDD controller most probably does the same.

Even SSD are faster in sequential operations. (On the side note: I wonder if ware leveling does negatively influence that or it is actually controller optimizing for sequential operations).

Again no one saying that all files are always created sequential.

Peanut253 · May 4, 2017, 10:40pm

Alright, going to sleep. See you when I wake up. Was fun

dmj · May 4, 2017, 10:51pm

Which is why hard drives have cache, I must add. And RAID controllers have write-back cache. And OS have I/O schedulers and cache I/O in RAM. Data storage is the slowest part of any system, which is why making it just a bit faster by rearranging I/O into as much sequential as it can be is so important.

Freaksmacker · May 4, 2017, 11:13pm

Unless your windows.

jak_ub · May 4, 2017, 11:19pm

You cannot rely ..... when imaging a disk, that the sectors you ask for (0/0/0) are the sectors that actually on the disk due to the replacement table, media characteristics (like skipping down to the next platter instead of continuing to write using the existing head for performance reasons), or a million other factors.

And why I would care about that? What all the higher layers rely is that logical block 0 is most probably closer to the block 1 than to the block 10000000. That is enough for higher layers to assume that when writing one file that just started at logical block N it will be beneficial to write its next part to block N+1. It does not matter if that means that will be next head (actually that would be supper beneficial because controller could potentially write both blocks at the same time) or next path (or at least very close) because moving the heads to write to block 10000000 most probably mean that the head need to move by larger distance.

partitions do not have to start at the beginning of a disk

And what it does change in regards of partition logical order being that logical blocks in close in number are also close in the physical media.
And who cares about small difference that logical sector N and N+1 some times be further away. That is the abstraction layer to hide this fact from me.
Abstraction layer role is not to obfuscate everything and cause that any operation will have completely random execution time.

You cannot rely that a file written contiguously on a file system will be written to a storage media contiguously.

No, and I never was, even because a simple fact that I can predict that a larger file will not fit in one cylinder and would need to be continued in the next one, or even better on the different platter (as multiple heads reading/;writing at once are beneficial for performence). Again I do not care that there will be any kind of internal brake with in the physical continuity. Whats important is that block N and block N+1 will be as close ass possible.

This optimization of creating a dedicated partition for the swap file, would dramatically decrease the performance of the swap file. What if the disk was smarter? What if the disk, recognizing that data was being written to it while it was 98% empty....

Where do you take you knowledge from? have you seen such HDD (not SSD)? Have you even heard anyone crated such disk (for normal use). Hey, yea sure, it is possible someone created such worst possible device.
But for what reason?

Please find it.

stacks is to always have to go through each layer's interface with every other layer....

Yes, but why do you assume that each layers does always worst case scenario. And randomizes everything.

How often can I count on that, always or just sometimes?

But then why do you even count on that a file will be written anywhare at all?. Since every layer lies to each other?

Strawman fallacy.

Why? You stated that physical characteristics should absolutely not be used in any other layer than physical medium I gave you example of one I knew you will exclude from that rule. Fallacy is on your side.

No one cares about hardware/software projects they can't actually use and that usually means lowest common denominator.

Yes I know, if someone do not understand the thing, then that someone either do not use it, or use it in the only known way by that person (limited usually to to lowest common denominator) or that person also states that this is magic, or scary thing that should never be spoken of.

Freaksmacker · May 5, 2017, 12:12am

I am willing to look at you statement to learn something from it but lets avoid personal attack. Debate ideas and concepts.

Freaksmacker · May 5, 2017, 12:13am

Supporting evidence as to why you think this way or that is in order.

jak_ub · May 5, 2017, 1:46am

I'm not sure what you are looking for exactly. I've already linked one in the begging and it was not enough.
Basically that is kind of common knowledge that is expected to be known by software engineers. Most of the thesis and research papers behind that knowledge were written decades ago.

Here, could be starting points for your exploration (look for phrases "preallocation" "defragmentation", "preemptive" ) :

The next one is found at random but it points out phrases that explain what OS can and does do on the file system level:

smart file allocation algorithms
large chunk allocation
preallocation
delayed allocation (also called allocate-on-flush)

https://www.quora.com/Why-does-Apple-claim-defragmentation-in-OS-X-is-unnecessary

Links found at random:
http://web.cs.ucla.edu/classes/fall14/cs111/scribe/11a/index.html
https://kb.sandisk.com/app/answers/detail/a_id/8150/related/1
http://downloads.diskeeper.com/pdf/improve-san-performance.pdf

Freaksmacker · May 5, 2017, 1:52am

Much better. Actually. The point is that the reader comes away with more than they had. I want to know why you presented your case that way. Some one like me is interested in the stance you take and why. Not an engineer ? Need to understand......

dmj · May 5, 2017, 7:29am

WTF? Is this thread about psychology and sociology now?

Peanut253 · May 5, 2017, 7:39am

Good morning,

Unrelated: Barking dogs are evil.

So after reading your response, I am not convinced you are addressing the points I raised at all. There is a difference between responding to the literal words someone is speaking with and responding to the message they communicated. To see if this process is worth continuing, I am going to go ahead and limit my response to a single statement you made in your previous post. If I cannot get movement from that single point that I believe highlights the issue, then I will abandon this conversation.

Peanut253:

An OS, from the perspective of an application requesting a file, only reads to fulfill requests from/to file systems. As long as the LVM given that request, returns the file, what does it matter to the application if that file is in a directory that is part of local media, hdd or ssd, USB flash media, or a network share speaking a compatible protocol, on a GPT disk or exFat or ZFS file system? Myprogram.exe does not have to care where cat.jpeg actually is in order to display it, just like OS-level software does not have to and should not take into account storage media characteristics. It cannot reasonably expect the storage media to always be the same, work the same, or benefit from the same optimizations and necessarily adds non-trivial complexity to the existing engineering design due to the numerous layers involved.

From the context, it is clear I meant "...just like OS-level software does not have to and should not take into account storage media characteristics, [beyond the characteristics reported]." That should have also been clear since this discussion is about the amount of obfuscation present in the PCB controller and related ramifications. Your response was:

Deliberately taking half a sentence out of its context, and responding to the literal words used is a great way to obscure the meaning meant to be conveyed. I pointed out that I did not believe this represented an accurate depiction of what I said.

And of course I clarified my meaning for completeness:

Again, this means the LVM actually only needs to care about the characteristics reported, not the /actual/ characteristics of the actual physical media. That, in essence, is half the subject of this discussion.

Your response:

So after pointing it out, there was no attempt, at all, to understand what I meant, nor go back and re-read to make sure the eviscerated statement actually reflected my overall views nor acknowledging my actual statements that tried to correct the false impression. Merely "digging in."

So here is my question to @jak_ub : Do you acknowledge that my statement "...just like OS-level software does not have to...." was responded to out of context and that, by responding to it out of context, constitutes responding to a view point that I did not espouse, (a.k.a. a strawman fallacy)?

Peanut253 · May 5, 2017, 7:56am

Do you really think that FreeNAS needs genuine physical access to the physical hardware? To have the knowledge to override the PCB controller and write the bits itself?

Usually, what it means when someone says that FreeNAS needs access to the physical hardware is that it needs to be able to communicate with a controller card on the motherboard without any additional abstraction layers present than are normally present when running a normal OS, e.g. do not add RAID. The drives will still be accessible to FreeNAS with a ZFS configuration, but would dilute the performance. So "FreeNAS needs access to the physical hardware", if said in a technically correct way, actually is "FreeNAS, for the purposes of optimizing ZFS configurations, should be able to access motherboard hard disk controllers as directly as possible. Indirect access, such as through a controller card that implements RAID or complex protocols such as iSCSI should be avoided for performance reasons."

It specifically does not mean, and could never mean, in light of computers being layered stacks, that FreeNAS could ever know how to actually manipulate the bits on HDDs or SSDs. That is what jak_ub and I were talking about before derailing. To what extent do the actual physical media characteristics differ from the reported ones? (SSDs reporting they have heads, HDDs reporting 128 platters.) Does that constitute a sufficient barrier towards operating system storage controller driver and file system access pattern optimizations? If in fact, they do, then would it not be more helpful simply to think of them in terms of discrete layers (PCB-controller and below distinct from the motherboard's storage controller) in terms of modern storage stacks? That then precluddes that one should focus on intra-layer optimizations rather than inter-layer ones.

Freaksmacker · May 5, 2017, 9:42am

Just trying to make sense. I now, at least get the approach.

jak_ub · May 5, 2017, 12:56pm

It might look like that, but I only quote one sentence out of part that I want to reefer to. It is not perfect but also quoting multiple paragraphs also is not perfect.

If I cannot get movement from that single point that I believe highlights the issue, then I will abandon this conversation.

First, here is the description of the storage characteristics:

Why would application care how to access the files:
https://ayende.com/blog/162791/on-memory-mapped-files

Here you have starting point why WIndows would care aboud HDD vs SSD (look for Storage Tiers):
https://technet.microsoft.com/en-us/library/hh831739%28v=ws.11%29.aspx?f=255&MSPPError=-2147217396

how do I know exactly what the underlying processor is so I know to invoke the correct dynamic link library? How can I tell? What part of this environment provided by the OS tells me that?

https://software.intel.com/en-us/articles/intel-sdm

Other links that might help (or not) in regards to the background of discussion.

Peanut253 · May 5, 2017, 1:12pm

You have not answered my question satisfactorily as to whether deliberately taking my sentence out of context and responding to a position I do not hold, despite numerous clarifications, constitutes a strawman fallacy, which I directly asked you about in my previous response.

I hope everyone was able to learn from the conversation up until this point.

I will now abandon this conversation and not respond to the topic further. Was fun. Goodbye

jak_ub · May 5, 2017, 3:50pm

Well, your prerogative to stop at any point.
But, I believe I've answered your question before you even asked it directly:

As I stated before:

I think you put to much attention to such details like head/cylinder/cluster instead of overall characteristics like semi/non-random access memory storage vs random access memory storage.

In the end I linked to the description of "storage characteristics", to make sure that we are on the same page about what those are. I linked to many examples (one in the begging) that clearly shows that those are important accepts considered when implementing the file systems.

At least try to read them (and those bellow).

There was also a part that I think I omitted completely:

In particular, I tend to think in terms of applications since I know them so well.

After that you asked many questions that in some cases have rather very simple answer (I was not sure if they were legit or not). I will omit simple answers like CPUID, language runtimes, and provide non-direct answers that go beyond that:

https://lmax-exchange.github.io/disruptor/

https://docs.oracle.com/cd/B28359_01/server.111/b28310/onlineredo001.htm#ADMIN11304

And here is example of optimizations category some developers must know and use, otherwise half of internet would be pissed (random link):

dmj · May 5, 2017, 8:40pm

What? No, of course not! You're the only one here who seems to think that understanding how underlying storage media works is the same as doing firmware's job yourself.

No, you're the only one arguing this point here.

But I see you're already out. Oh, well.