For shits'n'giggles I tried the same thing in python...
import time
start = time.time()
data = [0] * 250
for loop in range(10000000):
for index in range(250):
data[index] = index
stop = time.time()
print("time: " + str(stop - start))
Output:
time: 265.443000078 ( that's a little shy of 5 minutes... )
Basically, don't use python for anything more than complexity O(N)...
TL;DR: Python ran the same thing about 35x slower.
What did you expect? HIGH level language: - high abstraction (aka a lot of bullshit between written code and hardware) - high resource usage (aka RAM and CPU whoring - Yes Java, I look at you!) - high confusion (aka it compiles and runs but does something else... interesting) - high restrictions (aka "we don't know how to allow you to access the hardware, that's why we keep it away from you.") - high learning time (aka you can get a degree in CS only learning how to use Java without actually building something useful) - high latency (aka if 300 of 1000 concurrent requests don't timeout, we are good; otherwise upgrade your hardware) - high ego (aka "This language will change the world! It is amazing, has awesome features and you need to love it!")
If you want to see really high numbers, you need to check out the Java Reflection API. I once had to create a Java framework and read a lot of guides to prepare for that. They all had one in common: Don't use Java Reflections if you don't have to. Performance might increase 72x - That is on top of Java's default (poor) performance, not compared to C.
And I was just sitting there laughing my ass off since a professor one week before that was actually trying to convince me that Java and C are equally fast. He was also my boss at a company. I don't need to explain, that I left both, right? - college and company.
He also told me, he had a research project on that - with five students. None of them can use a pointer in C or is able to create a simple C program. He then referred to the JIT compiler (Java HotSpot Compiler) from IBM. This thing was created because Java's performance was even worse when the byte code of the JVM was interpreted. He totally missed that part but at least he had a research project, right?
looking deeper in the assemnly generated, one can see that when using (unsigned) char, it uses BYTE PTR instead of DWORD PTR and movzx insted of mov. But as long as nothing that uses more bytes is used, I would not have seen how that would make much of a difference because afaik those operands should take the same amount of cycles to process.
Run on a 2.3ghz i7 compiled with gcc 5.3.1 only extra thing I ran it with was tasksetting it
EDIT:
Further, it's no contest you pass in -O2/3 flags. Since you're using the volatile keyword, that blocks the compiler from doing any further optimization.
I think you're missing the point of the exercise... This was about testing the cost of typecasting and how much is gained by using 8bit-var for loops when you don't need to count very high and your data is also 8bit. This isn't about how fast you can count to 250 and write contiguous synchronous numbers.
That's optimized code... again... missing the point. if I wanted optimized code I wouldn't have written it just the way I did.
And yes, I know it's not a perfect match for the C code. A perfect match is pretty hard to get since the python runtime doesn't work anything like how compiled C code does.
You're not being a jerk... you're just taking my sarcasm too seriously. It was a sarcastic statement on how the python code that looked nearly identical took 35x as long to run. It was just for shits'n'giggles.
If it's about how fast you can cast the numbers, then why bother using volatile keywords and static_cast for free?
The python code is not "optimized".
There's no need to pre-allocate the list by padding it with zeros. Then you create another list of size 1e7 and upon each iteration you create a list of 250 only to refill the the original list you preallocated. Your inside loop takes 500 steps whereas the C code only takes 250. The outer loop takes 2e7 in python while the C code takes 1e7.
Core 2 Duo using default gcc optimization:
time(1): 10.395982
time(2): 10.763732
time(3): 11.787473
time(4): 16.133874
time(5): 10.515141
time(6): 10.483679
Core 2 Duo using -O2:
time(1): 9.927467
time(2): 10.431360
time(3): 10.430904
time(4): 10.430874
time(5): 10.950903
time(6): 10.954640
VM on AMD A6-5200 using default gcc optimization
time(1): 10.601255
time(2): 8.185348
time(3): 14.554221
time(4): 14.616711
time(5): 9.912460
time(6): 9.831467
VM on AMD A6-5200 using -O2
time(1): 10.380596
time(2): 11.633104
time(3): 11.237742
time(4): 11.201735
time(5): 9.955046
time(6): 9.840577
Interesting... I guess your mileage may vary... wildly... There is a noticeable negative difference in test(2) on the AMD CPU when -O2 is enabled and it was consistent between runs, but I could only run that in a VM so that may have something to do with it. May investigate further later today... because why not...
If you have coded anything in Assembly, it makes total sense. Picking int's is basically picking the machine's original register size. Or at least in your test case the "fallback" size, which is 32 bits. All 64 bit CPU's since their inception can easily change modes between 32 and 64bit operations, there is no overhead that I can think of (this means crunching 8 byte numbers on a 64 bit machine should have no overhead as well on this test, you can test it out too). When operating in char's you have to perform extra operations to fit things to it's size, which is an overhead, no matter how small. I believe this is the overhead that you're seeing.
I was right that the chars take more operations to fit things (but honestly, it's just a couple of cycles by looking at objdump) and that crunching 64 bit numbers would amount to no overhead (I added test7 with unsigned long 64 bit numbers). Though I cannot run that test as my machine is loaded right now, that's no way to test, somebody willing can do that for me? :)
EDIT EDIT: gcc version 6.1.1 20160510 (Red Hat 6.1.1-2) (GCC) by the way.