C optimization: GUESS!

After getting in a debate about C loop optimization, I banged out a quick test and found the results interesting enough to share.

My hypothesis was that using a char loop variable does not save enough cpu/memory to be worth the risk of running into a situation like this one.

Guess which of the test cases runs fastest, by how much, and why.

Context:
- intel i5
- gcc compiler, default settings


#include <stdio.h>
#include <time.h>

#define SCALE 10000000
#define NUMTESTS 6
#define SIZE 250


//===================================================================

// int --> char
void test1()
{
	volatile char data[SIZE];
	volatile int index;
	for (index = 0; index < SIZE; index++)
	{
		data[index] = index;
	}
}

// unsigned int --> char
void test2()
{
	volatile char data[SIZE];
	volatile unsigned int index;
	for (index = 0; index < SIZE; index++)
	{
		data[index] = index;
	}
}

// unsigned char --> char
void test3()
{
	volatile char data[SIZE];
	volatile unsigned char index;
	for (index = 0; index < SIZE; index++)
	{
		data[index] = index;
	}
}

// unsigned char --> unsigned char
void test4()
{
	volatile unsigned char data[SIZE];
	volatile unsigned char index;
	for (index = 0; index < SIZE; index++)
	{
		data[index] = index;
	}
}

// unsigned int --> int
void test5()
{
	volatile int data[SIZE];
	volatile unsigned int index;
	for (index = 0; index < SIZE; index++)
	{
		data[index] = index;
	}
}

// unsigned int --> int
void test6()
{
	volatile unsigned int data[SIZE];
	volatile unsigned int index;
	for (index = 0; index < SIZE; index++)
	{
		data[index] = index;
	}
}



//===================================================================

void (* tests[NUMTESTS])() = {
	&test1,
	&test2,
	&test3,
	&test4,
	&test5,
	&test6,
};


int main()
{
	clock_t start, stop;
	unsigned int loop, index;
	double deltaTime;

	for (index = 0; index < NUMTESTS; index++)
	{
		start = clock();

		for (loop = 0; loop < SCALE; loop++) tests[index]();

		// print result
		stop = clock();
		deltaTime = (double)(stop - start);
		printf("time(%d): %f\n", index + 1, deltaTime / CLOCKS_PER_SEC);
	}
}

Results !!!

( guess before looking! )



time(1): 8.156000
time(2): 8.070000
time(3): 7.875000
time(4): 7.319000
time(5): 7.614000
time(6): 7.665000

Very interesting, what optimization level did you have the compiler on?

--

For shits'n'giggles I tried the same thing in python...

import time

start = time.time()

data = [0] * 250
for loop in range(10000000):
	for index in range(250):
		data[index] = index

stop = time.time()

print("time: " + str(stop - start))

Output:

time: 265.443000078
( that's a little shy of 5 minutes... )

Basically, don't use python for anything more than complexity O(N)...

TL;DR: Python ran the same thing about 35x slower.

What did you expect? HIGH level language:
- high abstraction (aka a lot of bullshit between written code and hardware)
- high resource usage (aka RAM and CPU whoring - Yes Java, I look at you!)
- high confusion (aka it compiles and runs but does something else... interesting)
- high restrictions (aka "we don't know how to allow you to access the hardware, that's why we keep it away from you.")
- high learning time (aka you can get a degree in CS only learning how to use Java without actually building something useful)
- high latency (aka if 300 of 1000 concurrent requests don't timeout, we are good; otherwise upgrade your hardware)
- high ego (aka "This language will change the world! It is amazing, has awesome features and you need to love it!")

1 Like

I was expecting about 10x; Found it quite entertaining to be 35x.

If you want to see really high numbers, you need to check out the Java Reflection API.
I once had to create a Java framework and read a lot of guides to prepare for that.
They all had one in common: Don't use Java Reflections if you don't have to. Performance might increase 72x - That is on top of Java's default (poor) performance, not compared to C.

And I was just sitting there laughing my ass off since a professor one week before that was actually trying to convince me that Java and C are equally fast. He was also my boss at a company. I don't need to explain, that I left both, right? - college and company.

He also told me, he had a research project on that - with five students. None of them can use a pointer in C or is able to create a simple C program. He then referred to the JIT compiler (Java HotSpot Compiler) from IBM. This thing was created because Java's performance was even worse when the byte code of the JVM was interpreted. He totally missed that part but at least he had a research project, right?

1 Like

it is really interesting to see a difference.

time(1): 9.375526
time(2): 9.356545
time(3): 9.372130
time(4): 9.442717
time(5): 9.460095
time(6): 9.467903

looking deeper in the assemnly generated, one can see that when using (unsigned) char, it uses BYTE PTR instead of DWORD PTR and movzx insted of mov. But as long as nothing that uses more bytes is used, I would not have seen how that would make much of a difference because afaik those operands should take the same amount of cycles to process.

1 Like

yep, same number of cycles.... however, bytes are pulled through memory buss faster than ints.

void test7() {
std::array< int, SIZE > data;
std::iota(data.begin(), data.end(), 1);
}

time(1): 6.086766
time(2): 6.702280
time(3): 6.002496
time(4): 5.992530
time(5): 6.530134
time(6): 5.954396
time(7): 5.727235

Run on a 2.3ghz i7
compiled with gcc 5.3.1
only extra thing I ran it with was tasksetting it

EDIT:

Further, it's no contest you pass in -O2/3 flags. Since you're using the volatile keyword, that blocks the compiler from doing any further optimization.

time(7): 0.016261

The python code is not the same as the code provided, so it's not really a fair comparison.

Here's something that's more equivalent in python:

In [1]: %timeit -n100 [i for i in xrange(250)]
100 loops, best of 3: 6.94 µs per loop

In [2]: %timeit -n1000 [(i & 0xffffffff) for i in xrange(250)]
1000 loops, best of 3: 11.1 µs per loop

So it would take about a minute to run the outer loop (or close to 2 if you cast the signed into to an unsigned int).

I would be willing to bet with some basic cython, the performance would get a lot closer to the gcc results.

Hate to be a jerk, but that statement about not using python for anything with a complexity greater than O(N) is pretty ridiculous.

1 Like

I think you're missing the point of the exercise... This was about testing the cost of typecasting and how much is gained by using 8bit-var for loops when you don't need to count very high and your data is also 8bit. This isn't about how fast you can count to 250 and write contiguous synchronous numbers.

That's optimized code... again... missing the point. if I wanted optimized code I wouldn't have written it just the way I did.

And yes, I know it's not a perfect match for the C code. A perfect match is pretty hard to get since the python runtime doesn't work anything like how compiled C code does.

You're not being a jerk... you're just taking my sarcasm too seriously. It was a sarcastic statement on how the python code that looked nearly identical took 35x as long to run. It was just for shits'n'giggles.

Looking identical doesn't mean that you still aren't doing it wrong because there's an optimized solution built into the interpreter for that shit.

(Don't insult the only scripting language I know bro, it's late and I shouldn't even be posting ;_;)

If it's about how fast you can cast the numbers, then why bother using volatile keywords and static_cast for free?

The python code is not "optimized".

There's no need to pre-allocate the list by padding it with zeros. Then you create another list of size 1e7 and upon each iteration you create a list of 250 only to refill the the original list you preallocated. Your inside loop takes 500 steps whereas the C code only takes 250. The outer loop takes 2e7 in python while the C code takes 1e7.

1 Like

I got very different results.

Core 2 Duo using default gcc optimization:
time(1): 10.395982
time(2): 10.763732
time(3): 11.787473
time(4): 16.133874
time(5): 10.515141
time(6): 10.483679

Core 2 Duo using -O2:
time(1): 9.927467
time(2): 10.431360
time(3): 10.430904
time(4): 10.430874
time(5): 10.950903
time(6): 10.954640

VM on AMD A6-5200 using default gcc optimization
time(1): 10.601255
time(2): 8.185348
time(3): 14.554221
time(4): 14.616711
time(5): 9.912460
time(6): 9.831467

VM on AMD A6-5200 using -O2
time(1): 10.380596
time(2): 11.633104
time(3): 11.237742
time(4): 11.201735
time(5): 9.955046
time(6): 9.840577

Interesting...
I guess your mileage may vary... wildly... There is a noticeable negative difference in test(2) on the AMD CPU when -O2 is enabled and it was consistent between runs, but I could only run that in a VM so that may have something to do with it. May investigate further later today... because why not...

1 Like

If you have coded anything in Assembly, it makes total sense. Picking int's is basically picking the machine's original register size. Or at least in your test case the "fallback" size, which is 32 bits. All 64 bit CPU's since their inception can easily change modes between 32 and 64bit operations, there is no overhead that I can think of (this means crunching 8 byte numbers on a 64 bit machine should have no overhead as well on this test, you can test it out too). When operating in char's you have to perform extra operations to fit things to it's size, which is an overhead, no matter how small. I believe this is the overhead that you're seeing.

EDIT: Ok so what I have noticed:
Objdump of test

I was right that the chars take more operations to fit things (but honestly, it's just a couple of cycles by looking at objdump) and that crunching 64 bit numbers would amount to no overhead (I added test7 with unsigned long 64 bit numbers). Though I cannot run that test as my machine is loaded right now, that's no way to test, somebody willing can do that for me? :)

EDIT EDIT: gcc version 6.1.1 20160510 (Red Hat 6.1.1-2) (GCC) by the way.

1 Like