30 Jan 2008

explanation of some optimizations done at compile time

I got a couple of complaints that recent posts have been aimed at techie people.
In my last post, I explained a little about gcc, which is the C compiler used in linux distributions (and some BSD)

I’ll give some examples.
I mentioned that if you compile for your CPU, you can include SSE, which are special instruction sets that the CPU has to speed up certain processes. Here’s a very basic explanation.

Imagine your CPU does the basic functions. It can add numbers together, subtract, multiply, divide etc..
But what does it do when you need to calculate the square root, or find prime numbers, or find the multiples of a certain number. generally, the compiler will convert such statements (from source code) into something the computer CAN do.
For example, there are many ways to calculate the square root of a number. Some involve doing lots of multiplies and divides, and take a long time. For this reason, if the CPU maker knows that a certain command is likely to be used frequently, they’ll build an extra part on the CPU that basically has just that function, and does it extremely quickly.
So, in the case of square root, if Intel knew that certain programs used them all the time, they’d have a special square root instruction that would do square roots.

The problem is, since Microsoft has a monopoly on PCs, and they want Windows do work on old and new CPUs, they can’t include any of these special instructions that would work on newer CPUs.

The end result, companies build their binary (the code that a computer runs) using very few optimizations, as that’s the safest way, as it will be supported on more hardware.
Imagine Firefox only working on a Core 2 Duo CPU or above.

The benefit of compiling your own programs then, is that it can use all the new fancy things your new CPU has (and not the ones you don’t).

Next comes Loop unrolling.
It is very common for computer programs to have loops.
Imagine you need to do something a certain number of times
The source code would look something like this.


So, you have your current position in the loop declared and instantiated as ‘0’ and you increase by ‘1’ until you reach ‘9’

This program will run the function ‘doSomething()’ 10 times
You start out with i=0. you then check that it’s equal to or less than 10, increment it, and run the function. You then start over (i will equal 2 on the 2nd loop, and so on). This happens until you check the value of i, and equal to or greater than 10 (then it will quit out of the loop, and continue after that).

A quicker way of doing this would be to just write it all out.
Obviously in certain situations (like when you’re asking for user input) you don’t know how many times you want to do something, but in this case, it’s a predetermined 10 times.
So, why not write it out like this


They’re both do the exact same thing. The latter is optimized for speed, and the former is optimized for filesize.

This should explain the previous post about compiling firefox.