Why is a iterating a list of objects slower than iterating a list of object pointers?

After reading this blog post about how unfriendly a list is to cache:
http://www.baptiste-wicht.com/2012/11/cpp-benchmark-vector-vs-list/

… I tried to make a std::list of pointers to objects more cache friendly by putting the actual object into each node (thereby removing one indirection operation) in the hope that when the current node is cached, the object will be too. However, performance actually decreased. Here’s the code I used:

#include <list>
using std::list;

list<Object*> case1;
list<Object> case2;

class Object {
    public:
        Object(char i);
        ~Object();

        char dump[256];
};

// Should not notice much of a difference here, equal amounts of memory are 
// allocated
void Insertion(Test* test) {

    // create object, copy pointer
    float start1 = clock->GetTimeSec();
    for(int i = 0;i < test->size;i++) {
        case1.push_back(new Object(i)); 
    }
    test->insertion1 = clock->GetTimeSec()-start1;

    // create object in place, no temps on stack
    float start2 = clock->GetTimeSec();
    for(int i = 0;i < test->size;i++) {
        case2.emplace_back(i); 
    }
    test->insertion2 = clock->GetTimeSec()-start2;
}

// Case 2 removes one extra layer of derefence, so it should be more cache 
// friendly, because when the list node is found in cache, the object should be
// there too
void Iteration(Test* test) {

    // faster than case2 for some reason
    float start1 = clock->GetTimeSec();
    int tmp1 = 0;
    for(list<Object*>::iterator i = case1.begin();i != case1.end();i++) {
        tmp1 += (**i).dump[128]; 
    }
    test->iteration1 = clock->GetTimeSec()-start1;

    // why the hell is this slower? I removed a dereference
    float start2 = clock->GetTimeSec();
    int tmp2 = 0;
    for(list<Object>::iterator i = case2.begin();i != case2.end();i++) {
        tmp2 += (*i).dump[128]; // is equal to tmp1, so no mistakes...
    }
    test->iteration2 = clock->GetTimeSec()-start2;
}

// Case 2 removes one extra layer of derefence, so it should be more cache 
// friendly, because when the list node is found in cache, the object should be
// there too
void Deletion(Test* test) {

    // again, faster than case2 for some reason
    float start1 = clock->GetTimeSec();
    int size1 = case1.size();
    for(list<Object*>::iterator i = case1.begin();i != case1.end();i++) {
        delete *i;
    }
    case1.clear();
    test->deletion1 = clock->GetTimeSec()-start1;

    // as before: why is this slower? I removed a dereference
    float start2 = clock->GetTimeSec();
    int size2 = case2.size();
    case2.clear();
    test->deletion2 = clock->GetTimeSec()-start2;
}

These functions are run for test->size values linearly varying from 1 to 100000, and time differences between clock->GetTimeSec() are saved to disk after calculations are finished. A plot of my results can be found here:

http://wilcobrouwer.nl/bestanden/ListTest.png
As you can see, case 2 is about 10% faster at inserting and about 10% slower at iterating and deleting, which means the extra dereference needed for iterating case 1 makes it faster!

What am I missing here?

Edit 1: my CPU is a Phenom II X4 with 64K/1MB/6MB of cache, and I’m compiling this way (please note that -m64 is implied, which implies a ban on x87 via -mfpmath=ssse):

Compiler: TDM-GCC 4.7.1 64-bit Release
rm -f obj/Clock.o obj/main.o obj/Object.o ListTest.exe
g++.exe -c Clock.cpp -o obj/Clock.o -std=gnu++11
g++.exe -c main.cpp -o obj/main.o -std=gnu++11
g++.exe -c Objecst.cpp -o obj/Object.o -std=gnu++11
g++.exe obj/Clock.o obj/main.o obj/Object.o -o ListTest.exe -static-libgcc

Edit 2: answer to Dale Wilson: with list I mean a std::list. Answer to Mats Petersson: a summary has been added to the picture. Optimization checks are under way. Answer to the guy who asked about bigger data sets: sorry, I’ve only got 4GiB of RAM, and the plots from the current maximum up to filling that are quite boring.

Edit 3: I’ve enabled -O3 (-O2 produces similar results), which only made stuff worse:

http://wilcobrouwer.nl/bestanden/ListTestO3.png
This time, case 2 is about 20% faster at inserting and about 0~5 times slower (gets worse at higher test sizes) at iterating and deleting. Same conclusion.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>