# iLand Tech blog Blog Actions

some ideas and technical stuff....

## Documenting the Javascript API with Doxygen

Friday 21 of January, 2011

This post will most probably be no much fun to read - much like the researching of the topic was quite ... tedious.
The question is how to set up a documentation of the Javascript API. There are certain requirements for such a setup:

• the doc should be maintained while coding and you should be able to use the same tools (i.e. IDE) for that
• the doc tool should have some basic knowledge of software and should therefore be able to automatically create cross links
• however close the act of documenting is to the act of source code production: the result in the end should be a shiny HTML document that is linked to the iLand wiki

The C++ source code documentation of iLand uses Doxygen, which is a very nice tool and tailored explicitely to C++ as the language. So my first guess was to try to somehow use the documentation of the C++ source code that is attached to the objects/functions related to the scripting engine. While that sounds straightforward, it has some problems: On the one hand, the syntax of C++ and Javascript is different (e.g., there is no void in JS), and, more important, the documentation targets the wrong audience: if you want to use the Javascript function, you'd probably not very interested in the details of marshaling data types between Javascript and C++.
The alternative is to set up a specific "Javascript API documentation" that targets script hackers and deals with Javascript syntax.
After some failings I stumbled upon a perl script for Javascript to Doxygen conversion. This script creates from Javascript a "kind" of C++ code which would not compile but can be processed by Doxygen.
After a couple of minor difficulties (e.g. how to execute perl scripts on windows) I managed to set up the production chain:

• write the docmentation in the IDE (Qt Creator)
• process by the perlscript js2doxy.pl
• process the result with Doxygen

However, it took some time until I found out how to write the documentation right....
And here are the rules:

• write a "constructor" function (even if it is not used).
• write the signatures of the functions in javascript; use the special commands for paramters/return values to provide types in the documentation
• assign the function to the prototype (!). This is just to show membership relatiions to the js2doxy-script

Here is an example:

/** DataSource is the basic object to handle input data of various sources.
*/
function DataSource()   // constructor (use  @ctor to add real docmentation)
{ }
...

/** advance to next record and retrieve content. Returns false, if pointer is located behind the last record.
The next() function provides a efficient looping construct:
@code
// ds: a data source
while (ds.next()) {
// do something
}
@endcode
@treturn bool Returns false, if pointer is located behind the last record. */
function next() {}
DataSource.prototype.next = next;

/**
opens a datasource with a given \p type from a specific source \p name.
Opens a data source. Writes a output message if opening the data source failed.
@tparam string type Type of the datasource.
@tparam string name path to the file (for file based data sources) or the SQL query for database types.
\sa isOpen() */
function open(type, name) {}
DataSource.prototype.open = open;

Note the way the how datatypes are descibed for the open function, and how code examples are included for the next function.
Hopefully soon I will add a link to an example of how such a documentation then looks like in HTML format...
This way of documentation is obviously not perfect; however, it allows to separate clearly between source code doc and users doc and is uses the same tools (i.e., Doxygen).

## Linux based profiling and performance

Wednesday 13 of October, 2010

It is more than time for another blog post here as the last post had several month to mature. The strange thing, however, is that the urge to write is always highest after fiddling with performance-screws...
After implementing the seed-dispersal and regeneration module, the performance of the new module was a little bit disappointing. I took this as an opportunity to spend an afternoon with profiling iLand. Well, most of the time I spent with research, installations and such. "Profiling" essentially means to let a profiling software perform extensive and very detailed timings of the runtime behavior of the target program; this helps when looking for performance bottlenecks, or simply when you want seek performance insights.
After some research I figured that a good solution would be Valgrind as measurement framework in Linux accompanied by KCacheGrind as a GUI. That of course requires iLand to be compiled for Linux (which I had tried already before). So the steps were not particularly daunting:

• take a Ubuntu (10.04) in a VirtualBox with installed Qt (currently 4.6.3)
• checkout the iLand source code from iland.boku.ac.at
• build iLand and fix the minor issues stemming from different compiler versions (e.g.: _isnan() on Windows versus isnan() on Linux - I am sure, there is a simple solution for that)
• run valgrind + KCacheGrind

Using valgrind is quite straightforward; cd to the iland executable directory and type the following into the terminal:

valgrind --tool=callgrind --toggle-collect=Model::runYear() ./iland

the flag --toggle-collect instructs valgrind to start and stop measuring when entering/leaving the runYear() function of the Model-class, which is exactly the thing to do when interested in model performance. The valgrind tool produces a output file in that directory which can be conveniently read by KCacheGrid.

More difficult was to enable debug symbols also in release mode. This is also one of those things that I wanted to try for a long time: debug symbols are needed if one wants to run the "real" iLand version in the debugger, i.e. to set breakpoints, and step through the code. Furthermore, debug symbols allow to check the assembler instructions produced by the compiler, which is particularly useful for critical code sections (and of course to satisfy personal curiosity). And last but not least, they are needed for sensible profiling results.
Voila, here is my "solution", which is quite a crude hack, but works: to enable debug symbols in release mode, add those two lines in the iland.pro file:

# to enable debug symbols in release code
# debug information in release-mode executable
QMAKE_CXXFLAGS_RELEASE += -g
QMAKE_LFLAGS_RELEASE -= -Wl,-s

It took some tries until I found out that the linker also needs to know about debug symbols (the -Wl,s tells the linker not to remove debug information from the binary).
After getting a little familiar with the profiling tool at hand, I started to look for possible villains, performance-wise. The sapling growth proved to be an especially rewarding field. However, I also did applied some minor optimizations, e.g. to the indexOf()" method of the Grid''-template class.

And here are some results: the test case were a approx. 1km2 area with a stocked resource unit in the very center that acts as a source of seeds. I simulated 100 years; during the first half of the simulation a lot of regeneration/sapling growth took place on the bare ground; the second half was dominated by very high stem numbers. Multithreading was disabled (I ran the test in Windows XP on my good old HP dual-core laptop):

The image shows the runtimes (ms) for certain (sub-)tasks for the unoptimized and the optimized iLand version. Note that the "total" column is not the sum of the left-hand columns.

This chart shows the relative performance gain due to the optimization. For instance, the establishment routine is more than twice as fast as before. But also the "applyPattern" (i.e. the LIF generation) shows improved speed.

Some conclusions:

• iLand still runs flawless on Linux (see screenshot below for iLand running in Ubuntu/Gnome)
• valgrind/KCachegrind is a powerful profiling tool (and it is set up now!)
• there is always room for improving performance

## let the seeds fly

Friday 30 of April, 2010

During the last weeks we spent some time discussing the whole topic of regeneration. We are quite excited about that, because this will be the first real 'landscape'-process and we are curious to find out how nice this plays with the concepts on which iLand is built upon... Besides that, Rupert did a very thorough testing and parameterization of PNW tree species. To that end we applied some modifications in the water cycle (as it seams, it is not easy to model the severity of the growth limitation due to drought). Anyway, so far the results look quite good.
But this post is about the seeds. The seed dispersion algorithm was the first part of the regeneration chain that made the jump from mere theory (e.g. regeneration, or dispersal) into silicio. Implementation details may be found here.

Here is an image showing the dispersal:
<img src='tiki-view_blog_post_image.php?imgId=3' border='0' alt='image' width='100%'/>

The test landscape is essentially empty (see the small screenshot in the lower right corner) and so the effect of the dispersal process is quite obvious.

The production of such nice looking debug images is very straightforward and can be done by one line of code:

// save a image in black/white mode (2nd parameter)
// store as png
gridToImage(seedMap(), true, 0.,1.).save("targetfile.png");

The (iLand)-Grid class offers a 'gridToImage()' function; and saving as a compressed .png is done - thanks to Qt - just by a call to 'save()'.

The next challenging topic will be to find way to model the establishment phase in a way that is both scientifically sound and computationally cheap... The answer (of course) is blowing in the wind...

## Some performance considerations

Tuesday 13 of April, 2010

While applying several modifications in iLand (primarily in the water cycle, see http://iland.boku.ac.at/tiki-view_tracker_item.php?itemId=30) some performance issues came up as well.
First, just out of curiosity I ran a test that compares the debug and the release mode of iLand:
All tests were done on the HP-laptop (dual-core). The tree inits were 450Stems/ha with mean diameter of 40cm (only spruce). Outputs were turned off, log output was written in a file.

## Comparison of debug vs. release mode

A test with a simulation area of 5000x5000m (i.e. 2500ha) over 10 years:

 metric debug mode(sec) release mode(sec) grow trees 13.9 9.9 grow resource units 14.2 8.6 water 9.3 6.4 apply LIP 135.4 31.1 total 171 53

From this figures it is very clear, that release mode is much faster.... not a big surprise, though. Currently, the 3pg-calculation (grow resource units) are not multithreaded (because of crashes). The CPU-percentage therefore drops from 100% (for single tree operations) to a lower value while calculation resource-unit related data.

## dynamic stand output

The dynamic stand output is a very convenient features as it allows the kind user to define the tree aggregates that he/she wants (see http://iland.boku.ac.at/Outputs#dynamic_stand_output_by_species_RU ). The drawback of all the dynamics is the poor performance. One reason is the calculation of percentiles: this is quite a complex computation which involves the sorting of the whole data set. Second: when using expressions (e.g. if(stress>0.1,1,0) there is additional overhead due to the execution of the expressions.
A test with 2000x2000m (i.e. 400ha) and with a varying list of columns for the dynamic output yielded the following figures (same trees in the previous test, release mode):

 case timing(sec) 34 columns 4.3 22 columns(removed most of the percentiles) 2.8 18 columns (removed expressions and percentiles) 2.1

Re-inspection of the statistics class (responsible for the calculations) revealed, that the median (i.e. the 50th percentile) is evaluated even if it is not used --- not good. A better approach is to calculate percentiles only when specifically requested... Ok. As shown in the next table this little optimization pays off:

 case timing(sec) old timing (sec) new 34 columns 4.3 2.7 18 columns (removed expressions and percentiles) 2.1 1.1

The unoptimized version (with 34 columns) used up 31% of the total running time. In contrast, the updated algorithm consumes 22% (for the case with 18 columns the figures are 18% and 10%). I think the speed-up is not too bad considering that the time to actually implement the change was very short.
Anyway - using the dynamic stand output has an impact on the overall performance of iLand.

## Debug outputs

Another class of outputs are the "debug outputs". They were designed for simple integration in the source code and not with performance in mind. These outputs should be deactivated if not actively used (system.settings.debugOutputAutoSave / system.settings.debugOutput).

## transatlantic debugging

Wednesday 10 of March, 2010

First, a quick note for what happened since the last post. Autumn came, the Oregon visit ended. Then winter came. Now, winter is still around - at least here in Vienna. From my side activities for the iLand project were somewhat reduced during the last couple of month - this is on one side a pity, because I definitely enjoy working on iLand a lot. The second side, though, is that everything worked pretty well despite the heavy testing Rupert did.
So much to the first part of bragging ;)
Today, I found a cry for help from Rupert in my mail box. He described the issue he stumbled into while he was testing iLand (namely strange behavior of doug fir). He added - and that is the remarkable point - some additional information.
The mail attachment consisted of:

• a complete project folder (zipped).
• a screencast of the iLand GUI to show what the 'strange behavior' is about

Notably, the bug was detected by actually watching the visual representation of the forest in iLand. This points to the importance of proper visuals - it's all about the unrivaled power of the human visual system! And the video is IMHO an appropriate way to communicate visual impressions...
Second, using the project-folder I was able to duplicate the exact simulation setup within minutes. This nicely confirms some design decisions like project specific XML files, the directory structure or the omitting of a central run-database. Well, that was the second part of bragging...

Lest I forget: the bug. As we introduced variable class widths for the storage of LIPs, I forgot to adapt the code for the highest dbh-classes. Eventually, the linking between trees and their respective LIP went bananas and super-large LIPs were assigned to small trees. Setting up 4000 small trees therefore resulted in an average light-influence-field value of about 10e-11. The scaling of LIF -> light response then increased small differences between the LIF levels of individual trees enormously which directly led to the observed behavior.
Modesty is easily restored in the light of such stupid bugs...

## Lego - or recent new building blocks

Thursday 24 of September, 2009

During the last 10 days (or so) the iLand platform grew remarkably, as new features were added. Some of those additions are:

• Outputs: the things that are necessary under the hood, to have some useful database output tables after a model run
• dynamic outputs: outputs based on aggregates of tree variables (e.g. per resource unit and species). "Aggregates" can be averages/sum or more fancy percentiles (e.g. the 90th percentile of the treeheight or the sum of tree volume)
• a basic management routine

The management implemented is not very powerful, it merely allows to reduce the stem number to a predefined number. Much more interesting is the way it works: it utilize the Qt Script engine to let management programs be programs in a very literal way (Javascript, it is!). A similar (and still much more powerful) scripting approach was used for the management in Picus. The difference: To set up a running Javascript-based simple management took about half a day, and to set up a running PicusScript management took, well, much longer...

The next step in developing iLand will be to setup the simpler parts of the "environment" and subsequently persuade the 3PG production module to respond to that... The more complicated parts (i.e. the water cycle) already are lurking around...

Tuesday 01 of September, 2009

## Why that?

Did I already mention, that Qt is just great? The issue that made me state this here is Multithreading. As the number of cores per CPU is very likely to increase further, the issue of parallel computation gains importance. Actually, parallelism is one of the more important conceptual goals of the iLand-development.

So I started this morning to take a closer look at the QtConcurrent documentation. It is very convenient to handle multiple threads, but it is even easier, to use the container based function provided by the framework. If the basic design allows such an approach (and, err, iLand does!), the threading issue merely boils down to an one-liner. A nice extra: the library chooses the appropriate number of threads according to the number of cores in your machine, so this is supposed to scale nicely with upcoming hardware.

## How it works

// worker-function:
// execute something for each tree of the ressource unit
{
QVector<Tree>::iterator tit;
QVector<Tree>::iterator  tend = unit->trees().end();

for (tit=unit->trees().begin(); tit!=tend; ++tit) {
}
return unit;
}

{
DebugTimer t("readPattern()"); // some timing measurement
// execute "nc_readPattern" for each Ressourceunit of the "mRU"-Container.
// do all the thread-magic in the background, and return, when
// everything is finished.
}

## Results

The whole thing worked almost immediately, and here are the usual performance measurements (yes, I definitely like performance measurements ;)). With Multi-threading iLand runs now on my dual-core laptop at 100% CPU-Load and roughly twice as fast:

The test case is a 8x6km covered with 2.8 Mio. trees (Note, that the growth of trees is mostly incomplete).

### Conclusions

• For the time being the multi-threading approach shown above works just fine - future will tell if on some point a more fine-grained approach will be necessary.
• And - of course - Qt rocks!

## There is some progress

Sunday 30 of August, 2009

This weekend is definitely dedicated to really start hacking on the iLand modelling software. After weeks and month of preparations, theoretical considerations, the creation of software pieces to toy around (but also with more tangible results like the pattern creation "LightRoom" software), now it was finally the time to directly dive into the "real thing", the development of the iLand platform.

## Achievements so far

After two days of intensive coding (with *lots* of coffee, but on pizza :)), the progress made is quite remarkable. Here are some cornerstones:

• convenient means to parse and interpret input parameters based on XML
• the setup of multiple RessourceUnits including basic handling of trees per ressource unit
• a "Model"-class that combines all the structural elements
• the "LightCycle", i.e. the derivation of light concurrence indices for individual trees

Put another way, now we can:

• setup a large model-world (e.g. 10km x 10km) consisting of many ressource units (with the size of approx. 1ha) and a vast amount of trees
• calculate the light status of individual trees - this is the result of writing/reading light influence patterns on a large grid

The next step will be to implement the growth of trees - this is thoroughly elaborated on paper, but waits to be transformed into bits and bytes. Besides of that, the current model version allows a rough estimate of memory consumption and model performance:
The model loaded with a 10x10km (=10.000ha) world and approx. 8.000.000 individual trees consumed roughly 500MB of RAM and took 30secs to calculate the light status of the trees. Over time, memory usage of trees will presumably grow, yet this figures are reassuring insofar, that the planned (maximum) use cases seem to be digestible for today's computers... Unknown, of course, is the behavior of all the things to come (growth, seed dispersal, ...), but on the outher hand the possibilities on the IT side are not exhausted (multithreading, SSE, GPUs, ...)... so it remains a thrilling race...
A screenshot to illustrate the point:
<img src='tiki-view_blog_post_image.php?imgId=1' border='0' alt='image' width='60%'/>
In the case shown a 1ha-stand with a gap in the center is repeatedly used for the whole area of 5000x4000m. The color indicate the strength of the LightInfluenceField (red=dark) on the 2x2m basal grid.

## FONStudio performance

Saturday 15 of August, 2009

The implementation of the light competition cycle is still uncomplete (August, 14th, 2009), but it is in a shape that allows some conclusions with regard to the performance. Recently, I tinkered around with some timing issues (the Qt built in QTime is quite limited) and finally I came up with a much more precise stopwatch. From that point I couldn't resist to do some measurements...

## Current state of the competition cycle

Currently, Light Influence Patterns (LIP) have a horizontal resolution of 2x2m, while the "dominance height grid" uses a 10x10m grid. The maximum width/height of a LIP of an individual tree is the tree height (e.g. a tree with 40m has a maximum pattern of 80x80m or 40x40=1600 cells), but usually the realized sizes are smaller (the 40m tree may have a grid of 30x30m).
For the "reading out" of the influence value only those pixels within the crown-radius are considered, thus only a fraction of pixels have to be considered.

The cylce consists of the following steps

• clear the grids (initialize)
• calculate the dominance height grid (height)
• apply the LIP of each tree (apply)

## Some results

### Goals

• how is the general performance, i.e. how many milliseconds are needed to calculate 1ha? This provides a rough estimate how the approach works at the landscape scale.
• how is the effect of tree count vs. tree size?

### Method

Test stands with different tree count / tree sizes were applied on a grid of 1ha. Each phase was repeated separately several times to achieve a little robustness in results. The FONstudio was compiled in release mode with GCC shipped with Qt 4.5. The current revision in SVN is 71.

### Results

The used stands are summarized in the following table:

 Name No. of trees avg.dbh(cm) avg.height(m) crowded_small 3000 10 10 crowded_medium 640 30 27 crowded_tall 250 60 40 uneven_gap1 622

The next table provides some results of performance measurements (times are in milliseconds):

 stand cycle total avg. apply avg. read crowded_small 24 13.4 9.4 crowded_medium 29 26.8 0.7 crowded_tall 24 22.2 0.7 uneven_gap1 29 26 0.7

The other phases (intialize, height) together took a time of <1ms per cycle for all stands.
Interestingly, the total times for the apply/read cylce are roughly the same for all stands, but the share of writing/reading differs remarkably. The application of the LIP is more time consuming for larger trees featuring patterns with more pixels. Here, the medium stand with a large number of relatively big trees consumed the most time. On the other hand, the "read"-time scale much more with the number of trees to process, making the stand with the smallest tree the clear winner (or - depending on the point of view - the looser).
One interesting feature is the big difference of read-time between the small and the medium stand; tree-count is only 5 times higher, but calculation is more than 10 times slower?

## Conclusions

Although the structure of the current code was developed with performance as one of the core goals, there is very likely a big potential of improvements in the current implementation (without even talking about using SSE, or changing the grid sizes). So one has to be careful with conclusions; but it seems to be clear, that *some* improvements will be necessary if we are really aiming at the application on a landscape level.
A (not very surprising) general point that can be made: more important for performance is not so much the mere number of trees, but the number of items with the smallest resolution - in our case the number and resolution of the concurrency-map (now 2x2m, with large trees having >1000pixels in the pattern).

## Update 20090817

The results above were not very satisfying and so I finally set down and looked again at the code that actually does the pattern-related work. Again, I couldn't resist and so I started to play around with some optimizations. The current revision (svn version 78) features an improved (and more complicated) method of calculating the height grid, and optimizations for calculation performance for both the pattern application and the tree read-routine.
The updated values are presented below (the old values are in parentheses):

 stand cycle total avg. apply avg. read crowded_small 14.8 (24) 10.1 (13.4) 2.9 (9.4) crowded_medium 17.5 (29) 15.8 (26.8) 0.2 (0.7) crowded_tall 11.8 (24) 10.5 (22.2) 0.2 (0.7) uneven_gap1 17.1 (29) 15.2(26) 0.2 (0.7)

The current version performs the apply/read cycle almost twice as fast; this is especially true for stands with larger trees. The improvement of the reading routine (which is 3 times faster now) is only notable for stands with many small trees.

The current work cycle changed the state from "not optimized at all" to "exploited the most obvious issues", and, considerung the law of diminishing returns, without fundamental changes in design future improvements will likely be more tedious to achieve.
One note: when applying more repetitions, each cylce seems to be more time consuming (like +25% for 1000 cycles compared to 100); I do have no explanation for that behaviour, but it should be considered in futere...

## Performance gain using SSE instructions

Tuesday 20 of January, 2009

## SSE

SSE (e.g. http://en.wikipedia.org/wiki/SSE2) is the multimedia extension of modern x86 Intel and AMD CPUs. Basically, using 128 bit wide registers, 4 floats (or 2 doubles) can be processed within one cycle.

## The Problem

One idea how to do efficient light competition calculation is:

• Use a large grid that stores the influence values (e.g. 10x10m, that would end up with something like, say 2000x2000 values)
• each tree has a small vector (e.g. 5x5 or 7x7) expressing the influence of the tree on its neighborhood
• each tree "stamps" its influence matrix on the large grid by simple addition

## SSE solution

Using SSE instructions the code might look like:

// MS Visual Studio decl-spec to force (necessary) alignment in memory
#define CACHE_LINE  64
#define CACHE_ALIGN __declspec(align(CACHE_LINE))
#define CNT 1000000
float CACHE_ALIGN fData[CNT*4];

int test4()
{
__m128 empty = _mm_setzero_ps();
__m128 stamp[8];
for (int i=0;i<8;++i)
stamp[i]=empty;
stamp[3] = _mm_set1_ps( 3.343f); // any value
__m128 *p1, *p2, *p3, *p4, *s;
TTickTack t;
float *f1=fData;
__m128 val;
const int lineoffset=100;
for (int i=0;i<10000;i++)
for (int j=0;j<10000;j++) {
//apply(stamp, j);
s = stamp;
f1 = fData + 16*j; // force different starting positions
val = _mm_add_ps(val, *s++); // add 4 float values (of the stamp)
_mm_store_ps(f1, val); // store the result back to memory
_mm_store_ps(f1, val);
f1+=lineoffset; // go to next line
_mm_store_ps(f1, val);
f1+=4;
_mm_store_ps(f1, val);
f1+=lineoffset;
_mm_store_ps(f1, val);
f1+=4;
_mm_store_ps(f1, val);
f1+=lineoffset;
_mm_store_ps(f1, val);
f1+=4;
_mm_store_ps(f1, val);
f1+=lineoffset;

}

std::cout << "applied stamp (SSE): "  << t.elapsedStr() << "\n";	 t.reset();
float sum=0.f;
for (int j=0;j<10000;j++)
sum+=fData[j];

std::cout << "sum: " << sum << "\n";
return 0;
}



Important to note: The used data structures must be aligned in memory (at least for 16 bytes, but it seems to be preferable to use 64-byte alignment because of faster cache-access). The "_mm_xxx()"-functions are replaced by single assembler instructions, as the disassembly shows:

      __m128 val = _mm_load_ps(f1);
00401735  movaps      xmm1,xmmword ptr [eax-4E0h]
_mm_store_ps(f1, val);
0040173F  movaps      xmmword ptr [eax-4E0h],xmm1


## Without using SSE

A algorithm that use no special instructions could be as follows:

int test5()
{
float stamp[64];
for (int i=0;i<64;++i)
stamp[i]=0.f;
stamp[20]=3.343; // some value
TTickTack t;
float *f1, *s;
int ii;
const int lineoffset=100;
for (int i=0;i<10000;i++)
for (int j=0;j<10000;j++) {
//apply(stamp, j);
s = stamp;
f1 = fData + 16*j; // force different starting positions
for (ii=0;ii<8;++ii) {
*f1++ += *s++;
*f1++ += *s++;
*f1++ += *s++;
*f1++ += *s++;
*f1++ += *s++;
*f1++ += *s++;
*f1++ += *s++;
*f1++ += *s++;
f1+=lineoffset;
}
}
std::cout << "applied stamp (no SSE): "  << t.elapsedStr() << "\n";
t.reset();
}


As it is, this code applies a 8x8 stamp (64 float values) 10000x10000 times (i.e. 100,000,000). It is somewhat optimized (faster pointer arithmetic, manual rolling out of the innermost loop and such).

## Results

The code is executed on a T2300 Intel Core Duo processor. The results are:

 Without SSE With SSE 8.6s 1.6s (=ca.18%)

This is quite impressive and considering the 100,000,000 repetitions both versions are extremely fast. In reality, however, this example would be too artificial: one point is that the "stamp" here keeps the same for all calculations (and would therefore keep sitting in the cache), another point is that there is no bound checking (when stamping along the borders of the big grid).
I also tried a kind of standard-implementation with a bunch of nested loops, having the innermost loop looking like this:

  f1 = fData + 16*j; // start somewhere
for (x=0;x<8;x++) {
for (y=0;y<8;y++) {
f1[y*gridwidth+x] += stamp[y*stampwidth+x];
}
}


This version surprisingly performs very similar to the above version (without SSE). At least the Microsoft-compiler seems to do a very good job in optimizing the code.
To confirm this, I also did a run in debug-mode of the compiler. The results are then:

 no SSE, nested loops SSE pointer optimized SSE 77s 53s 16s 100% 68% 21%

Generally, using SSE instructions and careful memory alignment/cache-aware programming has a huge potential of performance improvements.

• 1
• 2 (current)
• »