Optimizing ITK’s Registration Methods for Multi-processor, Shared-Memory Systems

Please use this identifier to cite or link to this publication: http://hdl.handle.net/1926/566
This document describes work-in-progress for refactoring ITK’s registration methods to exploit the parallel, computation power of multi-processor, shared-memory systems. Refactoring includes making the methods multi-threaded as well as optimizing the algorithms. API backward compatibility is being maintained. Helper classes that solve common registration tasks are also being developed.


The refactoring has reduced computation times by factors of 2 to 200 for metrics, interpolators, and transforms computed on multi-processor systems. Extensive sets of tests are being developed to further test operation and backward compatibility.
Data
minus 2 Files (489Kb)
Code
There is no code review at this time.

Reviews
minus Reviewed Feb 2008 by Rupert Brooks on 02-18-2008 for revision #3
starstarstarstarstar expertise: 3 sensitivity: 5
yellow
Summary:
The authors propose to speed up the implementation of ITKs registration framework both through code optmizations and multithreading. ITKs registration framework is - or perhaps I should say was - regrettably slow. This paper is an excellent and much needed contribution.

If imitation is the sincerest form of flattery, the authors should know that I am planning to reimplement much of my own work using the ideas and framework presented here.

Evidence:
The majority of the evidence presented is in the form of timing comparisons on two machines. Unfortunately, there are several different speed up methods at work here. It is difficult to discern from the evidence, which code changes are more important. In my reproduction of the work (see below) I noted that there are speedups of both the metric evaluations, and the metric initialization. This is of course good, but this might merit separate analysis.

Open Science:
The work adheres to the principles of open science. However - the work was somewhat difficult to reproduce. Recently the code was added to the ITK CVS, which made access to it much easier. However, it was added in such a way that an ITKCVS build seems to either use these classes, or the original versions, but not both. At least, it appears this way - in all honesty i did not try to build them from the Review directory, but came up with my own way. Details follow below.

Reproducibility:
I got the code working in nearly its original form. Very significant speedup was immediately apparent, plus of course the new capability of multithreading. As this is a new framework for (mainly) the ITK registration metrics, i implemented some of my own work in this framework. This was not possible to do easily from the document - it was only possible after carefully reading the code.

A careful testing approach is essential for any large code optimization project like this. The approach described in the paper seems good. However, I did not reproduce it. The current set up in the ITK CVS seems to preclude testing the old against the new versions of metrics unless one is willing to build two concurrent ITK installs. Even then, how one could write a single program to run one metric, then the other and compare their output was not obvious. I did it my own way see below.

Use of Open Source Software / Open Source Contributions:
It is a contribution to the open source ITK project.

Code Quality:
The code is of high quality. However, there are three design criticisms that i might make
1. A not insignificant part of the speed is due to using memset instead of the Fill() methods. Perhaps the Fill methods should be recoded to use memset instead? Then the speed benefit would appear everywhere that uses them.

2. The code for the threaded GetValue and GetValueAndDerivative methods is complex and highly repetitive. As we all know, repetitive code is hard to maintain. I think that the repetition could be reduced somewhat. In particular, i found it possible to compress both the Transform point methods into one. I found other ways of reducing the amount of repeated code in my re-implementation by using a recursive including scheme, but this did make it a bit more difficult to understand.

3. The caching of the bspline parameters does give a significant speedup, but at the cost of a MASSIVE use of memory. For any interesting size 3D image, and even many 2D images, it is completely impossible to run this on a 32-bit system. I think there should be a switch to turn this on or off. Yes it runs slower, but its still useful to run it on 32-bit systems on medium size images.

Requests for additional information from authors:
I think that additional documentation is needed. Not all the modified classes are described, or even listed, in the text.
In particular, i found the way the multithreading is actually implemented, and what a user must implement to reproduce their own metric, would benefit from a written thorough explanation. I had to read the code - very carefully - to figure it out.

It would also be interesting to have a list of what is different about all the optimized classes. Some independent testing might be valuable. For example, how much of the speed up is due to the interpolator, and how much to the metric? What is the optimized resample filter, it is not discussed in the text.

It seems to me that the efficiency of these methods is highly reliant on inlining, especially of the ThreadProcessSample function. This would be interesting to explore. If inlineing is not vital, function pointers could be used, which could simplify the code, i think. On the other hand, if inlining is important, perhaps the transform point method (also called for every pixel) could be inlined.

I was curious about the authors thoughts of existing shared memory parallelization methods, in particular openMP, as a way of getting quick multithread gains for low coding effort. This was more a vague thought, than a specific request. Clearly many of the compilers on which ITK is built do not support openMP, so it may not have universal applicability.

Additional Comments:

How I reproduced (some of) the work.

I was using the version in the ITK CVS as of Feb 6, 2008. The rest of the code besides the review directory was ITK 3.4. The optimized classes are intended to replace the nonoptimized classes of the same name - therefore building both alongside each other is difficult. I first renamed all the optimized classes to have the prefix "opt" on their names. I then made the OptImageToImageMetric a subclass of ImageToImageMetric - this way i could plug it into existing code.

I trusted the optimized interpolators, so i just began using them directly.

I evaluated the metrics a number of times with the old metric implementation, and the new metric implementation, and compared the results. The mean squares metric is numerically identical to the old one. The MI metric is numerically very similar, but not identical. Both optimized metrics are MUCH faster, running single threaded. This is not due to the interpolators - during this test both metrics were using the same interpolators. The metric code itself is significantly more efficient.

Running multithreaded, contrary to what is described in the text, I actually found further performance increases (although not even close to linear) running more threads than cores. Specifically 3 threads on a two core machine was noticably better than two. This was a Windows XP 32bit machine, compiler Visual C++ 2003. Its probably some quirk of windows.

I did not evaluate the oriented image gradient fix - i have my own way of doing this, discussed elsewhere.

I did not evaluate the optimized resample image filter.

As a further evaluation of the work, I reimplemented one of my own metrics copying most of the ideas from this framework. I found that it was not too difficult to do, but does require a careful reading of the code - more documentation would have been helpful. The study of the code probably took me a days work. The implementation about 4 hours, and theres still a bug or two to track down. Nevertheless, i conclude that the framework is usable and general and strikes a reasonable balance between efficiency and intelligibility. As mentioned, I did find that the code might be difficult to maintain, due to repetition.

Things that may be bugs or errors:

In the paper (p. 4) it says that the base itkTransform class was modified for thread safety. However, in the ITK CVS Review versions, there is no modified version of itkTransformBase. Furthermore, the classes seem to achieve thread safety by making multiple copies of the transform.
------
In Figure three, supposedly the old method is being compared to the new method. In the text it is claimed the new method is 6 times faster when run single threaded (probably true, based on my experience). However, the graph starts at 1. I dont think it graphs the right thing.
-------
in itkOptImageToImageMetric.txx

// If user provided a mask over the Moving image
if ( m_MovingImageMask )
{
// Check if mapped point is within the support region of the moving image
// mask
sampleOk = m_MovingImageMask->IsInside( mappedPoint );
}
else
{
// Check if mapped point inside image buffer
sampleOk = m_Interpolator->IsInsideBuffer( mappedPoint );
}

I think that being inside the moving image mask does not guarantee that the point is inside the valid region of the moving image. I think both tests should be done.
-------
There is a minor inconsistency in that the stub for ThreadPostProcess is placed in the header, and the stubs for the other two are in the .txx
-------


minus great ground work - opens the way by Alexandre Gouaillard on 09-13-2007 for revision #3
starstarstarstarstar expertise: 3 sensitivity: 5
yellow

Summary:
this paper report the first results of a multithreaded implementation of the ITK registration framework. Dedicated metrics and test environment has been developed to compare the new implementation with the previous one on different plateforms, with different environments and depending on the number of thread used.

Hypothesis:
* multicore/multi cpu are ubiquitous or about to be (85% of intel CPU are currently at least dual core).
* using all the core/cpu should improve the performances
* the improvment should happen across plateform and OSes. 

Evidence:
The speed comparison depending on the number of thread have been plotted and illustrate the gain. Example are multiple and show clearly that users will gain from using the new implementation, whatever filter/metric/oprtimizer they use in the registration framework and whatever plateform they are working on. That was the main target of the work, and we can say: Work well done!
Note: It seems that part of the gain was made from code optimisation and not multihreading.

It seems that using more threads is interesting as long as the number of threads is not greater than the number of cores/CPU. If the number of thread is higher, we see the performance decrease. It would be nice to have the filters decide ow many thread they want to use automatically. It does not seems to be consistent enough across platforms or threading implementation to be able to do so. Another point not discussed on the paper but discussed intensively on the itk-develloper mailin glist is that it depends on the multithreading implementation: does the filter creates and destroy threads, or does it access a pool of shared threads? This question is under investigation.

One very interesting part of the work is the implementation of the cross-platform multithreading testing and benchmarking tool. Mixing BatchMake and CMake seems to provide good results. It would be interesting in the future to use this tool for other multithreading implementations (the Mesh part of ITK would be my first target, but I'm kind of biased). The authors also experiment a lot with different profiling tools and timing tools, I think many users would be interested in knowing which tools they used (they mention valgrind, but I suppose they mean cachegrind), what kind of experiment they did and what ther conclusion were about the pros and the cons of each tools. That is outside the scope of this paper, but that would make a great wiki page.

Open Science:
In ITK spirit, the author provide the code under BSD and the IJ article under CC. The code should be integrated in ITK post 3.4 release. It must be noted that modification of this magnitude can not be integrated in ITK at once. The quality process, that insure that the code is stable on an almost daily basis prevent that from happening. 

Reproducibility:
Checking out an older version of ITK, i could compile it and run a basic test. The amount of work needed to reproduce the result was such that I did not even try. Teh current version does not compile, but it should be only a matter of weeks before it is available in the CVS version of ITK.

Use of Open Source Software:
ITK

Open Source Contributions:
integration in ITK in progress.

Code Quality:
Master Luis checked it, it should be good. Test are provided. Along the main line of work , the authors took the time to profile the application and found many improvements, not only on speed, but also on combinaitions of filters/metrics/optimizers that had never been tested together before and happened not to work along well. I guess a complete matrix test using all the possible combinations is being developped right now. It should find a lot of potential bugs an improve the overall quality of the registration framework. Thank you for that.

Applicability to other problems:
the multithreading test framework could be reused for other part of ITK, or for any multithreaded code that use CMake, I guess.
A little report on the timing and profiling tools would be of interest.


Suggestions for future work:
* threading implementation and other issues are being investigated and (intensely) discussed right now.
* better and more complete registration framework tests are being developed.

It would be interesting to have results on realy big cases (zebrafish datasets ...) to check if you really have much better improvement in this case as you are expecting. In that case, it would also make the thread pool model more interesting.

You report that threads and associated items (mutex) implementation on SunOS were especially slow, and I noticed that the SunOS machine used for your test were using sunOS 5.8. It so happen that the SUN threading library is known to be horribly slow, untill sun decided to completely rewrite it. The new library was shipped separatly one year after the release of sunOS 5.8, so your machine might not have it installed. With the current OpenSolaris (or the latest release of 5.10) you should have much better results. I was wondering, out of pure curiosity, what the results would be on one of their "Niagara" box which is optimized for multithreading operations.


Requests for additional information from authors:
It's a work on progress. I wil ljust wait for the results. Everything is really interesting indeed.

minus Great direction for ITK registration framework -- more details requested by David Holmes on 09-12-2007 for revision #3
starstarstarstarstar expertise: 3 sensitivity: 5
yellow

Summary:
The authors describe the optimization of the ITK registration framework for multi-processor systems.  This includes some code optimization, making everything thread-safe, and threading several of the processing steps.  The authors are targeting various transforms, metrics, interpolators, and solvers.

Hypothesis:
The hypothesis is that multi-threading will improve the computational performance of the ITK registration code.


Evidence:
The authors have implemented and tested the code on several platforms.  They present the results for two of the systems that they ran on.  The results are consistent with expectations.

Open Science:
In true ITK form, the work is completely open.

Reproducibility:
I have to admit that I have not had a chance to download and run the code. 

Use of Open Source Software:
The method was implemented in ITK.  There was some use of BWHITK which I would guess is also available if I went looking for it.  Valgrind was used for profiling which is open.  Unfortunately, neither vtune nor ltprof are open source. 

Open Source Contributions:
Great contribution

Code Quality:
Not evaluated

Applicability to other problems:
Well, this is the first of my two little issues.  It is not about the work, which is incredible valuable, but about the details of the paper.  One of the most important contributions here is that the authors have expended the energy to optimize multi-threading across platforms.  This is the most "applicable" component of the work.  Unfortunately, there is only one paragraph on the optimization process in which the authors state "certain changes and parameterizations could improve performance on one platform and degrade performance on others...."  There are no details about the difference by platform or proposed explanations why this is.  If someone is looking to develop cross-platform code that is multi-threaded, they might look to this article and find it lacking in that detail.  If this has been previously explained, then I would simply suggest including a reference.

Suggestions for future work:
Speed is one of the major issues with optimized software, and the authors have addressed this issue very well.  The other issue -- which can be at direct odds with the speed -- is the use of memory.  Algorithms which require large amounts of memory are either slower or require machines with gobs (sorry for the technical jargon) of memory -- like 128 GB.  In addition, making thread-safe code will result in more memory usage.  This issue has not been discussed; it will be valuable to run this same set of tests and report the memory usage as the threads go up.


Requests for additional information from authors:
Only the in the two areas noted above.

Additional Comments:
Congratulations on moving in this direction.  Optimization is such an important process to make usable software particularly given the magnitude of the problems in medical imaging.

plus Work in progress by Torsten Rohlfing on 09-11-2007 for revision #2
starstarstarstarstar expertise: 5 sensitivity: 5
Add a new review
Quick Comments


Resources
backyellow
Download All

Statistics more
backyellow
Global rating: starstarstarstarstar
Review rating: starstarstarstarstar [review]
Code rating:
Paper Quality: plus minus

Information more
backyellow
Categories: Deformable registration, Groupwise registration, Image pyramids, Multi-modality registration, Registration metrics, Registration optimizers, Transforms
Keywords: Registration, Threaded, Parallel, ITK
Toolkits: ITK (moved into the sandbox), CMake
Export citation:

Share
backyellow
Share

Linked Publications more
backyellow
Diffeomorphic Demons Using ITK's Finite Difference Solver Hierarchy Diffeomorphic Demons Using ITK's Finite Difference Solver Hierarchy
by Vercauteren T., Pennec X., Perchant A., Ayache N.
ITK in Biomedical Research and Commercial Applications ITK in Biomedical Research and Commercial Applications
by McCormick M., Aylward S., Johnson H., Lowekamp B.

View license
Loading license...

Send a message to the author
main_flat
Powered by Midas