Love affair with CBIR (Content base image retrieval system) Part 3
I never figure out, this would be series of article, here are links to rest of it
Love affair with CBIR – Part 3 (The one you are reading)
This is continuation of the series of article, so, if you are not able to catch with this post, read the previous ones to understand the context.
Last article, we had discuss about the De-duplication of image, and how to get identical across thousand of images. But, when it some to similar images, those algorithm does work. My quest for find some algorithm, which works for similar images with has different lighting condition, rotation, affine transformation, yet could identify the image.
List of CBIR
As, I was going through this rabbit hole, I found there are several research has been done in this area and tons of research paper has been publish. There are several open and close source CBIR already developed. Many are the outcome of the University research, while others are commercial.
Google already has reverse image search, if you haven’t tried it, I recommend to do so. It fast, robust, and works really well in finding similar image. But, this won’t work for me, as it search entire web for images, and prerequisites was, that image set in which you are trying to find similar need to be crawl by google search bots. Mine was internal website (wasn’t expose to internet), and image set was sort of copyrighted.
Next best commercial solution was tineye matchengine. It have sets of API, where I could submit my image, it would run it’s indexing algorithm on my set of images, then I can pass the query image, and it would return in set of similar images. As you can see above, paint of American Gothic, where farmer in different mask would be detect as similar. This is big deal in computer vision research. But, sigh, Tineye cost the bomb ($1000 for starter) and my project stakeholder required some free solution.
List of free ones (CBIR)
I look to various open source CBIR, WindSurf is desktop base application in Java, LIRE is web implementation using lucene and solr in Java, FIRE (Flexible Image Retrieval on Image Parts) developed in C++ and phyton. But, none suit my needs, first I need something in .NET, second, it need to work on part image search. I mean, I give just the portion of image, and it should retrieve the complete image, if it able to match the part.
It looks like, there isn’t good CBIR in .NET and I need to built one myself. Sigh! To built one CBIR, I need to know how to tell computer to identify images, next question stump be was how does human eye identify images?
I did some research and found out, that human eye tries to find feature in the images, and try to identify same set of features in another image, to tell if the images are similar. There is excellent tutorial on OpenCV, which takes more about the image feature detection.
Now, my next problem was, how do I tell computer to detect feature in image. Images are going to be random, and i need some algorithm, which can detect significant feature in any image I throw to it, not just one type. e.g. In picture of a cat, it should detect cat eyes, nose, whisker, pointed ear. But, in some beach picture, it should detect people on beach, coconut trees, sea waves. You got the problem?
Now, I was going down hot waters , research field of Computer Vision, Image Processing, mathematics, matching learning, artificial intelligence. In short, areas I was afraid of. But, its love affair, and I knew, its not going to be easy.
Corner Point Detection
In computer vision, they call this problem as corner or interesting point detection. Their are several algorithm, like Harris corner detection, Susan corner detection, LoG, DoG, and DoH feature detection and several others. One that catch my attention was Harris corner detection and FREAK (Fast Retina Keypoint) detection.
This algorithm detect corner something like this
Cool, knowing algorithm name is one thing, and implementing is other. I was searching for .NET implementation of these corner detection algorithm, where I can to know about two beautiful open source libraries / framework.
This libraries have set of functions which you need to for Computer Vision, Image Processing, Machine Learning, Artificial Neuron networking, etc.
Correlation base matching
Now, I know, how to detect corner, I was stump with another problem, one I detect corners, from Image, how do I compare them. I mean, corners are just set of (x,y) points in space. Now, I need to detect this points in other image. You would say, just compare their x coordinate and y coordinate. Cool, but remember rotation and affine transformation in first post, it means one image points x,y no longer remain the same in other image.
Phew, it gonna we tough than I though. Here, come another algorithm, know as correlation based matching. It use matrices, and whole lots of math to find the distance between points and matching points. Accord framework has the Correlation matching algorithm, which approximately find the matching points.
It detect matching points, somewhat similar to image below
Figure: Top Left: Corners detect in Image A, Top Right: Corners detect in ImageB. Bottom Left: Correlation mapping of points in both image. Bottom Right: Remain after RANSAC estimation.
Cools, now, I know, how to match points in one image with another. But, how do I find set of matching points with thousand of images in database. Stump again!
Another idea struck, I need to say to describe the points detect in image. If the description of the point match with any of the description of points in thousand image, it means, it match. Wow! I am genius.
SIFT descriptors extractor
But, wait, how do describe the points? Few googling, and found out computer vision scientist David Lowe, though about this problem long back in 2004. He came up with SIFT algorithm (Scale-invariant feature transform), which works on LoG, DoG, and DoH feature detection, detects the interesting keypoints (corner), and describe it with arrays of double of 128 bit length.
Now, Image comparison problem has been reduce to math, where you need to compare one matrix with other
Problem: Need for speed
Cool, but this there is problem. First, the SIFT detection of points and descriptors is not fast, I need some algorithm, which is fast enough to detect points and descriptors of thousand of images in matter of minutes.
Second, method to compare fast with the thousands of images in database
Other descriptor extractor
Lets, take problem, one by one. SIFT fast alternative? There is another algorithm, SURF (Speeded Up Robust Features), which is fast compare to SIFT and in default mode, computes descriptor of 64 bit length. But, there is another problem with SIFT and SURF, both are patented and can’t be use in commercial application without loyalty fee to patent holder. Life in never easy. Quest of open source algorithms continues.
There are several other algorithm, which are patent free, and can be use in commercial application, like BRISK, FREAK and ORB. You can get their .NET implementation in EMGU CV library.
Cool, now, I have way to reduce image in set of jagged array (read Matrix) and now, I can find a particular feature in other images, irrespective to lighting, slight rotation and affine transformation. But, second problem till remain, how to compare descriptors to sets of hundred millions even billion (descriptors of thousand of images) of descriptors.
This can be tackle in two ways.
Trees and Nearest Neighbor
Trees – Answer like in tree structure. There need to create the tree data structure which stores the descriptors in such fashion, that it can be fast matched when searched in tree. Something similar to binary tree, but here challenges are more. In order to compare two descriptors say [d1,d2,d3,…, d64] which is one row in matrix to another row is via [n1,n2,n3…..n64] is done via different distance metrics formula like Euclidian, Manhattan or Minkowski. You arrange the descriptors in such way that it compliment the distance with each and other in the tree.
R-Tree, BK-Tree, Cover tree, MVP tree are possible solution.I had gone with k-d tree. You can kind the implementation of the k-d tree in the Accord Framework libraries. Once, you arrange the descriptors in the k-d tree, you perform the k-nearest neighbor to find the descriptor in skyscraper like matrix.
In my personal experience, I found k-d tree do not perform well in higher dimension 64 and 128. Accord framework had the Approximate Nearest Neighbor implementation, but the result were Random. I found FLANN library, which there in Emgu CV as Flann Index, give better result. FLANN uses k-d forest to perform approximate nearest neighbor search.
If all this stuff makes you dizzy, no problem. You can checkout my implementation on GitHub where you can play around the application to understand all the Jargons which I have thrown here.
Till, problem lies with the volume of descriptors, in my trial, 1000 images, which cumulates to approx 40 MB in size had SURF descriptors 120 MB in size. Phew, three times more. What if, I had ten thousands of images, descriptors size would result in GBs. Maintaining such large dataset of descriptors is problem, even on Hard disk, plus loading into memory during search is big pain.
Bag of Visual Words
Bag of Visual Words – In Bag of Visual Words, you read sets of images, and calculate mean sets of feature out of it. Now, you get histogram of mean visual descriptors which appear number of times in image. It will get the array of fix length which describes the image. This is brief description of bag of words. If you don’t understand the bit, you can look for detail explanation on this blog, this one or google out for more post which describes the Bag of Visual Words.
Plus, applied the document and query weighting scheme to query out the images. You can find more about this in the implementation of term frequency, document frequency and normalization. There is good paper on the why to use of weighting scheme of vector space (Bag of visual words, in our case)
Now, we have reduce to statistical problem, and compare thousands, even millions images in database. Here, each image is just the representation of visual bag of words, and base on the query image, which again is converted into visual bag of words, and let the similar ones for it.
Enough Words, Action
You would be thinking, enough of gyan, where the implementation/code of all these? That’s my plan for next post. Stay tuned.