Querying Text in Image


Retrieving relevant images by a text attached to the image may not be sufficient. In the other hand, the text contained in the image is more specific and informative. For example, it is more likely to find the word restaurant in the image as a label for describing the place rather than in the attached text.


Figure 1 restaurant word is found in the image


The goal of this project is to implement the idea of the article: “Image retrieval using Textual Cues” that its main focus is on finding ways to search for query text in a large collection of images and retrieving all occurrences of the query text.

Recognizing a text in image is not a solved problem. This article takes a large collection of images that contain the same text with different position, viewpoints and font style that could help with giving optimal results.





We implemented querying characters in image so we can extend the idea to search multiple characters (text) as well.

We used HOG to identify every character and SVM for training positive and negative set of images and detecting the queried character or text.

The idea is to scan every image by sliding window. The sliding window contains blocks and every block contains cells and every cell has number of fixed number of bins.

In the project we used a sliding window of size 160x96, a block of size 16x16, a block stride 8x8, a cell of size 8x8 and the number of orientation bins for every cell is 9.

We got 4 cells in every block, and 209 blocks in every window calculated as following: (96/8 - 1)*(160/8 - 1) = 209.

Totally the HOG size for a window is:  209x4x9= 7524.


HOG example:

Figure 2 example for HOG image



The goal is to create a file that contains data of all images related to current positive character that will be used as input for SVM trainer.

Every line in the file should have the following format:

<label> <Index> : <value>  <Index> : <value> …

Using  hog.compute(img, descriptors, Size( 1, 1 ), Size( 0, 0 )) to calculate descriptors for each image of a positive character. Where:

-         Size(1,1) is the sliding-window’s stride and size.

-         Size(0,0) is the padding size. In this case it is zero because we chose the sliding window size to be the same as image size so there will be no margins.

Every vector from descriptors will be copied to SVM-trainer’s input-file.


In order to train the positive characters we used svm-train.exe from libsvm library.

The arguments we used:

1.     Input file of calculated HOGs of characters’ positive images.

2.     -s svm_type : C_SVC – it deals with imperfect separation of classes.

3.     -t  kernel_type : linear

4.     -c   cost : value between 0.001 – 10. We will show later that this parameter didn’t have significant impact on the predicted results.



In figure 3 the marked regions are detected by SVM detector as “D” letter. A wrong detection of “E”, “IJ” as “D” is because of the different variations of positive images of “D”.

Figure 3: result of querying letter "D"



Figure 4 shows the images from positive set of “D” that could match the combination “I J”.


Figure 4: from left to right I J, I I