Computer vision and image recognition. What’s the difference?Serge Korzh
25 September 2020
Reading Time: 5 minutes
You've started your journey into the world of deep learning, and you're thrilled to do something cool with images. While looking around the Internet on guides and tutorials, you'll probably stumble upon many terms: computer vision, image recognition, object detection, OCR... It is very easy to lose yourself in all these notions. And it gets even worse: as you read through the resources, you discover that people use one term in many, sometimes conflicting, ways! So in this article, I will outline the core concepts and give simple definitions for your future reference.
How to make computers understand images
Computer vision (CV) is not just about identifying objects on images. It comprises all sorts of tools and methods that allow computers to interpret, understand, and extract information from visual data. Let's take this picture as an example:
A photo of Mulberry street, New York, circa 1900. Source: Wikimedia Commons.
- How many people are in the photo?
- How wide is the street?
- What is the weather like on that day?
- What do different people do?
Humans are very good at answering these questions effortlessly. But how can computers learn to answer them? First of all, we can distinguish between two different types of information encoded in any image: spatial (geometric) and semantic.
We're dealing with spatial information when we want to know the shapes and dimensions of different objects in the scene, their position and orientation relative to each other, or even spatial relation between several images (e.g., panorama stitching). Knowing spatial properties is necessary for 3D reconstruction, image search, and especially for AR and robotics.
Visualization of SIFT keypoint descriptors done with OpenCV. SIFT is a popular technique of identifying features in an image that can be later used for image stitching, image indexing, or even object recognition. These aren't semantic features but instead visually distinctive keypoints.
On the other hand, semantic information has more to do with what is presented in an image. If we look at the picture above, we unconsciously extract a lot of semantic information: it's an old photograph of a crowded city street, the people seem to be busy with their work, there's a fruit market, brewery, shoe repair, etc. This is where image recognition proves useful, which is used in countless applications such as autonomous manufacturing, access control, and self-driving cars.
Classifying things in an image is one of the common semantic information extraction tasks.
But what is image recognition exactly?
Given that image recognition is a popular buzzword used in countless situations, it is somewhat unsatisfying to hear that there is no clear definition of what it comprises. The phrase is often used to denote any technology that can "recognize" or classify something in an image, but the exact problem's complexity varies greatly. So, what are the possible tasks you might stumble upon when doing image recognition?
A dictionary of useful terms
(save the page for later reference!)
- Image classification – given an image, tell me what one thing is displayed on it. In other words, determine the "class" of the image. It can be a simple binary problem, like "There is a dog" versus "no dogs", or a more complex one like finding a breed of a dog out of 132 available options.
- Object localization – here, not only are we interested in the class, but we also want to know the exact boundaries in which the object is lying. Such a boundary is specified by a bounding box (see examples below).
- Object detection – Similarly to localization, but now there may be many different objects of different classes on one image. Our task is to identify all instances of objects from known classes as well as their bounding boxes.
- Semantic segmentation – instead of rectangular bounding boxes like in object detection, we want to "segment" the whole image into arbitrarily-shaped areas corresponding to identified objects. Semantic segmentation is also called image segmentation, but keep in mind that there are other non-semantic kinds of image segmentation (e.g., color quantization).
Examples of the four most common image recognition tasks. As you can see, there is still room for improvement, even for state-of-the-art models. The test image is taken from the ImageNet dataset.
There are a couple of specific instances of these problems which are so popular that they got their own names:
- OCR – Object Character Recognition. The task is to localize and classify characters in an image (i.e., object detection), and then stitch them into words and sentences, producing a machine-readable transcript of all the text visible in the image. One of the first big successes of computer vision.
- Face detection – just like with object detection, but the objects of interest are faces. It's responsible for those white frames that appear around faces when taking a photo with a modern digital camera.
- Facial recognition – here, we're also interested in a specific person staring at a camera 🙂. On the surface, it seems just like face detection but with every person having their own class. However, the task is much more difficult, since the system must correctly identify the person having only a couple of previous examples of their face. Perhaps, the most prominent example is Apple's Face ID.
As you can notice, most of these tasks require both semantic and spatial processing of images. In fact, it is the interplay between the two that proves to be so valuable.
Bear in mind that this is not a comprehensive list of problems in image recognition. Researchers continuously push the limits of what machines can do with images. For example, automatic image captioning technology can generate a short description of an image without any additional input, adding a new layer of complexity – language generation.
And do you remember the black and white image we used at the beginning? Turns out, there are Deep Learning models that allow us to colorize such images with no manual work at all!
Example of automatic colorization using DeOldify developed by Jason Antic. Source: Twitter
Image recognition and transfer learning
The success and widespread use of image recognition technology wouldn't be possible without deep learning. Novel approaches such as convolutional neural networks eventually became far more performant on many tasks than the "traditional" computer vision methods. But there is one catch: they require much more data and computing resources to train.
To alleviate these issues, we can use a technique called transfer learning. Instead of training a neural network from scratch, we can take a model already trained for a similar task and use it as a base for our new model. The intuition behind it is that a lot of the knowledge on how to recognize objects is similar, no matter what kind of objects you are interested in. Any image consists of simple shapes like corners and edges that merge into more complex patterns. It seems reasonable that having learned how to recognize those patterns and shapes, the model can learn which combinations of them create which objects with much less time and data.
Visualization of one of the layers of a trained CNN. Notice that one channel has learned to detect wheels on an image, even though it wasn't one of the classes the network was trained to recognize. Such knowledge can be "transferred" to many different problems. Image is taken from Yosinski et al. 2015. Source: GitHub
[BONUS] Train your own classifier with transfer learning!
It's useful to know the theory, but it is far more enjoyable to actually train an image classifier. Let's do so! No need to install anything, just open this Colab notebook and learn how to use transfer learning to train image recognition models fast in the cloud. Alternatively, you can set up the project locally by cloning our GitHub repository.
P.S. If you want to see a real-life example of how transfer learning can be used, check out our dog breed recognizer at breedread.kiwee.eu that was built on the same principles described in the tutorial.