A-Z of Machine Learning and Computer Vision Terms

COCO (Common Objects in Context) is a large-scale dataset for computer vision, widely used for training and evaluating models on tasks like object detection, segmentation, and image captioning. The COCO dataset contains on the order of 330,000 images, with about 200,000 of those images labeled with extensive annotations (the remainder reserved for testing).Each annotated image comes with one or more of the following: bounding boxes around objects (for detection), class labels for each object (80 object categories in total, ranging from person to dog to chair, etc.), segmentation masks outlining object shapes (for instance and semantic segmentation), keypoints for certain object types (e.g., human body landmarks for pose estimation), and even multiple descriptive captions for the image.COCO’s images are complex everyday scenes – “objects in context” means that images typically contain multiple objects interacting in natural environments (unlike simpler datasets that might have one object against a clear background). For example, an image might show a living room with several people, a couch, a TV, and a cat; COCO would have each person, the cat, and salient objects annotated with location and category. This richness makes COCO a challenging benchmark that pushes models to detect objects under occlusion and in diverse contexts.Introduced by Microsoft researchers in 2014, COCO quickly became a standard benchmark for the computer vision community. It powers the annual COCO competition, where algorithms compete on tasks like object detection (localizing and classifying all objects in the image) and instance segmentation (precisely outlining each object). Due to its large scale and diversity, models pre-trained on COCO (for detection/segmentation) are often used as off-the-shelf starting points for related tasks. For instance, a model trained on COCO’s 80 object classes can be fine-tuned to a custom set of objects with typically fewer training images. COCO also established standardized evaluation metrics – e.g., mean Average Precision (mAP) for detection across a range of intersection-over-union thresholds – which became a common way to report detection performance.