# Abridged task description
The full task description is available here
You are given a bunch of photos of animals in the wilderness. On the judging server, there is a CLiP model that can recognize animals. For each image, you need to find a rectangular crop that meets the following conditions:
- The rectangular crop covers no more than 6.25% of the image
- The CliP model must be able to correctly identify the animal in the cropped image.
The same CLiP model used for judging is provided in the training environment. So you can inspect the model and see how it classifies things.
The one caveat is that the test set had 700 224x224 images, and your notebook needed to finish executing in less than 8 minutes. So you couldn’t run the CLiP model too many times per image.
# Jury’s solution
Credit: China, Jury members
The Scientific Committee’s reference solution used attention masks and added [CLS] tokens to the ViT, with their attention masked to only see the info in their respective masked area, to sort of process different masks in parallel. The idea is that we first call the CLIP for the unmasked image to get the “true” label of it (CLIP is really good on the unmasked images), and then pass it through the CLIP with added masked CLS tokens. We then get the predictions for the different masks by calculating the cosine similarity between the masked CLS tokens and the text. We mark a mask as potentially correct if it returns the same label as the unmasked prediction. We then check about three of them and return the one that is correct. This gets 82–85% accuracy.
References: Message 1