IOAI 2025 Day 2 - Antique (Problem 2)

09 Aug 2025

IOAI 2025

# Abridged task description

You are given a set of points in 5-dimensional space. Each point represents either a “real” or a “fake” painting. It is known that the real and fake points are clustered separately. However, in the training data you only know the labels of a very small number of points.

Your task is to classify all the points in the test data as either “real” or “fake”.

# Unofficial writeup

# Useful info

Many high-performing solutions for this task started by visualizing the data in 2 dimensions by techniques such as PCA/t-SNE. The clusters kind of look like two half moons.

Even if you just projected the first two dimensions, you could still get a decent amount of separation between the two clusters (reference)

PCA projection image

With t-SNE the data looks a bit like this:

t-SNE image

# 97 points

Credit: Brazil, Hungary

Reduce the data to two dimensions. We can use the K-nearest neighbours algorithm in order to infer all the unknown points. The only problem is that because there is so little labelled training data, KNN can get confused and mislabel points in the clusters.

However this is easily fixed. Since the two clusters are very distinct from each other, we can just manually add labelled points for each of the clusters, and change the value of K until KNN correctly separates the two clusters.

# 99 points

Credit: Australia, Greece

It looks like t-SNE gives us more separated clusters than PCA. So we use t-SNE. Then, instead of relying on the labelled training data, we used the unsupervised DBSCAN algorithm. Setting eps=4 and min_samples=2 on scikit-learn’s DBSCAN implementation was enough for 99.36 points.

Looking back, it was quite lucky that the same code worked across Leaderboard A and B. DBSCAN is very sensitive to the first point given in the input. So for example, if the first point in the validation set belonged to Real, but the first point in the test set belonged to Fake, you might be tricked into a 0.998 on Leaderboard A but ~0.002 on Leaderboard B.

In order to avoid this, the Greece team came up with the funny approach of selecting both their 0.998 and 0.002 submissions for Leaderboard B. You could also avoid this by sorting the points by vertical coordinate.

# Additional cool info

Credit: Brazil

For this specific task there was such clear separation between the t-SNE components that you could just check whether the 2nd (vertical) component was positive or negative for basically full score.