# Abridged task description
The full task description is available here
You are given a set of points in 5-dimensional space. Each point represents either a “real” or a “fake” painting. It is known that the real and fake points are clustered separately. However, in the training data you only know the labels of a very small number of points.
Your task is to classify all the points in the test data as either “real” or “fake”.
# Unofficial writeup
# Useful info
Many high-performing solutions for this task started by visualizing the data in 2 dimensions by techniques such as PCA/t-SNE. The clusters kind of look like two half moons.
Even if you just projected the first two dimensions, you could still get a decent amount of separation between the two clusters (reference)

With t-SNE the data looks a bit like this:

# 97 points
Credit: Brazil, Hungary
Reduce the data to two dimensions. We can use the K-nearest neighbours algorithm in order to infer all the unknown points. The only problem is that because there is so little labelled training data, KNN can get confused and mislabel points in the clusters.
However this is easily fixed. Since the two clusters are very distinct from each other, we can just manually add labelled points for each of the clusters, and change the value of K until KNN correctly separates the two clusters.
References:
# 99 points
Credit: Australia, Greece
It looks like t-SNE gives us more separated clusters than PCA. So we use t-SNE. Then, instead of relying on the labelled training data, we used the unsupervised DBSCAN algorithm. Setting eps=4 and min_samples=2 on scikit-learn’s DBSCAN implementation was enough for 99.36 points.
Looking back, it was quite lucky that the same code worked across Leaderboard A and B. DBSCAN is very sensitive to the first point given in the input. So for example, if the first point in the validation set belonged to Real, but the first point in the test set belonged to Fake, you might be tricked into a 0.998 on Leaderboard A but ~0.002 on Leaderboard B.
In order to avoid this, the Greece team came up with the funny approach of selecting both their 0.998 and 0.002 submissions for Leaderboard B. You could also avoid this by sorting the points by vertical coordinate.
# Additional cool info
Credit: Brazil
For this specific task there was such clear separation between the t-SNE components that you could just check whether the 2nd (vertical) component was positive or negative for basically full score.