X from R^D (usually D=2), X = x* + noise
the model has D(=2) input and N(=2, 3, 6, 12, 24?) output channel
each output channel’s target is classifying x* under c_i, b_i decision boundary (i from 0, 1, ,,, N)
so it’s basically similar with parallel setting in Elia,Omri2024 paper. (the simplicity bias paper)
x* is sampled from uniform distribution [-0.5, 0.5], given D=2, then all 4 quadrants.
when they test OOD, they limited the sign of each dimension of x*
for example trained on x only one quadrant and test on the rest three quadrants.
https://arxiv.org/pdf/1902.07275 and we can also link with this paper
they trained N_task simultaneously, but if we add one by one, the pre-formed representation for each task would be disrupted