interesting paper and task https://openreview.net/pdf?id=XgH1wfHSX8
proposed new task, finite mixture of Markov chains.
there are 10 states (0~9) and task T is defined as 10x10 transition matrix. also T_train = {T_1, T_2, ,,, ,T_N} and N can control complexity. sequence length l is fixed to 512 in training.
4 algorithmic phases Unigram/Bigram Retrieval/Inference
(Uni-Ret, Uni-Inf, Bi-Inf, Bi-Ret)
Unigram: answer based on simple statistics,histogram of states (fast but imprecise)
Bigram: answer based on transition matrix (more precise)
Retrieval: dependent on train dataset - good ID (indistribution) but not good OOD
Inference: independent on train datset - great OOD
a) data diversity threshold
if N is small (easy task) dominant Uni-Ret phase.
if N is high enough (complex), initial training step Uni-Inf --> Bi-Inf (good OOD) --> Bi-Ret (ID overfitting).
b) emergence of induction heads
high N and intermediate training step, induction heads are formed.
c) transient nature
good OOD algorithmic phase (Bi-Inf) only activated transiently. (in the middle of training step). and deactivated by Bi-Ret.