2nd AICC Competition Editorials (Unofficial)

Published on December 29, 2025

Updated on December 28, 2025

3 min read

Intro

My first competition in quite a few months! I’m quite new to the scene so don’t slam me for the suboptimal solutions here. The theme for this competition is NLP, Audio and CV. After spending ~10 minutes reading the problem statements, I think the difficulty rank is Essay Gap < Face Matching < Audio Demixing.

Task 1: Essay Gap

Problem statement (simplified):

Given a cloze task with missing sentence, train/fine-tune a model to choose the best choice from 4 options that maximizes the coherence of the text.

Main idea:

Get a small pretrained language model (e.g. DeBERTa-v3-small) and fine-tune it on the training set.

For each training sample, we make 4 copies of it, each filled with one of the options. E.g. before {opt_X} after
Then we get a model that runs the text through a transformer and get a score for each option. Softmax the scores to get probabilities.
Fine-tune the model on the correct labels
Then we run the same model on the test set and get max probability option as the prediction.

Code:

Coming to GitHub soon!

Improvements

This code runs in ~5 minutes on a laptop GPU with a LB score of 0.97. One of my ideas to improve is simply use a larger model (i.e. microsoft/deberta-v3-large)

Simple change, it gets me 1.0 score!

Task 2: Face Matching

Problem statement (simplified):

Given some images that is particularly tricky with some sunglasses, poses, clothes differences, cluster them based on the reference images. For this I tried a few techniques and my score progression is roughly: 0.60 -> 0.70 -> 0.89 (BEST)

First idea:

Zero-shot classification

This sounds like a zero-shot classification problem, the simplest way is to use a CLIP model to extract the image embeddings and use cosine similarity between ref image -> all images and select the max similarity as the prediction. Simple enought, it gets 0.53 score with 0.80 threshold while decreasing threshold roughly brought it up to 0.60. Interestingly, TTA didn’t help much here. (If you know why, please let me know!)

First improvement

Take in overall score with respect to statistical significance

There are some generic photos which caused a lot of high score matches, however, it may be the “highly generic” face caused all the matches to get high scores. To solve this, I normalize the scores using Z-score normalization and compute the statistical significance of each match. This solves the issue of “hoarding” with a little improvement to 0.70 score.

Second improvement

Minimizing the “cost” of assignment (Slight improvement only)

Instead of taking the max similarity, we can treat this as an assignment problem and minimize the overall cost of assignment using Hungarian algorithm. This gives a slight improvement to 0.72 score. It works by prioritizing the “global score” rather than “local max score”. I used this to try and combat the tricky poses that have high similarity to the wrong person having the same pose.

Last improvement

Larger model…?

I notice the code runs very fast, so why not swap in a larger clip model? I swapped in clip-vit-large-patch14 and it got me to 0.89 score! This is probably due to the larger model having better representation and able to catch smaller details.

Code:

Coming to GitHub soon!

Task 3: Audio Demixing

Damn I am not good whatsoever at audio tasks, please suggests resources to study!

Problem statement (simplified):

Given an audio with two distinct environment sounds mixed together, separate them into two audios.

Main idea:

Since there’s two audios, we need to implement a loss function that can handle permutation invariant training (PIT). I used the MSE loss as per the problem’s evaluation metric. Then for the model, I used a U-Net architecture that takes in the spectrogram of the mixed audio and outputs two spectrograms.

Authored By:

Yu Xuan Low

2nd AICC Competition Editorials (Unofficial)

Intro

Task 1: Essay Gap

Problem statement (simplified):

Main idea:

Code:

Improvements

Task 2: Face Matching

Problem statement (simplified):

First idea:

First improvement

Second improvement

Last improvement

Code:

Task 3: Audio Demixing

Problem statement (simplified):

Main idea:

Tags

Authored By: