Post

Hackathon across the Globe

For my first ever Hackathon I had two real challenges. The lack of sleep, and creating an AI-Enabled Assessment Marking with the use of ML-based text mining algorithm.


Big Data Society of Macquarie University

Organised by Macquire University Big Data Society and sponsored by Microsoft, AIP Research Center, and ITIC, this Hackthon was for the first time remote whilst being hosted in Australia.

The prizes rounded over 6000$ where my team got the distinguished spot number one and was granted the first prize.

Granted, we did lose some sleep, as Australia’s time zone is 11 hours difference from Lisbon’s1


The Hackathon Competition

The Hackathon had three different challenges, where the first was the most challenging and the last two were entry-level. We decided to focus on the first.

The Problem was the following:

  1. Help professors mark an essay in an automated way
  2. Create a marking application front-end layer to input the data and be presented with some results

This was the Architecture that we decided to use: Architecture

The main focus really relies on the scoring_engine.py where the rest of the page will mention it in more detail, but first…


Data Preparation

Data Preparation As part of the challenge, the given dataset wasn’t totally prepared to be used in a machine learning algorithm. Some tokens weren’t correct and so we used Regex operations to parse them into their more consistent state.

Most of this work was done in Excel and for some small tweaks on the answer fields (our independent variable) we used a simple script in Python


Machine Learning Model

We chose a Naive Bayes Algorithm with a bag of words to train our model.
We split the dataset into training, test and validation as follows:

TrainingTestValidation
70%25%5%

To create our model we used Scikit-Learn which the group was very used to and could immediately start working. There were two important functions that we needed to create, the training function and the prediction function in order to fit the model into our cleaned dataset and get some output from previously unseen inputs.

Training Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def _train_model(df):
    print('Training model...')

    DV = "Score"  # the dependent variable, text is the independent variable here
    X = df.drop([DV], axis=1)  # drop from our X array because this is the text data that gets trained
    y = df[DV]

    # we train on 70% of the data, test on the rest
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

    # Build bag of words
    count_vect = CountVectorizer(token_pattern="[^\s]*", max_features=100)
    X_train_counts = count_vect.fit_transform(X_train["Answer"])

    print(count_vect.vocabulary_)  # Print vocabulary

    X_test = count_vect.transform(X_test["Answer"])

    # fit the training dataset on the NB classifier
    Naive = MultinomialNB()
    Naive.fit(X_train_counts, y_train)

    # predict the labels on validation dataset
    predictions_NB = Naive.predict(X_test)

    # Use accuracy_score function to get the accuracy
    return {"naive": Naive, "vocabulary": count_vect, "accuracy": accuracy_score(predictions_NB, y_test) * 100}

Prediction Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def _predict(model, df):
    Naive = model["naive"]
    count_vect = model["vocabulary"]

    answers = []
    # Calculate marking for each row
    for index, row in df.iterrows():
        answer = [row['Answer']]

        answer_vect = count_vect.transform(answer)  # create bag of words
        predict_onion = Naive.predict(answer_vect)  # applying it to the trained model

        value_as_int_array = predict_onion.astype(int)

        # Score the answers
        grade = value_as_int_array[0]

        answers.append({"id": int(row["ID"]), "answer": row["OriginalAnswer"], "grade": int(grade)})

    return answers

We also created a function to validate the model which gave us some impressive results.

It is important to note that the challenge was divided into four individual problems, each with more complex phrases than the last, these were the results

Validation

Problem #Size of DatasetAccuracy
1262100%
225686%
3258100%
4213100%

The image presented is auto-generated from our front-end layer that we created using React and Nivo.


Finishing Touches

We created a back-end server with Flask in order to server the machine learning model. We selected this framework as it’s simple for our use-case and is also written in Python.

The team was also created a simple website where we could upload the answers of the students (in this case, the dataset) using React and Nivo.

Conclusion

Given the end of the Hackathon, our team pitched the project and two hours later whilst I was sleeping we were announced as the winning team for Problem number one!

And so it was, we won a Hackathon competition, all the way across the globe :D
Thank you to my team! Team

This post is licensed under CC BY 4.0 by the author.