AI-generated Text Detection Engine

Chat GPT detector

Project Summary:

This project aims to develop a model for distinguishing between human-generated and language model-generated text, with evaluation based on the area under the ROC curve. We utilised the BERT encoder for text embeddings and incorporated a custom architecture featuring a Multi-Head Attention layer. Challenges included handling overfitting the training dataset and handling of long text sequences. These obstacles were overcome through careful architectural design, troubleshooting, sequential processing of text input, and leveraging the pre-trained BERT embeddings. The model achieved promising results, demonstrating its effectiveness in discerning language model-generated text.

For source code, please refer: https://github.com/yugamjayant/sept_2023_jh/blob/main/AI_text_detection_d06-transfer-learning-model-03-data-01.ipynb

Data Set:

The model was trained on, https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset, which encompasses human and AI-generated essays on the following topics,

  • "Car-free cities"
  • "Does the electoral college work?"
  • "Exploring Venus"
  • "The Face on Mars"
  • "Facial action coding system"
  • "A Cowboy Who Rode the Waves"
  • "Driverless cars"

Model Architecture:

Used transfer learning to build a Neural Network on top of BERT, the model had 23,63,137 trainable parameters, where BERT's parameters were not trainable.

For free chat GPT-4 access without login, visit Stable Diffusion online AI - Free Chat GPT-4