February 3, 2023

CheckGPT - a neural network for detecting text generated by large language models (ChatGPT, GPT3, BLOOM, you.com AI).

CheckGPT as a Telegram Bot greetings
checking the text written by a human
Check the text generated by ChatGPT

Motivation

At the end of 2022, the American company OpenAI, introduced ChatGPT - the artificial intelligence (AI) chatbot based on the advanced neural network language model GPT-3.5 "davinci", which was tuned for conversational communication. ChatGPT quickly gained popularity online: in the week since its launch, CEO Sam Altman reported on Twitter about 1 million users. How has this chatbot attracted an audience and why does it show such results?

ChatGPT's conversational AI chatbot and its ability to provide complex, real-time responses has already been called a potential breakthrough in artificial intelligence development. [1] [2].

ChatGPT - a revolution in the world of information handling. Finding, creating, processing it will never be the same again. The year 2022 will no doubt go down in history as the starting point for big generative models, as text-to-anything (Text-to-code, Text-to-image, Text-to-video, Text-to-3D, Text-to-music, Text-to-audio and etc).

ChatGPT has completely turned the rules of the game. It showed the great potential of artificial intelligence. ChatGPT can write not just poems and books, songs, a greeting speech or an article for the blog. It can write a resume, and it can write a resume that will get you kicked in the teeth by the big companies. He can defend you in court or even help you write your dissertation.

Microsoft is investing $10 billion in OpenAI, and Google has declared a "code red" and is having emergency meetings with Brin and Page about new threats from chatbots.

While most people interacting with ChatGPT for the first time are amazed by the possibilities, there are others who are concerned about the impact this new technology could have in the years and decades to come.

ChatGPT is able to write syntactically, grammatically and logically connected texts of any length with preservation of logic and coherence. For example, schoolchildren and students can get unique thematic essays and papers, reports and theses in a short time. Scientists can use a neural network to create scientific papers and articles indistinguishable from professional papers written by humans. The growth of plagiarism from language models is also a challenge.

ChatGPT content is succinct and informative, searchable, and can simplify complex human concepts in a variety of areas for understanding and learning. ChatGPT has great generalizability, allowing complex concepts and combinations of concepts to be described simply, as if a very smart person were explaining them to a child on their fingers.

The Risks

New opportunities bring new potential threats. In a short time, the Internet can be filled with gigantic volumes of "synthetic" information generated by large language models.

It is important to understand that such "synthetic" information is not always generated for good purposes. For example, in order to mislead, propagandize, misinform and deceive people at an unprecedented level of truthfulness and juggling of facts, which is very difficult to detect.

How can I tell if the text is generated by a neural network (ChatGPT and the like)?

We’ve been thinking about this question since ChatGPT and as a result we created our own neural network CheckGPT, which detects with up to 98% accuracy whether the text was written using a large language model (ChatGPT, GPT3, BLOOM, etc.) or written by a human.

How does CheckGPT work?

To check human generated or written text, we use a combination of statistical and heuristic methods:

  • Statistical attributes take into account metrics such as text readability and connectivity index, complex text complexity, perplexity, number of unique and compound words, length of words and sentences, number of characters, etc., unigrams and tokens.
  • The heuristic attributes also take into account the extracted certain speech turns and words, non-standard wording and sentence structures, deviations from human texts.

As a result, an ML algorithm for text classification emerged that uses, among other things, the following metrics:

  • perplexity or "uncertainty coefficient" is a measure for evaluating language patterns. It determines the complexity of the text.
  • Automated readability index (ARI) — a measure of the complexity of the reader's perception of a text, approximating the complexity of the text.
  • Correlation of the occurrence of turnips in texts.
  • Text complexity index.
  • Flesch reading ease formula — a measure that uses a formula that estimates the complexity of the text.
  • Coleman–Liau index — readability index, which along with the ARI index can be used to determine the difficulty of reading by approximating the difficulty of the text.
  • Text uniqueness - a measure for determining probabilistic and unique words and their combinations in sentences.
  • Cohesion of sentences - and lexical coherence of a text or sentence that connects them into a coherent whole and gives them meaning; one of the defining characteristics of text/discourse and one of the necessary conditions for textuality.
  • Text coherence - a measure assessment of the integrity of the text, which lies in the logical-semantic, grammatical and stylistic correlation and interdependence of its constituent elements (words, sentences and etc.).
  • Code Mixing Index - a measure denotes the spontaneous switching of a sentence or speech component.
  • and some our know-how methods :).

CheckGPT Features:

  • Very fast performance;
  • High accuracy is up to 84-98%;
  • Support long texts and articles
  • Only English and Russian language are supported for now;
  • Almost all big LLM generated content detection;
  • Chunked detection support;
  • Check URL option to detect article by the link;

CheckGPT are able to detect LLM generated content:

  • ChatGPT
  • GPT3
  • GPT2
  • BLOOM
  • AI you.com

written texts.

CheckGPT v1 released and available as:

You can check the result of its work on your own texts and compare the effectiveness of the definition, as well as build it into the logic of your application.

Detectors for China and some other languages will be added soon.

Contacts: @uberwow

keywords: detect ChatGPT, check ChatGPT text, determinate ChatGPT text, look for GPT generated content, ChatGPT detector, LLM generated content detect, AI content detection tool, CheckGPT check GPT text, CheckGot tool to detect machine-written text