Advanced Plagiarism Detector Using Python and AI [4 Methods] | by Hasan Aboul Hasan

06Jun

Other than that, the code should be simple to read and understand, given all the comments I added throughout the code😅 However, in case you found something unclear and you need some help, don’t hesitate to drop your questions on the forum!

In this method, we’ll be directly comparing both articles as a whole without chunking them by converting both of them into vector embeddings. Then, using cosine similarity, we’ll see if they’re similar to each other.

from scipy.spatial.distance import cosine
import time 
import resources
import openaidef convert_to_vector(text):
"""
Converts a given piece of text into a vector using OpenAI's embeddings API.
"""
text = text.replace("\n", " ")  # Remove newlines for consistent embedding processing
response = openai.embeddings.create(
input=[text],
model="text-embedding-3-small"
)
return response.data[0].embedding  # Return the embedding vector
def calculate_cosine_similarity(vec1, vec2):
"""
Calculates the cosine similarity between two vectors, representing the similarity of their originating texts.
"""
return 1 - cosine(vec1, vec2)  # The cosine function returns the cosine distance, so 1 minus this value gives similarity
def is_similarity_significant(similarity_score):
"""
Determines if a cosine similarity score indicates significant semantic similarity, implying potential plagiarism.
"""
threshold = 0.7  # Define a threshold for significant similarity; adjust based on empirical data
return similarity_score >= threshold  # Return True if the similarity is above the threshold, False otherwise
def search_semantically_similar(text_to_check):
"""
Compares the semantic similarity between the input text and a predefined article text.
It returns a list containing the similarity score and a boolean indicating whether
the similarity is considered significant.
"""
result = []  # Initialize an empty list to store the similarity score and significance flag
input_vector = convert_to_vector(text_to_check)  # Convert the input text to a vector using an embedding model
article_text = resources.article_two  # texts.two contains the text of the article to compare with
article_vector = convert_to_vector(article_text)  # Convert the article text to a vector
similarity = calculate_cosine_similarity(input_vector, article_vector)  # Calculate the cosine similarity between the two vectors
result.append(similarity)  # Append the similarity score to the list
result.append(is_similarity_significant(similarity))  # Append the result of the significance check to the list
return result  # Return the list containing the similarity score and significance flag
def calculate_plagiarism_score(text):
"""
Calculates the plagiarism score of a given text by comparing its semantic similarity
with a predefined article text. The score is expressed as a percentage.
"""
data = search_semantically_similar(text) # Obtain the similarity data for the input text
data[0] = data[0] * 100  # Convert the first item in the data list (similarity score) to a percentage
return data  # Return the plagiarism score and significance
#MAIN SECTION
start_time = time.time()  # Record the start time of the operation
text_to_check = resources.article_one  # Assign the text to check for plagiarism
plagiarism_score = calculate_plagiarism_score(text_to_check)[0]
significance = calculate_plagiarism_score(text_to_check)[1]
end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime
# Output the results
print(f"Plagiarism Score: {plagiarism_score}%")  # Print the calculated plagiarism score
print(f"Is result Significant: {significance}")  # Print the signficance of the score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

As you can see, the code is very similar in structure to method 1. However, the search_semantically_similar function was edited to directly turn both articles into vectors, compare them, and return the result without chunking.

Plus, I added the calculate_plagiarism_score function, which takes the similarity score and generates a percentage of it. Then, it will return the percentage score and True/False statement if the plagiarism score is significant, which will be analyzed by comparing the cosine similarity score with the threshold I initiated to be 0.7

Now it’s time for AI to enter the battlelfield😂

This method is the same as method 1 in concept; however, instead of comparing the chunks by embedding them into vectors and generating the cosine similarity, we’ll compare them using a power prompt and OpenAI’s GPT model.

from SimplerLLM.tools.text_chunker import chunk_by_paragraphs
from SimplerLLM.language.llm import LLM, LLMProvider
import time 
import resources
import jsondef compare_chunks(text_chunk):
"""
Compares a text chunk with an article text and generates a response using a OpenAI's Model
"""
article_text = resources.article_two  # The text to compare against
prompt = resources.prompt3  # A template string for creating the comparison prompt
final_prompt = prompt.format(piece=text_chunk, article=article_text)  # Formatting the prompt with the chunk and article texts
llm_instance = LLM.create(provider=LLMProvider.OPENAI)  # Creating an instance of the language model
response = llm_instance.generate_text(final_prompt)  # Generating text/response from the LLM
response_data = json.loads(response)  # Parsing the response into a JSON object
return response_data  # Returning the parsed response data
def calculate_plagiarism_score(text):
"""
Calculates the plagiarism score of a text by comparing its chunks against an article text
and evaluating the responses from OpenAI's Model
"""
text_chunks = chunk_by_paragraphs(text)  # Split the input text into chunks using SimplerLLM built-in method
total_chunks = text_chunks.num_chunks  # The total number of chunks in the input text
similarities_json = {}  # Dictionary to store similarities found
chunk_index = 1  # Index counter for naming the chunks in the JSON
plagiarised_chunks_count = 0  # Counter for the number of chunks considered plagiarised
total_scores = 0  # Sum of scores from the LLM responses
for chunk in text_chunks.chunks:
response_data = compare_chunks(chunk.text)  # Compare each chunk using the LLM
total_scores += response_data["score"]  # Add the score from this chunk to the total scores
if response_data["score"] > 6:  # A score above 6 indicates plagiarism
plagiarised_chunks_count += 1
similarities_json[f"chunk {chunk_index}"] = response_data["article"]  # Record the article text identified as similar
json.dumps(similarities_json)  # Convert the JSON dictionary to a string for easier storage
chunk_index += 1  # Increment the chunk index
plagiarism_result_json = {}  # Dictionary to store the final plagiarism results
plagiarism_score = (plagiarised_chunks_count / total_chunks) * 100 if total_chunks > 0 else 0  # Calculate the plagiarism score as a percentage
plagiarism_result_json["Score"] = plagiarism_score
plagiarism_result_json["Similarities"] = similarities_json # Adding where we found similaritites
plagiarism_result_json["IsPlagiarised"] = (total_scores > total_chunks * 6)  # Recording if the response is really plagiarised
json.dumps(plagiarism_result_json)  # Convert the final results dictionary to a JSON string
return plagiarism_result_json  # Return the plagiarism results as a dictionary
#MAIN SECTION
start_time = time.time()  # Record the start time of the operation
text_to_check = resources.article_one  # Assign the text to check for plagiarism
plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability
end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime
# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}")  # Print the calculated plagiarism score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

In the code,, the main function is the calculate_plagiarism_score, which chunks the articles, sends them to the compare_chunks function to get the similarity score, generates a total plagiarism score, and formats the results as JSON to add some details other than the plagiarism score, keeping them clear and readable.

The compare_chunks function creates a GPT instance using SimplerLLM, then uses a power prompt to analyze both chunks and generate a score out of 10 for how similar they are. Here’s the prompt I’m using:

###TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input chunk,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to 
the input chunk. After that you're gonna pick the piece of article which is most similar and generate for it
a score out of 10 for how similar it is to the input chunk. Then you're gonna need to generate the output
as a JSON format that contains the input chunk, the article chunk which is the most similar, and the score
out of 10. ### SCORING CRITERIA 
When checking for pieces in the article that are close in meaning to the chunk of text make sure you 
go over the article at least 2 times to make sure you picked the the right chunk in the article which is the most 
similair to the input chunk. Then when picking a score it should be based of how similar are the meanings 
and structure of both these sentences.# INPUTS
input chunk: [{piece}]
article: [{article}]# OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
"chunk": "[input chunk]",
"article": "[chunk from article which is similar]",
"score": [score]
}}

As you can see, it is a detailed prompt that is very well crafted to generate a specific result. You can learn how to craft similar prompts yourself by becoming a Prompt Engineer.

This method is a combination of methods 2 and 3, where we’re gonna be comparing both articles as a whole but using AI instead of vector embeddings.

from SimplerLLM.language.llm import LLM, LLMProvider
import time 
import resources
import jsondef compare_chunks(text_chunk):
"""
Compares a given text chunk with an article to determine plagiarism using a language model.
Returns dict: The response from the language model, parsed as a JSON dictionary.
"""
article_text = resources.article_two  # The text to compare against
# Formatting the prompt to include both the input text chunk and the article text
comparison_prompt = resources.prompt4.format(piece=text_chunk, article=article_text)
llm_instance = LLM.create(provider=LLMProvider.OPENAI)  # Creating an instance of the language model
response = llm_instance.generate_text(comparison_prompt)  # Generating response
response_data = json.loads(response)  # Parsing the response string into a JSON dictionary
return response_data  # Returning the parsed JSON data
def calculate_plagiarism_score(text_to_analyze):
"""
Calculates the plagiarism score based on the analysis of a given text against a predefined article text.
Returns dict: A JSON dictionary containing the plagiarism score and the raw data from the analysis.
"""
plagiarism_results = {}  # Dictionary to store the final plagiarism score and analysis data
plagiarised_chunk_count = 0  # Counter for chunks considered plagiarised
analysis_data = compare_chunks(text_to_analyze)  # Analyze the input text for plagiarism
total_chunks = len(analysis_data)  # Total number of chunks analyzed
for key, value in analysis_data.items():
# Check if the value is a list with at least one item and contains a 'score' key
if isinstance(value, list) and len(value) > 0 and 'score' in value[0] and value[0]['score'] > 6:
plagiarised_chunk_count += 1
# Check if the value is a dictionary and contains a 'score' key
elif isinstance(value, dict) and 'score' in value and value['score'] > 6:
plagiarised_chunk_count += 1
plagiarism_score = (plagiarised_chunk_count / total_chunks) * 100 if total_chunks > 0 else 0 # Calculate plagiarism score as a percentage
plagiarism_results["Total Score"] = plagiarism_score  # Add the score to the results dictionary
plagiarism_results["Data"] = analysis_data  # Add the raw analysis data to the results dictionary
json.dumps(plagiarism_results)  # Convert the results dictionary to a clear JSON string
return plagiarism_results  # Return the final results dictionary
#MAIN SECTION
start_time = time.time()  # Record the start time of the operation
text_to_check = resources.article_one # Assign the text to check for plagiarism
plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability
end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime
# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}")  # Print the scores
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

This code is 80% like the code in method 3. However, instead of comparing each chunk, we send both articles as a whole and let OpenAI’s GPT generate a detailed plagiarism test, comparing all parts of the articles as it wishes. In the end, it returns a detailed output containing a plagiarism score and the top sections are found to be similar in their similarity score.

All this is done using this power prompt:

### TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input text,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to 
the pieces of the input text. After that you're gonna pick chunk pairs that are most similar to each other
in meaning and structure, a chunk from the input text and a chunk from the article. You will then generate 
a score out of 10 for each pair for how similar they are.
Then you're gonna need to generate the output as a JSON format for each pair that contains 
the input text chunk, the article chunk which are the most similar, and the score out of 10.### SCORING CRITERIA 
When checking for peices in the article that are close in meaning to the chunk of text make sure you 
go over the article at least 2 times to make sure you picked the right pairs of chunks which are most similar.
Then when picking a score it should be based of how similar are the meanings and structure of both these sentences.### INPUTS
input text: [{piece}]
article: [{article}]### OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
"pair 1": 
[
"chunk 1": "[chunk from input text]",
"article 1": "[chunk from article which is similar]",
"score": [score]
],
"pair 2": 
[
"chunk 2": "[chunk from input text]",
"article 2": "[chunk from article which is similar]",
"score": [score]
],
"pair 3": 
[
"chunk 3": "[chunk from input text]",
"article 3": "[chunk from article which is similar]",
"score": [score]
],
"pair 4": 
[
"chunk 4": "[chunk from input text]",
"article 4": "[chunk from article which is similar]",
"score": [score]
]
}}

The prompt in methods 3 and 4 is very important to be well-crafted since all the results are based on it. Feel free to tweak and optimize it to your liking and if it generates better results make sure to share it with us in the comments below!

After we tried 2 types of machines to do the work for us, let’s now use human intelligence and see if their results are significant!

Here are the 2 texts I was comparing:

Article 1:What is generative AI?Generative AI refers to deep-learning models that can generate high-quality text, images, and other content based on the data they were trained on.Artificial intelligence has gone through many cycles of hype, but even to skeptics, the release of ChatGPT seems to mark a turning point. OpenAI's chatbot, powered by its latest large language model, can write poems, tell jokes, and churn out essays that look like a human created them. 
Prompt ChatGPT with a few words, and out comes love poems in the form of Yelp reviews, or song lyrics in the style of Nick Cave.

Source link