In this post, I will show you how to detect the percentage of plagiarism in a piece of text. A direct, practical solution I created and tested!
The idea is very simple, acting as a perfect starting point to check plagiarism for any piece of text. I will explain the approach step by step with a practical example, so let’s start!
Let’s keep things simple with a real practical example! Here is what we need:
1- A function that takes care of the chunking of our text
2- A function that surfs the web and checks if this chunk exists
3- Add up all the results and get the percentage
The first thing we need to do is split text into smaller chunks like phrases, sentences, and paragraphs; notice how I didn’t say “splitting into individual words.” That’s because words are independent, resulting in a less effective plagiarism test.
Now, let’s make it dynamic!
def chunk_text(text, chunk_by) -> List[str]:
if chunk_by == "sentence":
sentences = re.split(r'(? sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
return sentences
elif chunk_by == "paragraph":
paragraphs = [paragraph.strip() for paragraph in text.split("\n") if paragraph.strip()]
return paragraphs
else:
raise ValueError("Invalid chunk_by value. Choose 'sentence' or 'paragraph'.")
This function takes as input the text and your chosen chunking method, then if you choose:
- By Sentence: I used a very straightforward method: I split whenever I find a ‘.’ or ‘!’ or ‘?’ between sentences.
- By Paragraph: I used a similar approach to the one above, which splits the input whenever there’s a new line between paragraphs. In Python, the new line is defined as \n.
This dynamic approach makes it easier to switch to whichever method is based on your liking. Plus, you can see the experiment yourself and see how the accuracy changes depending on the text and method used.
Now that we have split the text into chunks, we need to take each chunk, put it between double quotes like “[chunk]”, and search for if it matches something on the internet.
Here’s an example of a unique chunk:
As you can see, no results were found for “Learnwithhasan is the best website” although it’s a well-known fact 😂
💡 Tip 💡
When you’re searching for an exact match of something you should delimit it between double quotes. Like this search engine you’re using knows that you’re looking for this exact phrase and not normal searching.
Back to our code:
def search_chunk(chunk) -> bool:
try:
search_results = search_with_serpapi(f"\"{chunk}\"")
found = len(search_results) > 0
return found
except Exception as e:
print(f"An error occurred: {e}")
return False
In this function, I used my Library SimplerLLM, specifically a method that uses SerperAPI to search on Google from the code.
To access Google’s search engine from your code, you would need an API and its corresponding code. However, using SimplerLLM, the function is already built-in, and you just call it using the “search_with_serpapi” method.
But, you need to generate your API key from their website, create a .env file, and add your key like this:
SERPER_API_KEY = "YOUR_API_KEY"
So, using the above function, each chunk is searched for on Google, and if a result exists, it returns True; otherwise, it returns False.
Now it’s time to take these Trues and Falses and turn them into a percentage:
def calculate_plagiarism_score(text, chunk_by) -> float:
chunks = chunk_text(text, chunk_by)
total_chunks = len(chunks)
plagiarised_chunks = 0
for chunk in chunks:
if search_chunk(chunk):
plagiarised_chunks += 1plagiarism_score = (plagiarised_chunks / total_chunks) * 100 if total_chunks > 0 else 0
return plagiarism_score
This function works by first calling the chunking method explained in Step 1, and then counting the total number of these chunks.
Using step 2, we determine whether each chunk is available on the web. If it returns True, it increases the count of plagiarized chunks.
After checking all chunks, the plagiarism score is calculated by dividing the number of plagiarized chunks by the total number of chunks, multiplying by 100 to get a percentage. Finally, it returns the plagiarism score as a decimal number(float).
All the above methods wouldn’t generate anything if you didn’t give it any input and print the result.
#MAIN SECTION
start_time = time.time()
text = "YOUR_TEXT" # The Input Textchunk_by = "sentence" # "sentence" or "paragraph"
plagiarism_score = calculate_plagiarism_score(text, chunk_by)
end_time = time.time() # Record the end time
runtime = end_time - start_time # Calculate the runtime
print(f"Plagiarism Score: {plagiarism_score}%")
print(f"Runtime: {runtime} seconds") # Print the runtime
In this section of the code, you need to enter the text you want to run the plagiarism checker on, pick your preferred method of chunking, and print the results!
You’ll even get the time it took to generate the results (we’ll use it later🤫)
SimplerLLM is an open-source Python library designed to simplify interactions with large language models (LLMs). It offers a unified interface for different LLM providers and a suite of tools to enhance language model capabilities.
I created it to facilitate coding, and it did indeed save me a lot of time. But the main reason I’m using it in this script is that I’m planning on improving this code more and making it detect similarities, too, not just exact copies of the text. So, keep an eye out for the Semantic Plagiarism Checker Post!
Now, although the script we created is working properly, why don’t we improve it a little?
For example, when we find that the chunk is available on a webpage somewhere, we can fetch the URLs of these web pages. This simple tweak to the code would make the results of this script a lot more interesting, especially if you turned it into a tool with a nice UI.
Here’s what the new code will look like:
def search_chunk(chunk) -> List[str]:
list = []
try:
search_results = search_with_serpapi(f"\"{chunk}\"")
found = len(search_results) > 0
if (found):
list.append(found)
list.append(search_results[0].URL)
return list
else:
list.append(found)
list.append("None")
return list
except Exception as e:
print(f"An error occurred: {e}")
list.append(False)
list.append("None")
return list def calculate_plagiarism_score(text, chunk_by) -> float:
chunks = chunk_text(text, chunk_by)
total_chunks = len(chunks)
plagiarised_chunks = 0
counter = 1
for chunk in chunks:
print(f"Chunk {counter} : {chunk} .... {search_chunk(chunk)[0]} .... {search_chunk(chunk)[1]}")
counter += 1
if search_chunk(chunk)[0]:
plagiarised_chunks += 1
plagiarism_score = (plagiarised_chunks / total_chunks) * 100 if total_chunks > 0 else 0
return plagiarism_score
As you can see, I edited the “search_chunk” function so that it returns a list containing a True/ False if it found an existing duplicate and the link to the webpage that contains the same chunk. Plus, I added a print statement in the “calculate_plagiarism_score” function to print each chunk, its number, True/False, and the URL of the webpage.
Here’s what the result will look like:
A major limitation of the above script is that running it on a large amount of data would be inefficient, like multiple blog posts at a time. What happens here is every chunk will have to be searched for on Google to see if there is existing content that matches it.
So, How can we fix this? There are two approaches we can try:
- Parallel Programming
- Search Result Indexing
I mentioned a little bit about both in the full article if you want to check it.
If you have other better approaches to improve the plagiarism detector, make sure to drop them in the comments below!