Blog Classic - FindAI Jobs, training and advice

04Apr

Meet Plandex: An Open-Source Terminal-based AI Coding Engine for Complex Tasks

The field of software development is evolving rapidly, and the integration of artificial intelligence (AI) with coding practices is poised to transform the way developers work on their projects. Against this backdrop, there is a new project called Plandex that aims to simplify the process of building complex software. This is an open-source, terminal-based AI coding engine that utilizes the capabilities of OpenAI. It represents a significant advancement in coding efficiency and project management.

Plandex is a tool that automates the routine tasks of coding, allowing developers to concentrate on more innovative and challenging assignments. It was developed by a programmer who found the tedious process of constantly copying and pasting code between ChatGPT and other projects to be inconvenient. Plandex is exceptional, not only because it can handle intricate tasks that involve multiple files and steps but also because of its unique approach to managing the inevitable errors and the iterative nature of coding.

Plandex utilizes long-running agents that break down large tasks into manageable subtasks, methodically implementing each one. This approach ensures that tasks requiring extensive multi-file operations are completed efficiently, transforming how developers tackle their backlogs, explore new technologies, and overcome coding obstacles.

One of the key features of Plandex is its integration with the OpenAI API, requiring users to provide their API key. However, its roadmap includes support for other models, such as Google’s Gemini and Anthropic’s Claude, as well as open-source models, indicating a future where Plandex becomes even more versatile and powerful.

The Plandex project offers a range of functionalities tailored to enhance the coding experience:

The ability to build complex software functionalities beyond mere autocomplete.
Efficient context management within the terminal, enabling seamless updates of files and directories to ensure the AI models have access to the latest project state.
A sandbox environment for testing changes before applying them to project files, complete with built-in version control and branching capabilities for exploring different coding strategies.
It is compatible across Mac, Linux, FreeBSD, and Windows, running as a single binary without dependencies.

Plandex is more than just a tool, it represents a great help for developers for reducing the “copy-pasting madness” that currently affects modern software development. By providing a platform where developers can experiment, revise, and select the best approach without the need for manual context management, Plandex is leading the way towards a new era of software development.

Key Takeaways

Plandex is an open-source, AI-powered coding engine designed to streamline the development of complex software projects.
It leverages the OpenAI API to automate tasks across multiple files, enhancing productivity and focus for developers.
Unique features like version control, sandbox testing, and efficient context management in the terminal set Plandex apart in the coding tools landscape.
By minimizing the tedious aspects of coding and focusing on automation and efficiency, Plandex represents a significant advancement in the integration of AI into software development.

Shobha is a data analyst with a proven track record of developing innovative machine-learning solutions that drive business value.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Source link
Shobha Kakkar:

03Apr

How to Build a Plagiarism Detector [Part 2] – Semantic Search

In this post, I will show you a better approach to building a Plagiarism detector tool, other than the one we built last time which checks for exact matches on the Internet.

Checking for Plagiarism in a given text using AI and Vector Embeddings

Today’s method will check for plagiarism based on how close the meaning and sentence structure are rather than searching for exact matches. This will help detect paraphrased text in addition to copy-pasted text.

We will go over 4 approaches, where we will compare articles as a whole and as chunks. Each approach will be applied using AI and vector embeddings, resulting in the following 4 approaches:

Method 1: Comparing chunks of both articles using vector embeddings

Method 2: Comparing 2 articles as a whole using vector embeddings

Method 3: Comparing chunks of both articles using AI

Method 4: Comparing 2 articles as a whole using AI

What are Vector Embeddings?

If you know what vector embeddings are, feel free to skip this section.

Vector Embedding is one of the most important concepts in machine learning. It is used in many NLP, recommendation, and search algorithms.

It enables computers to understand and process complex data, such as text, images, and more, in a more human-like manner.

By representing objects or words as high-dimensional vectors (points in space), embeddings capture their meanings, relationships, and properties in a way that numerical algorithms can manipulate.

So, all words/phrases/paragraphs with similar meanings are positioned closely together in the embedding space, allowing models to recognize patterns and make predictions.

In our case, we’ll use vector embeddings to generate high-dimensional vectors for pieces of text. Then, using something called cosine similarity, we’ll know if the texts are similar in meaning and structure.

💡 Tip 💡

Make sure to go over part 1 and apply the simpler implemention to understand how plagiarism checkers work to navigate easily through this one.

The Implementations

I’m gonna go over each implementation briefly, explaining the idea, and in the end, I’m gonna compare all the results we got and analyze them. (I’ll share all the resources I used at the end don’t worry )

For more accurate results and analysis, and to make things easier, I won’t be surfing the web and comparing my pieces of text with pieces on the web as we did in the first part. But I’ll use 2 pieces of text that are close in meaning to each other I got, and I’ll apply all the methods to them.

After understanding how these methods all work, you can then add the search function to your code, which I mentioned in part 1, and you’ll have your own custom semantic Plagiarism Checker!

Method 1: Comparing Chunks Using Vector Embeddings

The idea of this approach is I’m gonna split the article we want to test and the article we’re comparing it to into chunks, where the chunking method used is by-paragraph.

Then I’m gonna turn all these chunks into vector embeddings using OpenAI’s “text-embedding-3-small” model, and I’ll compare each chunk from the input article with all chunks in the other article, using the cosine similarity function, giving it a threshold of 0.7

This threshold will be used to compare the output of the cosine similarity to it. If the cosine similarity is greater than 0.7, then the 2 vectors are similar in meaning and, therefore, plagiarised. The threshold I chose is just for testing; if you want to apply a more accurate one, you’ll have to do your own research to know what threshold would be the best in this case.

from SimplerLLM.tools.text_chunker import chunk_by_paragraphs
from scipy.spatial.distance import cosine
import time 
import resources
import openai

def search_semantically_similar(text):
    """
    This function takes a piece of text and calculates its plagiarism score
    against another text by comparing the semantic similarity of their chunks.
    """
    chunks = chunk_by_paragraphs(text)  # Divide the input text into chunks/paragraphs
    article_paragraphs = chunk_by_paragraphs(resources.article_two)  # Divide the second text into chunks/paragraphs for comparison
    all_comparisons = 0  # Initialize a counter for all comparison attempts
    plagiarised_chunks = 0  # Initialize a counter for chunks found to be plagiarised based on similarity threshold

    for chunk in chunks.chunks:  # Iterate over each chunk in the first text
        chunk_vector = convert_to_vector(chunk.text)  # Convert the chunk text to a vector using an embedding model
            
        for paragraph in article_paragraphs.chunks:  # Iterate over each paragraph in the second text
            if paragraph.text.strip():  # Ensure the paragraph is not just whitespace
                all_comparisons += 1  # Increment the total comparisons counter
                paragraph_vector = convert_to_vector(paragraph.text)  # Convert the paragraph text to a vector
                similarity = calculate_cosine_similarity(chunk_vector, paragraph_vector)  # Calculate the cosine similarity between vectors
                
                if is_similarity_significant(similarity):  # Check if the similarity score is above a certain threshold
                    plagiarised_chunks += 1  # If so, increment the count of plagiarised chunks
        
    plagiarism_score = ((plagiarised_chunks / all_comparisons) * 100)  # Calculate the percentage of chunks considered plagiarised
    return plagiarism_score  # Return the plagiarism score

def convert_to_vector(text):
    """
    Converts a given piece of text into a vector using OpenAI's embeddings API.
    """
    text = text.replace("\n", " ")  # Remove newlines for consistent embedding processing
    response = openai.embeddings.create(
        input=[text],
        model="text-embedding-3-small"
    )
    return response.data[0].embedding  # Return the embedding vector

def calculate_cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors, representing the similarity of their originating texts.
    """
    return 1 - cosine(vec1, vec2)  # The cosine function returns the cosine distance, so 1 minus this value gives similarity

def is_similarity_significant(similarity_score):
    """
    Determines if a cosine similarity score indicates significant semantic similarity, implying potential plagiarism.
    """
    threshold = 0.7  # Define a threshold for significant similarity; adjust based on empirical data
    return similarity_score >= threshold  # Return True if the similarity is above the threshold, False otherwise

#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one  # Assign the text to check for plagiarism

plagiarism_score = search_semantically_similar(text_to_check)  # Calculate the plagiarism score

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {plagiarism_score}%")  # Print the calculated plagiarism score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

As you can see in the above code, in the main section, we’re giving it the text_to_check, which will be run using the search_semantically_similar function, which, in its role, goes over all the steps I mentioned above.

In the codes, I’ll be using the SimplerLLM library I built to facilitate and speed up the coding process. In these implementations, I’ll be using it to generate text using OpenAI’s API(methods 3 and 4) and chunk text by paragraphs using this simple function:

chunks = chunk_by_paragraphs(text)

Other than that, the code should be simple to read and understand, given all the comments I added throughout the code😅 However, in case you found something unclear and you need some help, don’t hesitate to drop your questions on the forum!

Method 2: Comparing 2 articles as a whole using vector embeddings

In this method, we’ll be directly comparing both articles as a whole without chunking them by converting both of them into vector embeddings. Then, using cosine similarity, we’ll see if they’re similar to each other.

from scipy.spatial.distance import cosine
import time 
import resources
import openai

def convert_to_vector(text):
    """
    Converts a given piece of text into a vector using OpenAI's embeddings API.
    """
    text = text.replace("\n", " ")  # Remove newlines for consistent embedding processing
    response = openai.embeddings.create(
        input=[text],
        model="text-embedding-3-small"
    )
    return response.data[0].embedding  # Return the embedding vector

def calculate_cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors, representing the similarity of their originating texts.
    """
    return 1 - cosine(vec1, vec2)  # The cosine function returns the cosine distance, so 1 minus this value gives similarity

def is_similarity_significant(similarity_score):
    """
    Determines if a cosine similarity score indicates significant semantic similarity, implying potential plagiarism.
    """
    threshold = 0.7  # Define a threshold for significant similarity; adjust based on empirical data
    return similarity_score >= threshold  # Return True if the similarity is above the threshold, False otherwise

def search_semantically_similar(text_to_check):
    """
    Compares the semantic similarity between the input text and a predefined article text.
    It returns a list containing the similarity score and a boolean indicating whether
    the similarity is considered significant.
    """
    result = []  # Initialize an empty list to store the similarity score and significance flag

    input_vector = convert_to_vector(text_to_check)  # Convert the input text to a vector using an embedding model
        
    article_text = resources.article_two  # texts.two contains the text of the article to compare with
        
    article_vector = convert_to_vector(article_text)  # Convert the article text to a vector
        
    similarity = calculate_cosine_similarity(input_vector, article_vector)  # Calculate the cosine similarity between the two vectors
        
    result.append(similarity)  # Append the similarity score to the list
    result.append(is_similarity_significant(similarity))  # Append the result of the significance check to the list
    
    return result  # Return the list containing the similarity score and significance flag
    
def calculate_plagiarism_score(text):
    """
    Calculates the plagiarism score of a given text by comparing its semantic similarity
    with a predefined article text. The score is expressed as a percentage.
    """
    data = search_semantically_similar(text) # Obtain the similarity data for the input text
    data[0] = data[0] * 100  # Convert the first item in the data list (similarity score) to a percentage
    
    return data  # Return the plagiarism score and significance

#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one  # Assign the text to check for plagiarism

plagiarism_score = calculate_plagiarism_score(text_to_check)[0]
significance = calculate_plagiarism_score(text_to_check)[1]

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {plagiarism_score}%")  # Print the calculated plagiarism score
print(f"Is result Significant: {significance}")  # Print the signficance of the score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

As you can see, the code is very similar in structure to method 1. However, the search_semantically_similar function was edited to directly turn both articles into vectors, compare them, and return the result without chunking.

Plus, I added the calculate_plagiarism_score function, which takes the similarity score and generates a percentage of it. Then, it will return the percentage score and True/False statement if the plagiarism score is significant, which will be analyzed by comparing the cosine similarity score with the threshold I initiated to be 0.7

Method 3: Comparing chunks of both articles using AI

Now it’s time for AI to enter the battlelfield😂

This method is the same as method 1 in concept; however, instead of comparing the chunks by embedding them into vectors and generating the cosine similarity, we’ll compare them using a power prompt and OpenAI’s GPT model.

from SimplerLLM.tools.text_chunker import chunk_by_paragraphs
from SimplerLLM.language.llm import LLM, LLMProvider
import time 
import resources
import json

def compare_chunks(text_chunk):
    """
    Compares a text chunk with an article text and generates a response using a OpenAI's Model
    """
    article_text = resources.article_two  # The text to compare against

    prompt = resources.prompt3  # A template string for creating the comparison prompt
    final_prompt = prompt.format(piece=text_chunk, article=article_text)  # Formatting the prompt with the chunk and article texts

    llm_instance = LLM.create(provider=LLMProvider.OPENAI)  # Creating an instance of the language model
    response = llm_instance.generate_text(final_prompt)  # Generating text/response from the LLM

    response_data = json.loads(response)  # Parsing the response into a JSON object

    return response_data  # Returning the parsed response data

def calculate_plagiarism_score(text):
    """
    Calculates the plagiarism score of a text by comparing its chunks against an article text
    and evaluating the responses from OpenAI's Model
    """
    text_chunks = chunk_by_paragraphs(text)  # Split the input text into chunks using SimplerLLM built-in method
    total_chunks = text_chunks.num_chunks  # The total number of chunks in the input text

    similarities_json = {}  # Dictionary to store similarities found
    chunk_index = 1  # Index counter for naming the chunks in the JSON
    plagiarised_chunks_count = 0  # Counter for the number of chunks considered plagiarised
    total_scores = 0  # Sum of scores from the LLM responses

    for chunk in text_chunks.chunks:
        response_data = compare_chunks(chunk.text)  # Compare each chunk using the LLM
        total_scores += response_data["score"]  # Add the score from this chunk to the total scores

        if response_data["score"] > 6:  # A score above 6 indicates plagiarism
            plagiarised_chunks_count += 1
            similarities_json[f"chunk {chunk_index}"] = response_data["article"]  # Record the article text identified as similar
            json.dumps(similarities_json)  # Convert the JSON dictionary to a string for easier storage
            chunk_index += 1  # Increment the chunk index

    plagiarism_result_json = {}  # Dictionary to store the final plagiarism results
    plagiarism_score = (plagiarised_chunks_count / total_chunks) * 100 if total_chunks > 0 else 0  # Calculate the plagiarism score as a percentage

    plagiarism_result_json["Score"] = plagiarism_score
    plagiarism_result_json["Similarities"] = similarities_json # Adding where we found similaritites
    plagiarism_result_json["IsPlagiarised"] = (total_scores > total_chunks * 6)  # Recording if the response is really plagiarised

    json.dumps(plagiarism_result_json)  # Convert the final results dictionary to a JSON string

    return plagiarism_result_json  # Return the plagiarism results as a dictionary

#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one  # Assign the text to check for plagiarism

plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}")  # Print the calculated plagiarism score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

In the code,, the main function is the calculate_plagiarism_score, which chunks the articles, sends them to the compare_chunks function to get the similarity score, generates a total plagiarism score, and formats the results as JSON to add some details other than the plagiarism score, keeping them clear and readable.

The compare_chunks function creates a GPT instance using SimplerLLM, then uses a power prompt to analyze both chunks and generate a score out of 10 for how similar they are. Here’s the prompt I’m using:

#TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input chunk,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to 
the input chunk. After that you're gonna pick the piece of article which is most similar and generate for it
a score out of 10 for how similar it is to the input chunk. Then you're gonna need to generate the output
as a JSON format that contains the input chunk, the article chunk which is the most similar, and the score
out of 10. 

### SCORING CRITERIA 
When checking for pieces in the article that are close in meaning to the chunk of text make sure you 
go over the article at least 2 times to make sure you picked the the right chunk in the article which is the most 
similair to the input chunk. Then when picking a score it should be based of how similar are the meanings 
and structure of both these sentences.

# INPUTS
input chunk: [{piece}]
article: [{article}]

# OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
    "chunk": "[input chunk]",
    "article": "[chunk from article which is similar]",
    "score": [score]
}}

As you can see it is a detailed prompt, very well crafted to generate a specific result. You can learn how to craft similar prompts yourself by becoming a Prompt Engineer.

Method 4: Comparing 2 articles as a whole using AI

This method is a combination of methods 2 and 3, where we’re gonna be comparing both articles as a whole but using AI instead of vector embeddings.

from SimplerLLM.language.llm import LLM, LLMProvider
import time 
import resources
import json

def compare_chunks(text_chunk):
    """
    Compares a given text chunk with an article to determine plagiarism using a language model.
    
    Returns dict: The response from the language model, parsed as a JSON dictionary.
    """
    article_text = resources.article_two  # The text to compare against

    # Formatting the prompt to include both the input text chunk and the article text
    comparison_prompt = resources.prompt4.format(piece=text_chunk, article=article_text)

    llm_instance = LLM.create(provider=LLMProvider.OPENAI)  # Creating an instance of the language model
    response = llm_instance.generate_text(comparison_prompt)  # Generating response

    response_data = json.loads(response)  # Parsing the response string into a JSON dictionary

    return response_data  # Returning the parsed JSON data

def calculate_plagiarism_score(text_to_analyze):
    """
    Calculates the plagiarism score based on the analysis of a given text against a predefined article text.
    
    Returns dict: A JSON dictionary containing the plagiarism score and the raw data from the analysis.
    """
    plagiarism_results = {}  # Dictionary to store the final plagiarism score and analysis data
    plagiarised_chunk_count = 0  # Counter for chunks considered plagiarised

    analysis_data = compare_chunks(text_to_analyze)  # Analyze the input text for plagiarism
    total_chunks = len(analysis_data)  # Total number of chunks analyzed
    
    for key, value in analysis_data.items():
        # Check if the value is a list with at least one item and contains a 'score' key
        if isinstance(value, list) and len(value) > 0 and 'score' in value[0] and value[0]['score'] > 6:
            plagiarised_chunk_count += 1
        # Check if the value is a dictionary and contains a 'score' key
        elif isinstance(value, dict) and 'score' in value and value['score'] > 6:
            plagiarised_chunk_count += 1

    plagiarism_score = (plagiarised_chunk_count / total_chunks) * 100 if total_chunks > 0 else 0 # Calculate plagiarism score as a percentage
    plagiarism_results["Total Score"] = plagiarism_score  # Add the score to the results dictionary
    plagiarism_results["Data"] = analysis_data  # Add the raw analysis data to the results dictionary

    json.dumps(plagiarism_results)  # Convert the results dictionary to a clear JSON string

    return plagiarism_results  # Return the final results dictionary
    
#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one # Assign the text to check for plagiarism

plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}")  # Print the scores
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

This code is 80% like the code in method 3. However, instead of comparing each chunk, we send both articles as a whole and let OpenAI’s GPT generate a detailed plagiarism test, comparing all parts of the articles as it wishes. In the end, it returns a detailed output containing a plagiarism score and the top sections are found to be similar in their similarity score.

All this is done using this power prompt:

### TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input text,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to 
the pieces of the input text. After that you're gonna pick chunk pairs that are most similar to each other
in meaning and structure, a chunk from the input text and a chunk from the article. You will then generate 
a score out of 10 for each pair for how similar they are.
Then you're gonna need to generate the output as a JSON format for each pair that contains 
the input text chunk, the article chunk which are the most similar, and the score out of 10. 

### SCORING CRITERIA 
When checking for peices in the article that are close in meaning to the chunk of text make sure you 
go over the article at least 2 times to make sure you picked the right pairs of chunks which are most similar.
Then when picking a score it should be based of how similar are the meanings and structure of both these sentences.

### INPUTS
input text: [{piece}]
article: [{article}]

### OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
    "pair 1": 
    [
    "chunk 1": "[chunk from input text]",
    "article 1": "[chunk from article which is similar]",
    "score": [score]
    ],
    "pair 2": 
    [
    "chunk 2": "[chunk from input text]",
    "article 2": "[chunk from article which is similar]",
    "score": [score]
    ],
    "pair 3": 
    [
    "chunk 3": "[chunk from input text]",
    "article 3": "[chunk from article which is similar]",
    "score": [score]
    ],
    "pair 4": 
    [
    "chunk 4": "[chunk from input text]",
    "article 4": "[chunk from article which is similar]",
    "score": [score]
    ]
}}

The prompt in methods 3 and 4 is very important to be well-crafted since all the results are based on it. Feel free to tweak and optimize it to your liking and if it generates better results make sure to share it with us in the comments below!

Method 5: My Opinion

After we tried 2 types of machines to do the work for us, let’s now use human intelligence and see if their results are significant!

Here are the 2 texts I was comparing:

Article 1:

What is generative AI?

Generative AI refers to deep-learning models that can generate high-quality text, images, and other content based on the data they were trained on.

Artificial intelligence has gone through many cycles of hype, but even to skeptics, the release of ChatGPT seems to mark a turning point. OpenAI's chatbot, powered by its latest large language model, can write poems, tell jokes, and churn out essays that look like a human created them. 
Prompt ChatGPT with a few words, and out comes love poems in the form of Yelp reviews, or song lyrics in the style of Nick Cave.

Article 1:

What is generative AI?

Generative artificial intelligence (AI) describes algorithms (such as ChatGPT) that can be used to create new content, including audio, code, images, text, simulations, and videos. 
Recent breakthroughs in the field have the potential to drastically change the way we approach content creation.

Generative AI systems fall under the broad category of machine learning, and here's how one such system—ChatGPT—describes what it can do:

Ready to take your creativity to the next level? Look no further than generative AI! 
This nifty form of machine learning allows computers to generate all sorts of new and exciting content, from music and art to entire virtual worlds. And it's not just for fun—generative AI has plenty of practical uses too, like creating new product designs and optimizing business processes. So why wait? Unleash the power of generative AI and see what amazing creations you can come up with!

Did anything in that paragraph seem off to you? Maybe not. The grammar is perfect, the tone works, and the narrative flows.

As you can both articles are about the same topic and they’re just a small chunk of it, so it’s something logical for the plagiarism score to be at least 50% if not 80%. Read both of them, and you’ll see they’re very close; they were just written in different styles.

Therefore, to get more accurate results and see which of the methods is the best among them, we’ll need to run all of them on 10-20 pairs of long articles.

Of course, I can’t do that in this blog and share all the results. It would take forever😂 So, I’ll keep the experimentation for you and share the results with us!

Run The Codes

To run the codes, you’re gonna have to create a .env file that contains your OpenAI API key like this:

This way, all the methods will run perfectly, but using the articles I presented above. If you wish to input your own articles, you will find them in the resources.py file along with both power prompts I mentioned above.

Plus, don’t forget to install all the necessary libraries, which you will find in the requirements file. Install them by running this in the terminal:

pip install -r requirements.txt

Comparison

I executed all the methods on the same set of articles I presented above, and here were the results:

Factors	Method 1	Method 2	Method 3	Method 4
Plagiarism Score	25%	85%	100%	100%
Runtime	44 secs	1 sec	8 secs	10 secs

Runtime analysis

Logically, Methods 1 and 3 are supposed to take more time than Methods 2 and 4 because they compare all chunks rather than the articles as a whole. However, the runtime of method 1 is very bad, so to use this method, you’re gonna either need to optimize the code to run faster (ex: parallel programming)

Other than that, all runtimes are good, so there is no need to optimize any of the other codes.

Plagiarism Score Analysis

I’ll give my personal analysis of each method and then draw a full conclusion.

Method 1:

25% for these 2 articles is very low, considering they’re very similar in meaning. However, I hypothesize that since each chunk is being compared to all other chunks, it’s something logical that not all parts of the article are gonna be about the same idea.

So, when we’re comparing, for example, a chunk in paragraph 1 and a chunk in the last paragraph, of course, these chunks would be about totally different ideas. Plus, the probability for a pair to be similar is way lower than that of being about a different idea because, in each article, every idea is mentioned once, so only one chunk will have a very similar meaning to the chunk we’re comparing.

However, this method has a major drawback: where if in one article we have a paragraph and in the other article we have exactly the same paragraph but split into 2, it won’t detect that they’re the same. That’s because we’re chunking based on paragraphs so in article 2 the 2 paragraphs would be 2 different chunks while in article 1 they would be 1 chunk, therefore affecting the score. To solve this, we need a better chunking method!

Method 2:

A score of 85% is very fair for such articles; they are truly very similar. However, do you think comparing 2 articles as a whole is really efficient to test for plagiarism? Personally, I don’t think it’s a good practice to use it since the purpose of plagiarism detection is to check for parts of articles that are found on the internet.

In this case, it will only work if both articles are the exact same copies from the introduction to the conclusion, giving a 100% accurate result.

Methods 3 and 4:

These 2 methods are kinda the same because in the background, both of them are using AI to go over different chunks and check which pairs are the most similar. However, the main difference is that in 3 we are manually chunking the articles by paragraph, while in 4 the AI is doing it as it finds the most efficient, so we can’t actually tell how it is chunking the articles.

In addition, these methods totally rely on how well-crafted the prompt is, so you can get better results by improving it and vice versa. The main factor that determines how good the prompt is, is making it apply the best plagiarism algorithm possible, where you’re gonna have to do your own research understand the algorithm, and implement it in a well-crafted prompt.

Conclusion

There is no actual conclusion; it’s more of an opinion.

Based on the tests I did, I can say that method 1 might be the best way to implement a good plagiarism checker because it goes into detail about all the chunks of the articles and compares them. So, with a better chunking method and more optimized code to make it faster, I think it would make a good plagiarism checker!

Agree or Disagree? Share your thoughts in the comments section!

Earn Points for Every Share!

Source link
Husein Aboul Hasan:

02Apr

Find Similar Research Paper Abstracts in 5 Minutes With AI and Python

An obstacle most people face when writing academic research papers is finding similar papers easily. I myself faced this problem because it takes too much time to do so.

Leverage from Arxiv and AI to find similar abstracts to yours

So, I built a Python script powered by AI to search for related keywords in an input abstract and then get related abstracts on Arxiv.

ArXiv is an open-access archive for nearly 2.4 million academic articles in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

Why choose Arxiv?

Simply because it already contains a lot of articles and a free built-in API, which makes it easier to access any article’s abstract directly. This bypasses the need to search the web using paid APIs for articles and then check if they contain an abstract, and if they do use an HTML parser to find the abstract — TOO MUCH WORK 🫠

How The API Works

http://export.arxiv.org/api/query?search_query=all:{abstract_topic}&start=0&max_results={max_results}

Enter your abstract’s topic instead of {abstract_topic}, and how many search results you want instead of {max_results}. Then, paste it into the web browser, and it’ll generate an XML file containing the summary (which is the abstract) and some other details about the article, like its ID, the authors, etc…

My Implementation

The idea is simple; here’s the workflow:

1- Extract from the input abstract the top 5 keywords (topics) that are most representative of its content.

2- Call the API on each of the 5 keywords we extracted

3- Analyze the results and check the similarity of these abstracts to the input abstract.

Step 1: Extract Top Keywords

To use Arxiv’s API, we need a keyword to search for, so we need to extract the top keywords present in the input abstract. You can either use built-in libraries like nltk or spacy, but when I tried using them, the results were not as expected and not so accurate.

So, to get better results, I used OpenAI’s GPT-4 (you can use Gemini if you prefer it), gave it a power prompt, and generated optimal results. Here’s the code:

def extract_keywords(abstract):
    # Constructing a prompt for the language model to generate keywords from the abstract
    prompt = f"""
    ### TASK
    You are an expert in text analysis and keyword extraction. Your task is to analyse an abstract I'm going to give you
    and extract from it the top 5 keywords that are most representative of its content. Then you're going to generate
    them in a JSON format in descending order from the most relevant to the least relevant.

    ### INPUTS
    Abstract: {abstract}

    ### OUTPUT
    The output should be in JSON format. Here's how it should look like:
    [
        {{"theme": "[theme 1]"}},
        {{"theme": "[theme 2]"}},
        {{"theme": "[theme 3]"}},
        {{"theme": "[theme 4]"}},
        {{"theme": "[theme 5]"}}
    ]
    """
    # Creating an instance of the language model using SimplerLLM
    llm_instance = LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4")

    # Generating response from the language model
    response = llm_instance.generate_text(user_prompt=prompt)

    # Attempting to parse the response as JSON
    try:
        response_data = json.loads(response)
        return json.dumps(response_data, indent=2)
    except json.JSONDecodeError:
        # Returning an error message if the response is not valid JSON
        return json.dumps({"error": "Invalid response from LLM"}, indent=2)

This function uses SimplerLLM, which facilitates the process of calling OpenAI’s API without writing tedious code. In addition, it makes it very easy for you to use Gemini’s API instead of OpenAI by only changing the name of the LLM instance like this:

llm_instance = LLM.create(provider=LLMProvider.GEMINI, model_name="gemini-pro")

Very nice, right?😉

Back to our code.

The power prompt I crafted is the main engine of the above function, so if it weren’t efficiently crafted, the code wouldn’t work at all.

### TASK
You are an expert in text analysis and keyword extraction. Your task is to analyse an abstract I'm going to give you and extract from it the top 5 keywords that are most representative of its content. Then you're going to generate them in a JSON format in descending order from the most relevant to the least relevant.

### INPUTS
Abstract: {abstract}

### OUTPUT
The output should be in JSON format. Here's how it should look like:
[
   {{"theme": "[theme 1]"}},
   {{"theme": "[theme 2]"}},
   {{"theme": "[theme 3]"}},
   {{"theme": "[theme 4]"}},
   {{"theme": "[theme 5]"}}
]

As you can see, it is a detailed prompt that is very well crafted to generate a specific result. By becoming a Prompt Engineer, you can learn how to craft similar prompts yourself.

Step 2: API call on each of the 5 keywords extracted

After running the above function, we’ll have a JSON-formatted output containing 5 keywords. So, we need to search for abstracts for each of the 5 keywords, and we’ll do that using Arxiv’s API.

However, when you run Arxiv’s API call, you get an XML file like this:

So, to easily extract the ID and summary (abstract), we’ll import xml.etree.ElementTree that helps us easily navigate and extract information from XML-formatted text.

def get_abstracts(json_input):
    input_data = json.loads(json_input)
    all_summaries_data = []

    for theme_info in input_data:
        keyword = theme_info['theme']
        max_results = 1  # Number of results to fetch for each keyword

        # Constructing the query URL for the arXiv API
        url = f"http://export.arxiv.org/api/query?search_query=all:{keyword}&start=0&max_results={max_results}&sortBy=submittedDate&sortOrder=descending"
        
        response = requests.get(url)
        if response.status_code == 200:
            root = ET.fromstring(response.text)
            ns = {'atom': 'http://www.w3.org/2005/Atom'}

            summaries_data = []
            for entry in root.findall('atom:entry', ns):
                arxiv_id = entry.find('atom:id', ns).text.split('/')[-1]
                summary = entry.find('atom:summary', ns).text.strip()
                
                summaries_data.append({"ID": arxiv_id, "abstract": summary, "theme": keyword})

            all_summaries_data.extend(summaries_data[:max_results]) 
        else:
            print(f"Failed to retrieve data for theme '{keyword}'. Status code: {response.status_code}")

    json_output = json.dumps(all_summaries_data, indent=2)
    return json_output

In the above function, we’re looping over the 5 keywords we generated, and for each one, we’re calling the API, extracting the ID and abstract from the XML, saving them in a list, and formatting this list into JSON (easier to read).

Step 3: Analyze the results and check for similarity

How can we check for similarity between 2 abstracts? Again, AI 🤖

We’ll be using SimplerLLM again to create an OpenAI instance and a power prompt to perform the analysis and similarity checking.

def score_abstracts(abstracts, reference_abstract):
    new_abstracts = json.loads(abstracts)
    scored_abstracts = []

    for item in new_abstracts:
        prompt = f"""
        ### TASK
        You are an expert in abstract evaluation and English Literature. Your task is to analyze two abstracts
        and then check how similar abstract 2 is to abstract 1 in meaning. Then you're gonna generate
        a score out of 10 for how similar they are. 0 being have nothing in common on different topics, and 10
        being exactly the same. Make sure to go over them multiple times to check if your score is correct.

        ### INPUTS
        Abstract 1: {reference_abstract}
        Abstract 2: {item["abstract"]}

        ### OUTPUT
        The output should be only the number out of 10, nothing else.
        """
        llm_instance = LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4")

        # Generating the similarity score from the language model
        response = llm_instance.generate_text(user_prompt=prompt)
        
        # Extracting the score from the response and handling potential errors
        try:
            score = int(response)
            perfect_match = score == 10
        except ValueError:
            score = 0
            perfect_match = False
        
        scored_abstracts.append({
            "ID": item["ID"],
            "theme": item["theme"],
            "score": score,
            "perfect_match": perfect_match
        })
    
    return scored_abstracts

We’re gonna use the JSON output we got from the function above containing all abstracts and IDs, and we’ll loop over each abstract, run the power prompt on it with the input abstract, and get the similarity score.

As mentioned above, the power prompt is a crucial part of the function; if it is bad, the code won’t work. So, read this article to improve your prompt crafting skills.

After getting the score, if it is 10/10, then the abstract we found is a perfect match for the input abstract.

Executing The Script

To run the codes, you’re gonna have to create a .env file that contains your OpenAI API key or Gemini key like this:

And, of course, you’ll need to enter your input abstract to run the code on it:

# MAIN SCRIPT
reference_abstract = """
YOUR_ABSTRACT
"""
json_data = extract_keywords(reference_abstract)   
abstracts = get_abstracts(json_data)
data = json.dumps(score_abstracts(abstracts, reference_abstract),indent=2)
print(data)

Plus, don’t forget to install all the necessary libraries, which you can install by running this in the terminal:

pip install requests simplerllm

Advanced Technique

Now, although the script we created is working properly, why don’t we improve it a little?

The search for abstracts is limited to only Arxiv, and maybe there is a very similar copy to your abstract that is not available on Arxiv but on a different website. So, why don’t we tweak the code a little and make it search on Google directly for similar abstracts, and then turn it into a tool with a nice UI?

To do that, we’ll only need to update the get_abstracts function:

# Search for related abstracts according to keywords and get link and content
def get_google_results(json_input):
    keywords = json.loads(json_input)
    search_results = []
    for theme_info in keywords:
        keyword = theme_info['theme']
        query = f"{keyword} AND abstract AND site:edu AND -inurl:pdf"
        result = search_with_value_serp(query, num_results=1)
        for item in result:
            try:
                url = str(item.URL)
                load = load_content(url)  # Assumes load_content is a function that fetches content from the URL
                content = load.content
                search_results.append({"Link": url, "Content": content, "theme": keyword})
            except Exception as e:
                print(f"An error occurred with {url}: {e}")
                continue


    json_output = json.dumps(search_results, indent=2)
    return json_output

As you can see, the function now searches on Google using the search_with_value_serp The function is integrated into the SimplerLLM library. Then, I used the load_content function, which is also in the SimplerLLM library. This makes it very easy to access the link’s title and content.

In addition, you have to add your VALUE_SERP_API_KEY in the .env file. This is how it will look like:

Keep in mind that some keywords may not have an abstract similar to it on Google so that the search would return nothing. Therefore, you might get less than 5 links for similar abstracts.

Make Money Using The Code!

The code above is only a prototype showing a headstart of this function. You can improve it to get better results, design a nice User Interface for it, and make a fully functional tool out of it. Then, you can build a SAAS business based on this tool.

In this way, you’ll have a monthly recurring income from these tools you built! Pretty nice, huh 😉

Remember, if you have any questions, make sure to drop them below in the comments section or on the forum.

Earn Points for Every Share!

Source link

01Apr

Q* • Q-Star • AI Blog

\[ Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’, a’) – Q(s, a)] \]

This equation incorporates the learning rate (α), discount factor (γ), reward (r), current state (s), current action (a), and new state (s′).

Exploration vs. Exploitation: Balancing new experiences and utilizing known information is crucial. Strategies like the ε-greedy method manage this balance by alternating between exploration and exploitation based on a set probability.

Q-Learning’s Role in Advancing AGI

AGI encompasses an AI’s capability to broadly apply its intelligence, similar to human cognitive abilities. While Q-learning is a step in this direction, it faces several hurdles:

Scalability: Q-learning’s applicability to large state-action spaces is limited, a critical issue for AGI’s diverse problem-solving needs.
Generalization: AGI requires extrapolating from learned experiences to new situations, a challenge for Q-learning which generally needs specific training for each scenario.
Adaptability: AGI’s dynamic adaptability to evolving environments is at odds with Q-learning’s need for stable environments.
Integration of Cognitive Skills: AGI involves a blend of various skills, including reasoning and problem-solving, beyond Q-learning’s learning-focused approach.

Progress and Future Outlook

Deep Q-Networks (DQN): Merging Q-learning with deep neural networks, DQNs are better suited for complex tasks due to their ability to handle high-dimensional spaces.
Transfer Learning: Techniques allowing Q-learning models to apply knowledge across different domains hint at the generalization required for AGI.
Meta-Learning: Integrating meta-learning into Q-learning could enable AI to refine its learning strategies, a key component for AGI.

In its quest for AGI, OpenAI’s focus on Q-learning within Reinforcement Learning from Human Feedback (RLHF) is a noteworthy endeavor.

Source link

01Apr

LUMOS: An Open-Source Generalizable Language Agent Training Framework

Imagine having a digital assistant that can not only answer your questions but also navigate the web, solve complex math problems, write code, and even reason about images and text-based games. Sound too good to be true? Well, brace yourselves because the future of artificial intelligence just got a whole lot more accessible and transparent with the introduction of LUMOS.

In a ground breaking development, researchers from the Allen Institute for AI, UCLA, and the University of Washington have unveiled LUMOS, an open-source framework that promises to revolutionize the way we interact with language agents. Unlike existing closed-source solutions that often feel like black boxes, LUMOS offers an unprecedented level of affordability, transparency, and reproducibility, making it a game-changer in the world of AI.

But what exactly is LUMOS, and why is it causing such a stir in the AI community? Buckle up, because we’re about to dive into the nitty-gritty details of this remarkable innovation, exploring how it works, what it can do, and why it matters more than you might think.

Current language agents often rely on large, closed-source language models like GPT-4 or ChatGPT as the core component. While powerful, these models are expensive, need more transparency, and provide limited reproducibility and controllability.

The LUMOS framework takes a different approach by utilizing open-source large language models (LLMs) as the base models. It employs a unified and modular architecture consisting of three key components: a planning module, a grounding module, and an execution module.

The planning module decomposes complex tasks into a sequence of high-level subgoals expressed in natural language. For example, for a multimodal question like “The device in her hand is from which country?”, the planning module might generate two subgoals: “Identify the brand of the device” and “Answer the country of the device brand.”

The grounding module then translates these high-level subgoals into executable low-level actions that can be executed by various tools in the execution module. For instance, the first subgoal might be grounded into an action like “VQA(<img>, What is the brand..?)” to identify the device brand from the image using a visual question-answering tool.

The execution module contains a collection of off-the-shelf tools, including APIs, neural models, and virtual simulators, that can execute the grounded actions. The results of these executed actions are then fed back into the planning and grounding modules, enabling an iterative and adaptive agent behaviour.

One of the key advantages of LUMOS is its modular design, which allows for easy upgrades and wider applicability to diverse interactive tasks. By separating the planning, grounding, and execution components, researchers can improve or replace individual modules without affecting the others.

To train LUMOS, the researchers curated a large-scale, high-quality dataset of over 56,000 annotations derived from diverse ground-truth reasoning rationales across various complex interactive tasks, including question answering, mathematics, coding, web browsing, and multimodal reasoning. These annotations were obtained by employing GPT-4 and other advanced language models to convert existing benchmarks into a unified format compatible with the LUMOS architecture. The resulting dataset is one of the largest open-source resources for agent fine-tuning, enabling smaller language models to be trained as language agents effectively.

In evaluations across nine datasets, LUMOS exhibited several key advantages. It outperformed multiple larger open-source agents on held-out datasets for each task type, even surpassing GPT agents on question-answering and web tasks in some cases. LUMOS also outperformed agents produced by other training methods, such as chain-of-thoughts and unmodularized integrated training. LUMOS notably demonstrated impressive generalization capabilities, significantly outperforming 30B-scale (WizardLM-30B and Vicuna-v1.3-33B) and domain-specific agents on unseen tasks involving new environments and actions.

With its open-source nature, competitive performance, and strong generalization abilities, LUMOS represents a significant step forward in developing affordable, transparent, and reproducible language agents for complex interactive tasks.

01Apr

Part Qualified Data Science Actuary

Job title: Part Qualified Data Science Actuary

Company: Raretec Recruitment

Job description: We are excited to be working with a leading and innovative consultancy on a new role for a Part Qualified Data Science… opportunity to provide innovative solutions and valuable insights to clients. About the Data Science Actuary role This role…

Expected salary:

Location: Dublin

Job date: Wed, 27 Mar 2024 06:28:53 GMT

Apply for the job now!

1 … 307 308 309