10Jun

SimplerLLM is all You Need! (For Beginners and Researchers) | by Hasan Aboul Hasan


This Will Change The Way You Interact With Language Models

Generated with AI

šŸš€ The Birth of SimplerLLM

Hey there, Iā€™m thrilled to introduce SimplerLLM (open-source Python library) , my latest creation thatā€™s set to transform how we interact with Large Language Models (LLMs) with python.

The Magic of Simplicity

I love Simplicity!

Imagine generating text, images, or building AI-Powered tools with just two lines of code. Yes, you heard that right ā€” only two lines! šŸ¤Æ

Why SimplerLLM?

  1. Beginner-Friendly: Whether youā€™re taking your first steps in AI or youā€™re a seasoned researcher, SimplerLLM is your new best friend.
  2. Deep Dive into AI: This isnā€™t just a tool; itā€™s my personal journey to understanding the nuts and bolts of AI and Language Models. And I want you to join me on this adventure. šŸŒŸ I believe building this library will help me go in depth in the AI world and master what is beyond the basics.
  3. Community-Centric: Building a library isnā€™t just about writing code; itā€™s about building a community. Thatā€™s where you come in!

A Peek into the Magic

Hereā€™s a little teaser: With SimplerLLM, generating text is as easy as:

my_llm = LLM.create(model=LLMProvider.OPENAI)
my_llm.generate_text("your prompt goes here")

And voilĆ ! Youā€™ve just interacted with a OpenAI model. šŸŽ©āœØ

A Call to Action for the Curious Minds

If you like to join me in this journey, I will be more than happy to hear from you, ping me at

ha***@le************.com













Source link

09Jun

How To Get Consistent JSON From Google Gemini (With Practical Example) | by Hasan Aboul Hasan


In this post, I will show you how to generate consistent JSON responses from Google Gemini using Python.

No fluffā€¦ a direct, practical solution I created, Tested, and Worked!

Generated with Dalle AI

We want the output to be a Consistent JSON ONLY, so we can rely on this output to build tools and applications!

Let me show you an example of a tool I built: The Hook Generator Tool

This is how it works:

So, I create a prompt that generates Hooks ā€” not any prompt, but a Power prompt based on data. (not our topic for today)

I pass the prompt to the language model, and I get a JSON response. Then, I read the JSON with JavaScript and populate the UI on WordPress.

Here is a sample JSON I got from the LLM for my tool:

[
{
"hook_type": "The Intriguing Question",
"hook": "Whatā€™s the most effective way to learn Python through short videos?"
},
{
"hook_type": "Visual Imagery",
"hook": "Imagine a world where Python tutorials are as captivating as short films."
},
{
"hook_type": "Quotation",
"hook": "Albert Einstein once said, 'The only source of knowledge is experience.' Learn Python through engaging short videos and experience the learning journey."
}
]

And based on that, I can build a UI like this:

šŸ”„ If you are interested in learning how to build this AI tool step by step and monetize with the credit system, as I do here on my website with my tools, to turn WordPress into SaaS, you can check out my courses here. šŸ”„

Anyway, getting back to our problem, did you spot it? šŸ¤”

Yes, it is in JSON. To build tools like this, we must ensure that we get the same JSON response from the language model every time.

If we get a different JSON, it will be impossible to have a consistent UI for the tool because we will not be able to parse and read the response with Javascript or any language you are using. (even with no code)

There are several ways and approaches to solve this issue and achieve consistent JSON response.

Starting with some prompting techniques that force the model to generate a response based on the example output you provide, something like this:

IMPORTANT: The output should be a JSON array of 10 titles without field names. Just the titles! Make Sure the JSON is valid.

Example Output:
[
"Title 1",
"Title 2",
"Title 3",
"Title 4",
"Title 5",
"Title 6",
"Title 7",
"Title 8",
"Title 9",
"Title 10",
]

Another approach is using Function calling with OpenAI Models or Python Instructor Package with Pydantic, which is also limited to OpenAI and relies on Function calling.

I also automated and simplified the process of building AI tools fast in this blog.

To learn more about this problem and suggested solutions, you can check out this blog post I wrote on function chaining.

šŸŸ¢ But what about a generic approach that works with any model and does not rely solely on a specific model or functionality?

You CANā€™T build all your tools and apps relying on one feature or model.

It is better to take a more dynamic approach so that you can switch from model to model at any time without changing your full codes and structure.

With this said, I thought of a way, and I came up with a basic yet powerful approach that got me the results I wanted: consistent JSON response!

Let me show you what I did!

Letā€™s keep things simple with a real practical example!

Letā€™s say you want to build a Simple Blog Title Generator Tool, maybe like this one.

Here is what we need:

1- Craft a Prompt that Generate Blog Post Titles.

2- Feed the prompt to Google Gemini or other language models.

3- Get JSON Structured Response šŸ”“

4- Return the JSON to the UI to build it.

Our main problem in step 3.

Here is my approach to solving this problem:

Step 1: Decide on the JSON structure you want to return.

First, you should know what you want!

Which JSON Structure do you want? So you can ask the model to get it.

For example, in my case, I want something like this:

{
"titles": [
"Title 1",
"Title 2",
"Title 3",
"Title 4",
"Title 5"
]
}

Now, letā€™s create a Python script and continue to step 2

Step 2: Define the model

The easiest and most efficient way to build tools is to return a class or a Pydantic model that can be read and accessed easily in your code.

So, I created a Pydantic model that fits the JSON response that I want.

class TitlesModel(BaseModel):
titles: List[str]

Step 3: Create the base prompt

Now, letā€™s create a prompt that generates blog post titles based on a topic. I will keep things simple for this example, letā€™s say:

base_prompt = f"Generate 5 Titles for a blog post about the following topic: [{topic}]"

Step 4: Convert the Pydantic model into an example JSON String

I created a simple Python function to automate the process of creating an example Text JSON based on the Pydantic model.

We will use this to pass to the LLM in Step 5

Here is the Function:

def model_to_json(model_instance):
"""
Converts a Pydantic model instance to a JSON string.
    Args:
model_instance (YourModel): An instance of your Pydantic model.
Returns:
str: A JSON string representation of the model.
"""
return model_instance.model_dump_json()

Then, we use this Function to generate the string representation of the Pydantic model.

json_model = model_to_json(TitlesModel(titles=['title1', 'title2']))

Step 5: Post-optimize the prompt

Now, I will use prompt engineering techniques to force the model to generate the JSON we want within the response. Here is how I did it:

optimized_prompt = base_prompt + f'.Please provide a response in a structured JSON format that matches the following model: {json_model}'

It is just adding and telling the language model to generate a JSON that matches the JSON_model we generated in step 4.

Step 6: Generate Response with Gemini

Now call Gemini API and generate a response with the optimized_prompt.

I created a simple function that does this so I can use it directly in my code. Here it is:

import google.generativeai as genai
# Configure the GEMINI LLM
genai.configure(api_key='AIzgxb0')
model = genai.GenerativeModel('gemini-pro')
#basic generation
def generate_text(prompt):
response = model.generate_content(prompt)
return response.text

Then, I call from my script this way:

gemeni_response = generate_text(optimized_prompt)

Then we will get something like:

Absolutely! Here's a JSON format representation of 5 engaging blog post titles for a Python programming blog:
JSON
{
"titles": [
"Python Tricks: 5 Hidden Gems You Might Have Missed",
"Mastering Python Data Structures: Level Up Your Coding",
"Debugging Python Code Like a Pro: Strategies and Tools",
"Project Inspiration: Build a Fun Web App with Python",
"Elegant Python: Writing Clean and Readable Code"
]
}

A combination of text and JSON in the response!

But the JSON is constructed the way we want, great!

Step 7: Extract the JSON String

Now, I used regular expressions to extract the JSON string from the output.

Here is the Function I created:

def extract_json(text_response):
# This pattern matches a string that starts with '{' and ends with '}'
pattern = r'\{[^{}]*\}'
    matches = re.finditer(pattern, text_response)
json_objects = []
for match in matches:
json_str = match.group(0)
try:
# Validate if the extracted string is valid JSON
json_obj = json.loads(json_str)
json_objects.append(json_obj)
except json.JSONDecodeError:
# Extend the search for nested structures
extended_json_str = extend_search(text_response, match.span())
try:
json_obj = json.loads(extended_json_str)
json_objects.append(json_obj)
except json.JSONDecodeError:
# Handle cases where the extraction is not valid JSON
continue
if json_objects:
return json_objects
else:
return None # Or handle this case as you prefer
def extend_search(text, span):
# Extend the search to try to capture nested structures
start, end = span
nest_count = 0
for i in range(start, len(text)):
if text[i] == '{':
nest_count += 1
elif text[i] == '}':
nest_count -= 1
if nest_count == 0:
return text[start:i+1]
return text[start:end]

Then I call it:

json_objects = extract_json(gemeni_response)

Now we have the JSON!

Step 8: Validate the JSON

Before using the JSON, I validated it to ensure it matched the Pydantic model I wanted. This allows me to implement a retry mechanism in case of any errors.

Here is the Function I created:

def validate_json_with_model(model_class, json_data):
"""
Validates JSON data against a specified Pydantic model.
    Args:
model_class (BaseModel): The Pydantic model class to validate against.
json_data (dict or list): JSON data to validate. Can be a dict for a single JSON object,
or a list for multiple JSON objects.
Returns:
list: A list of validated JSON objects that match the Pydantic model.
list: A list of errors for JSON objects that do not match the model.
"""
validated_data = []
validation_errors = []
if isinstance(json_data, list):
for item in json_data:
try:
model_instance = model_class(**item)
validated_data.append(model_instance.dict())
except ValidationError as e:
validation_errors.append({"error": str(e), "data": item})
elif isinstance(json_data, dict):
try:
model_instance = model_class(**json_data)
validated_data.append(model_instance.dict())
except ValidationError as e:
validation_errors.append({"error": str(e), "data": json_data})
else:
raise ValueError("Invalid JSON data type. Expected dict or list.")
return validated_data, validation_errors

Here is how I used it in the code:

validated, errors = validate_json_with_model(TitlesModel, json_objects)

Step 9: Play with the Model!

If there are no errors from step 8, we can convert the JSON to Pydantic again and play with it as we like!

Here is the Function that converts JSON back to Pydantic:

def json_to_pydantic(model_class, json_data):
try:
model_instance = model_class(**json_data)
return model_instance
except ValidationError as e:
print("Validation error:", e)
return None

Here is how I used it in my script:

model_object = json_to_pydantic(TitlesModel, json_objects[0])
#play with
for title in model_object.titles:
print(title)

You see, now I can access the titles easily with my code!

Get The Full Code

Instead of going through all the steps every time you want to build a tool or write a script, I added all this as a single function in the SimplerLLM Library!

Here is how you can build the same blog title generator tool with SimplerLLM with a few lines of code:

from pydantic import BaseModel
from typing import List
from SimplerLLM.langauge.llm import LLM, LLMProvider
from SimplerLLM.langauge.llm_addons import generate_basic_pydantic_json_model as gen_json
llm_instance = LLM.create(provider=LLMProvider.GEMINI, model_name="gemini-pro")class Titles(BaseModel):
list: List[str]
topic: str
input_prompt = "Generate 5 catchy blog titles for a post about SEO"json_response = gen_json(model_class=Titles,prompt=input_prompt, llm_instance=llm_instance)

print(json_response.list[0])

All the steps are now compressed into one line:

json_response = gen_json(model_class=Titles,prompt=input_prompt, llm_instance=gemini)

In this way, you can build AI tools way faster and focus on the tool idea, functionality, and prompt instead of dealing with Inconsistent JSONs.

What is more important is that with this approach, you are not restricted to a specific language model. For example, you can change this line:

llm_instance = LLM.create(provider=LLMProvider.GEMINI, model_name="gemini-pro")

To:

llm_instance = LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4")

And you will be using another Modelā€¦ It is really like magic. Isnā€™t it?

I will be more than happy if you share your thoughts and opinions, and maybe your tests if you do some.

I think I deserve some claps šŸ˜…



Source link

07Jun

Implementing Chain-of-Thought Principles in Fine-Tuning Data for RAG Systems | by Cobus Greyling | Jun, 2024


Considering that retrieved documents may not always answer the userā€™s question, the burden is placed on the LLM to discern if a given document contains the information to answer the question.

Inspired by the Chain-Of-Thought Reasoning (CoT), the study proposes to break down the instruction into several steps.

  1. Initially, the model should summarise the provided document for a comprehensive understanding.
  2. Then, it assesses whether the document directly addresses the question.

If so, the model generates a final response, based on the summarised information.

Otherwise, if the document is deemed irrelevant, the model issues the response as irrelevant.

Additionally, the proposed CoT fine-tuning method should effectively mitigate hallucinations in LLMs, enabling the LLM to answer questions based on the provided knowledge documents.

Below the instruction for CoT fine-tuning is shownā€¦



Source link

06Jun

Your Own Free Plagiarism Checkers? | by Hasan Aboul Hasan


In this post, I will show you how to detect the percentage of plagiarism in a piece of text. A direct, practical solution I created and tested!

The idea is very simple, acting as a perfect starting point to check plagiarism for any piece of text. I will explain the approach step by step with a practical example, so letā€™s start!

Letā€™s keep things simple with a real practical example! Here is what we need:

1- A function that takes care of the chunking of our text

2- A function that surfs the web and checks if this chunk exists

3- Add up all the results and get the percentage

The first thing we need to do is split text into smaller chunks like phrases, sentences, and paragraphs; notice how I didnā€™t say ā€œsplitting into individual words.ā€ Thatā€™s because words are independent, resulting in a less effective plagiarism test.

Now, letā€™s make it dynamic!

def chunk_text(text, chunk_by) -> List[str]:
if chunk_by == "sentence":
sentences = re.split(r'(? sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
return sentences
elif chunk_by == "paragraph":
paragraphs = [paragraph.strip() for paragraph in text.split("\n") if paragraph.strip()]
return paragraphs
else:
raise ValueError("Invalid chunk_by value. Choose 'sentence' or 'paragraph'.")

This function takes as input the text and your chosen chunking method, then if you choose:

  • By Sentence: I used a very straightforward method: I split whenever I find a ā€˜.ā€™ or ā€˜!ā€™ or ā€˜?ā€™ between sentences.
  • By Paragraph: I used a similar approach to the one above, which splits the input whenever thereā€™s a new line between paragraphs. In Python, the new line is defined as \n.

This dynamic approach makes it easier to switch to whichever method is based on your liking. Plus, you can see the experiment yourself and see how the accuracy changes depending on the text and method used.

Now that we have split the text into chunks, we need to take each chunk, put it between double quotes like ā€œ[chunk]ā€, and search for if it matches something on the internet.

Hereā€™s an example of a unique chunk:

As you can see, no results were found for ā€œLearnwithhasan is the best websiteā€ although itā€™s a well-known fact šŸ˜‚

šŸ’” Tip šŸ’”

When youā€™re searching for an exact match of something you should delimit it between double quotes. Like this search engine youā€™re using knows that youā€™re looking for this exact phrase and not normal searching.

Back to our code:

def search_chunk(chunk) -> bool:
try:
search_results = search_with_serpapi(f"\"{chunk}\"")
found = len(search_results) > 0
return found
except Exception as e:
print(f"An error occurred: {e}")
return False

In this function, I used my Library SimplerLLM, specifically a method that uses SerperAPI to search on Google from the code.

To access Googleā€™s search engine from your code, you would need an API and its corresponding code. However, using SimplerLLM, the function is already built-in, and you just call it using the ā€œsearch_with_serpapiā€ method.

But, you need to generate your API key from their website, create a .env file, and add your key like this:

SERPER_API_KEY = "YOUR_API_KEY"

So, using the above function, each chunk is searched for on Google, and if a result exists, it returns True; otherwise, it returns False.

Now itā€™s time to take these Trues and Falses and turn them into a percentage:

def calculate_plagiarism_score(text, chunk_by) -> float:
chunks = chunk_text(text, chunk_by)
total_chunks = len(chunks)
plagiarised_chunks = 0
for chunk in chunks:
if search_chunk(chunk):
plagiarised_chunks += 1

plagiarism_score = (plagiarised_chunks / total_chunks) * 100 if total_chunks > 0 else 0
return plagiarism_score

This function works by first calling the chunking method explained in Step 1, and then counting the total number of these chunks.

Using step 2, we determine whether each chunk is available on the web. If it returns True, it increases the count of plagiarized chunks.

After checking all chunks, the plagiarism score is calculated by dividing the number of plagiarized chunks by the total number of chunks, multiplying by 100 to get a percentage. Finally, it returns the plagiarism score as a decimal number(float).

All the above methods wouldnā€™t generate anything if you didnā€™t give it any input and print the result.

#MAIN SECTION
start_time = time.time()
text = "YOUR_TEXT" # The Input Text

chunk_by = "sentence" # "sentence" or "paragraph"
plagiarism_score = calculate_plagiarism_score(text, chunk_by)

end_time = time.time() # Record the end time
runtime = end_time - start_time # Calculate the runtime

print(f"Plagiarism Score: {plagiarism_score}%")
print(f"Runtime: {runtime} seconds") # Print the runtime

In this section of the code, you need to enter the text you want to run the plagiarism checker on, pick your preferred method of chunking, and print the results!

Youā€™ll even get the time it took to generate the results (weā€™ll use it lateršŸ¤«)

Get The Full Code

SimplerLLM is an open-source Python library designed to simplify interactions with large language models (LLMs). It offers a unified interface for different LLM providers and a suite of tools to enhance language model capabilities.

I created it to facilitate coding, and it did indeed save me a lot of time. But the main reason Iā€™m using it in this script is that Iā€™m planning on improving this code more and making it detect similarities, too, not just exact copies of the text. So, keep an eye out for the Semantic Plagiarism Checker Post!

Now, although the script we created is working properly, why donā€™t we improve it a little?

For example, when we find that the chunk is available on a webpage somewhere, we can fetch the URLs of these web pages. This simple tweak to the code would make the results of this script a lot more interesting, especially if you turned it into a tool with a nice UI.

Hereā€™s what the new code will look like:

def search_chunk(chunk) -> List[str]:
list = []
try:
search_results = search_with_serpapi(f"\"{chunk}\"")
found = len(search_results) > 0
if (found):
list.append(found)
list.append(search_results[0].URL)
return list
else:
list.append(found)
list.append("None")
return list
except Exception as e:
print(f"An error occurred: {e}")
list.append(False)
list.append("None")
return list

def calculate_plagiarism_score(text, chunk_by) -> float:
chunks = chunk_text(text, chunk_by)
total_chunks = len(chunks)
plagiarised_chunks = 0
counter = 1
for chunk in chunks:
print(f"Chunk {counter} : {chunk} .... {search_chunk(chunk)[0]} .... {search_chunk(chunk)[1]}")
counter += 1
if search_chunk(chunk)[0]:
plagiarised_chunks += 1
plagiarism_score = (plagiarised_chunks / total_chunks) * 100 if total_chunks > 0 else 0
return plagiarism_score

As you can see, I edited the ā€œsearch_chunkā€ function so that it returns a list containing a True/ False if it found an existing duplicate and the link to the webpage that contains the same chunk. Plus, I added a print statement in the ā€œcalculate_plagiarism_scoreā€ function to print each chunk, its number, True/False, and the URL of the webpage.

Hereā€™s what the result will look like:

A major limitation of the above script is that running it on a large amount of data would be inefficient, like multiple blog posts at a time. What happens here is every chunk will have to be searched for on Google to see if there is existing content that matches it.

So, How can we fix this? There are two approaches we can try:

  1. Parallel Programming
  2. Search Result Indexing

I mentioned a little bit about both in the full article if you want to check it.

If you have other better approaches to improve the plagiarism detector, make sure to drop them in the comments below!



Source link

06Jun

Advanced Plagiarism Detector Using Python and AI [4 Methods] | by Hasan Aboul Hasan


Other than that, the code should be simple to read and understand, given all the comments I added throughout the codešŸ˜… However, in case you found something unclear and you need some help, donā€™t hesitate to drop your questions on the forum!

In this method, weā€™ll be directly comparing both articles as a whole without chunking them by converting both of them into vector embeddings. Then, using cosine similarity, weā€™ll see if theyā€™re similar to each other.

from scipy.spatial.distance import cosine
import time
import resources
import openai

def convert_to_vector(text):
"""
Converts a given piece of text into a vector using OpenAI's embeddings API.
"""
text = text.replace("\n", " ") # Remove newlines for consistent embedding processing
response = openai.embeddings.create(
input=[text],
model="text-embedding-3-small"
)
return response.data[0].embedding # Return the embedding vector

def calculate_cosine_similarity(vec1, vec2):
"""
Calculates the cosine similarity between two vectors, representing the similarity of their originating texts.
"""
return 1 - cosine(vec1, vec2) # The cosine function returns the cosine distance, so 1 minus this value gives similarity

def is_similarity_significant(similarity_score):
"""
Determines if a cosine similarity score indicates significant semantic similarity, implying potential plagiarism.
"""
threshold = 0.7 # Define a threshold for significant similarity; adjust based on empirical data
return similarity_score >= threshold # Return True if the similarity is above the threshold, False otherwise

def search_semantically_similar(text_to_check):
"""
Compares the semantic similarity between the input text and a predefined article text.
It returns a list containing the similarity score and a boolean indicating whether
the similarity is considered significant.
"""
result = [] # Initialize an empty list to store the similarity score and significance flag
input_vector = convert_to_vector(text_to_check) # Convert the input text to a vector using an embedding model

article_text = resources.article_two # texts.two contains the text of the article to compare with

article_vector = convert_to_vector(article_text) # Convert the article text to a vector

similarity = calculate_cosine_similarity(input_vector, article_vector) # Calculate the cosine similarity between the two vectors

result.append(similarity) # Append the similarity score to the list
result.append(is_similarity_significant(similarity)) # Append the result of the significance check to the list

return result # Return the list containing the similarity score and significance flag

def calculate_plagiarism_score(text):
"""
Calculates the plagiarism score of a given text by comparing its semantic similarity
with a predefined article text. The score is expressed as a percentage.
"""
data = search_semantically_similar(text) # Obtain the similarity data for the input text
data[0] = data[0] * 100 # Convert the first item in the data list (similarity score) to a percentage

return data # Return the plagiarism score and significance

#MAIN SECTION
start_time = time.time() # Record the start time of the operation
text_to_check = resources.article_one # Assign the text to check for plagiarism
plagiarism_score = calculate_plagiarism_score(text_to_check)[0]
significance = calculate_plagiarism_score(text_to_check)[1]
end_time = time.time() # Record the end time of the operation
runtime = end_time - start_time # Calculate the total runtime
# Output the results
print(f"Plagiarism Score: {plagiarism_score}%") # Print the calculated plagiarism score
print(f"Is result Significant: {significance}") # Print the signficance of the score
print(f"Runtime: {runtime} seconds") # Print the total runtime of the script

As you can see, the code is very similar in structure to method 1. However, the search_semantically_similar function was edited to directly turn both articles into vectors, compare them, and return the result without chunking.

Plus, I added the calculate_plagiarism_score function, which takes the similarity score and generates a percentage of it. Then, it will return the percentage score and True/False statement if the plagiarism score is significant, which will be analyzed by comparing the cosine similarity score with the threshold I initiated to be 0.7

Now itā€™s time for AI to enter the battlelfieldšŸ˜‚

This method is the same as method 1 in concept; however, instead of comparing the chunks by embedding them into vectors and generating the cosine similarity, weā€™ll compare them using a power prompt and OpenAIā€™s GPT model.

from SimplerLLM.tools.text_chunker import chunk_by_paragraphs
from SimplerLLM.language.llm import LLM, LLMProvider
import time
import resources
import json

def compare_chunks(text_chunk):
"""
Compares a text chunk with an article text and generates a response using a OpenAI's Model
"""
article_text = resources.article_two # The text to compare against
prompt = resources.prompt3 # A template string for creating the comparison prompt
final_prompt = prompt.format(piece=text_chunk, article=article_text) # Formatting the prompt with the chunk and article texts
llm_instance = LLM.create(provider=LLMProvider.OPENAI) # Creating an instance of the language model
response = llm_instance.generate_text(final_prompt) # Generating text/response from the LLM
response_data = json.loads(response) # Parsing the response into a JSON object
return response_data # Returning the parsed response data

def calculate_plagiarism_score(text):
"""
Calculates the plagiarism score of a text by comparing its chunks against an article text
and evaluating the responses from OpenAI's Model
"""
text_chunks = chunk_by_paragraphs(text) # Split the input text into chunks using SimplerLLM built-in method
total_chunks = text_chunks.num_chunks # The total number of chunks in the input text
similarities_json = {} # Dictionary to store similarities found
chunk_index = 1 # Index counter for naming the chunks in the JSON
plagiarised_chunks_count = 0 # Counter for the number of chunks considered plagiarised
total_scores = 0 # Sum of scores from the LLM responses
for chunk in text_chunks.chunks:
response_data = compare_chunks(chunk.text) # Compare each chunk using the LLM
total_scores += response_data["score"] # Add the score from this chunk to the total scores
if response_data["score"] > 6: # A score above 6 indicates plagiarism
plagiarised_chunks_count += 1
similarities_json[f"chunk {chunk_index}"] = response_data["article"] # Record the article text identified as similar
json.dumps(similarities_json) # Convert the JSON dictionary to a string for easier storage
chunk_index += 1 # Increment the chunk index
plagiarism_result_json = {} # Dictionary to store the final plagiarism results
plagiarism_score = (plagiarised_chunks_count / total_chunks) * 100 if total_chunks > 0 else 0 # Calculate the plagiarism score as a percentage
plagiarism_result_json["Score"] = plagiarism_score
plagiarism_result_json["Similarities"] = similarities_json # Adding where we found similaritites
plagiarism_result_json["IsPlagiarised"] = (total_scores > total_chunks * 6) # Recording if the response is really plagiarised
json.dumps(plagiarism_result_json) # Convert the final results dictionary to a JSON string
return plagiarism_result_json # Return the plagiarism results as a dictionary

#MAIN SECTION
start_time = time.time() # Record the start time of the operation
text_to_check = resources.article_one # Assign the text to check for plagiarism
plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability
end_time = time.time() # Record the end time of the operation
runtime = end_time - start_time # Calculate the total runtime
# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}") # Print the calculated plagiarism score
print(f"Runtime: {runtime} seconds") # Print the total runtime of the script

In the code,, the main function is the calculate_plagiarism_score, which chunks the articles, sends them to the compare_chunks function to get the similarity score, generates a total plagiarism score, and formats the results as JSON to add some details other than the plagiarism score, keeping them clear and readable.

The compare_chunks function creates a GPT instance using SimplerLLM, then uses a power prompt to analyze both chunks and generate a score out of 10 for how similar they are. Hereā€™s the prompt Iā€™m using:

###TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input chunk,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to
the input chunk. After that you're gonna pick the piece of article which is most similar and generate for it
a score out of 10 for how similar it is to the input chunk. Then you're gonna need to generate the output
as a JSON format that contains the input chunk, the article chunk which is the most similar, and the score
out of 10.
### SCORING CRITERIA
When checking for pieces in the article that are close in meaning to the chunk of text make sure you
go over the article at least 2 times to make sure you picked the the right chunk in the article which is the most
similair to the input chunk. Then when picking a score it should be based of how similar are the meanings
and structure of both these sentences.
# INPUTS
input chunk: [{piece}]
article: [{article}]
# OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
"chunk": "[input chunk]",
"article": "[chunk from article which is similar]",
"score": [score]
}}

As you can see, it is a detailed prompt that is very well crafted to generate a specific result. You can learn how to craft similar prompts yourself by becoming a Prompt Engineer.

This method is a combination of methods 2 and 3, where weā€™re gonna be comparing both articles as a whole but using AI instead of vector embeddings.

from SimplerLLM.language.llm import LLM, LLMProvider
import time
import resources
import json

def compare_chunks(text_chunk):
"""
Compares a given text chunk with an article to determine plagiarism using a language model.

Returns dict: The response from the language model, parsed as a JSON dictionary.
"""
article_text = resources.article_two # The text to compare against
# Formatting the prompt to include both the input text chunk and the article text
comparison_prompt = resources.prompt4.format(piece=text_chunk, article=article_text)
llm_instance = LLM.create(provider=LLMProvider.OPENAI) # Creating an instance of the language model
response = llm_instance.generate_text(comparison_prompt) # Generating response
response_data = json.loads(response) # Parsing the response string into a JSON dictionary
return response_data # Returning the parsed JSON data

def calculate_plagiarism_score(text_to_analyze):
"""
Calculates the plagiarism score based on the analysis of a given text against a predefined article text.

Returns dict: A JSON dictionary containing the plagiarism score and the raw data from the analysis.
"""
plagiarism_results = {} # Dictionary to store the final plagiarism score and analysis data
plagiarised_chunk_count = 0 # Counter for chunks considered plagiarised
analysis_data = compare_chunks(text_to_analyze) # Analyze the input text for plagiarism
total_chunks = len(analysis_data) # Total number of chunks analyzed

for key, value in analysis_data.items():
# Check if the value is a list with at least one item and contains a 'score' key
if isinstance(value, list) and len(value) > 0 and 'score' in value[0] and value[0]['score'] > 6:
plagiarised_chunk_count += 1
# Check if the value is a dictionary and contains a 'score' key
elif isinstance(value, dict) and 'score' in value and value['score'] > 6:
plagiarised_chunk_count += 1
plagiarism_score = (plagiarised_chunk_count / total_chunks) * 100 if total_chunks > 0 else 0 # Calculate plagiarism score as a percentage
plagiarism_results["Total Score"] = plagiarism_score # Add the score to the results dictionary
plagiarism_results["Data"] = analysis_data # Add the raw analysis data to the results dictionary
json.dumps(plagiarism_results) # Convert the results dictionary to a clear JSON string
return plagiarism_results # Return the final results dictionary

#MAIN SECTION
start_time = time.time() # Record the start time of the operation
text_to_check = resources.article_one # Assign the text to check for plagiarism
plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability
end_time = time.time() # Record the end time of the operation
runtime = end_time - start_time # Calculate the total runtime
# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}") # Print the scores
print(f"Runtime: {runtime} seconds") # Print the total runtime of the script

This code is 80% like the code in method 3. However, instead of comparing each chunk, we send both articles as a whole and let OpenAIā€™s GPT generate a detailed plagiarism test, comparing all parts of the articles as it wishes. In the end, it returns a detailed output containing a plagiarism score and the top sections are found to be similar in their similarity score.

All this is done using this power prompt:

### TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input text,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to
the pieces of the input text. After that you're gonna pick chunk pairs that are most similar to each other
in meaning and structure, a chunk from the input text and a chunk from the article. You will then generate
a score out of 10 for each pair for how similar they are.
Then you're gonna need to generate the output as a JSON format for each pair that contains
the input text chunk, the article chunk which are the most similar, and the score out of 10.
### SCORING CRITERIA
When checking for peices in the article that are close in meaning to the chunk of text make sure you
go over the article at least 2 times to make sure you picked the right pairs of chunks which are most similar.
Then when picking a score it should be based of how similar are the meanings and structure of both these sentences.
### INPUTS
input text: [{piece}]
article: [{article}]
### OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
"pair 1":
[
"chunk 1": "[chunk from input text]",
"article 1": "[chunk from article which is similar]",
"score": [score]
],
"pair 2":
[
"chunk 2": "[chunk from input text]",
"article 2": "[chunk from article which is similar]",
"score": [score]
],
"pair 3":
[
"chunk 3": "[chunk from input text]",
"article 3": "[chunk from article which is similar]",
"score": [score]
],
"pair 4":
[
"chunk 4": "[chunk from input text]",
"article 4": "[chunk from article which is similar]",
"score": [score]
]
}}

The prompt in methods 3 and 4 is very important to be well-crafted since all the results are based on it. Feel free to tweak and optimize it to your liking and if it generates better results make sure to share it with us in the comments below!

After we tried 2 types of machines to do the work for us, letā€™s now use human intelligence and see if their results are significant!

Here are the 2 texts I was comparing:

Article 1:What is generative AI?Generative AI refers to deep-learning models that can generate high-quality text, images, and other content based on the data they were trained on.Artificial intelligence has gone through many cycles of hype, but even to skeptics, the release of ChatGPT seems to mark a turning point. OpenAI's chatbot, powered by its latest large language model, can write poems, tell jokes, and churn out essays that look like a human created them. 
Prompt ChatGPT with a few words, and out comes love poems in the form of Yelp reviews, or song lyrics in the style of Nick Cave.



Source link

05Jun

Find Similar Research Papers In 1 Minute with AI and Python! | by Hasan Aboul Hasan


An obstacle most people face when writing academic research papers is finding similar papers easily. I myself faced this problem because it takes too much time to do so.

So, I built a Python script powered by AI to search for related keywords in an input abstract and then get related abstracts on Arxiv.

ArXiv is an open-access archive for nearly 2.4 million academic articles in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

Why choose Arxiv?

Simply because it already contains a lot of articles and a free built-in API, which makes it easier to access any articleā€™s abstract directly. This bypasses the need to search the web using paid APIs for articles and then check if they contain an abstract, and if they do use an HTML parser to find the abstract ā€” TOO MUCH WORK šŸ« 

http://export.arxiv.org/api/query?search_query=all:{abstract_topic}&start=0&max_results={max_results}

Enter your abstractā€™s topic instead of {abstract_topic}, and how many search results you want instead of {max_results}. Then, paste it into the web browser, and it’ll generate an XML file containing the summary (which is the abstract) and some other details about the article, like its ID, the authors, etc…

The idea is simple; hereā€™s the workflow:

1- Extract from the input abstract the top 5 keywords (topics) that are most representative of its content.

2- Call the API on each of the 5 keywords we extracted

3- Analyze the results and check the similarity of these abstracts to the input abstract.

To use Arxivā€™s API, we need a keyword to search for, so we need to extract the top keywords present in the input abstract. You can either use built-in libraries like nltk or spacy, but when I tried using them, the results were not as expected and not so accurate.

So, to get better results, I used OpenAIā€™s GPT-4 (you can use Gemini if you prefer it), gave it a power prompt, and generated optimal results. Hereā€™s the code:

def extract_keywords(abstract):
# Constructing a prompt for the language model to generate keywords from the abstract
prompt = f"""
### TASK
You are an expert in text analysis and keyword extraction. Your task is to analyse an abstract I'm going to give you
and extract from it the top 5 keywords that are most representative of its content. Then you're going to generate
them in a JSON format in descending order from the most relevant to the least relevant.### INPUTS
Abstract: {abstract}
### OUTPUT
The output should be in JSON format. Here's how it should look like:
[
{{"theme": "[theme 1]"}},
{{"theme": "[theme 2]"}},
{{"theme": "[theme 3]"}},
{{"theme": "[theme 4]"}},
{{"theme": "[theme 5]"}}
]
"""
# Creating an instance of the language model using SimplerLLM
llm_instance = LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4")
# Generating response from the language model
response = llm_instance.generate_text(user_prompt=prompt)
# Attempting to parse the response as JSON
try:
response_data = json.loads(response)
return json.dumps(response_data, indent=2)
except json.JSONDecodeError:
# Returning an error message if the response is not valid JSON
return json.dumps({"error": "Invalid response from LLM"}, indent=2)

This function uses SimplerLLM, which facilitates the process of calling OpenAIā€™s API without writing tedious code. In addition, it makes it very easy for you to use Geminiā€™s API instead of OpenAI by only changing the name of the LLM instance like this:

llm_instance = LLM.create(provider=LLMProvider.GEMINI, model_name="gemini-pro")

Very nice, right?šŸ˜‰

Back to our code.

The power prompt I crafted is the main engine of the above function, so if it werenā€™t efficiently crafted, the code wouldnā€™t work at all.

### TASK
You are an expert in text analysis and keyword extraction. Your task is to analyse an abstract I'm going to give you and extract from it the top 5 keywords that are most representative of its content. Then you're going to generate them in a JSON format in descending order from the most relevant to the least relevant.
### INPUTS
Abstract: {abstract}
### OUTPUT
The output should be in JSON format. Here's how it should look like:
[
{{"theme": "[theme 1]"}},
{{"theme": "[theme 2]"}},
{{"theme": "[theme 3]"}},
{{"theme": "[theme 4]"}},
{{"theme": "[theme 5]"}}
]

As you can see, it is a detailed prompt that is very well crafted to generate a specific result. By becoming a Prompt Engineer, you can learn how to craft similar prompts yourself.

After running the above function, weā€™ll have a JSON-formatted output containing 5 keywords. So, we need to search for abstracts for each of the 5 keywords, and weā€™ll do that using Arxivā€™s API.

However, when you run Arxivā€™s API call, you get an XML file like this:

So, to easily extract the ID and summary (abstract), weā€™ll import xml.etree.ElementTree that helps us easily navigate and extract information from XML-formatted text.

def get_abstracts(json_input):
input_data = json.loads(json_input)
all_summaries_data = []
for theme_info in input_data:
keyword = theme_info['theme']
max_results = 1 # Number of results to fetch for each keyword
# Constructing the query URL for the arXiv API
url = f"http://export.arxiv.org/api/query?search_query=all:{keyword}&start=0&max_results={max_results}&sortBy=submittedDate&sortOrder=descending"

response = requests.get(url)
if response.status_code == 200:
root = ET.fromstring(response.text)
ns = {'atom': 'http://www.w3.org/2005/Atom'}
summaries_data = []
for entry in root.findall('atom:entry', ns):
arxiv_id = entry.find('atom:id', ns).text.split('/')[-1]
summary = entry.find('atom:summary', ns).text.strip()

summaries_data.append({"ID": arxiv_id, "abstract": summary, "theme": keyword})
all_summaries_data.extend(summaries_data[:max_results])
else:
print(f"Failed to retrieve data for theme '{keyword}'. Status code: {response.status_code}")
json_output = json.dumps(all_summaries_data, indent=2)
return json_output

In the above function, weā€™re looping over the 5 keywords we generated, and for each one, weā€™re calling the API, extracting the ID and abstract from the XML, saving them in a list, and formatting this list into JSON (easier to read).

How can we check for similarity between 2 abstracts? Again, AI šŸ¤–

Weā€™ll be using SimplerLLM again to create an OpenAI instance and a power prompt to perform the analysis and similarity checking.

def score_abstracts(abstracts, reference_abstract):
new_abstracts = json.loads(abstracts)
scored_abstracts = []
for item in new_abstracts:
prompt = f"""
### TASK
You are an expert in abstract evaluation and English Literature. Your task is to analyze two abstracts
and then check how similar abstract 2 is to abstract 1 in meaning. Then you're gonna generate
a score out of 10 for how similar they are. 0 being have nothing in common on different topics, and 10
being exactly the same. Make sure to go over them multiple times to check if your score is correct.
### INPUTS
Abstract 1: {reference_abstract}
Abstract 2: {item["abstract"]}
### OUTPUT
The output should be only the number out of 10, nothing else.
"""
llm_instance = LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4")
# Generating the similarity score from the language model
response = llm_instance.generate_text(user_prompt=prompt)

# Extracting the score from the response and handling potential errors
try:
score = int(response)
perfect_match = score == 10
except ValueError:
score = 0
perfect_match = False

scored_abstracts.append({
"ID": item["ID"],
"theme": item["theme"],
"score": score,
"perfect_match": perfect_match
})

return scored_abstracts

Weā€™re gonna use the JSON output we got from the function above containing all abstracts and IDs, and weā€™ll loop over each abstract, run the power prompt on it with the input abstract, and get the similarity score.

As mentioned above, the power prompt is a crucial part of the function; if it is bad, the code wonā€™t work. So, read this article to improve your prompt crafting skills.

After getting the score, if it is 10/10, then the abstract we found is a perfect match for the input abstract.

To run the codes, youā€™re gonna have to create a .env file that contains your OpenAI API key or Gemini key like this:

And, of course, youā€™ll need to enter your input abstract to run the code on it:

# MAIN SCRIPT
reference_abstract = """
YOUR_ABSTRACT
"""
json_data = extract_keywords(reference_abstract)
abstracts = get_abstracts(json_data)
data = json.dumps(score_abstracts(abstracts, reference_abstract),indent=2)
print(data)

Plus, donā€™t forget to install all the necessary libraries, which you can install by running this in the terminal:

pip install requests simplerllm

Get Code

Now, although the script we created is working properly, why donā€™t we improve it a little?

The search for abstracts is limited to only Arxiv, and maybe there is a very similar copy to your abstract that is not available on Arxiv but on a different website. So, why donā€™t we tweak the code a little and make it search on Google directly for similar abstracts, and then turn it into a tool with a nice UI?

To do that, weā€™ll only need to update the get_abstracts function:

# Search for related abstracts according to keywords and get link and content
def get_google_results(json_input):
keywords = json.loads(json_input)
search_results = []
for theme_info in keywords:
keyword = theme_info['theme']
query = f"{keyword} AND abstract AND site:edu AND -inurl:pdf"
result = search_with_value_serp(query, num_results=1)
for item in result:
try:
url = str(item.URL)
load = load_content(url) # Assumes load_content is a function that fetches content from the URL
content = load.content
search_results.append({"Link": url, "Content": content, "theme": keyword})
except Exception as e:
print(f"An error occurred with {url}: {e}")
continue
json_output = json.dumps(search_results, indent=2)
return json_output

As you can see, the function now searches on Google using the search_with_value_serp The function is integrated into the SimplerLLM library. Then, I used the load_content function, which is also in the SimplerLLM library. This makes it very easy to access the link’s title and content.

In addition, you have to add your VALUE_SERP_API_KEY in the .env file. This is how it will look like:

Keep in mind that some keywords may not have an abstract similar to it on Google so that the search would return nothing. Therefore, you might get less than 5 links for similar abstracts.

The code above is only a prototype showing a headstart of this function. You can improve it to get better results, design a nice User Interface for it, and make a fully functional tool out of it. Then, you can build a SAAS business based on this tool.

In this way, youā€™ll have a monthly recurring income from these tools you built! Pretty nice, huh šŸ˜‰

Remember, if you have any questions, make sure to drop them below in the comments section or on the forum.



Source link

05Jun

How Many Steps Forward? ā€“ European Law Blog


Blogpost 30/2024

The history of EU institutions is marked by a long list of statements and political initiatives that endorse the legal claims of the LGBTIQA+ community (see, for instance, Kollman and Bell). Over the past decades, these have gradually been mainstreamed within different areas of EU law. Particularly, the current EU legislative term (2019-2024) has witnessed an increased commitment of EU institutions towards the LGBTIQA+ community. This is not only shown by the numerous and recurrent Resolutions of the European Parliament on this topic (see EPRS). It is also evident from several political and legislative initiatives that have been introduced over recent years, which (attempt to) intervene in diverse fields of EU law that are considered as relevant to individuals that identify as LGBTIQA+.

Meanwhile, most EU law scholars focus their research on narrow areas, such as non-discrimination (mainly, in the field of employment) and free movement (of same-sex couples and their children). In other words, LGBTIQA+ issues never appear as the starting point of the analysis but rather as an incidental reference in the context of other research topics (on this point, see Belavusau). This piece aims to provide a deeper overview of the EUā€™s direct commitment towards the LGBTIQA+ community during the EU legislative term that is now coming to an end. It will thus retrace the different political, legislative, and judicial developments occurred, which have been marked as relevant for, or targeted to, LGBTIQA+ persons. Some contextual challenges of EU law vis-Ć -vis LGBTIQA+ matters will also be highlighted.

An EU Strategy for LGBTIQA+ Equality

Looking back at the very beginning of this EU legislative term, on 12 December 2020, the European Commission adopted, by way of a Communication, the EU LGBTIQ Equality Strategy (hereinafter, ā€˜the Strategyā€™). Unsurprisingly, the adoption of the Strategy comes during the EU legislative term in which the first-ever Commissioner for Equalitywas appointed. Likewise, a specific unit working on ā€˜non-discrimination and LGBTIQā€™ matters has been established in the European Commission. Prior to the publication of the Strategy, some had argued that the EU is equipped with adequate legal bases to intervene in the fields of non-discrimination and equality for LGBTIQA+ persons. These are, for instance, the non-discrimination clause in Article 19 TFEU, or Article 81(3) TFEU as regards aspects of family law with cross-border implications. Yet, the potential of these provisions had been restrained by the absence of an overarching and coherent approach. The Strategy seems to have, at least in principle, addressed this gap.

Despite its non-binding nature, the Strategy has been considered a significant development for LGBTIQA+ persons in the EU for the following three main reasons. First, the Strategy has a strong symbolic value. It represents the first instrument in the history of EU integration that targets specifically the LGBTIQA+ community. Second, the Strategy provides a comprehensive approach, as it addresses the topic from different angles. Indeed, it is built on four major axes: i) tackling discrimination against LGBTIQ people; ii) ensuring LGBTIQ peopleā€™s safety; iii) building LGBTIQ inclusive societies; iv) leading the call for LGBTIQ equality around the world. Last, the Strategy is very detailed. It precisely identifies legislative and non-legislative initiatives to be achieved within a fixed timeline, thus serving as a planning instrument for the Commissionā€™s action.

More recently, a survey conducted by the EU Fundamental Rights Agency shows that while there are signs of slow and gradual progress, discrimination against LGBTIQA+ persons remain dramatically high. This is also evident in the ILGA-Europeā€™s annual rainbow map. As the end date of the Commissionā€™s Strategy is approaching and EU elections are coming up, the question remains whether the next European Commission will develop a new instrument for LGBTIQA+ equality; or, as it will be argued below, try at least to fulfil the missed objectives of the current Strategy.

Recognition of same-sex parents and their children

On 7 December 2022, the European Commission proposed the Equality Package (hereinafter, ā€˜the Packageā€™), a proposal for a Regulation to harmonise rules concerning parenthood in cross-border situations. One of the key aspects of the proposal is that once parental bonds are established in one Member State, these must be automatically recognised everywhere in the EU (for a deeper analysis of the Package, see Tryfonidou; see also Marcia).

The mutual recognition of same-sex parents and their children had also been addressed, just a year earlier, by the Court of Justice (CJEU) in the Pancharevo case (C-490/20). The dispute concerned a same-sex couple, a Bulgarian and a UK national. They gave birth to S.D.K.A. in Spain, where the couple had been married and was legally residing. Spain thus issued a birth certificate, as Spanish law recognises same-sex parenthood. Yet, Bulgarian authorities refused to issue a passport/ID for S.D.K.A since Bulgarian law does not recognise same-sex parenthood. This led to a preliminary question referred to the CJEU, namely whether such a refusal constituted a breach of EU free movement rights (notably, Articles 20 and 21 TFEU and Directive 2004/38). The Court ruled that the refusal to issue a passport or ID to S.D.K.A. would indeed alter the effectiveness of her right to move and reside freely within the Union. National authorities are thus required to recognise the parental bonds legally established in another Member State. This obligation, however, applies only for the purposes of the exercise of the right to free movement, while Member States remain free (not) to recognise same-sex parenthood within their internal legal orders (for a full overview of the judgment, see Tryfonidou; see also De Groot).

Despite the obligation stemming from this judgment, in practice, same-sex parents often experience long and expensive proceedings before national authorities. Indeed, the Commission stated that the key objective of the Equality Package is to reduce times, costs, and burdens of recognition proceedings for both families and national judicial systems. The proposed regulation would, in other words, ā€˜automatiseā€™ the requirements introduced by the Court in Pancharevo (for the purposes of the exercise of the right to free movement). However, one of the biggest challenges to the adoption of the Package is its legal basis: Article 81(3) TFEU. This requires the Council to act unanimously under a special legislative procedure, after obtaining the consent of the European Parliament. If reaching unanimity among the 27 Member States is generally challenging, this becomes even more complex when the file concerns a topic on which Member Statesā€™ sensibilities and approaches differ dramatically. Indeed, some national governments, such as the Italian one, have already declared their unwillingness to support the Commissionā€™s initiative (see, for instance, Marcia).

Combatting hate crime and hate speech

Current EU law criminalises hate crime and hate speech only if related to the grounds of race and ethnic origin. Yet, national laws differ significantly when it comes to such conduct in relation to sex, sexual orientation, age, and disability (see EPRS). To implement the Strategyā€™s objective of ā€˜ensuring LGBTIQ peopleā€™s safetyā€™, on 9 December 2021, the Commission proposed to include hate crime and hate speech against LGBTIQA+ persons within EU crimes. This initiative requires a two-step procedure. First, Article 83(1) TFEU contains a list of areas of ā€˜particularly serious crimeā€™ with a ā€˜cross-border dimensionā€™ that justify a common action at EU level. This list can only be updated by a Council decision, taken by unanimity, after receiving the consent of the European Parliament. Second, once hate crime and hate speech have been included in this list, the Commission can follow up with a proposal for a directive to be adopted through the ordinary legislative procedure. This would establish minimum rules concerning the definition of criminal offences and sanctions (for a full analysis of the proposal, see PerÅ”ak).

The European Parliament addressed the problem of hate crime and hate speech against LGBTIQA+ persons on different occasions. Accordingly, in a Resolution of 18 January 2024, the Parliament positively welcomed the Commissionā€™s initiative and urged the Member States to make progress on it. The Justice and Home Affairs Council of 3-4 March 2022 had previously discussed the proposal, concluding that ā€˜a very broad majority was in favour of this initiativeā€™. Yet, the file has never been scheduled for further discussion or vote since then. Significantly, not even the Belgian Presidency of the Council managed to make any progress, despite the declared intention to make of LGBTIQA+ equality a priority during the countryā€™s six-month lead of the institution. The Commissionā€™s proposal is therefore far from being accomplished, with unanimity being ā€“ once again ā€“ the greatest challenge to overcome.

The return to EU values

In December 2022, the European Commission referred Hungary to the Court of Justice in the context of an infringement procedure (C-769/22). The contested legislation, approved by the Hungarian Parliament in June 2021, was depicted as a tool to combat paedophilia. As highlighted by the Commission and several NGOs, however, the law directly targets the LGBTIQA+ community. Indeed, it limits minorsā€™ access to content that ā€˜promote(s) divergence from self-identity corresponding to sex at birth, sex change or homosexualityā€™ and bans or limits media content that concerns homosexuality or gender identity. It also introduces a set of penalties for organisations that breach these rules (see Bonelli and Claes).

During the past decade, Viktor OrbĆ”n made Hungary very (un)popular for the multiple violations of the rule of law and fundamental rights, including attacks to the LGBTIQA+ community. Thus, the introduction of ā€“ another ā€“ infringement procedure against Hungary seems business as usual. However, EU law scholars have immediately pointed out how this could be a landmark case. For the first time, the Commission has directly relied on Article 2 TEU, proposing a direct link between LGBTIQA+ equality and the ā€˜founding valuesā€™ of the EU. If there is no doubt that this is of high symbolical and political importance, questions have been raised as regards the ā€˜added legal valueā€™ of article 2 TEU. In other words, the judicial mobilisation of Article 2 TEU does not seem to bring more legal benefits than an infringement procedure based only on the Charter of Fundamental Rights and other provisions of EU law.

It must be noted that the Commissionā€™s reliance on EU values has encouraged a significant political and judicial mobilisation. In an unprecedented move, the European Parliament and fifteen Member States have asked to intervene before the CJEU. This is the first time in the history of EU integration that so many Member States have asked to intervene in support of the Commissionā€™s action against another Member State. For some of them, including France and Germany, this is the first-ever intervention in a case related to fundamental rightsā€™ protection (see Chopin and Leclerc). However, it should also be underlined that the group of countries that participate in the lawsuit has a markedly Western component. This clearly shows the existence (and the persistence) of an East-West divide when it comes to the controversial topic of LGBTIQA+ rightsā€™ protection. Therefore, considering the unanimity requirements mentioned above, even the high participation the Member States to the infringement procedure seems insufficient to advance coherent action at EU level.

Conclusions

EU institutions, in particular the Commission and the Parliament, seem increasingly committed to offer more robust protection to LGBTIQA+ persons. This is shown by the first-ever EU comprehensive Strategy and the related legislative proposals, as well as the numerous calls of the European Parliament. Whereas this is clearly positive for the visibility and legal claims of the LGBTIQA+ community, the legal outcome appears however limited. All legislative proposals are blocked by the failure to reach unanimity in the Council. Indeed, the only changes occurred in terms of legal obligations seem to stem from the CJEU ruling in case Pancharevo (and other minor developments related to anti-discrimination case-law). If it is true that, in principle, the EU is equipped with good legal bases to legislate in the fields of non-discrimination and equality for LGBTIQA+ persons, the feasibility of EU intervention seems challenged by the type of legislative procedure provided and the unanimity requirement. Therefore, further research is needed to identify the actual potential of EU competences to deal with the legal claims advanced by the LGBTIQA+ community.

The pending ā€˜EU values caseā€™ (C-769/22 Commission v Hungary) shows the existence of highly divergent cultural and political views between the Member States, especially when it comes to issues such as LGBTIQA+ equality which seemingly continues to be controversial. At the end of this week (6-9 June 2024), EU citizens will be called to elect the new Members of the European Parliament (MEPs). As current polls show, far-right parties are likely to gain an increased number of seats. Accordingly, this could lead to a more conservative composition of the next European Commission. These dynamics may constitute a significant shift in the commitment of these institutions to enhance LGBTIQA+ rightsā€™ protection. Indeed, the European Parliament and the European Commission are considered two early [LGBTIQA+] movement allies, as they have been supporting the claims of this community on numerous occasions before and during this term. Therefore, the question is whether these potential political changes will result in a softening of their commitment. If so, the CJEU may remain the only and last resort for LGBTIQA+ individuals at EU level.



Source link

03Jun

Assertions Are Like Guardrails for LLM Apps | by Cobus Greyling | Jun, 2024


DSPy Assertions are a different approach to guardrails, which asserts computational constraints on foundation models.

In a previous post I gave some background on what the basic architecture of DSPy is and what some of the possible use-cases might be.

Exploration & Optimisation

DSPy can be well suited as an interface to describe your needs, share a very small amount of data, and have DSPy generate the optimal prompts, prompt templates and prompting strategies.

To get better results without spending a fortune, you should try different approaches like breaking tasks into smaller parts, refining prompts, increasing data, fine-tuning, and opting for smaller models. The real magic happens when these methods work together, but tweaking one can impact the others.

GUI

DSPy is a programatic approach and I can just imagine how DSPy will benefit from a GUI for more basic implementations. Consider a user can upload sample data, describe in natural language what they want to achieve, and then have via the GUI a prompting strategy generated with templates etc.

Use-Case

Deciding if DSPy is the right fit for your implementation, the use-case needs to be considered. And this goes for all implementations, the use-case needs to lead.

In essence, DSPy is designed for scenarios where you require a lightweight, self-optimising programming model rather than relying on pre-defined prompts and integrations.

I believe much can be gleaned from this implementation of guardrailsā€¦

  1. The guardrails can be described in natural language and the LLM can be leveraged to self-check its responses.
  2. More complicated statements can be created in Python where values are parsed to perform checks.
  3. The flexibility of describing the guardrails lend a high order of flexibility in what can be set for specific implementations.
  4. The division between assertions and suggestions is beneficial, as it allows for a clearer delineation of checks.
  5. Additionally, the ability to define recourse adds another layer of flexibility and control to the process.
  6. The studyā€™s language primarily revolves around constraining the LLM and defining runtime retry semantics.
  7. This approach also serves as an abstraction layer for self-refinement methods into arbitrary steps for pipelines.

There are two types of assertions, hard and soft.

Hard Assertions represent critical conditions that, when violated after a maximum number of retries, cause the LM pipeline to halt, if so defined, signalling a non-negotiable breach of requirements.

On the other hand, suggestions denote desirable but non-essential properties; their violation triggers the self-refinement process, but exceeding a maximum number of retries does not halt the pipeline. Instead, the pipeline continues to execute the next module.

dspy.Assert(your_validation_fn(model_outputs), "your feedback message", target_module="YourDSPyModuleSignature")

dspy.Suggest(your_validation_fn(model_outputs), "your feedback message", target_module="YourDSPyModuleSignature")

Assertions makes use of DSPy as a foundational framework.

Considering the code snippet below from a DSPy notebook:

class LongFormQAWithAssertions(dspy.Module):
def __init__(self, passages_per_hop=3, max_hops=2):
super().__init__()
self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
self.retrieve = dspy.Retrieve(k=passages_per_hop)
self.generate_cited_paragraph = dspy.ChainOfThought(GenerateCitedParagraph)
self.max_hops = max_hops

def forward(self, question):
context = []
for hop in range(self.max_hops):
query = self.generate_query[hop](context=context, question=question).query
passages = self.retrieve(query).passages
context = deduplicate(context + passages)
pred = self.generate_cited_paragraph(context=context, question=question)
pred = dspy.Prediction(context=context, paragraph=pred.paragraph)
dspy.Suggest(citations_check(pred.paragraph), f"Make sure every 1-2 sentences has citations. If any 1-2 sentences lack citations, add them in 'text... [x].' format.", target_module=GenerateCitedParagraph)
_, unfaithful_outputs = citation_faithfulness(None, pred, None)
if unfaithful_outputs:
unfaithful_pairs = [(output['text'], output['context']) for output in unfaithful_outputs]
for _, context in unfaithful_pairs:
dspy.Suggest(len(unfaithful_pairs) == 0, f"Make sure your output is based on the following context: '{context}'.", target_module=GenerateCitedParagraph)
else:
return pred
return pred

The assertions included aims to enforce the defined computational constraints, allowing the LongFormQA program to operate within these guidelines automatically.

In the first assertion, the program validates the output paragraph to ensure citations appear every 1ā€“2 sentences. If this validation fails, the assertion backtracking logic activates with the feedback: ā€œEnsure each 1ā€“2 sentences include citations in ā€˜textā€¦ [x].ā€™ format.ā€

In the second assertion, the CheckCitationFaithfulness program is used to verify the accuracy of each cited reference, examining text segments in the generated paragraph.

For unfaithful citations, it provides feedback with the context: ā€œEnsure your output aligns with this context: ā€˜{context}ā€™.ā€

This ensures the assertion backtracking has the necessary information and context.

ā­ļø Follow me on LinkedIn for updates on Large Language Models ā­ļø

Iā€™m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

03Jun

Create a Free AI Chatbot on WordPress Without Any Third-Party Plugins! | by Hasan Aboul Hasan | Apr, 2024


In this post, I will show you how to create a Free AI Chatbot on a WordPress site WITHOUT using third-party services or chatbot plugins.

If you follow up with me for the next few minutes, you will create an AI chatbot like this:

We will build this totally from scratch so you can understand the idea behind building AI chatbots and integrating them with WordPress.

šŸšØ šŸšØ Disclaimer: This post has a lot of value and may cause some side effects, like making you smarter and super eager to share it with everyone you know šŸ˜‚

The Chatbot Structure

Before we start building our AI-powered chatbot, letā€™s take a moment to understand its structure and see how each part functions.

Our chatbot consists of three main parts:

  1. The UI or the Chatbot Interface: This is the part of the chatbot that your users will interact with.
  2. The Backend: This is the core of the chatbot where all the configurations are set up. It handles the logic and processes necessary to determine how the chatbot should respond to various inputs.
  3. AI APIs: These are the artificial intelligence engines that power the chatbotā€™s ability to understand and respond to user queries. You can use APIs from major providers like OpenAI, Gemini, or Anthropic, or even open-source models.

Both the UI and the backend are highlighted in blue, indicating that these components will be published and implemented on WordPress, making them accessible and manageable through your websiteā€™s platform.

Creating the Chatbot UI

The first step in building our AI chatbot is to create the user interface (UI). In this guide, weā€™re building the chatbot from scratch without using any plugins, focusing on using basic web development frameworks like HTML, CSS, and JavaScript.

Donā€™t worry about coding everything yourself; Iā€™ve prepared the codes for you. You just need to copy and paste them!

Access the Code

To get started, click the link below to access the code for our chatbot:

šŸ‘‰ AI Chatbot Codes

Inside the download, you will find two folders: one for the frontend and one for the backend. In the frontend folder, there are three files:

When you open the index.html file in your browser, the chatbot UI will load, but since itā€™s not yet connected to the backend, an error will appear. This is expected at this point, so donā€™t worry!

This error message confirms that the UI part is not yet communicating with the backend, which is exactly what we anticipate at this stage.

Setup the Backend

Now that weā€™ve set up the user interface for our AI chatbot, itā€™s time to move on to part two: setting up the backend.

Since we are building the chatbot for WordPress, weā€™ll implement the backend directly within the WordPress environment.

Note that if you like to test this and you dont have a wordpress site, you can use LocalWP to install WordPress locally and test with it.

1- Install Wp Code Snippets Plugin

Start by going to your WordPress dashboard. Navigate to ā€˜Pluginsā€™, click ā€˜Add Newā€™, and search for ā€œWP Code Snippets.ā€

Install ā€œWpCodeā€ plugin. Itā€™s important to select the correct one as shown in the image below:

2- Copy The Backend Code

Return to the folder you downloaded earlier and open the backend folder. Look for the config.php file.

Open this file with any text editor you prefer, copy all the contents, and then go back to your WordPress dashboard.

Inside the WP Code Snippets area, create a new empty snippet. Paste the copied PHP code into this snippet.

Ensure that you set the code type to ā€œPHP Snippetā€ as shown below:

3- Set your OpenAI API Key

The backend code uses the OpenAI API to power our AI chatbot, so you will need an API key from OpenAI. If you donā€™t already have one, youā€™ll need to visit OpenAIā€™s website, sign up, and generate a new API key.

Once you have your API key, replace the placeholder in the snippet with your actual key:

$api_key = 'sk-XXX'; // Replace with your actual API key

Connect the UI with the Backend

With the UI and backend now set up, the next crucial step is to link them together.

Hereā€™s how to connect the UI with the backend:

Open the script.js file located in the frontend folder you’ve previously downloaded.

Look for the section at the top of the file where URLs are configured.

Here, youā€™ll need to update the URLs to match your websiteā€™s domain. This ensures that when the UI sends a request, it correctly reaches the backend hosted on your WordPress site.

For example, if your website is https://www.YourSuperWebsite.com, you will change the URL in the script.js to something like this:

const apiUrl = 'https://www.YourSuperWebsite.com/wp-json/myapi/v1/chat-bot/';
const botConfigurationUrl = 'https://www.YourSuperWebsite.com/wp-json/myapi/v1/chat-bot-config';

Once youā€™ve updated the URLs in the script.js, save your changes.

Now, re-open the index.html file in your browser to reload the chatbot interface. This time, when you interact with the chatbot, it should be able to communicate with the backend and you should be able to chat with the AI bot without any problems.

If everything is set up correctly, your chatbot should now be operational, responding to queries based on the AIā€™s processing done in the backend.

The Chatbot Configuration

To enhance the flexibility and personalization of our AI chatbot, I have developed a backend function that allows for easy customization of the bot.

To customize your chatbot, you will need to access the PHP snippet we created in the WP Code Snippets plugin. Hereā€™s a detailed guide on how to do it:

  1. Access the Backend Snippet: Go to your WordPress dashboard and navigate to the WP Code Snippets area where you previously pasted the backend code. Look for the function named load_chat_bot_base_configuration.
  2. Understand the Function: This function is designed to dynamically configure your chatbot. Here is a breakdown of what you can customize:
  • User Avatar URL: Set the URL for the avatar image that represents the user in the chat interface. This could be a default image or something more personalized.
  • Bot Image URL: Set the URL for the botā€™s avatar image to give your chatbot a distinct personality.
  • Startup Message: Customize the initial greeting message that users see when they start interacting with the chatbot.
  • Font Size: Adjust the font size to ensure that the chat interface is readable and matches the style of your website.
  • Common Buttons: Define buttons that appear in the chat interface, providing quick options for common queries or actions.

Here is a snippet of the code for better understanding:

function load_chat_bot_base_configuration(WP_REST_Request $request) {
$user_avatar_url = "https://learnwithhasan.com/wp-content/uploads/2023/09/pngtree-businessman-user-avatar-wearing-suit-with-red-tie-png-image_5809521.png";
$bot_image_url = "https://learnwithhasan.com/wp-content/uploads/2023/12/site-logo-mobile.png";
$response = array(
'botStatus' => 0, // 0 for offline, 1 for online
'StartUpMessage' => "Hi, How are you?",
'fontSize' => '16',
'userAvatarURL' => $user_avatar_url,
'botImageURL' => $bot_image_url,
'commonButtons' => array(
array(
'buttonText' => 'I want your help!!!',
'buttonPrompt' => 'I have a question about your courses'
),
array(
'buttonText' => 'I want a Discount',
'buttonPrompt' => 'I want a discount'
)
)
);
return $response;
}

Upgrade Your Chatbot

Now that we have a functioning AI chatbot on our site, letā€™s explore some advanced features and upgrades to make it even more powerful and tailored to your specific needs.

1- System Prompt Customization

To make your chatbot niche-specific and direct its responses in a certain style, consider customizing the system prompt. This is the instruction you give to the AI about how it should behave.

For example, to make your chatbot focus on digital marketing, you can modify the backend code like this:

$conversation_history[] = [
'role' => 'system',
'content' => 'Answer questions only related to digital marketing, otherwise, say I donā€™t know.'
];

This restricts the chatbot to respond only to queries related to digital marketing. You can also tailor the system prompt to define the response style, topic, target audience, etc. Hereā€™s another example:

You are a language learning coach. your task is to helps users learn and practice new languages. Offer grammar explanations, vocabulary building exercises, and pronunciation tips. 
Engage users in conversations to help them improve their listening and speaking skills and gain confidence in using the language.
Do not respond to questions outside of your language coaching role.

2- Retrieval-Augmented Generation (RAG)

This upgrade involves giving the chatbot access to your documents, FAQs, or any site-specific knowledge base.

This allows the AI to pull information directly from your resources to answer questions, making its responses more accurate and tailored to your content.

3- Agentic Chatbot

Transform your chatbot into an autonomous agent that can perform additional functions.

For example, integrating with WordPress functionalities allows the chatbot to manage user accounts, reset passwords, and more, acting like a human agent on your site.

Iā€™ve created a simple prototype for building AI agents on WordPress. You can learn more and access the code for testing here:

AI Agent on WordPress

āœ… If youā€™re interested in understanding what agents are and how to build them from scratch, check out this course ā€œBuild AI Agents From Scratch With Pythonā€

4- Connect with Python API

While our backend is PHP-based, Python offers more robust libraries and packages. You can make your PHP backend a tunnel to a Python API, enhancing your chatbotā€™s capabilities.

Here is an example of how to make the chatbot interact with an API instead:

function call_fast_api_endpoint( $last_prompt, $conversation_history ) {
$api_url = 'https://yourapi.com/chat-bot-test';
$api_key = 'XXX';
$body = json_encode(array(
'last_prompt' => $last_prompt,
'conversation_history' => $conversation_history,
));
$response = wp_remote_post($api_url, array(
'headers' => array(
'Accept' => 'application/json',
'x-api-key' => $api_key,
'Content-Type' => 'application/json',
),
'body' => $body,
'method' => 'POST',
'data_format' => 'body',
));
if ( is_wp_error( $response ) ) {
// Handle error
return $response->get_error_message();
}
$response_body = wp_remote_retrieve_body( $response );
return json_decode($response_body, true);
}

This function sends the last prompt and the conversation history to the Python API, which processes the data and returns a response. It uses WordPress functions like wp_remote_post and wp_remote_retrieve_body to handle HTTP requests.

This integration allows your chatbot to leverage advanced Python capabilities, enriching its functionality and responsiveness.

Detailed instructions on how to I integrated Python with my WordPress Site with Points system to Build my SaaS on WordPress + code examples are available in the WordPress SaaS Course.

5- Utilize Other AI APIs

Yes, itā€™s possible to integrate other AI platforms like Anthropic, Gemini, or even open-source models into your WordPress.

While a full explanation is beyond this scope, you can find a tutorial on how to integrate the Gemini API to build an AI tool on WordPress my YouTube channel.

Publish on WordPress

Once you have your AI chatbot ready, the next step is to integrate it into your WordPress site. You can choose to deploy it on a specific page or across all pages on your site, depending on your needs. Hereā€™s how you can do it:

Publishing on a Single Page

  1. Create a New Page: Navigate to your WordPress dashboard and go to Pages > Add New. Give your page a title that corresponds with the chatbotā€™s function, such as ā€œCustomer Support Chatbot.ā€
  2. Add Custom HTML Block: In the WordPress block editor, click the ā€˜+ā€™ button to add a new block, and select ā€˜Custom HTMLā€™ from the formatting options. This is where you will paste the custom HTML, CSS, and JavaScript codes.
  3. Preview and Publish: Always preview your page to ensure the chatbot appears and functions as expected before hitting the publish button. Adjust the placement and design if necessary to ensure it fits well with the rest of your pageā€™s layout.

Publishing on All Pages

If you want your AI chatbot to be accessible from any page of your WordPress site, you can paste it to the footer of your website, again, by using Wp Code Snippets.

Just Click on Add Headers and Footers, and paste the full code (HTML, JS, and CSS) in the footer section.

Save, and now your chatbot will appear on all pages.

For further details and support, join our forum, where Iā€™m available nearly every day to answer questions and engage with community members!



Source link

02Jun

How To Create Autonomous AI Agents From Scratch! | by Hasan Aboul Hasan | May, 2024


Having established the ReAct System Prompt and defined the necessary functions, we can now integrate these elements to construct our AI agent.

Letā€™s return to our main.py script to complete the setup.

Define Available Functions

First, list the functions the agent can utilize. For this example, we only have one:

available_actions = {
"get_response_time": get_response_time
}

In our case we have one function only.

This will enable the agent to select the correct function efficiently.

Set Up User and System Prompts

Define the user prompt and the messages that will be passed to the generate_text_with_conversation function we previously created:

user_prompt = "What is the response time for learnwithhasan.com?"

messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]

The system prompt, structured as a ReAct loop directive, is provided as a system message to the OpenAI LLM.

Now OpenAIā€™s LLM Model will be instructed to act in a loop of Though. Action, and Action Result!

Create the Agentic Loop

Implement the loop that processes user inputs and handles AI responses:

turn_count = 1
max_turns = 5

while turn_count print (f"Loop: {turn_count}")
print("----------------------")
turn_count += 1

response = generate_text_with_conversation(messages, model="gpt-4")

print(response)

json_function = extract_json(response)

if json_function:
function_name = json_function[0]['function_name']
function_parms = json_function[0]['function_parms']
if function_name not in available_actions:
raise Exception(f"Unknown action: {function_name}: {function_parms}")
print(f" -- running {function_name} {function_parms}")
action_function = available_actions[function_name]
#call the function
result = action_function(**function_parms)
function_result_message = f"Action_Response: {result}"
messages.append({"role": "user", "content": function_result_message})
print(function_result_message)
else:
break

This loop reflects the ReAct cycle, generating responses, extracting JSON-formatted function calls, and executing the appropriate actions.

So we generate the response, and we check if the LLM returned a function.

I created the extract_json method to make it easy for your to extract any functions from the LLM response.

In the following line:

json_function = extract_json(response)

We will check if the LLM returned a function to execute, if yes, it will execute and append the result to the messages, so in the next turn, the LLM can use the Action_response to asnwer the user query.

Test the Agent!

To see this agent in action, you can download the complete codebase using the link provided below:

Basic AI Agent Code

And if you like to see all this in action and see another Real-world Agent Example, you can check this free video:

For more in-depth exploration and additional examples, consider checking out my full course, ā€œBuild AI Agents From Scratch With Python.ā€

And remember, If you have any questions or any encounter issues, Iā€™m available nearly every day on the forum to assist you ā€” for free!



Source link

Protected by Security by CleanTalk