31May

DSPy & The Principle Of Assertions | by Cobus Greyling | May, 2024


The principle of Language Model (LM) Assertions is implemented into the DSPy programming framework.

The objective is to make programs more steerable, reliable and accurate in guiding and placing a framework in place for the LLM output.

According to the study, in four different text generation tests LM Assertions not only helped Generative AI Apps follow rules better but also improved task results, meeting rules up to 164% more often and generating up to 37% better responses.

When an assertion constraint fails, the pipeline can back-track and retry the failing module. LM Assertions provide feedback on retry attempts; they inject erring outputs and error messages to the prompt to introspectively self-refine outputs.

There are two types of assertions, hard and soft.

Hard Assertions represent critical conditions that, when violated after a maximum number of retries, cause the LM pipeline to halt, if so defined, signalling a non-negotiable breach of requirements.

On the other hand, suggestions denote desirable but non-essential properties; their violation triggers the self-refinement process, but exceeding a maximum number of retries does not halt the pipeline. Instead, the pipeline continues to execute the next module.

DSPy Assert

The use of dspy.Assert is recommended during the development stage as checkers or scanners to ensure the LM behaves as expected. Hence a very descriptive way of identifying and addressing errors early in the development cycle.

Below is a basic example on how to formulate an Assert and Suggest.

dspy.Assert(your_validation_fn(model_outputs), "your feedback message", target_module="YourDSPyModuleSignature")

dspy.Suggest(your_validation_fn(model_outputs), "your feedback message", target_module="YourDSPyModuleSignature")

When the assertion criteria is not met, resulting in a failure, dspy.Assert conducts a sophisticated retry mechanism, allowing the pipeline to adjust. Hence the program or pipeline is not necessarily terminated.

On an Assert failing, the pipeline transitions to a special retry state, allowing it to reattempt a failing LM call while being aware of its previous attempts and the error message raised.

After a maximum number of self-refinement attempts, if the assertion still fails, the pipeline transitions to an error state and raises an AssertionError, terminating the pipeline.

This enables Assert to be much more powerful than conventional assert statements, leveraging the LM to conduct retries and adjustments before concluding that an error is irrecoverable.

DSPy Suggest

dspy.Suggest is best utilised as helpers during the evaluation phase, offering guidance and potential corrections without halting the pipeline.

dspy.Suggest(len(unfaithful_pairs) == 0, f"Make sure your output is based on the following context: '{context}'.", target_module=GenerateCitedParagraph)

In contrast to asserts, suggest statements provide gentler recommendations rather than strict enforcement of conditions.

These suggestions guide the LM pipeline towards desired outcomes in specific domains. If a Suggest condition isn’t met, similar to Assert, the pipeline enters a special retry state, enabling retries of the LM call and self-refinement.

However, if the suggestion consistently fails after multiple attempts at self-refinement, the pipeline logs a warning message called SuggestionError and continues execution.

This flexibility allows the pipeline to adapt its behaviour based on suggestions while remaining resilient to less-than-optimal states or heuristic computational checks.

LM Assertions, a novel programming construct designed to enforce user-specified properties on LM outputs within a pipeline. — Source

Considering the image below, two Suggests are made, as apposed to Asserts. The first suggestions a query length, and the second creating a unique value.

dspy.Suggest(
len(query) "Query should be short and less than 100 characters",
)

dspy.Suggest(
validate_query_distinction_local(prev_queries, query),
"Query should be distinct from: "
+ "; ".join(f"{i+1}) {q}" for i, q in enumerate(prev_queries)),
)

I believe much can be gleaned from this implementation of guardrails…

  1. The guardrails can be described in natural language and the LLM can be leveraged to self-check its responses.
  2. More complicated statements can be created in Python where values are parsed to perform checks.
  3. The flexibility of describing the guardrails lend a high order of flexibility in what can be set for specific implementations.
  4. The division between assertions and suggestions is beneficial, as it allows for a clearer delineation of checks.
  5. Additionally, the ability to define recourse adds another layer of flexibility and control to the process.
  6. The study’s language primarily revolves around constraining the LLM and defining runtime retry semantics.
  7. This approach also serves as an abstraction layer for self-refinement methods into arbitrary steps for pipelines.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.



Source link

30May

Comparing LLM Agents to Chains: Differences, Advantages & Disadvantages | by Cobus Greyling | May, 2024


RPA Approach

Prompt chaining can be utilised in Robotic Process Automation (RPA) implementations. In the context of RPA, prompt chaining can involve a series of prompts given to an AI model or a bot to guide it through the steps of a particular task or process automation.

By incorporating prompt chaining into RPA implementations, organisations can enhance the effectiveness, adaptability, and transparency of their automation efforts, ultimately improving efficiency and reducing operational costs.

Human In The Loop

Prompt chaining is ideal for involving humans and by default chains are a dialog turn based conversational UI where the dialog or flow is moved forward based on user input.

There are instances where chains do not depend on user input, and these implementations are normally referred to as prompt pipelines.

Agents can also have a tool for human interaction, the HITL tool are ideal when the agent reaches a point where existing tools do not suffice for the query, and then the Human-In-The-Loop Tool can be used to reach out to a human for input.

Managing Cost

Managing costs is more feasible with a chained approach compared to an agent approach. One method to mitigate cost barriers is by self-hosting the core LLM infrastructure, reducing the significance of the number of requests made to the LLM in terms of cost.

Optimising Latency

Optimising latency through self-hosted local LLMs involve hosting the language model infrastructure locally, which reduces the time it takes for data to travel between the user’s system and the model. This localisation minimises network latency, resulting in faster response times and improved overall performance.

LLM Choose Action Sequence

LLMs can choose action sequences for agents by employing a sequence generation mechanism. This involves the LLM generating a series of actions based on the input provided to it. These actions can be determined through a variety of methods such as reinforcement learning, supervised learning, or rule-based approaches.

Seamless Tool Introduction

With autonomous agents, agent tools can be introduced seamlessly to update and enhance the agent capabilities.

Design Canvas Approach

A prompt chaining design canvas IDE (Integrated Development Environment) would provide a visual interface for creating, editing, and managing prompt chains. Here’s a conceptual outline of what features such an IDE might include: Visual Prompt Editor, Prompt Library, Connection Management, Variable Management, Preview and Testing, etc.

Overall, a prompt chaining design canvas IDE would provide a user-friendly environment for designing, implementing, and managing complex conversational flows using a visual, intuitive interface.

No/Low-Code IDEs

Agents are typically pro-code in their development where chains mostly follows a design canvas approach.

Agents often involve a pro-code development approach, where developers write and customise code to define the behaviour and functionality of the agents. Conversely, chains typically follow a design canvas approach, where users design workflows or sequences of actions visually using a graphical interface or canvas. This visual approach simplifies the creation and modification of processes, making it more accessible to users without extensive coding expertise.

I need to add that there are agent IDEs like FlowiseAI, LangFlow, Stack and others.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

29May

Using DSPy For A RAG Implementation | by Cobus Greyling | May, 2024


In this notebook, GPT-3.5 (specifically gpt-3.5-turbo) and the ColBERTv2 retriever are made use of.

The ColBERTv2 retriever is hosted on a free server, housing a search index derived from Wikipedia 2017 “abstracts,” which contain the introductory paragraphs of articles from a 2017 dump.

Below you can see how the Language Model and the Retriever Model are configured within DSPy settings.

import dspy

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

next the test data is loaded:

from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

And the signatures are created…you can see how the context, input and output fields are defined.

class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""

context = dspy.InputField(desc="may contain relevant facts")
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")

The RAG pipeline is created as a DSPy module which will require two methods:

  • The __init__ method will simply declare the sub-modules it needs: dspy.Retrieve and dspy.ChainOfThought. The latter is defined to implement our GenerateAnswer signature.
  • The forward method will describe the control flow of answering the question using the modules we have: Given a question, we’ll search for the top-3 relevant passages and then feed them as context for answer generation.
class RAG(dspy.Module):
def __init__(self, num_passages=3):
super().__init__()

self.retrieve = dspy.Retrieve(k=num_passages)
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)

Here is an example of the training data:

Example({'question': 'At My Window was released by which American singer-songwriter?', 
'answer': 'John Townes Van Zandt'})
(input_keys={'question'}),

Example({'question': 'which American actor was Candace Kita guest starred with ',
'answer': 'Bill Murray'})
(input_keys={'question'}),

Example({'question': 'Which of these publications was most recently published, Who Put the Bomp or Self?',
'answer': 'Self'})
(input_keys={'question'}),

Below the program is executed with a question…

# Ask any question you like to this simple RAG program.
my_question = "What castle did David Gregory inherit?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

With the response:

Question: What castle did David Gregory inherit?
Predicted Answer: Kinnairdy Castle
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'Gregory Tarchaneiotes | Gregory Tarchaneiotes (Greek: Γρηγόριος Ταρχανειώτης , Italian: "Gregorio Tracanioto" or "Tracamoto" ) was a "protospatharius" and the long-reigning catepan of Italy from 998 t...', 'David Gregory (mathematician) | David Gregory (originally spelt Gregorie) FRS (? 1659 – 10 October 1708) was a Scottish mathematician and astronomer. He was professor of mathematics at the University ...']

Considering the DSPy implementation, there are a few initial observations:

  • The code is clean and concise.
  • Creating an initial RAG application is straight forward with enough parameters which can be set.
  • Having a robust data ingestion pipeline is very convenient and that will have to be a consideration.
  • The built-in evaluation of the pipeline and retrieval is convenient..
  • I cannot comment on the extensibility and scaling of the RAG framework, and the complexity of building code around the DSPy RAG framework.
  • However, as a quick standalone implementation, it is impressive in its simplicity.
  • Lastly, considering the graphic below, the GitHub communities of LangChain, LlamaIndex and DSPy.



Source link

28May

Controllable Agents For RAG With Human In The Loop Chat | by Cobus Greyling | May, 2024


One major hurdle for agent implementations is the issue of observability and steerability.

Agents frequently employ strategies such as chain-of-thought or planning to handle user inquiries, relying on multiple interactions with a Large Language Model (LLM).

Yet, within this iterative approach, monitoring the agent’s inner mechanisms or intervening to correct its trajectory midway through execution proves challenging.

To address this issue, LlamaIndex has introduced a lower-level agent specifically engineered to provide controllable, step-by-step execution on a RAG (Retrieval-Augmented Generation) pipeline.

This demonstration underscores LlamaIndex’s goal of showcasing the heightened control and transparency that the new API brings to managing intricate queries and navigating extensive datasets.

Added to this, introducing agentic capabilities on top of a RAG pipeline can allow you to reason over much more complex questions.

The Human In The Loop chat capabilities allows for a step-wise approach by a human via a chat interface. While it is possible to ask agents complex questions which demands multiple reasoning steps. These queries can be long running and can in some instances be wrong.



Source link

27May

Teaching LLMs To Say “I don’t Know” | by Cobus Greyling | May, 2024


Rather than fabricating information when presented with unfamiliar inputs, models should rather recognise untrained knowledge & express uncertainty or confine their responses within the limits of their knowledge.

This study investigates how Large Language Models (LLMs) generate inaccurate responses when faced with unfamiliar concepts.

The research discovers that LLMs tend to default to hedged predictions for unfamiliar inputs, shaped by the way they were trained on unfamiliar examples.

By adjusting the supervision of these examples, LLMs can be influenced to provide more accurate responses, such as admitting uncertainty by saying “I don’t know”.

Building on these insights, the study introduces a reinforcement learning (RL) approach to reduce hallucinations in long-form text generation tasks, particularly addressing challenges related to reward model hallucinations.

The findings are confirmed through experiments in multiple-choice question answering, as well as tasks involving generating biographies and book/movie plots.

Large language models (LLMs) have a tendency to hallucinate — generating seemingly unpredictable responses that are often factually incorrect. ~ Source

Large language models (LLMs) demonstrate remarkable abilities in in-context learning (ICL), wherein they leverage surrounding text acting as a contextual reference to comprehend and generate responses.

Through continuous exposure to diverse contexts, LLMs adeptly adapt their understanding, maintaining coherence and relevance within ongoing discourse. This adaptability allows them to provide nuanced and contextually appropriate responses, even in complex or evolving situations.

By incorporating information from previous interactions, LLMs enhance their contextual understanding, improving performance in tasks such as conversation, question answering, and text completion. This capability underscores the potential of LLMs to facilitate more natural and engaging interactions across various domains and applications.

LLMs have a tendency to hallucinate.

This behaviour is especially prominent when models are queried on concepts that are scarcely represented in the models pre-training corpora; hence unfamiliar queries.

Instead of hallucination, models should instead recognise the limits of their own knowledge, and express their uncertainty or confine their responses within the limits of their knowledge.

The goal is to teach models this behaviour, particularly for long-form generation tasks.

The study introduces a method to enhance the accuracy of long-form text generated by LLMs using reinforcement learning (RL) with cautious reward models.



Source link

26May

HILL: Solving for LLM Hallucination & Slop | by Cobus Greyling | May, 2024


HILL is a prototypical User Interface which highlight hallucinations to LLM users, enabling them to assess the factual correctness of an LLM response.

HILL can be described as a User Interface for accessing LLM APIs. To some extent HILL reminds of a practice called grounding. Grounding has been implemented by OpenAI and Cohere, where documents are uploaded. Should a user query match uploaded content, a piece of an uploaded document is used as contextual reference; in a RAG-like fashion. A link is also provided to the document referenced and serves as grounding.

Slop is the new Spam. Slop refers to unwanted generated content, like Google’s Search Generative Experience (SGE), which sits above some search results. As you will see later in the article, HILL will tell users how valuable auto-generated content is. Or if it could be regarded as slop.

HILL is not a generative AI chat UI like HuggingChat, Cohere Coral or ChatGPT…however, I can see a commercial use-case for HILL as a user interface for LLMs.

One can think of HILL as a browser of sorts for LLMs. If search offerings include this type of information by default, there is sure to be immense user interest.

The information supplied by HILL includes:

Confidence Score: the overall score of accuracy or response generation.

Political Spectrum: A score classifying the political spectrum of the answer on a scale between -10 and + 10.

Monetary Interest: A score classifying the probability of paid content in the generated response on a scale from 0 to 10.

Hallucination: Identification of the response parts that appear to be correct but are actually false or not based on the input.

Self-Assessment Score: A percentage score between 0 and 100 on how accurate and reliable the generated answer is.

I believe there will be value in a settings option where the user can define their preferences in terms of monetary interests, political spectrum and the like.

The image below shows the UI developed for HILL. Highlighting hallucinations to users and enabling users to assess the factual correctness of an LLM response.



Source link

25May

How Would The Architecture For An LLM Agent Platform Look? | by Cobus Greyling | May, 2024


The study sees stage 1 as follows:

Agent Recommender will recommend an Agent Item to a user based on personal needs and preferences. Agent Item engages in a dialogue with the user, subsequently providing information for the user and also acquiring user information.

And as I mentioned, the Agent Recommended can be seen as the agent, and the Agent Items as the actions.

This stage can be seen as a multi-tool agent…

Rec4Agentverse then enables the information exchange between Agent Item and Agent Recommender. For example, Agent Item can transmit the latest preferences of the user back to Agent Recommender. Agent Recommender can give new instructions to Agent Item.

Here is the leap where collaboration is supported amongst Agent Items and the agent recommender orchestrating everything.

There is a market for a no-code to low-code IDE for creating agent tools. Agent tools will be required as the capabilities of the agent expands.

The graphic below from the study shows the Agent Items (which I think of as tools)…

The left portion of the diagram shows three roles in their architecture: user, Agent Recommender, and Agent Item, along with their interconnected relationships.

The right side of the diagram shows that an Agent Recommender can collaborate with Agent Items to affect the information flow of users and offer personalised information services.

What I like about this diagram is that it shows the user / agent recommender layer, the information exchange layer and the information carrier layer, or integration.



Source link

25May

Copy This AI-Powered Automated System For Topic Research (No-Code) | by Hasan Aboul Hasan | May, 2024


Perfect, now that we understand how the system works, let’s set it up!

1- Log in to Your Make Account

If you don’t have an account, just sign up and log in.

2- Install the Content Extractor App

It is very simple, click the button below:

Install Make App

You will see this page:

Click “Install,” and you are done!

3- Clone the Google Sheet

As explained before, the system reads and saves data in a Google Sheet. I prepared the sheet to make it easy for you to get started quickly. Just create a clone of it in your Google account.

Clone Google Sheet

4- Create a Datastore

Before we set up the system, you need to create a data store.

So head to “Datastores” from the right menu and create a new one.

Call it: SERP_RESULTS

Create a new data structure, which means adding fields to the table. Add the following:

field 1:

Name: link

Type: Text

field 2:

Name: position

Type: Number

field 3:

Name: Parent Keyword

Type: Text

field 4:

Name: Last Updated

Type: Date

Great! We have our data store. We are ready to create the automated system.

5- Import The System Blueprint

Now, go to “Scenarios,” create a new scenario, then click “Import Blueprint.

Download Scenerio Blueprint

🔴 DON’T FORGET TO EXTRACT THE ZIP FILE FIRST

6- Update the Modules

Now that we have the scenario, the database, and the Google Sheet, we just need to update our app modules to match your accounts.

1- Update the Google Sheets Module Authentication

Please update all the Google Sheet modules and the spreadsheet ID, which can be found in the browser URL.

2- Connect OpenAI Module

Click on the OpenAI App to connect with your account using your API key.

3- Set the Serper API Key

Since we are using the Serper API to get Google organic results, get an API key and set it here.

4- Connect with your Datastore.

Make sure all modules are set up correctly and update the datastore module to match the datastore we created in step 4.

5- Set “Extract Web Content” API Key

To use my app for free, make sure to use this API key: HASAN2024

7- Run a Test

Perfect, we have our system ready!

Let’s give it a try!

If you have any problems, you can join us on the forum; it is free!

I will be there almost every day to help.

If you came to this article from my YouTube video, you know I discussed how these systems can be used to build a business in today’s “AI Era.”

One of the most in-demand products in the digital world today is ready-made systems, or what we call “done-for-you” systems.

Businesses and individuals need plug-and-play systems that help them automate or fix a specific problem.

This system is a great example of a “done-for-you” system that you can sell online. You could also use it as a powerful lead agent.

Now, I’m giving it to you for free, yes. But that doesn’t mean you shouldn’t join my newsletter to get my weekly updates and exclusive tips 😅

Get Weekly Exclusive Tips

Anyway, the idea here is to learn and build such systems. This service will make you stand out in the competition today, as it is still new and not many freelancers know about it.

🟢 As a bonus tip, to make your offering even more unique and provide more value to your customers, and help you turn this into a recurring income business:

You can create custom apps in the workflow you are selling. Like the one I shared with you for free, the “Extract Web Content” app.

Yes, I gave it to you for free, but you can create something similar that makes your system unique and helps get your customers attached to the service you provide.

Do it, and thank me later 😉

How did I build the Make Custom App?

Make allows you to build any custom app you want as soon as you have the API endpoint for it.

So what I did simply was create a very basic API in Python. Here is the code:

from fastapi import APIRouter
from SimplerLLM.tools.generic_loader import load_content
router = APIRouter()

#extract content from blog post
@router.get("/tools/extract-content-from-page",)
async def extract_content_from_web_page(url: str,):
return load_content(url)

I used SimplerLLM, my free Python library.

You can see how easy it is to read content with the built-in functionalities.

And built the app based on that.

👉 You can learn more about APIs and how to build and sell them in this course here.

You can add an API key as I did, and this app can even be sold independently as a custom app for Make that helps build more complex and customized systems!

Other than building custom Make apps, you can extend this system further to provide more functionalities and value to your customers or your own business.

Here are some ideas:

1- Add AI Analysis

It would be a great addition to the system to have AI analyze the final results and suggest tips based on that.

For example, my AI-powered SEO analyzer extracts an SEO report using an API, and then AI analyzes and creates a detailed report based on that data. You can test it here.

Another example is the AI keyword research tool, where I feed the keywords to the AI, and it suggests a content plan and tips based on that data.

You can even go further with an agentic workflow that automates the entire process.

2- Add Keyword Position Tracking

Since we are already extracting organic Google results with the Serper API and saving them in the database with their positions, you can also track the position of specific domains for each target keyword to see if they rank.

3- Add Keyword Metrics

You can also add more metrics to the keywords, such as search volume, keyword difficulty, CPC, and much more. This will enrich the data for AI analysis or your customers.

You can obtain such data from SEO APIs like Spyfu, Semrush, and others.

4- Optimize the Prompts

When you open the OpenAI modules, you will see I added some prompts to extract content ideas from text.

These are not the best or perfect prompts. They do the job, but you can optimize them to get better results.

If you have taken my prompt engineering course, you know that with some prompting techniques, you can optimize the prompts for better results.

Remember, if you have any problems or want to chat, hop into the forum. I’m there almost every day to answer your queries!

Join The Forum

Good Luck!



Source link

25May

Quantization and LLMs: Condensing Models to Manageable Sizes


Quantization and LLMs: Condensing Models to Manageable SizesQuantization and LLMs: Condensing Models to Manageable Sizes
 

The Scale and Complexity of LLMs

 
The incredible abilities of LLMs are powered by their vast neural networks which are made up of billions of parameters. These parameters are the result of training on extensive text corpora and are fine-tuned to make the models as accurate and versatile as possible. This level of complexity requires significant computational power for processing and storage.

 
Quantization and LLMs: Condensing Models to Manageable SizesQuantization and LLMs: Condensing Models to Manageable Sizes
 

The accompanying bar graph delineates the number of parameters across different scales of language models. As we move from smaller to larger models, we witness a significant increase in the number of parameters with ‘Small’ language models at the modest millions of parameters and ‘Large’ models with tens of billions of parameters.

However, it is the GPT-4 LLM model with 175 billion parameters that dwarfs other models’ parameter size. Not only is GPT-4 using the most parameters out of the graphs, but it also powers the most recognizable generative AI model, ChatGPT. This towering presence on the graph is representative of other LLMs of its class, displaying the requirements needed to power the future’s AI chatbots, as well as the processing power required to support such advanced AI systems.

 

The Cost of Running LLMs and Quantization

 
Deploying and operating complex models can get costly due to their need for either cloud computing on specialized hardware, such as high-end GPUs, AI accelerators, and continuous energy consumption. Reducing the cost by choosing an on-premises solution can save a great deal of money and increase flexibility in hardware choices and freedom to utilize the system wherever with a trade-off in maintenance and employing a skilled professional. High costs can make it challenging for small business deployments to train and power an advanced AI. Here is where quantization comes in handy.

 

What is Quantization?

 
Quantization is a technique that reduces the numerical precision of each parameter in a model, thereby decreasing its memory footprint. This is akin to compressing a high-resolution image to a lower resolution while retaining the essence and most important aspects but at a reduced data size. This approach enables the deployment of LLMs on with less hardware without substantial performance loss.

ChatGPT was trained and is deployed using thousands of NVIDIA DGX systems, millions of dollars of hardware, and tens of thousands more for infrastructure. Quantization can enable good proof-of-concept, or even fully fledged deployments with less spectacular (but still high performance) hardware.

In the sections to follow, we will dissect the concept of quantization, its methodologies, and its significance in bridging the gap between the highly resource-intensive nature of LLMs and the practicalities of everyday technology use. The transformative power of LLMs can become a staple in smaller-scale applications, offering vast benefits to a broader audience.

 

Basics of Quantization

 
Quantizing a large language model refers to the process of reducing the precision of numerical values used in the model. In the context of neural networks and deep learning models, including large language models, numerical values are typically represented as floating-point numbers with high precision (e.g., 32-bit or 16-bit floating-point format). Read more about Floating Point Precision here.

Quantization addresses this by converting these high-precision floating-point numbers into lower-precision representations, such as 16- or 8-bit integers to make the model more memory-efficient and faster during both training and inference by sacrificing precision. As a result, the training and inferencing of the model requires less storage, consumes less memory, and can be executed more quickly on hardware that supports lower-precision computations.

 

Types of Quantization

 
To add depth and complexity to the topic, it is critical to understand that quantization can be applied at various stages in the lifecycle of a model’s development and deployment. Each method has its distinct advantages and trade-offs and is selected based on the specific requirements and constraints of the use case.

 

1. Static Quantization

Static quantization is a technique applied during the training phase of an AI model, where the weights and activations are quantized to a lower bit precision and applied to all layers. The weights and activations are quantized ahead of time and remain fixed throughout. Static quantization is great for known memory requirements of the system the model is planning to be deployed to.

  • Pros of Static Quantization
    • Simplifies deployment planning as the quantization parameters are fixed.
    • Reduces model size, making it more suitable for edge devices and real-time applications.
  • Cons of Static Quantization
    • Performance drops are predictable; so certain quantized parts may suffer more due to a broad static approach.
    • Limited adaptability for static quantization for varying input patterns and less robust update to weights.

 

2. Dynamic Quantization

Dynamic Quantization involves quantizing weights statically, but activations are quantized on the fly during model inference. The weights are quantized ahead of time, while the activations are quantized dynamically as data passes through the network. This means that quantization of certain parts of the model are executed on different precisions as opposed to defaulting to a fixed quantization.

  • Pros of Dynamic Quantization
    • Balances model compression and runtime efficiency without significant drop in accuracy.
    • Useful for models where activation precision is more critical than weight precision.
  • Cons of Dynamic Quantization
    • Performance improvements aren’t predictable compared to static methods (but this isn’t necessarily a bad thing).
    • Dynamic calculation means more computational overhead and longer train and inference times than the other methods, while still being lighter weight than without quantization

 

3. Post-Training Quantization (PTQ)

In this technique, quantization is incorporated into the training process itself. It involves analyzing the distribution of weights and activations and then mapping these values to a lower bit depth. PTQ is deployed on resource-constrained devices like edge devices and mobile phones. PTQ can be either static or dynamic.

  • Pros of PTQ
    • Can be applied directly to a pre-trained model without the need for retraining.
    • Reduces the model size and decreases memory requirements.
    • Improved inference speeds enabling faster computations during and after deployment.
  • Cons of PTQ
    • Potential loss in model accuracy due to the approximation of weights.
    • Requires careful calibration and fine tuning to mitigate quantization errors.
    • May not be optimal for all types of models, particularly those sensitive to weight precision.

 

4. Quantization Aware Training (QAT)

During training, the model is aware of the quantization operations that will be applied during inference and the parameters are adjusted accordingly. This allows the model to learn to handle quantization induced errors.

  • Pros of QAT
    • Tends to preserve model accuracy compared to PTQ since the model training accounts for quantization errors during training.
    • More robust for models sensitive to precision and is better at inferencing even on lower precisions.
  • Cons of QAT
    • Requires retraining the model resulting in longer training times.
    • More computationally intensive since it incorporates quantization error checking.

 

5. Binary Ternary Quantization

These methods quantize the weights to either two values (binary) or three values (ternary), representing the most extreme form of quantization. Weights are constrained to +1, -1 for binary, or +1, 0, -1 for ternary quantization during or after training. This would drastically reduce the number of possible quantization weight values while still being somewhat dynamic.

  • Pros of Binary Ternary Quantization
    • Maximizes model compression and inferencing speed and has minimal memory requirements.
    • Fast inferencing and quantization calculations enables usefulness on underpowered hardware.
  • Cons of Binary Ternary Quantization
    • High compression and reduced precision results in a significant drop in accuracy.
    • Not suitable for all types of tasks or datasets and struggles with complex tasks.

 

The Benefits & Challenges of Quantization

 
Before and after quantizationBefore and after quantization

The quantization of Large Language Models brings forth multiple operational benefits. Primarily, it achieves a significant reduction in the memory requirements of these models. Our goal for post-quantization models is for the memory footprint to be notably smaller. Higher efficiency permits the deployment of these models on platforms with more modest memory capabilities and decreasing the processing power needed to run the models once quantized translates directly into heightened inference speeds and quicker response times that enhance user experience.

On the other hand, quantization can also introduce some loss in model accuracy since it involves approximating real numbers. The challenge is to quantize the model without significantly affecting its performance. This can be done with testing the model’s precision and time of completion before and after quantization with your models to gauge effectiveness, efficiency, and accuracy.

By optimizing the balance between performance and resource consumption, quantization not only broadens the accessibility of LLMs but also contributes to more sustainable computing practices.
 
Original. Republished with permission.
 
 

Kevin Vu manages Exxact Corp blog and works with many of its talented authors who write about different aspects of Deep Learning.



Source link

18May

Google AI Introduces PaliGemma: A New Family of Vision Language Models 


Google has released a new family of vision language models called PaliGemma. PaliGemma can produce text by receiving an image and a text input. The architecture of the PaliGemma (Github) family of vision-language models consists of the image encoder SigLIP-So400m and the text decoder Gemma-2B. A cutting-edge model that can comprehend both text and visuals is called SigLIP. It comprises a joint-trained image and text encoder, similar to CLIP. Like PaLI-3, the combined PaliGemma model can be easily refined on downstream tasks like captioning or referencing segmentation after it has been pre-trained on image-text data. Gemma is a text-generating model that requires a decoder. By utilizing a linear adapter to integrate Gemma with SigLIP’s image encoder, PaliGemma becomes a potent vision language model.

Big_vision was used as the training codebase for PaliGemma. Using the same codebase, numerous other models, including CapPa, SigLIP, LiT, BiT, and the original ViT, have already been developed. 

The PaliGemma release includes three distinct model types, each offering a unique set of capabilities:

  1. PT checkpoints: These pretrained models are highly adaptable and designed to excel in a variety of tasks. Blend checkpoints: PT models adjusted for a variety of tasks. They can only be used for research purposes and are appropriate for general-purpose inference with free-text prompts.
  2. FT checkpoints: A collection of refined models focused on a distinct academic standard. They are only meant for research and come in various resolutions.

The models are available in three distinct precision levels (bfloat16, float16, and float32) and three different resolution levels (224×224, 448×448, and 896×896). Each repository holds the checkpoints for a certain job and resolution, with three revisions for every precision possible. The main branch of each repository has float32 checkpoints, while the bfloat16 and float16 revisions have matching precisions. It’s important to note that models compatible with the original JAX implementation and hugging face transformers have different repositories.

The high-resolution models, while offering superior quality, require significantly more memory due to their longer input sequences. This could be a consideration for users with limited resources. However, the quality gain is negligible for most tasks, making the 224 versions a suitable choice for the majority of uses.

PaliGemma is a single-turn visual language model that performs best when tuned to a particular use case. It is not intended for conversational use. This means that while it excels in specific tasks, it may not be the best choice for all applications.

Users can specify the task the model will perform by qualifying it with task prefixes like ‘detect’ or ‘segment ‘. This is because the pretrained models were trained in a way to give them a wide range of skills, such as question-answering, captioning, and segmentation. However, instead of being used immediately, they are designed to be fine-tuned to specific tasks using a comparable prompt structure. The ‘mix’ family of models, refined on various tasks, can be used for interactive testing.

Here are some examples of what PaliGemma can do: it can add captions to pictures, respond to questions about images, detect entities in pictures, segment entities within images, and reason and understand documents. These are just a few of its many capabilities.

  • When asked, PaliGemma can add captions to pictures. With the mix checkpoints, users can experiment with different captioning prompts to observe how they react.
  • PaliGemma can respond to a question about an image passed on with it. 
  • PaliGemma may use the detect [entity] prompt to find entities in a picture. The bounding box coordinate location will be printed as unique tokens, where the value is an integer that denotes a normalized coordinate. 
  • When prompted with the segment [entity] prompt, PaliGemma mix checkpoints can also segment entities within an image. Because the team utilizes natural language descriptions to refer to the things of interest, this technique is known as referring expression segmentation. The output is a series of segmentation and location tokens. As previously mentioned, a bounding box is represented by the location tokens. Segmentation masks can be created by processing the segmentation tokens one more time.
  • PaliGemma mix checkpoints are very good at reasoning and understanding documents.

he field.


Check out the Blog, Model, and Demo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.






Source link

Protected by Security by CleanTalk