01Jun

Can Minor Document Typos Comprehensively Disrupt RAG Retriever & Reader Components? | by Cobus Greyling | May, 2024


Three key findings from the study:

  • They point out that RAG systems are vulnerable to minor but frequent textual errors within the documents.
  • An attack method called GARAG is proposed, based on an algorithm searching for adversarial documents.
  • RAG systems are susceptible to noisy documents in real-world databases.

The reader’s ability to accurately ground information significantly depends on the retriever’s capability of sourcing query-relevant documents.

GARAG assesses the holistic robustness of a RAG system against minor textual errors, offering insights into the system’s resilience through iterative adversarial refinement.

This new study contains three main contributions…

  1. Highlighting a vulnerability in RAG systems pertaining to frequent minor textual errors within documents. This evaluation focuses on the retriever and reader components’ functionality.
  2. Introducing GARAG, a straightforward & potent attack strategy leveraging a genetic algorithm to craft adversarial documents capable of exploiting weaknesses in both components of RAG simultaneously.
  3. Through experimentation, demonstrating the detrimental impact of noisy documents on the RAG system within real-world databases.

The results show that typos seriously harm the RAG system, making it work much worse. Even though the retriever helps protect the reader, it can still be affected by small disruptions.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

31May

RAGTruth


In essence, RAGTruth is a large-scale corpus of naturally generated hallucinations, featuring detailed word-level annotations specifically designed for retrieval-augmented generation (RAG) scenarios.

Introduction

RAGTruth is a good illustration of how a hand-crafted dataset can be created and used to train an LLM for superior performance in identifying hallucinations by fine-tuning LLM with the RagTruth dataset.

This study again illustrates the growing trend of data design…Data design can be described as a process of designing and structuring data in such a way, not to augment the knowledge intensive nature of the model.

But rather to imbue the model with a specific and customised behaviour. In this case this behaviour is to optimise the model for a RAG implementation. There are other scenarios where the same is done to enhance the reasoning capabilities of a model, and so forth.

This study again underscores the importance of a data centric approach, where data discovery, data design and data development are central to the AI implementation.

The key is to have a highly flexible and intuitive data discovery and design UI. Where annotators are enabled and augmented via AI to improve speed and accuracy.

Four Categories Of Hallucination

The study is premised on four categories of hallucination…

Evident Conflict

Evident Conflict is described when generative content presents direct contraction or opposition to the provided information. These conflicts are easily verifiable without extensive context, often involving clear factual errors, misspelled names, incorrect numbers, etc.

Subtle Conflict

Subtle Conflict is described as when generative content presents a departure or divergence from the provided information, altering the intended contextual meaning.

These conflicts often involve substitution of terms that carry different implications or severity, requiring a deeper understanding of their contextual applications.

Evident Introduction Of Baseless Information

When generated content includes information not substantiated in the provided information. It involves the creation of hypothetical, fabricated, or hallucinatory details lacking evidence or support.

Subtle Introduction Of Baseless Information

When generated content extends beyond the provided information by incorporating inferred details, insights, or sentiments. This additional information lacks verifiability and might include subjective assumptions or commonly observed norms rather than explicit facts.

One key challenge is the lack of high-quality, large-scale datasets specifically designed for hallucination detection.

Data Gathering Pipeline

The image below shows the data gathering pipeline. The first of two steps is the response generation portion. Responses are generated with multiple LLMs using natural prompts.

Step two is where human annotation is made use of; the human labeller annotates hallucinated spans in the LLM responses.

The occasional generation of outputs that appear plausible but are factually incorrect significantly undermine the reliability of LLMs in real-world scenarios.

Three Most Used RAG Tasks

When tasks and data sources were selected, three most used RAG tasks were identified. Those were:

  • Question Answering,
  • Data-to-Text Writing, &
  • News Summarisation.

To simplify the annotation process, the researchers selected only questions related to daily life and retained only three retrieved passages for each question. They then prompted the LLMs to generate answers based solely on these retrieved passages.

For the data-to-text writing task, the researchers prompted LLMs to generate an objective overview for a randomly sampled business in the restaurant and nightlife categories from the Yelp Open Dataset (Yelp, 2021).

In this dataset, information about a business is represented using structured data. To streamline the annotation process, the study focused on the following business information fields: BusinessParking, Restaurant Reservations, OutdoorSeating, WiFi, RestaurantsTakeOut, RestaurantsGoodForGroups, Music, and Ambience.

Additionally, to enrich the context information, they included up to three business-related user reviews. In the prompt, this information is represented in JSON format.

Human Annotation

Identifying AI-generated hallucinations is a challenging task that demands strong critical thinking skills to understand the logical flow of various texts and meticulous attention to detail to spot subtle inaccuracies and inconsistencies.

Additionally, a certain level of media literacy and knowledge of current affairs is crucial for understanding subjects discussed in news-related sample data.

Therefore, the researchers selected annotators who are proficient in English and possess a bachelor’s degree in English, Communications, or relevant fields to ensure the accuracy and reliability of the annotation results. The annotators were invited to perform annotation tasks using Label Studio.

Each labelling task was presented on one page and included the components:

  1. The context provided to the LLM
  2. Set of six responses generated by different AI models.

The image below shows the annotation interface, notice the business info on the left, the model answers in the middle, etc.

Their task was to annotate specific spans of the generated text that contained hallucinated information and categorize them into four types.

To ensure the quality of the annotations, each response was independently labeled by two annotators.

The consistency rate between the two annotators was 91.8% at the response level and 78.8% at the span level. In cases where there was a significant difference between the two annotations, a third review was conducted.

Finally

In essence RAGTruth is a large-scale corpus of naturally generated hallucinations, featuring detailed word-level annotations tailored for retrieval-augmented generation (RAG) scenarios.

This work also analysed the interplay between hallucinations and various factors, such as:

  1. Task types,
  2. Models being used, &
  3. Contextual settings.

Benchmarks of several hallucination detection approaches were created using this new corpus.

The Study demonstrates how a fine-tuned Llama with RAGTruth leads to competitive performance.

This suggests that using a high-quality dataset like RAGTruth can enable the development of specialised hallucination detection models that outperform prompt-based methods utilising general models such as GPT-4.

The findings also revealed that identifying hallucinations in RAG contexts, particularly at the span level, remains a formidable challenge, with current methods still falling short of reliable detection.

Meticulous manual annotations were performed on both an individual case and word levels.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn





Source link

31May

DSPy & The Principle Of Assertions | by Cobus Greyling | May, 2024


The principle of Language Model (LM) Assertions is implemented into the DSPy programming framework.

The objective is to make programs more steerable, reliable and accurate in guiding and placing a framework in place for the LLM output.

According to the study, in four different text generation tests LM Assertions not only helped Generative AI Apps follow rules better but also improved task results, meeting rules up to 164% more often and generating up to 37% better responses.

When an assertion constraint fails, the pipeline can back-track and retry the failing module. LM Assertions provide feedback on retry attempts; they inject erring outputs and error messages to the prompt to introspectively self-refine outputs.

There are two types of assertions, hard and soft.

Hard Assertions represent critical conditions that, when violated after a maximum number of retries, cause the LM pipeline to halt, if so defined, signalling a non-negotiable breach of requirements.

On the other hand, suggestions denote desirable but non-essential properties; their violation triggers the self-refinement process, but exceeding a maximum number of retries does not halt the pipeline. Instead, the pipeline continues to execute the next module.

DSPy Assert

The use of dspy.Assert is recommended during the development stage as checkers or scanners to ensure the LM behaves as expected. Hence a very descriptive way of identifying and addressing errors early in the development cycle.

Below is a basic example on how to formulate an Assert and Suggest.

dspy.Assert(your_validation_fn(model_outputs), "your feedback message", target_module="YourDSPyModuleSignature")

dspy.Suggest(your_validation_fn(model_outputs), "your feedback message", target_module="YourDSPyModuleSignature")

When the assertion criteria is not met, resulting in a failure, dspy.Assert conducts a sophisticated retry mechanism, allowing the pipeline to adjust. Hence the program or pipeline is not necessarily terminated.

On an Assert failing, the pipeline transitions to a special retry state, allowing it to reattempt a failing LM call while being aware of its previous attempts and the error message raised.

After a maximum number of self-refinement attempts, if the assertion still fails, the pipeline transitions to an error state and raises an AssertionError, terminating the pipeline.

This enables Assert to be much more powerful than conventional assert statements, leveraging the LM to conduct retries and adjustments before concluding that an error is irrecoverable.

DSPy Suggest

dspy.Suggest is best utilised as helpers during the evaluation phase, offering guidance and potential corrections without halting the pipeline.

dspy.Suggest(len(unfaithful_pairs) == 0, f"Make sure your output is based on the following context: '{context}'.", target_module=GenerateCitedParagraph)

In contrast to asserts, suggest statements provide gentler recommendations rather than strict enforcement of conditions.

These suggestions guide the LM pipeline towards desired outcomes in specific domains. If a Suggest condition isn’t met, similar to Assert, the pipeline enters a special retry state, enabling retries of the LM call and self-refinement.

However, if the suggestion consistently fails after multiple attempts at self-refinement, the pipeline logs a warning message called SuggestionError and continues execution.

This flexibility allows the pipeline to adapt its behaviour based on suggestions while remaining resilient to less-than-optimal states or heuristic computational checks.

LM Assertions, a novel programming construct designed to enforce user-specified properties on LM outputs within a pipeline. — Source

Considering the image below, two Suggests are made, as apposed to Asserts. The first suggestions a query length, and the second creating a unique value.

dspy.Suggest(
len(query) "Query should be short and less than 100 characters",
)

dspy.Suggest(
validate_query_distinction_local(prev_queries, query),
"Query should be distinct from: "
+ "; ".join(f"{i+1}) {q}" for i, q in enumerate(prev_queries)),
)

I believe much can be gleaned from this implementation of guardrails…

  1. The guardrails can be described in natural language and the LLM can be leveraged to self-check its responses.
  2. More complicated statements can be created in Python where values are parsed to perform checks.
  3. The flexibility of describing the guardrails lend a high order of flexibility in what can be set for specific implementations.
  4. The division between assertions and suggestions is beneficial, as it allows for a clearer delineation of checks.
  5. Additionally, the ability to define recourse adds another layer of flexibility and control to the process.
  6. The study’s language primarily revolves around constraining the LLM and defining runtime retry semantics.
  7. This approach also serves as an abstraction layer for self-refinement methods into arbitrary steps for pipelines.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.



Source link

30May

Comparing LLM Agents to Chains: Differences, Advantages & Disadvantages | by Cobus Greyling | May, 2024


RPA Approach

Prompt chaining can be utilised in Robotic Process Automation (RPA) implementations. In the context of RPA, prompt chaining can involve a series of prompts given to an AI model or a bot to guide it through the steps of a particular task or process automation.

By incorporating prompt chaining into RPA implementations, organisations can enhance the effectiveness, adaptability, and transparency of their automation efforts, ultimately improving efficiency and reducing operational costs.

Human In The Loop

Prompt chaining is ideal for involving humans and by default chains are a dialog turn based conversational UI where the dialog or flow is moved forward based on user input.

There are instances where chains do not depend on user input, and these implementations are normally referred to as prompt pipelines.

Agents can also have a tool for human interaction, the HITL tool are ideal when the agent reaches a point where existing tools do not suffice for the query, and then the Human-In-The-Loop Tool can be used to reach out to a human for input.

Managing Cost

Managing costs is more feasible with a chained approach compared to an agent approach. One method to mitigate cost barriers is by self-hosting the core LLM infrastructure, reducing the significance of the number of requests made to the LLM in terms of cost.

Optimising Latency

Optimising latency through self-hosted local LLMs involve hosting the language model infrastructure locally, which reduces the time it takes for data to travel between the user’s system and the model. This localisation minimises network latency, resulting in faster response times and improved overall performance.

LLM Choose Action Sequence

LLMs can choose action sequences for agents by employing a sequence generation mechanism. This involves the LLM generating a series of actions based on the input provided to it. These actions can be determined through a variety of methods such as reinforcement learning, supervised learning, or rule-based approaches.

Seamless Tool Introduction

With autonomous agents, agent tools can be introduced seamlessly to update and enhance the agent capabilities.

Design Canvas Approach

A prompt chaining design canvas IDE (Integrated Development Environment) would provide a visual interface for creating, editing, and managing prompt chains. Here’s a conceptual outline of what features such an IDE might include: Visual Prompt Editor, Prompt Library, Connection Management, Variable Management, Preview and Testing, etc.

Overall, a prompt chaining design canvas IDE would provide a user-friendly environment for designing, implementing, and managing complex conversational flows using a visual, intuitive interface.

No/Low-Code IDEs

Agents are typically pro-code in their development where chains mostly follows a design canvas approach.

Agents often involve a pro-code development approach, where developers write and customise code to define the behaviour and functionality of the agents. Conversely, chains typically follow a design canvas approach, where users design workflows or sequences of actions visually using a graphical interface or canvas. This visual approach simplifies the creation and modification of processes, making it more accessible to users without extensive coding expertise.

I need to add that there are agent IDEs like FlowiseAI, LangFlow, Stack and others.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

30May

Does the EU’s MiFIR Review make single-name credit default swaps transparent enough? – European Law Blog


Blogpost 29/2024

Regulation 2024/791 (“MiFIR Review”) was published in the Official Journal of the European Union on 8 March 2024. This newly adopted legislation requires single-name credit default swaps (CDSs) to be made subject to transparency rules, only however if they reference global systemically important banks (G-SIBS) or those referencing an index comprised of such banks.

In this blog post, I discuss the suitability of the revised transparency requirements for single-name CDSs of the MiFIR Review. On the one hand, it seems that the new requirements are limited in scope as any referencing entity that is not a G-SIB will not be majorly impacted (see, in more detail, my recent working paper). Indeed, CSDs referencing G-SIBS represent only a small fraction of the market: i.e., 8.36% based on the total notional amount traded and 5.68% based on the number of transactions (source: DTCC). It follows that a substantial percentage of the single-name CDS market will not be captured. On the other hand, this post cautions against creating even more far-reaching transparency requirements than those provided for in the MiFIR Review: more transparency could, in practice, be detrimental for financial markets as it could result in higher trade execution costs and volatility and could even discourage dealers from providing liquidity.

 

Single-name credit default swaps and why they are opaque.

CDSs are financial derivative contracts between two counterparties to ‘swap’ or transfer the risk of default of a borrowing reference entity (i.e., a corporation, bank, or sovereign entity). The buyer of the CDS – also called the ‘protection buyer’ – needs to make a series of payments to the protection seller until the maturity date of the financial instrument, while the seller of the CDS is contractually bound to pay the buyer a compensation in the event of, for example, a debt default of the reference entity. Single-name CDSs are mostly traded in the over-the-counter derivatives markets, typically on confidential, decentralized systems. A disadvantage, however, of over-the-counter derivative markets is that they are typically opaque, in contrast with, for example, listed financial instruments.

Over-the-counter derivative markets have very limited access to pre-trade information (i.e., information such as the bid-ask quotes and order book information before the buy or sell orders are executed) and post-trade information (i.e. data such as prices, volumes, and the notional amount after the trade took place),

In March 2023, three small-to-mid-size US banks (i.e. Silicon Valley Bank, Silvergate Bank, and Signature Bank) ran into financial difficulties with spillovers to Europe where Credit Suisse needed to be taken over by USB. During this financial turmoil, the CDSs of EU banks rose considerably in terms of price and volume. For Deutsche Bank, there were even more than 270 CDS transactions for a total of US 1.1 billion in the week following UBS’s takeover of Credit Suisse. This represented a more than four-fold increase in trade count and a doubling in notional value compared with average volumes of the first ten weeks of the year. The CDS market is namely illiquid with only a few transactions a day for a particular reference entity, so this increase in trading volumes was exceptional. On 28 March 2023, the press reported that regulators had identified that a single CDS transaction referencing Deutsche Bank’s debt of roughly 5 million EUR conducted on 23 March 2023 could have fuelled the dramatic sell-off of equity on 24 March 2023 causing Deutsche Bank’s share price to drop by more than 14 percent.

One of the conclusions drawn by regulators, such as the European Securities and Markets Authority (ESMA), on the 24 March event was that the single-name CDS market is opaque (i.e., very limited pre-trade and post-trade market information), and consequently, subject to a high degree of uncertainty and speculation as to the actual trading activity and its drivers.

The Depository Trust and Clearing Corporation (DTCC) indeed provides post-trade CDS information, but the level of transparency is not very high, given that only aggregated weekly volumes are provided rather than individual prices. Furthermore, only information for the top active instruments are disclosed rather than for all traded instruments. Regarding pre-trade information, trading is conducted mostly through bilateral communication between dealers, who might directly contact a broker to trade or use a trading platform to enter anonymously non-firm quotes. However, even when screen prices are available, they are only indicative, and most dealers will not stand behind their pre-trade indicated price because the actual price the dealer will transact with is entirely subject to bilateral negotiations conducted over the phone or via some electronic exchange. Dealers are free to change the price until the moment the trade is mutually closed. The end-users are thus dependent on their dealers and sometimes do not even have access to the pre-trade information because they have to rely on third-party vendors and services that aggregate data. End-users do not know before the trade which price offered by dealers is the best one and do not know which other parties are willing to pay or to sell at, nor do they have comparable real-time prices against which to compare the price of their particular trade.

 

New transparency requirements in the MiFIR Review

On 25 November 2021, the European Commission published a proposal to amend Regulation No 600/2014 on markets in financial instruments (MiFIR) as regards enhancing market data transparency, removing obstacles to the emergence of a consolidated tape, optimizing trading obligations, and prohibiting receiving payments for forwarding client orders. This initiative was one of a series of measures to implement the Capital Markets Union (CMU) in Europe to empower investors – in particular, smaller and retail investors – by enabling them to better access market data and by making EU market infrastructures more robust. To foster a true and efficient single market for trading, the Commission was of the view that the transparency and availability of market data had to be improved.

The proposal implemented the view of ESMA that the transparency regime that was in place earlier was too complicated and not always effective in ensuring transparency for market participants. For single-name CDSs, the large majority of CDSs are indeed traded over the counter where the level of pre-trade transparency is low. This is because pre-trade requirements only apply to market operators and investment firms operating trading venues. Even for CDSs traded on a trading venue, there is a possibility to obtain a waiver as they do not fall under the trading obligation and are considered illiquid financial instruments. Because of their illiquidity, the large majority of listed single-name CDSs can also benefit from post-trade deferrals where information could even be disclosed only after four weeks.

Regulation (EU) 2024/791 (“MiFIR Review”) was finally approved on 28 February 2024 and entered into force on 28 March 2024. Article 8(a) of the MiFIR Review now requires as pre-trade transparency requirement that when applying a central limit order book or a periodic auction trading system, market operators and investment firms operating a multilateral trading facility or organized trading facility have to make public the current bid and offer prices, and the depth of trading interest at those prices for single-name CDSs that reference a G-SIB and that are centrally cleared. A similar requirement is now there for CDSs that reference an index comprising global systemically important banks and that are centrally cleared. Hence, under the new MiFIR Review, CDSs referencing G-SIBS are subject to transparency requirements only when they are centrally cleared. Such CDSs are, however, not subject to any clearing obligation provided for in the European Market Infrastructure Regulation (Regulation No 648/2012 EMIR). This means that data on single-name CDSs referencing G-SIBS that are not cleared or CDSs referencing other entities do not need to be made transparent.

Regarding post-trade transparency, Article 10 of the MiFIR Review requires that market operators and investment firms operating a trading venue have to make public the price, volume, and time of the transactions executed in respect of bonds, structured finance products, and emission allowances traded on a trading venue. For the transactions executed in respect of exchange-traded derivatives and the over-the-counter derivatives referred to in the pre-trade transparency requirements (see above), the information has to be made available as close to real-time as technically possible. The EU co-legislators are further of the view that the duration of deferrals has to be determined utilizing regulatory technical standards, based on the size of the transaction and liquidity of the class of derivatives. Article 11 of the MiFIR Review states that the arrangements for deferred publication will have to be organized by five categories of transactions related to a class of exchange-traded derivatives or of over-the-counter derivatives referred to in the pre-trade transparency requirements. ESMA will thus need to determine which classes are considered liquid or illiquid, and above which size of transaction and for which duration it should be possible to defer the publication of details of the transaction.

Besides the pre- and post-trade transparency requirements for market operators and investment firms operating a trading venue, the MiFIR Review also focuses on the design and implementation of a consolidated tape. This consolidated tape is a centralized database meant to provide a comprehensive overview of market data, namely on prices and volumes of securities traded throughout the Union across a multitude of trading venues. According to Article 22a, trade repositories and Approved Publication Arrangements (APAs) will need to provide data to the consolidated tape provider (CTP). The MiFIR Review is then also more specific on the information that has to be made public by an APA concerning over-the-counter derivatives, which will flow into the consolidated tapes. Where Articles 8, 10 and 11 of MiFIR before referred to ‘derivatives traded on a trading venue’, the MiFIR Review no longer uses this wording with respect to derivatives and refers to ‘OTC derivatives as referred to in Article 8a’, being those subject to the pre-trade transparency requirements. This incorporates again those single-name CDSs that reference a G-SIB and that are centrally cleared, or CDSs that reference an index comprising G-SIBs and that are centrally cleared. Similarly as for the pre-trade and post-trade transparency, data on single-name CDSs referencing G-SIBS that are not cleared or CDSs referencing other reference entities do not need to be made transparent.

 

Do we want even more transparency?

The MiFIR Review’s revised transparency requirements for single-name CDSs are not very far-reaching, given that CDSs referencing to reference entities that are not a G-SIB are not majorly impacted. Given that CSDs referencing G-SIBS represent only a small fraction of the market (see introduction above), a substantial percentage of CDSs is not captured by the MiFIR Review. In addition, single-name CSDs referencing G-SIBS that are not centrally cleared are also not affected. As there is no clearing obligation on CDSs because they are not sufficiently liquid, a large fraction will not be impacted or can continue to benefit from pre-trade transparency waivers or post-trade deferrals. This entails that a large fraction of the entire CDS market will thus not be affected by the MiFIR Review.

Nevertheless, I argue that even more severe transparency requirements than those foreseen by the MiFIR Review might not necessarily be beneficial for financial markets. Too much transparency can be detrimental to financial markets as it might result in higher trade execution costs and volatility and could even discourage dealers from providing liquidity. In a market, in which there are few buyers and sellers ready and willing to trade continuously, asking for more transparency could lead to even less liquidity as the limited number of liquidity providers would be obliged to make their trading strategies available, giving incentives to trade even less. A total lack of transparency might thus be undesirable to avoid market manipulation or from an investor protection point of view, but full transparency on an illiquid CDS market might dissuade traders even more from trading. The EU’s newly adopted MiFIR Review thus seems to strike an appropriate balance between reducing the level of opaqueness while not harming liquidity.



Source link

29May

Using DSPy For A RAG Implementation | by Cobus Greyling | May, 2024


In this notebook, GPT-3.5 (specifically gpt-3.5-turbo) and the ColBERTv2 retriever are made use of.

The ColBERTv2 retriever is hosted on a free server, housing a search index derived from Wikipedia 2017 “abstracts,” which contain the introductory paragraphs of articles from a 2017 dump.

Below you can see how the Language Model and the Retriever Model are configured within DSPy settings.

import dspy

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

next the test data is loaded:

from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

And the signatures are created…you can see how the context, input and output fields are defined.

class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""

context = dspy.InputField(desc="may contain relevant facts")
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")

The RAG pipeline is created as a DSPy module which will require two methods:

  • The __init__ method will simply declare the sub-modules it needs: dspy.Retrieve and dspy.ChainOfThought. The latter is defined to implement our GenerateAnswer signature.
  • The forward method will describe the control flow of answering the question using the modules we have: Given a question, we’ll search for the top-3 relevant passages and then feed them as context for answer generation.
class RAG(dspy.Module):
def __init__(self, num_passages=3):
super().__init__()

self.retrieve = dspy.Retrieve(k=num_passages)
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)

Here is an example of the training data:

Example({'question': 'At My Window was released by which American singer-songwriter?', 
'answer': 'John Townes Van Zandt'})
(input_keys={'question'}),

Example({'question': 'which American actor was Candace Kita guest starred with ',
'answer': 'Bill Murray'})
(input_keys={'question'}),

Example({'question': 'Which of these publications was most recently published, Who Put the Bomp or Self?',
'answer': 'Self'})
(input_keys={'question'}),

Below the program is executed with a question…

# Ask any question you like to this simple RAG program.
my_question = "What castle did David Gregory inherit?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

With the response:

Question: What castle did David Gregory inherit?
Predicted Answer: Kinnairdy Castle
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'Gregory Tarchaneiotes | Gregory Tarchaneiotes (Greek: Γρηγόριος Ταρχανειώτης , Italian: "Gregorio Tracanioto" or "Tracamoto" ) was a "protospatharius" and the long-reigning catepan of Italy from 998 t...', 'David Gregory (mathematician) | David Gregory (originally spelt Gregorie) FRS (? 1659 – 10 October 1708) was a Scottish mathematician and astronomer. He was professor of mathematics at the University ...']

Considering the DSPy implementation, there are a few initial observations:

  • The code is clean and concise.
  • Creating an initial RAG application is straight forward with enough parameters which can be set.
  • Having a robust data ingestion pipeline is very convenient and that will have to be a consideration.
  • The built-in evaluation of the pipeline and retrieval is convenient..
  • I cannot comment on the extensibility and scaling of the RAG framework, and the complexity of building code around the DSPy RAG framework.
  • However, as a quick standalone implementation, it is impressive in its simplicity.
  • Lastly, considering the graphic below, the GitHub communities of LangChain, LlamaIndex and DSPy.



Source link

28May

Controllable Agents For RAG With Human In The Loop Chat | by Cobus Greyling | May, 2024


One major hurdle for agent implementations is the issue of observability and steerability.

Agents frequently employ strategies such as chain-of-thought or planning to handle user inquiries, relying on multiple interactions with a Large Language Model (LLM).

Yet, within this iterative approach, monitoring the agent’s inner mechanisms or intervening to correct its trajectory midway through execution proves challenging.

To address this issue, LlamaIndex has introduced a lower-level agent specifically engineered to provide controllable, step-by-step execution on a RAG (Retrieval-Augmented Generation) pipeline.

This demonstration underscores LlamaIndex’s goal of showcasing the heightened control and transparency that the new API brings to managing intricate queries and navigating extensive datasets.

Added to this, introducing agentic capabilities on top of a RAG pipeline can allow you to reason over much more complex questions.

The Human In The Loop chat capabilities allows for a step-wise approach by a human via a chat interface. While it is possible to ask agents complex questions which demands multiple reasoning steps. These queries can be long running and can in some instances be wrong.



Source link

27May

Teaching LLMs To Say “I don’t Know” | by Cobus Greyling | May, 2024


Rather than fabricating information when presented with unfamiliar inputs, models should rather recognise untrained knowledge & express uncertainty or confine their responses within the limits of their knowledge.

This study investigates how Large Language Models (LLMs) generate inaccurate responses when faced with unfamiliar concepts.

The research discovers that LLMs tend to default to hedged predictions for unfamiliar inputs, shaped by the way they were trained on unfamiliar examples.

By adjusting the supervision of these examples, LLMs can be influenced to provide more accurate responses, such as admitting uncertainty by saying “I don’t know”.

Building on these insights, the study introduces a reinforcement learning (RL) approach to reduce hallucinations in long-form text generation tasks, particularly addressing challenges related to reward model hallucinations.

The findings are confirmed through experiments in multiple-choice question answering, as well as tasks involving generating biographies and book/movie plots.

Large language models (LLMs) have a tendency to hallucinate — generating seemingly unpredictable responses that are often factually incorrect. ~ Source

Large language models (LLMs) demonstrate remarkable abilities in in-context learning (ICL), wherein they leverage surrounding text acting as a contextual reference to comprehend and generate responses.

Through continuous exposure to diverse contexts, LLMs adeptly adapt their understanding, maintaining coherence and relevance within ongoing discourse. This adaptability allows them to provide nuanced and contextually appropriate responses, even in complex or evolving situations.

By incorporating information from previous interactions, LLMs enhance their contextual understanding, improving performance in tasks such as conversation, question answering, and text completion. This capability underscores the potential of LLMs to facilitate more natural and engaging interactions across various domains and applications.

LLMs have a tendency to hallucinate.

This behaviour is especially prominent when models are queried on concepts that are scarcely represented in the models pre-training corpora; hence unfamiliar queries.

Instead of hallucination, models should instead recognise the limits of their own knowledge, and express their uncertainty or confine their responses within the limits of their knowledge.

The goal is to teach models this behaviour, particularly for long-form generation tasks.

The study introduces a method to enhance the accuracy of long-form text generated by LLMs using reinforcement learning (RL) with cautious reward models.



Source link

26May

HILL: Solving for LLM Hallucination & Slop | by Cobus Greyling | May, 2024


HILL is a prototypical User Interface which highlight hallucinations to LLM users, enabling them to assess the factual correctness of an LLM response.

HILL can be described as a User Interface for accessing LLM APIs. To some extent HILL reminds of a practice called grounding. Grounding has been implemented by OpenAI and Cohere, where documents are uploaded. Should a user query match uploaded content, a piece of an uploaded document is used as contextual reference; in a RAG-like fashion. A link is also provided to the document referenced and serves as grounding.

Slop is the new Spam. Slop refers to unwanted generated content, like Google’s Search Generative Experience (SGE), which sits above some search results. As you will see later in the article, HILL will tell users how valuable auto-generated content is. Or if it could be regarded as slop.

HILL is not a generative AI chat UI like HuggingChat, Cohere Coral or ChatGPT…however, I can see a commercial use-case for HILL as a user interface for LLMs.

One can think of HILL as a browser of sorts for LLMs. If search offerings include this type of information by default, there is sure to be immense user interest.

The information supplied by HILL includes:

Confidence Score: the overall score of accuracy or response generation.

Political Spectrum: A score classifying the political spectrum of the answer on a scale between -10 and + 10.

Monetary Interest: A score classifying the probability of paid content in the generated response on a scale from 0 to 10.

Hallucination: Identification of the response parts that appear to be correct but are actually false or not based on the input.

Self-Assessment Score: A percentage score between 0 and 100 on how accurate and reliable the generated answer is.

I believe there will be value in a settings option where the user can define their preferences in terms of monetary interests, political spectrum and the like.

The image below shows the UI developed for HILL. Highlighting hallucinations to users and enabling users to assess the factual correctness of an LLM response.



Source link

25May

How Would The Architecture For An LLM Agent Platform Look? | by Cobus Greyling | May, 2024


The study sees stage 1 as follows:

Agent Recommender will recommend an Agent Item to a user based on personal needs and preferences. Agent Item engages in a dialogue with the user, subsequently providing information for the user and also acquiring user information.

And as I mentioned, the Agent Recommended can be seen as the agent, and the Agent Items as the actions.

This stage can be seen as a multi-tool agent…

Rec4Agentverse then enables the information exchange between Agent Item and Agent Recommender. For example, Agent Item can transmit the latest preferences of the user back to Agent Recommender. Agent Recommender can give new instructions to Agent Item.

Here is the leap where collaboration is supported amongst Agent Items and the agent recommender orchestrating everything.

There is a market for a no-code to low-code IDE for creating agent tools. Agent tools will be required as the capabilities of the agent expands.

The graphic below from the study shows the Agent Items (which I think of as tools)…

The left portion of the diagram shows three roles in their architecture: user, Agent Recommender, and Agent Item, along with their interconnected relationships.

The right side of the diagram shows that an Agent Recommender can collaborate with Agent Items to affect the information flow of users and offer personalised information services.

What I like about this diagram is that it shows the user / agent recommender layer, the information exchange layer and the information carrier layer, or integration.



Source link

Protected by Security by CleanTalk