Creating A Benchmark Taxonomy For Prompt Engineering | by Cobus Greyling | Jun, 2024

13Jun

Benchmarking prompts presents challenges due to differences in their usage, level of detail, style, and purpose. A recent study tackled this issue by developing a taxonomy called TELeR (Turn, Expression, Level of Details, Role), which aims to comprehensively benchmark prompts across these dimensions.

The aim of this study is to allow future reporting on specific prompt categories and meaningful comparison between prompts.

Establishing a common standard through some kind of taxonomy will allow the taxonomy to act as a reference when measuring and comparing the performance of different LLMs against varying prompts.

There has also been the emergence of prompt hubs, the most notable open prompt hubs are from LangChain and Haystack. Establishing a standard taxonomy will help with categorising and sorting prompts. And afford users a template to use while navigating prompt hubs, ensuring the prompt fits the application they have in mind.

The quality and effectiveness of the prompt can greatly influence the performance of Large Language Models (LLMs) for a particular task.

Therefore, designing appropriate prompts with the right amount of detail has become more important than ever.

What makes this study interesting, is that the researchers exclusively focus on understanding the potential of Large Language Models (LLMs) for performing complex tasks that are characterised by the following traits:

Ill-defined tasks
Abstract and goal-oriented
Highly dependent on subjective interpretation
Very hard to evaluate quantitatively.

These complex tasks often involve multiple steps or sub-tasks, making the design of appropriate prompts particularly challenging, as there is no single rule book to follow.

Added to this, the more complex the task, the larger the number of variances and possible permutations of the prompt.

Goals

Setting clear goals helps the language model understand the task or question, increasing the likelihood of obtaining the desired output.

Avoiding vague or ambiguous terms is crucial to prevent inaccurate or irrelevant responses. Be explicit in terms of instructions.

Associated Data

Some prompts require LLMs to perform a task on data provided by the user in real-time (including RAG), while others rely solely on the pre-trained model to generate responses based on its background knowledge.

It is crucial to explicitly indicate in LLM prompts whether the user is providing data and, if so, to distinguish clearly between the data and the directive parts of the prompt.

Sub-Tasks

Complex tasks consist of multiple steps or sub-tasks. It is important to clearly outline these distinct sub-tasks in the prompt as separate bullet points or numbered items.

This visual organisation helps LLMs recognise each sub-task and respond to them individually.

Evaluation Criteria/Few-Shot Examples

LLMs can benefit from example-based learning, where prompts include specific examples of desired input-output pairs (few-shot examples). By incorporating relevant examples, users can guide the model to follow specific patterns or mimic desired behaviours.

RAG

Both Small & Large Language Models excel at in context learning (ICL), where the model abandon its pre-trained knowledge and rely on contextual reference data injected at inference.

Self-Explain

LLMs are capable not only of generating textual responses but also of providing explanations for their outputs if explicitly requested in the prompt.

Context & Role

Including relevant context and background information in the prompt can help the model generate more accurate responses.

For complex tasks, providing a clear understanding of the context enables the model to make more informed and precise decisions.

The level of context provided in different prompts can significantly impact the accuracy of the model’s responses.

Expression Style

Directives can be expressed primarily in two styles:

Questions
Instructions

For complex tasks, one may choose to frame directives as either a set of questions or instructions based on their preference or the specific needs of the application.

Interaction Style

Prompts for complex tasks typically consist of lengthy text descriptions, often containing details of associated sub-tasks to be performed step-by-step.

Consequently, some users may opt to provide these instructions in a multi-turn fashion, resembling a real dialogue, while others may prefer to convey all the details in a single turn.

This choice between one-turn and multi-turn prompting can significantly impact the performance of an LLM, as the dialogue history differs in generation time between these two approaches.

Turn

Based on the number of turns used while prompting LLMs in order to perform a complex task, prompts can be either single or multi-turn.

Expresion

Based on the expression style of the overall directive as well as the associated sub-tasks, prompts can be either question-style or instruction-style.

Role

Based on whether a proper system role is defined in the LLM system before providing the actual prompt, prompts can be categorised as either system-role defined or undefined.

Level of Detail

Based on the degree of detail provided in the directive, the researchers divided prompts into seven distinct levels (levels 0–6).

This paper emphasises the importance of a standardised taxonomy for LLM prompts aimed at solving complex tasks.

The TELeR taxonomy, which can serve as a unified standard for comparing and benchmarking the performances of LLMs as reported by multiple independent research studies.

Standardisation of comparison can enable more meaningful comparisons among LLMs and help derive more accurate conclusions from multiple independent studies.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️