18Jul

Agentic AI: Creating An AI Agent Which Can Navigate The Internet | by Cobus Greyling | Jul, 2024


Recent studies have explored the construction of text-based web browsing environments and how to instruct large language model agents to perform web navigation.

This new development focusses on building multimodal web agents to leverage the environment rendered by browsers through screenshots, thus mimicking human web browsing behaviour.

WebVoyager is a multi-modal web AI agent designed to autonomously accomplish web tasks online from start to finish, managing the entire process end-to-end without any intermediate human intervention.

WebVoyager processes the user query by making observations from screenshots and textual content in interactive web elements, formulates a thought on what action to take.

Actions can include clicking, typing, scrolling, etc. And subsequently executes that action on the websites.

Below the sequence of events are shown for the agent to follow based on annotated screenshots from web navigation.

Similar to how humans browse the web, this agent uses visual information from the web (screenshots) as its primary input.

This approach allows for the bypassing the complexity of processing HTML DOM trees or accessibility trees, which can produce overly verbose texts and hinder the agent’s decision-making process.

Very similar to the approach Apple took with Ferret-UI, the researchers overlay bounding boxes on the interactive elements of the websites to better guide the agent’s action prediction.

This method does not require an object detection module but instead uses GPT-4V-ACT5, a JavaScript tool that extracts interactive elements based on web element types and overlays bounding boxes with numerical labels on the respective regions.

GPT-4V-ACT5 is efficient since it is rule-based and does not rely on any object detection models.

The action space for WebVoyager is designed to closely mimic human web browsing behaviour. This is achieved by implementing the most commonly used mouse and keyboard actions, enabling the agent to navigate effectively.

Using numerical labels in screenshots, the agent can respond with a concise Action Format. This method precisely identifies the interactive elements and executes the corresponding actions.

The primary actions include:

1. Click: Clicking on a webpage element, such as a link or button.
2. Input: Selecting a text box, clearing any existing content, and entering new content.
3. Scroll: Moving the webpage vertically.
4. Wait: Pausing to allow webpages to load.
5. Back: Returning to the previous page.
6. Jump to Search Engine: Redirecting to a search engine when stuck on a website without finding an answer.
7. Answer: Concluding the iteration by providing an answer that meets the task requirements.

These actions enable the agent to interact with web pages efficiently, simulating a human-like browsing experience.



Source link

17Jul

Advanced Retrieval Techniques in a World of 2M Token Context Windows, Part 1 | by Meghan Heintz | Jul, 2024


Exploring RAG techniques to improve retrieval accuracy

Visualising AI project launched by Google DeepMind. From Unsplash image.

Gemini Pro can handle an astonishing 2M token context compared to the paltry 15k we were amazed by when GPT-3.5 landed. Does that mean we no longer care about retrieval or RAG systems? Based on Needle-in-a-Haystack benchmarks, the answer is that while the need is diminishing, especially for Gemini models, advanced retrieval techniques still significantly improve performance for most LLMs. Benchmarking results show that long context models perform well at surfacing specific insights. However, they struggle when a citation is required. That makes retrieval techniques especially important for use cases where citation quality is important (think law, journalism, and medical applications among others). These tend to be higher-value applications where lacking a citation makes the initial insight much less useful. Additionally, while the cost of long context models will likely decrease, augmenting shorter content window models with retrievers can be a cost-effective and lower latency path to serve the same use cases. It’s safe to say that RAG and retrieval will stick around a while longer but maybe you won’t get much bang for your buck implementing a naive RAG system.

From Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems by Laban, Fabbri, Xiong, Wu in 2024. “Summary of a Haystack results of human performance, RAG systems, and Long-Context LLMs. Results are reported using three metrics: Coverage (left), Citation (center), and Joint (right) scores. Full corresponds to model performance when inputting the entire Haystack, whereas Rand, Vect, LongE, KWs, RR3, Orac correspond to retrieval components RAG systems. Models ranked by Oracle Joint Score. For each model, #Wb report the average number of words per bullet point.”

Advanced RAG covers a range of techniques but broadly they fall under the umbrella of pre-retrieval query rewriting and post-retrieval re-ranking. Let’s dive in and learn something about each of them.

Q: “What is the meaning of life?”

A: “42”

Question and answer asymmetry is a huge issue in RAG systems. A typical approach to simpler RAG systems is to compare the cosine similarity of the query and document embedding. This works when the question is nearly restated in the answer, “What’s Meghan’s favorite animal?”, “Meghan’s favorite animal is the giraffe.”, but we are rarely that lucky.

Here are a few techniques that can overcome this:

The nomenclature “Rewrite-Retrieve-Read” originated from a paper from the Microsoft Azure team in 2023 (although given how intuitive the technique is it had been used for a while). In this study, an LLM would rewrite a user query into a search engine-optimized query before fetching relevant context to answer the question.

The key example was how this query, “What profession do Nicholas Ray and Elia Kazan have in common?” should be broken down into two queries, “Nicholas Ray profession” and “Elia Kazan profession”. This allows for better results because it’s unlikely that a single document would contain the answer to both questions. By splitting the query into two the retriever can more effectively retrieve relevant documents.

From Query Rewriting for Retrieval-Augmented Large Language Models by Ma, Gong, He, Zhao, & Duan in 2023 “(a) standard retrieve-then-read method, (b) LLM as a query rewriter for rewrite-retrieve-read pipeline, and (c ) trainable rewriter.”

Rewriting can also help overcome issues that arise from “distracted prompting”. Or instances where the user query has mixed concepts in their prompt and taking an embedding directly would result in nonsense. For example, “Great, thanks for telling me who the Prime Minister of the UK is. Now tell me who the President of France is” would be rewritten like “current French president”. This can help make your application more robust to a wider range of users as some will think a lot about how to optimally word their prompts, while others might have different norms.

In query expansion with LLMs, the initial query can be rewritten into multiple reworded questions or decomposed into subquestions. Ideally, by expanding the query into several options, the chances of lexical overlap increase between the initial query and the correct document in your storage component.

Query expansion is a concept that predates the widespread usage of LLMs. Pseudo Relevance Feedback (PRF) is a technique that inspired some LLM researchers. In PRF, the top-ranked documents from an initial search to identify and weight new query terms. With LLMs, we rely on the creative and generative capabilities of the model to find new query terms. This is beneficial because LLMs are not restricted to the initial set of documents and can generate expansion terms not covered by traditional methods.

Corpus-Steered Query Expansion (CSQE) is a method that marries the traditional PRF approach with the LLMs’ generative capabilities. The initially retrieved documents are fed back to the LLM to generate new query terms for the search. This technique can be especially performant for queries for which LLMs lacks subject knowledge.

From Corpus-Steered Query Expansion with Large Language Models by Lei , Cao, Zhou , Shen, Yates in 2024. “Overview of CSQE. Given a query Biology definition and the top-2 retrieved documents, CSQE utilizes an LLM to identify relevant document 1 and extract the key sentences from document 1 that contribute to the relevance. The query is then expanded by both these corpus-originated texts and LLM-knowledge empowered expansions (i.e., hypothetical documents that answer the query) to obtain the final results.”

There are limitations to both LLM-based query expansion and its predecessors like PRF. The most glaring of which is the assumption that the LLM generated terms are relevant or that the top ranked results are relevant. God forbid I am trying to find information about the Australian journalist Harry Potter instead of the famous boy wizard. Both techniques would further pull my query away from the less popular query subject to the more popular one making edge case queries less effective.

Another way to reduce the asymmetry between questions and documents is to index documents with a set of LLM-generated hypothetical questions. For a given document, the LLM can generate questions that could be answered by the document. Then during the retrieval step, the user’s query embedding is compared to the hypothetical question embeddings versus the document embeddings.

This means that we don’t need to embed the original document chunk, instead, we can assign the chunk a document ID and store that as metadata on the hypothetical question document. Generating a document ID means there is much less overhead when mapping many questions to one document.

The clear downside to this approach is your system will be limited by the creativity and volume of questions you store.

HyDE is the opposite of Hypothetical Query Indexes. Instead of generating hypothetical questions, the LLM is asked to generate a hypothetical document that could answer the question, and the embedding of that generated document is used to search against the real documents. The real document is then used to generate the response. This method showed strong improvements over other contemporary retriever methods when it was first introduced in 2022.

We use this concept at Dune for our natural language to SQL product. By rewriting user prompts as a possible caption or title for a chart that would answer the question, we are better able to retrieve SQL queries that can serve as context for the LLM to write a new query.

From Precise Zero-Shot Dense Retrieval without Relevance Labels by Gao, Ma, Lin, Callan in 2022. “An illustration of the HyDE model. Documents snippets are shown. HyDE serves all types of queries without changing the underlying GPT-3 and Contriever/mContriever models.”



Source link

16Jul

AgentInstruct Uses Agentic Flows To Create Synthetic Training Data | by Cobus Greyling | Jul, 2024


High-Quality Data

By leveraging powerful models like GPT-4, along with tools such as search APIs and code interpreters, AgentInstruct ensures the generation of high-quality data.

Diverse Data

AgentInstruct produces both prompts and responses using a large number of agents equipped with powerful LLMs, various tools, and reflection flows.

It employs a taxonomy with over 100 subcategories to ensure diversity and quality in the prompts and responses generated.

Large Quantities of Data

Operating autonomously, AgentInstruct can generate vast amounts of data, applying flows for verification and filtering. It eliminates the need for seed prompts by using raw documents for seeding.

This challenge stems from the difficulty in creating high-quality and diverse synthetic data, which necessitates significant human effort in curation and filtering.

A common approach involves using powerful models like GPT-4 to generate responses to a set of prompts. This process is often enhanced by eliciting explanations or step-by-step instructions and employing complex prompting techniques to improve answer quality.

Using raw data (such as unstructured text documents or source code) as seeds offers two advantages. Firstly, it is abundant, enabling the creation of vast and diverse datasets using AgentInstruct.

Secondly, bypassing existing prompts, either as-is or after paraphrasing, can foster learning of more general capabilities rather than specific benchmarks.

The AgentInstruct approach is conducive to enhancing larger, more capable models due to its ability to generate new prompts and produce responses of higher quality than the LLM used in the agentic flow, facilitated by tools and reflection.

The AgentInstruct approach offers an effective solution for generating diverse, high-quality data for model post-training.

This method employs agentic flows to create synthetic data, addressing common issues such as lack of diversity and the need for extensive human curation.

By leveraging an agentic framework, AgentInstruct can produce custom datasets from unstructured data sources, enhancing model training and skill development.

The approach’s effectiveness is evidenced by the improved performance of the Orca-3 model, which benefited from a 25 million pair dataset generated by AgentInstruct.

The researchers believe that using agentic flows for synthetic data creation is valuable for all stages of model training, including pre-training, post-training, and domain/task specialisation.

This capability to generate diverse, high-quality instruction data from unstructured content could lead to partial or completely automated pipelines for model customisation and continuous improvement.

✨ Follow me on LinkedIn for updates on Large Language Models

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

16Jul

From Scratch to Deep Quantile Forecasting | by Jinhang Jiang | Jul, 2024


An end-2-end empirical sharing of multi-step quantile forecasting with Tensorflow, NeuralForecast, and Zero-shot LLMs.

Image by Author
  1. Short Introduction
  2. Data
  3. Build a Toy Version of Quantile Recurrent Forecaster
  4. Quantile Forecasting with the State-of-Art Models
  5. Zero-shot Quantile Forecast with LLMs
  6. Conclusion

Quantile forecasting is a statistical technique used to predict different quantiles (e.g., the median or the 90th percentile) of a response variable’s distribution, providing a more comprehensive view of potential future outcomes. Unlike traditional mean forecasting, which only estimates the average, quantile forecasting allows us to understand the range and likelihood of various possible results.

Quantile forecasting is essential for decision-making in contexts with asymmetric loss functions or varying risk preferences. In supply chain management, for example, predicting the 90th percentile of demand ensures sufficient stock levels to avoid shortages, while predicting the 10th percentile helps minimize overstock and associated costs. This methodology is particularly advantageous in sectors such as finance, meteorology, and energy, where understanding distribution extremes is as critical as the mean.

Both quantile forecasting and conformal prediction address uncertainty, yet their methodologies differ significantly. Quantile forecasting directly models specific quantiles of the response variable, providing detailed insights into its distribution. Conversely, conformal prediction is a model-agnostic technique that constructs prediction intervals around forecasts, guaranteeing that the true value falls within the interval with a specified probability. Quantile forecasting yields precise quantile estimates, whereas conformal prediction offers broader interval assurances.

The implementation of quantile forecasting can markedly enhance decision-making by providing a sophisticated understanding of future uncertainties. This approach allows organizations to tailor strategies to different risk levels, optimize resource allocation, and improve operational efficiency. By capturing a comprehensive range of potential outcomes, quantile forecasting enables organizations to make informed, data-driven decisions, thereby mitigating risks and enhancing overall performance.

To demonstrate the work, I chose to use the data from the M4 competition as an example. The data is under CC0: Public Domain license which can be accessed here. The data can also be loaded through datasetsforecast package:

# Install the package
pip install datasetsforecast
# Load Data
df, *_ = M4.load('./data', group='Weekly')
# Randomly select three items
df = df[df['unique_id'].isin(['W96', 'W100', 'W99'])]
# Define the start date (for example, "1970-01-04")
start_date = pd.to_datetime("1970-01-04")
# Convert 'ds' to actual week dates
df['ds'] = start_date + pd.to_timedelta(df['ds'] - 1, unit='W')
# Display the DataFrame
df.head()
Image by Author

The original data contains over 300 unique time series. To demonstrate, I randomly selected three time series: W96, W99, and W100, as they all have the same history length. The original timestamp is masked as integer numbers (i.e., 1–2296), I manually converted it back to normal date format with the first date to be January 4th, 1970. The following figure is a preview of W99:

Image by Author

First, let’s build a quantile forecaster from scratch to understand how the target data flows through the pipeline and how the forecasts are generated. I picked the idea from the paper A Multi-Horizon Quantile Recurrent Forecaster by Wen et al. The authors proposed a Multi-Horizon Quantile Recurrent Neural Network (MQ-RNN) framework that combines Sequence-to-Sequence Neural Networks, Quantile Regression, and Direct Multi-Horizon Forecasting for accurate and robust multi-step time series forecasting. By leveraging the expressiveness of neural networks, the nonparametric nature of quantile regression, and a novel training scheme called forking-sequences, the model can effectively handle shifting seasonality, known future events, and cold-start situations in large-scale forecasting applications.

We cannot reproduce everything in this short blog, but we can try to replicate part of it using the TensorFlow package as a demo. If you are interested in the implementation of the paper, there is an ongoing project that you can leverage: MQRNN.

Let’s first load the necessary package and define some global parameters. We will use the LSTM model as the core, and we need to do some preprocessing on the data to obtain the rolling windows before fitting. The input_shape is set to (104, 1) meaning we are using two years of data for each training window. In this walkthrough, we will only look into an 80% confidence interval with the median as the point forecast, which means the quantiles = [0.1, 0.5, 0.9]. We will use the last 12 weeks as a test dataset, so the output_steps or horizon is equal to 12 and the cut_off_date will be ‘2013–10–13’.

# Install the package
pip install tensorflow

# Load the package
from sklearn.preprocessing import StandardScaler
from datetime import datetime
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, concatenate, Layer

# Define Global Parameters
input_shape = (104, 1)
quantiles = [0.1, 0.9]
output_steps = 12
cut_off_date = '2013-10-13'
tf.random.set_seed(20240710)

Next, let’s convert the data to rolling windows which is the desired input shape for RNN-based models:

# Preprocess The Data
def preprocess_data(df, window_size = 104, forecast_horizon = 12):
# Ensure the dataframe is sorted by item and date

df = df.sort_values(by=['unique_id', 'ds'])
# List to hold processed data for each item
X, y, unique_id, ds = [], [], [], []
# Normalizer
scaler = StandardScaler()
# Iterate through each item
for key, group in df.groupby('unique_id'):
demand = group['y'].values.reshape(-1, 1)
scaled_demand = scaler.fit_transform(demand)
dates = group['ds'].values
# Create sequences (sliding window approach)
for i in range(len(scaled_demand) - window_size - forecast_horizon + 1):
X.append(scaled_demand[i:i+window_size])
y.append(scaled_demand[i+window_size:i+window_size+forecast_horizon].flatten())
unique_id.append(key)
ds.append(dates[i+window_size:i+window_size+forecast_horizon])
X = np.array(X)
y = np.array(y)
return X, y, unique_id, ds, scaler

Then we split the data into train, val, and test:

# Split Data
def split_data(X, y, unique_id, ds, cut_off_date):
cut_off_date = pd.to_datetime(cut_off_date)
val_start_date = cut_off_date - pd.Timedelta(weeks=12)
train_idx = [i for i, date in enumerate(ds) if date[0] val_idx = [i for i, date in enumerate(ds) if val_start_date test_idx = [i for i, date in enumerate(ds) if date[0] >= cut_off_date]

X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]
X_test, y_test = X[test_idx], y[test_idx]

train_unique_id = [unique_id[i] for i in train_idx]
train_ds = [ds[i] for i in train_idx]
val_unique_id = [unique_id[i] for i in val_idx]
val_ds = [ds[i] for i in val_idx]
test_unique_id = [unique_id[i] for i in test_idx]
test_ds = [ds[i] for i in test_idx]

return X_train, y_train, X_val, y_val, X_test, y_test, train_unique_id, train_ds, val_unique_id, val_ds, test_unique_id, test_ds

The authors of the MQRNN utilized both horizon-specific local context, essential for temporal awareness and seasonality mapping, and horizon-agnostic global context to capture non-time-sensitive information, enhancing the stability of learning and the smoothness of generated forecasts. To build a model that sort of reproduces what the MQRNN is doing, we need to write a quantile loss function and add layers that capture local context and global context. I added an attention layer to it to show you how the attention mechanism can be included in such a process:

# Attention Layer
class Attention(Layer):
def __init__(self, units):
super(Attention, self).__init__()
self.W1 = Dense(units)
self.W2 = Dense(units)
self.V = Dense(1)
def call(self, query, values):
hidden_with_time_axis = tf.expand_dims(query, 1)
score = self.V(tf.nn.tanh(self.W1(values) + self.W2(hidden_with_time_axis)))
attention_weights = tf.nn.softmax(score, axis=1)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights

# Quantile Loss Function
def quantile_loss(q, y_true, y_pred):
e = y_true - y_pred
return tf.reduce_mean(tf.maximum(q*e, (q-1)*e))

def combined_quantile_loss(quantiles, y_true, y_pred, output_steps):
losses = [quantile_loss(q, y_true, y_pred[:, i*output_steps:(i+1)*output_steps]) for i, q in enumerate(quantiles)]
return tf.reduce_mean(losses)

# Model architecture
def create_model(input_shape, quantiles, output_steps):
inputs = Input(shape=input_shape)
lstm1 = LSTM(256, return_sequences=True)(inputs)
lstm_out, state_h, state_c = LSTM(256, return_sequences=True, return_state=True)(lstm1)
context_vector, attention_weights = Attention(256)(state_h, lstm_out)
global_context = Dense(100, activation = 'relu')(context_vector)
forecasts = []
for q in quantiles:
local_context = concatenate([global_context, context_vector])
forecast = Dense(output_steps, activation = 'linear')(local_context)
forecasts.append(forecast)
outputs = concatenate(forecasts, axis=1)
model = Model(inputs, outputs)
model.compile(optimizer='adam', loss=lambda y, f: combined_quantile_loss(quantiles, y, f, output_steps))
return model

Here are the plotted forecasting results:

We also evaluated the SMAPE for each item, as well as the percentage coverage of the interval (how much actual was covered by the interval). The results are as follows:

This toy version can serve as a good baseline to start with quantile forecasting. The distributed training is not configured for this setup nor the model architecture is optimized for large-scale forecasting, thus it might suffer from speed issues. In the next section, we will look into a package that allows you to do quantile forecasts with the most advanced deep-learning models.

The neuralforecast package is an outstanding Python library that allows you to use most of the SOTA deep neural network models for time series forecasting, such as PatchTST, NBEATs, NHITS, TimeMixer, etc. with easy implementation. In this section, I will use PatchTST as an example to show you how to perform quantile forecasting.

First, load the necessary modules and define the parameters for PatchTST. Tuning the model will require some empirical experience and will be project-dependent. If you are interested in getting the potential-optimal parameters for your data, you may look into the auto modules from the neuralforecast. They will allow you to use Ray to perform hyperparameter tuning. And it is quite efficient! The neuralforecast package carries a great set of models that are based on different sampling approaches. The ones with the base_window approach will allow you to use MQLoss or HuberMQLoss, where you can specify the quantile levels you are looking for. In this work, I picked HuberMQLoss as it is more robust to outliers.

# Install the package
pip install neuralforecast

# Load the package
from neuralforecast.core import NeuralForecast
from neuralforecast.models import PatchTST
from neuralforecast.losses.pytorch import HuberMQLoss, MQLoss

# Define Parameters for PatchTST
PARAMS = {'input_size': 104,
'h': output_steps,
'max_steps': 6000,
'encoder_layers': 4,
'start_padding_enabled': False,
'learning_rate': 1e-4,
'patch_len': 52, # Length of each patch
'hidden_size': 256, # Size of the hidden layers
'n_heads': 4, # Number of attention heads
'res_attention': True,
'dropout': 0.1, # Dropout rate
'activation': 'gelu', # Activation function
'dropout': 0.1,
'attn_dropout': 0.1,
'fc_dropout': 0.1,
'random_seed': 20240710,
'loss': HuberMQLoss(quantiles=[0.1, 0.5, 0.9]),
'scaler_type': 'standard',
'early_stop_patience_steps': 10}

# Get Training Data
train_df = df[df.ds

# Fit and predict with PatchTST
models = [PatchTST(**PARAMS)]
nf = NeuralForecast(models=models, freq='W')
nf.fit(df=train_df, val_size=12)
Y_hat_df = nf.predict().reset_index()

Here are plotted forecasts:

Here are the metrics:

Through the demo, you can see how easy to implement the model and how the performance of the model has been lifted. However, if you wonder if there are any easier approaches to do this task, the answer is YES. In the next section, we will look into a T5-based model that allows you to conduct zero-shot quantile forecasting.

We have been witnessing a trend where the advancement in NLP will also further push the boundaries for time series forecasting as predicting the next word is a synthetic process for predicting the next period’s value. Given the fast development of large language models (LLMs) for generative tasks, researchers have also started to look into pre-training a large model on millions of time series, allowing users to do zero-shot forecasts.

However, before we draw an equal sign between the LLMs and Zero-shot Time Series tasks, we have to answer one question: what is the difference between training a language model and training a time series model? It would be “tokens from a finite dictionary versus values from an unbounded.” Amazon recently released a project called Chronos which well handled the challenge and made the large time series model happen. As the authors stated: “Chronos tokenizes time series into discrete bins through simple scaling and quantization of real values. In this way, we can train off-the-shelf language models on this ‘language of time series,’ with no changes to the model architecture”. The original paper can be found here.

Currently, Chronos is available in multiple versions. It can be loaded and used through the autogluon API with only a few lines of code.

# Get Training Data and Transform
train_df = df[df.dstrain_df_chronos = TimeSeriesDataFrame(train_df.rename(columns={'ds': 'timestamp', 'unique_id': 'item_id', 'y': 'target'}))

# Zero-shot forecast with Chronos
predictor = TimeSeriesPredictor(prediction_length=output_steps, freq='W', quantile_levels = [0.1, 0.9]).fit(
train_df_chronos, presets="chronos_base",
random_seed = 20240710
)
Y_hat_df_chronos = predictor.predict(train_df_chronos).reset_index().rename(columns={'mean': 'Chronos',
'0.1': 'P10',
'0.9': 'P90',
'timestamp': 'ds',
'item_id': 'unique_id'})

Here are the plotted forecasts:

Here are the metrics:

As you can see, Chronos showed a very decent performance compared to PatchTST. However, it does not mean it has surpassed PatchTST, since it is very likely that Chronos has been trained on M4 data. In their original paper, the authors also evaluated their model on the datasets that the model has not been trained on, and Chronos still yielded very comparable results to the SOTA models.

There are many more large time series models being developed right now. One of them is called TimeGPT which was developed by NIXTLA. The invention of this kind of model not only made the forecasting task easier, more reliable, and consistent, but it is also a good starting point to make reasonable guesses for time series with limited historical data.

From building a toy version of a quantile recurrent forecaster to leveraging state-of-the-art models and zero-shot large language models, this blog has demonstrated the power and versatility of quantile forecasting. By incorporating models like TensorFlow’s LSTM, NeuralForecast’s PatchTST, and Amazon’s Chronos, we can achieve accurate, robust, and computationally efficient multi-step time series forecasts. Quantile forecasting not only enhances decision-making by providing a nuanced understanding of future uncertainties but also allows organizations to optimize strategies and resource allocation. The advancements in neural networks and zero-shot learning models further push the boundaries, making quantile forecasting a pivotal tool in modern data-driven industries.

Note: All the images, numbers and tables are generated by the author. The complete code can be found here: Quantile Forecasting.



Source link

15Jul

LangSmith, LangGraph Cloud & LangGraph Studio | by Cobus Greyling | Jul, 2024


In this article I do a complete end-to-end walkthrough of an Agent built using LangGraph, deployed to LangGraph Cloud & viewed via LangGraph Studio. Ending with LangSmith on managing applications & LLM performance.

Considering the intersection of language and AI, developments have been taking place at a tremendous pace. And LangChain finds itself at the forefront of shaping how generative AI applications are developed and managed.

A few initial observations regarding generative AI and language:

  1. A few months ago it was thought that OpenAI has captured the market with their highly capable LLMs.
  2. Then a slew of open-sourced models, most notably from Meta disrupted the perceived commercial model.
  3. LLM providers realised that Language Models will become a mere utility and started to focus on end-user applications and RAG-like functionalities referred to as grounding, agent-like functionality and personal assistants.
  4. Hallucination had to be solved for, and it was discovered that LLMs do not have emergent capabilities, but rather LLMs do exceptionally well at in-context learning (ICL). An application structure developed around implementing, scaling and managing ICL implementations; which we now know as RAG.
  5. RAG (non-gradient) started to be preferred above fine-tuning (gradient) approaches for reasons of being transparent, not as opaque as fine-tuning. Adding to generative AI apps being observable, inspectable and easy modifiable.
  6. Because we started using all aspects of LLMs (NLG, reasoning, planning, dialog state management, etc.) except the knowledge intensive nature of LLMs, Small Language Models become very much applicable.
  7. This was due to very capable open-sourced SLMs, quantisation, local, offline inference, advanced capability in reasoning and chain-of-thought training.
  8. And, the focus is shifting to two aspects…the first being a data centric approach. Where unstructured data can be discovered, designed and augmented for RAG and fine-tuning. Recent fine-tuning did not focus on augmenting the knowledge-intensive nature of Language Models, but rather to imbue the LMs with specific behavioural capabilities.
  9. This is evident in the recent acquisition bye OpenAI to move closer to the data portion and delivering RAG solutions.
  10. The second aspect the need for a no-code to low-code AI productivity suite providing access to models, hosting, flow-engineering, fine-tuning, prompt studio and guardrails.
  11. There is also a notable movement to add graph data…graph is an abstract data type…An abstract data type is a mathematical model for data types, defined by its behaviour (semantics) from the point of view of a user of the data. Abstract data types are in stark contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user. This data structure is less opaque and easy to interpret.

langChain introduced LangSmith as a tool for detailed tracing and management of Generative AI applications. The offering included a prompt playground, and prompt hub.

langChain also recently introduced LangGraph, which adds to some degree structure to agentic applications.

An abstract data type is a mathematical model for data types, defined by its behaviour (semantics) from the point of view of a user of the data.

Abstract data types are in stark contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user. This data structure is less opaque and easy to interpret.

Directed graph (or digraph) is a graph that is made up of a set of nodes connected by directed edges.

Graph data structure consists of a finite set of nodes together with a set of unordered pairs of these nodes for an undirected graph.

Considering the graph representation below, the nodes are shown, together with the edges and the edge options.



Source link

13Jul

Build an AI Paraphraser Tool in 5 MinutesWith SimplerLLM | by Hasan Aboul Hasan | Jul, 2024


In this post, I’ll show you step-by-step how you can build an AI Paraphraser Tool using Python and SimplerLLM in Minutes.

Something like this:

Intro: How Do Paraphrasing Tools Work?

Before the era of AI, most paraphrasing tools swapped words with their synonyms, maintaining the original meaning of the text.

However, after AI took over this domain, these tools improved to the extent that they are now able to analyze the input text and create an alternative version with a different structure and wording while conveying the same meaning.

Here are some of the things most paraphraser tools do:

  1. Word Substitution: The tool identifies and replaces words with their synonyms while maintaining the original meaning of the text.
  2. Sentence Restructure: One of the most critical steps in paraphrasing is rearranging the structure of sentences. For example, it may convert active voice to passive voice or change the order of phrases and clauses to create a different sentence flow.
  3. Consolidating Information: Summarize information from long sentences or paragraphs into shorter, more concise versions that cover the essential points.
  4. Adjusting Formality and Tone: This would be done based on the settings or intended use. For instance, it can transform a casual tone into a more formal one or vice versa.
  5. Removing Redundancy: Detect and remove redundant phrases or words, making the text clearer without unnecessary repetition.
  6. Ensuring Coherence: Beyond word-level changes, effective paraphrasing ensures that the rephrased text remains logically connected, maintaining the flow and readability of the content.

As you can see, it’s not just about changing words. As we go now over some types of paraphrasing styles, you’ll see how the prompt changes a lot with each one.

Let’s Start! — The Implementation

In our case, the main engine that paraphrases the text is our power prompt. These prompts are fed to OpenAI’s GPT model or any other Model you prefer, which will do all the work for us.

The code structure is very simple. It just reads the content of a text file, paraphrases it using the power prompt chosen, and saves the response in another text file. So, the only part that needs to be very well-crafted is the prompt.

By now, you should have grasped the idea behind how the code works. So, it’s time to get technical!

Get Our Environment Ready

First, our code is dependent on the SimplerLLM library, which makes building AI tools much easier, as you can see now.

let’s start by creating a virtual environment and installing the library.

So, open a new terminal and run the following step-by-step:

1- Create the Virtual Environment:

python - m venv venv

2- Create and Activate the Virtual Environment:

venv/scripts/activate

3- Install SimplerLLM:

pip install simplerllm

Now, we have a virtual environment with simplerllm installed, which will help isolate the project and avoid conflicts between package versions.

Note that You will get all the codes and prompts mentioned, so don’t worry about the snippets now 🙂

First things first, we’ll need to create a .env file and add our OpenAI API Key so that the SimplerLLM functions can use it to generate the responses.

If you don’t have an API key, go to OpenAI’s website and generate a new one. Then, add it to the .env file in this form:

OPENAI_API_KEY = “YOUR_API_KEY”

Now, we’re ready to use the code; here it is:

from SimplerLLM.tools.generic_loader import load_content
from SimplerLLM.language.llm import LLM, LLMProvider
from prompts import Standard, Academic, Kiddie, Formal, Expand, Shorten

text = load_content("input.txt")

#Edit the prompt name in accordance to what you want to convert it to
final_prompt = Academic.format(input = text.content)

llm_instance = LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4o")

response = llm_instance.generate_response(prompt=final_prompt, max_tokens=1000)

with open("response.txt", "w", encoding='utf-8') as f:
f.write(response)

As you can see, we’ve imported two things to our code; the SimplerLLM functions we’ll use and the prompts module, which I created and saved all the power prompts in, and I’ll give you them for free!

The text variable uses the SimplerLLm function

load_content

that takes your text file as input and loads its respective data. Here’s how it would look:

text = load_content(“input.txt”)

Academic Paraphraser

Now, we need to format the prompt and store it in the final_prompt variable. This can be done by using the

Academic

prompt, which we imported from the prompts module and passed along with the content of the text file.

final_prompt = Academic.format(input = text.content)

Then, we create an OpenAI LLM instance in order to call their GPT model, and we call it using the

generate_reponse

function as shown in the response variable.

llm_instance = LLM.create(provider=LLMProvider.OPENAI, model_name=”gpt-4o”)

response = llm_instance.generate_response(prompt=final_prompt, max_tokens=1000)

💡 Note that by default, this function returns a maximum of 350 tokens as output; that’s why we added the max_tokens parameter to increase it to 1000. If your expected token count is bigger than 1000 tokens make sure you increase this number as needed.

💡 To calculate the number of tokens that your text is, use this tokenizer tool by OpenAI to do so. or you can use the tiktoken library to calculate directly in your python code.

Then, we take the response generated and save it the

response.txt

Here’s what the output will look like:

As you can see, the paraphrased output text is still very close in the number of characters; however, the content drastically changed when we used strictly formal language, and the vocabulary used is way more complex than before.

The changes that occur to the text are in accordance with what I constructed the prompt to do.

Kiddie Paraphraser

Let’s try using the kiddie format instead of academic and see how the output changes. So, just replace kiddie instead of academic in the final_prompt variable like this:

final_prompt = Kiddie.format(input = text.content)

And here’s the result:

The result now is very different from the one above. The words used are more informal, and the idea is explained very simply, making it easy for kids to understand and enjoy.

Shortening Paraphraser

Let’s now try another type of paraphrasing that not only changes words and sentence structure but also changes the text’s length either by increasing or decreasing.

As we did before, we’ll replace kiddie in the final_prompt variable by the shorten to decrease its length. Here’s what we get:

The paraphrased text now was shortened to 147 words, where it was before 204. Plus, it did some word substitution with a little bit of sentence restructuring.

As you can see, the output changes a lot depending on the prompt we choose. So, the better the prompt, the better the result.

This is what we call prompt engineering, which helps you create the most optimal prompts to get the best out of them.

The Code and Prompts

Here are both the code we used above and the prompts.py module file, which contains the prompts we used above.

The Code

The Prompts



Source link

13Jul

Speculative RAG By Google Research | by Cobus Greyling | Jul, 2024


Speculative RAG is a framework that uses a larger generalist language model to efficiently verify multiple RAG drafts produced in parallel by a smaller, specialised distilled language model.

Each draft is based on a distinct subset of retrieved documents, providing diverse perspectives and reducing input token counts per draft.

According to the research, this method enhances comprehension and mitigates position bias over long contexts. By delegating drafting to the smaller model and having the larger model perform a single verification pass, Speculative RAG accelerates the RAG process.

Experiments show that Speculative RAG achieves state-of-the-art performance with reduced latency.

Improving accuracy by up to 12.97% and reducing latency by 51% compared to traditional RAG systems.

This new RAG framework uses a smaller specialist RAG drafter to generate high-quality draft answers.

Each draft comes from a distinct subset of retrieved documents, offering diverse perspectives and reducing input token counts per draft.

The generalist LM works with the RAG drafter without needing additional tuning.

It verifies and integrates the most promising draft into the final answer, enhancing comprehension of each subset and mitigating the lost-in-the-middle phenomenon.

Google believes this method significantly accelerates RAG by having the smaller specialist LM handle drafting, while the larger generalist LM performs a single, unbiased verification pass over the drafts in parallel.

Extensive experiments on four free-form question-answering and closed-set generation benchmarks demonstrate the superior effectiveness and efficiency of this method.

  1. This study is a good example of how Small Language Models are being used on a larger framework which employs model orchestration.
  2. SLMs are leveraged for their reasoning capabilities, for which they have been specifically created.
  3. SLMs are ideal in this scenario, as they are not required to be knowledge intensive in nature for this implementation. Relevant and contextual knowledge is injected at inference.
  4. The aim of this framework is to optimise token count and hence safe cost.
  5. Reducing latency by 51% compared to conventional RAG systems.
  6. Enhances accuracy by up to 12.97%.
  7. Avoid fine-tuning of models.
  8. Multiple RAG drafts are produced in parallel by smaller, specialised Language Models.
  9. This smaller, specialised RAG model, excels at reasoning over retrieved documents and can rapidly produce accurate responses. This reminds of SLMs Orca-2 and Phi-3 which were trained to have exceptional reasoning capabilities.
  10. The best results were achieved with the RAG drafter being the Mistral 7B SLM.
  11. And the verifier Mixtral 8x7B.



Source link

12Jul

Gower’s Distance for Mixed Categorical and Numerical Data | by Haden Pelletier | Jul, 2024


A distance measure for clustering mixed data

Most likely you have heard of Manhattan distance or Euclidean distance. These are two different metrics which provide information as to how distant (or different) two given data points are.

Manhattan and Euclidean distance graphed. Image by author

In a nutshell, Euclidean distance is the shortest distance from point A to point B. Manhattan distance calculates the sum of the absolute differences between the x and y coordinates and finds the distance between them as if they were placed on a grid where you could only go up, down, left, or right (not diagonal).

Distance metrics often underlie clustering algorithms, such as k-means clustering, which uses Euclidean distance. This makes sense, as in order to define clusters, you have to first know how similar or different 2 data points are (aka how distant they are from each other).

Calculating the distance between 2 points

To show this process in action, I will start with an example using Euclidean distance.



Source link

11Jul

Moving From Natural Language Understanding To Mobile UI Understanding | by Cobus Greyling | Jul, 2024


As with conversations, context is of paramount importance. It is very hard to derive meaning from any conversation if there is not sufficient context. That is the underlying principle of RAG, to supply the LLM with context at inference.

Ferret-UI is a model designed to understand user interactions with a mobile screen.

The image is below is quite self explanatory, on how the mobile screen can be interrogated in natural language. There are numerous use-cases which comes to mind.

This solution can be seen as a conversational enablement of a mobile operating system. Or the information can be used to learn from user behaviour and supply users with a customised experience.

This is something which is referred to as ambient orchestration, where user behaviour can be learn and suggestions can be made by the mobile OS, automation of user routines can be intelligent and truly orchestrated.



Source link

10Jul

Teaching Small Language Models to Reason | by Cobus Greyling | Jul, 2024


Chain-Of-Thought Prompting at a foundational level is so successful, that it gave rise to something some refer to as the Chain-Of-X phenomenon. Google Research explored how to generate a CoT data ontology for existing datasets using LLMs and then how to fine-tune smaller Language Models on the CoT.

As most everyone knows, Chain-Of-Thought prompting improves the reasoning capabilities of large language models.

Google asserts that reasoning capabilities only emerge in models with at least tens of billions of parameters. This research from Google explores transferring these capabilities to smaller models via knowledge distillation.

They fine-tuned a student model using the Chain-Of-Thought outputs from a larger teacher model.

Researchers from Google found that this method improves task performance in arithmetic, common sense, and symbolic reasoning datasets.

Chain of thought (CoT) prompting teaches Language Models (LMs) to decompose a reasoning task into a series of intermediate steps.

It is demonstrated that this prompting significantly increases the task accuracy of large language models (LLMs) across common sense, symbolic and mathematical reasoning datasets.

However, the reasoning capabilities of smaller LMs do not improve with CoT prompting, mostly producing illogical CoT. Notably, CoT prompting even reduces the accuracy of models with less than 10 billion parameters.

Research attributes this to abilities, such as semantic understanding and symbolic mapping, only emerging at larger scale models.

Google Research propose a two-step pipeline for CoT (Chain-Of-Thought) knowledge distillation.

Annotation with CoT Reasoning

  1. Use a teacher model, like PaLM 540B or GPT-3 175B, to annotate an existing supervised dataset with CoT reasoning.
  2. Perform few-shot prompting with 8 examples to generate CoTs, adapting prompts to provide the target answer after the question and before the example CoT. This helps correct small mistakes.
  3. Remove incorrect CoTs based on the target answer to ensure quality.

Fine-Tuning the Student Model

  1. Fine-Tune a student model using teacher forcing.
  2. Provide the question as input and the CoT and answer as the target.
  3. This training eliminates the need for prompting during fine-tuning.

An overview of the proposed method is shown in the figure below:



Source link

Protected by Security by CleanTalk