
Comparing Sex Ratios: Revisiting a Famous Statistical Problem from the 1700s | by Ryan Burn | Aug, 2024

What can we say about the difference of two binomial distribution probabilities

18th century Paris and London [12]

Consider two independent binomial distributions with probabilities of successes p_1 and p_2. If we observe a_1 successes, b_1 failures from the first distribution and a_2 successes, b_2 failures from the second distribution, what can we say about the difference, p_1 – p_2?

Binomial model differences like this were first studied by Laplace in 1778. Laplace observed that the ratio of boys-to-girls born in London was notably larger than the ratio of boys-to-girls born in Paris, and he sought to determine whether the difference was significant.

Using what would now be called Bayesian inference together with a uniform prior, Laplace computed the posterior probability that the birth ratio in London was less than the birth ratio in Paris as


Source link


Structured Outputs and How to Use Them | by Armin Catovic | Aug, 2024

Building robustness and determinism in LLM applications

Image by the author

OpenAI recently announced support for Structured Outputs in its latest gpt-4o-2024–08–06 models. Structured outputs in relation to large language models (LLMs) are nothing new — developers have either used various prompt engineering techniques, or 3rd party tools.

In this article we will explain what structured outputs are, how they work, and how you can apply them in your own LLM based applications. Although OpenAI’s announcement makes it quite easy to implement using their APIs (as we will demonstrate here), you may want to instead opt for the open source Outlines package (maintained by the lovely folks over at dottxt), since it can be applied to both the self-hosted open-weight models (e.g. Mistral and LLaMA), as well as the proprietary APIs (Disclaimer: due to this issue Outlines does not as of this writing support structured JSON generation via OpenAI APIs; but that will change soon!).

If RedPajama dataset is any indication, the overwhelming majority of pre-training data is human text. Therefore “natural language” is the native domain of LLMs — both in the input, as well as the output. When we build applications however, we would like to use machine-readable formal structures or schemas to encapsulate our data input/output. This way we build robustness and determinism into our applications.

Structured Outputs is a mechanism by which we enforce a pre-defined schema on the LLM output. This typically means that we enforce a JSON schema, however it is not limited to JSON only — we could in principle enforce XML, Markdown, or a completely custom-made schema. The benefits of Structured Outputs are two-fold:

  1. Simpler prompt design — we need not be overly verbose when specifying how the output should look like
  2. Deterministic names and types — we can guarantee to obtain for example, an attribute age with a Number JSON type in the LLM response

For this example, we will use the first sentence from Sam Altman’s Wikipedia entry

Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best known as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023).

…and we are going to use the latest GPT-4o checkpoint as a named-entity recognition (NER) system. We will enforce the following JSON schema:

json_schema = {
"name": "NamedEntities",
"schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"description": "List of entity names and their corresponding types",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The actual name as specified in the text, e.g. a person's name, or the name of the country"
"type": {
"type": "string",
"description": "The entity type, such as 'Person' or 'Organization'",
"enum": ["Person", "Organization", "Location", "DateTime"]
"required": ["name", "type"],
"additionalProperties": False
"required": ["entities"],
"additionalProperties": False
"strict": True

In essence, our LLM response should contain a NamedEntities object, which consists of an array of entities, each one containing a name and type. There are a few things to note here. We can for example enforce Enum type, which is very useful in NER since we can constrain the output to a fixed set of entity types. We must specify all the fields in the required array: however, we can also emulate “optional” fields by setting the type to e.g. ["string", null] .

We can now pass our schema, together with the data and the instructions to the API. We need to populate the response_format argument with a dict where we set type to "json_schema” and then supply the corresponding schema.

completion = client.beta.chat.completions.parse(
"role": "system",
"content": """You are a Named Entity Recognition (NER) assistant.
Your job is to identify and return all entity names and their
types for a given piece of text. You are to strictly conform
only to the following entity types: Person, Location, Organization
and DateTime. If uncertain about entity type, please ignore it.
Be careful of certain acronyms, such as role titles "CEO", "CTO",
"VP", etc - these are to be ignore.""",
"role": "user",
"content": s
"type": "json_schema",
"json_schema": json_schema,

The output should look something like this:

{   'entities': [   {'name': 'Samuel Harris Altman', 'type': 'Person'},
{'name': 'April 22, 1985', 'type': 'DateTime'},
{'name': 'American', 'type': 'Location'},
{'name': 'OpenAI', 'type': 'Organization'},
{'name': '2019', 'type': 'DateTime'},
{'name': 'November 2023', 'type': 'DateTime'}]}

The full source code used in this article is available here.

The magic is in the combination of constrained sampling, and context free grammar (CFG). We mentioned previously that the overwhelming majority of pre-training data is “natural language”. Statistically this means that for every decoding/sampling step, there is a non-negligible probability of sampling some arbitrary token from the learned vocabulary (and in modern LLMs, vocabularies typically stretch across 40 000+ tokens). However, when dealing with formal schemas, we would really like to rapidly eliminate all improbable tokens.

In the previous example, if we have already generated…

{   'entities': [   {'name': 'Samuel Harris Altman',

…then ideally we would like to place a very high logit bias on the 'typ token in the next decoding step, and very low probability on all the other tokens in the vocabulary.

This is in essence what happens. When we supply the schema, it gets converted into a formal grammar, or CFG, which serves to guide the logit bias values during the decoding step. CFG is one of those old-school computer science and natural language processing (NLP) mechanisms that is making a comeback. A very nice introduction to CFG was actually presented in this StackOverflow answer, but essentially it is a way of describing transformation rules for a collection of symbols.

Structured Outputs are nothing new, but are certainly becoming top-of-mind with proprietary APIs and LLM services. They provide a bridge between the erratic and unpredictable “natural language” domain of LLMs, and the deterministic and structured domain of software engineering. Structured Outputs are essentially a must for anyone designing complex LLM applications where LLM outputs must be shared or “presented” in various components. While API-native support has finally arrived, builders should also consider using libraries such as Outlines, as they provide a LLM/API-agnostic way of dealing with structured output.

Source link


How to Reduce Class Imbalance Bias in AI? (Explained with a Riddle) | by Diana Morales

Do you like riddles? Perfect! In this article I’ll use a riddle as a fun way to explain class imbalance bias in machine learning models

For International Women’s Day, Mindspace asked 22 people to solve the following riddle and recorded their responses:

A father is about to bring his son to a job interview, applying to work at a large stock trading company. The son is incredibly nervous… In the car during their drive over they hardly speak… Just when arriving at the parking lot of the company the son receives a phone call. He looks up at this father, who says: ‘’Go ahead, pick it up.’’ The caller is the CEO of the stock trading company, who says: ‘’Good luck son…you’ve got this.’’ The boy hangs up the phone and again looks at his father, who is still sitting next to him in the car.

How is this possible? No, really… take a minute and think about it. Alright! Final answer? ˙ɹǝɥʇoɯ s,uos ǝɥʇ sı OƎƆ ǝɥ⊥

Source link


Visualizing Stochastic Regularization for Entity Embeddings | by Valerie Carey | Aug, 2024

A glimpse into how neural networks perceive categoricals and their hierarchies

Photo by Rachael Crowe on Unsplash

Industry data often contains non-numeric data with many possible values, for example zip codes, medical diagnosis codes, preferred footwear brand. These high-cardinality categorical features contain useful information, but incorporating them into machine learning models is a bit of an art form.

I’ve been writing a series of blog posts on methods for these features. Last episode, I showed how perturbed training data (stochastic regularization) in neural network models can dramatically reduce overfitting and improve performance on unseen categorical codes [1].

In fact, model performance for unseen codes can approach that of known codes when hierarchical information is used with stochastic regularization!

Here, I use visualizations and SHAP values to “look under the hood” and gain some insights into how entity embeddings respond to stochastic regularization. The pictures are pretty, and it’s cool to see plots shift as data is changed. Plus, the visualizations suggest model improvements and can identify groups that might be of interest to analysts.


Source link


Agent AI: Agentic Applications Are Software Systems With A Foundation Model AI Backbone & Defined Autonomy via Tools | by Cobus Greyling | Aug, 2024

Flow Engineering

Prompt Engineering alone was not enough and we had to find a way of re-using prompts; hence templates were introduced where key data fields could be populated at inference. This was followed by prompts being chained to create longer flows and more complex applications.

Chaining was supplemented with highly contextual information and inference, giving rise to an approach leveraging the In-Context Learning (ICL) via Retrieval Augmented Generation (RAG).

The next step in this evolution is Agentic Applications (AI Agents) where a certain level of agency (autonomy) is given to the application. LlamaIndex combined advanced RAG capabilities with an Agent approach to coin Agentic RAG.

For Agentic Applications to have an increased level of agency, more modalities need to be introduced. MindSearch can explore the web via a text interface. Where OmniParser, Ferrit-UI and WebVoyager enable agentic applications to be able define a graphic interface, and navigate the GUI.

The image above is from Microsoft is called OmniParser, where a similar approach is followed to Apple with FerritUI & WebVoyager. Screen elements are detected, mapped with bounding boxes and named. From here a natural language layer can be created between a UI and any conversational AI system.

MindSearch is premised on the problem that complex requests often cannot be accurately and completely retrieved by the search engine via a single instance.

Corresponding information which needs to be integrated into solving a problem or a question, is spread over multiple web pages along with significant noise.

Also, a large number of web pages with long contents may quickly exceed the maximum context length of LLMs.

The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process.

It decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher; using either GPT-4o or InternLM2.5–7B models.

Source link


Let’s reproduce NanoGPT with JAX!(Part 1) | by Louis Wang | Jul, 2024

Inspired by Andrej Kapathy’s recent youtube video on Let’s reproduce GPT-2 (124M), I’d like to rebuild it with most of the training optimizations in Jax. Jax is built for highly efficient computation speed, and it is quite interesting to compare Pytorch with its recent training optimization, and Jax with its related libraries like Flax (Layers API for neural network training for Jax)and Optax (a gradient processing and optimization library for JAX). We will quickly learn what is Jax, and rebuild the GPT with Jax. In the end, we will compare the token/sec with multiGPU training between Pytorch and Jax!

AI generated GPT

What is Jax?

Based on its readthedoc, JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning. I would like to introduce JAX with its name. While someone calls it Just Another XLA (Accelerated Linear Algibra), I prefer to call it J(it) A(utograd) X(LA) to demonstrate its capability of high efficiency.

J — Just-in-time (JIT) Compilation. When you run your python function, Jax converts it into a primitive set of operation called Jaxpr. Then the Jaxpr expression will be converted into an input for XLA, which compiles the lower-level scripts to produce an optimized exutable for target device (CPU, GPU or TPU).

A — Autograd. Computing gradients is a critical part of modern machine learning methods, and you can just call jax.grad() to get gradients which enables you to optimize the models.

X — XLA. This is a open-source machine learning compiler for CPU, GPU and ML accelerators. In general, XLA performs several built-in optimization and analysis passes on the StableHLO graph, then sends the HLO computation to a backend for further HLO-level optimizations. The backend then performs target-specific code generation.

Those are just some key features of JAX, but it also has many user friendly numpy-like APIs in jax.numpy , and automatic vectorization with jax.vmap , and parallize your codes into multiple devices via jax.pmap . We will cover more Jax concepts nd applications in the futher blogs, but now let’s reproduct the NanoGPT with Jax!

From Attention to Transformer

GPT is a decoder-only transformer model, and the key building block is Attention module. We can first define a model config dataclass to save the model hyperparameters of the model, so that the model module can consume it efficiently to initialize the model architecture. Similar to the 124M GPT model, here we initialize a 12-layer transformer decoder with 12 heads and vocab size as 50257 tokens, each of which has 768 embedding dimension. The block size for the attention calculation is 1024.

from dataclasses import dataclass

class ModelConfig:
vocab_size: int = 50257
n_head: int = 12
n_embd: int = 768
block_size: int = 1024
n_layer: int = 12
dropout_rate: float = 0.1

Next comes to the key building block of the transformer model — Attention. The idea is to process the inputs into three weight matrics: Key, Query, and Value. Here we rely on the flax , a the Jax Layer and training API library to initialize the 3 weight matrix, by just call the flax.linen.Dense . As mentioned, Jax has many numpy-like APIs, so we reshape the outputs after the weight matrix with jax.numpy.reshape from [batch_size, sequence_length, embedding_dim] to [batch_size, sequence_length, num_head, embedding_dim / num_head]. Since we need to do matrix multiplication on the key and value matrics, jax also has jax.numpy.matmul API and jax.numpy.transpose (transpose the key matrix for multiplication).

Multihead Attention

Note that we need to put a mask on the attention matrix to avoid information leakage (prevent the previous tokens to have access to the later tokens), jax.numpy.tril helps build a lower triangle array, and jax.numpy.where can fill the infinite number for us to get 0 after softmax jax.nn.softmax . The full codes of multihead attention can be found below.

from flax import linen as nn
import jax.numpy as jnp

class CausalSelfAttention(nn.Module):

config: ModelConfig

def __call__(self, x, deterministic=True):

assert len(x.shape) == 3

b, l, d = x.shape

q = nn.Dense(self.config.n_embd)(x)
k = nn.Dense(self.config.n_embd)(x)
v = nn.Dense(self.config.n_embd)(x)
# q*k / sqrt(dim) -> softmax -> @v
q = jnp.reshape(q, (b, l, d//self.config.n_head , self.config.n_head))
k = jnp.reshape(k, (b, l, d//self.config.n_head , self.config.n_head))
v = jnp.reshape(v, (b, l, d//self.config.n_head , self.config.n_head))
norm = jnp.sqrt(list(jnp.shape(k))[-1])
attn = jnp.matmul(q,jnp.transpose(k, (0,1,3,2))) / norm
mask = jnp.tril(attn)
attn = jnp.where(mask[:,:,:l,:l], attn, float("-inf"))
probs = jax.nn.softmax(attn, axis=-1)
y = jnp.matmul(probs, v)
y = jnp.reshape(y, (b,l,d))
y = nn.Dense(self.config.n_embd)(y)
return y

You may notice that there is no __init__ or forward methods as you can see in the pytorch. This is the special thing for jax, where you can explicitly define the layers with setup methods, or implicitly define them withn the forward pass by adding nn.compact on top of __call__ method. [ref]

Next let’s build the MLP and Block layer, which includes Dense layer, Gelu activation function, LayerNorm and Dropout. Again flax.linen has the layer APIs to help us build the module. Note that we will pass a deterministic boolean variable to control different behaviors during training or evaluation for some layers like Dropout.

class MLP(nn.Module):

config: ModelConfig

def __call__(self, x, deterministic=True):
x = nn.Dense(self.config.n_embd*4)(x)
x = nn.gelu(x, approximate=True)
x = nn.Dropout(rate=self.config.dropout_rate)(x, deterministic=deterministic)
x = nn.Dense(self.config.n_embd)(x)
x = nn.Dropout(rate=self.config.dropout_rate)(x, deterministic=deterministic)
return x

class Block(nn.Module):

config: ModelConfig

def __call__(self, x):
x = nn.LayerNorm()(x)
x = x + CausalSelfAttention(self.config)(x)
x = nn.LayerNorm()(x)
x = x + MLP(self.config)(x)
return x

Now Let’s use the above blocks to build the NanoGPT:

Given the inputs of a sequence token ids, we use the flax.linen.Embed layer to get position embeddings and token embeddings. Them we pass them into the Block module N times, where N is number of the layers defined in the Model Config. In the end, we map the outputs from the last Block into the probabilities for each token in the vocab to predict the next token. Besides the forward __call__ method, let’s also create a init methods to get the dummy inputs to get the model’s parameters.

class GPT(nn.Module):

config: ModelConfig

def __call__(self, x, deterministic=False):

B, T = x.shape
assert T

pos = jnp.arange(0, T)[None]
pos_emb = nn.Embed(self.config.block_size, self.config.n_embd)(pos)
wte = nn.Embed(self.config.vocab_size, self.config.n_embd)
tok_emb = wte(x)
x = tok_emb + pos_emb

for _ in range(self.config.n_layer):
x = Block(self.config)(x)
x = nn.LayerNorm()(x)
logits = nn.Dense(config.n_embd, config.vocab_size)
# logits = wte.attend(x) # parameter sharing
return logits

def init(self, rng):
tokens = jnp.zeros((1, self.config.block_size), dtype=jnp.uint16)
params = jax.jit(super().init, static_argnums=(2,))(rng, tokens, True)
return params

Now let’s varify the number of parameters: We first initialize the model config dataclass and the random key, then create a dummy inputs and feed in into the GPT model. Then we utilize the jax.util.treemap API to create a count parameter function. We got 124439808 (124M) parameters, same amount as Huggingface’s GPT2, BOOM!

Colab Result: number of parameters
Verify number of params in huggingface’s GPT2

DataLoader and Training Loop

Let’s now overfit a small dataset. To make it comparable inAndrej’s video on Pytorch NanoGPT, let’s use the toy dataset that he shared in his video. We use the GPT2′ tokenizer from tiktoken library to tokenize all the texts from the input file, and convert the tokens into jax.numpy.array for Jax’s model training.

class DataLoader:
def __init__(self, B, T):
self.current_position = 0
self.B = B
self.T = T

with open("input.txt","r") as f:
text = f.read()
enc = tiktoken.get_encoding("gpt2")
self.tokens = jnp.array(enc.encode(text))
print(f"loaded {len(self.tokens)} tokens in the datasets" )
print(f" 1 epoch = {len(self.tokens)//(B*T)} batches")

def next_batch(self):
B,T = self.B, self.T
buf = self.tokens[self.current_position:self.current_position+B*T+1]
x,y = jnp.reshape(buf[:-1],(B,T)), jnp.reshape(buf[1:],(B,T))
self.current_position += B*T
if self.current_position + B*T+1 > len(self.tokens):
self.current_position = 0
return x,y

Colab Result: Simple dataloader with 4 batch size and 128 sequence length

Next, let’s forget distributed training and optimization first, and just create a naive training loop for a sanity check. The first thing after intialize the model is to create a TrainState, a model state where we can update the parameters and gradients. The TrainState takes three important inputs: apply_fn (model forward function), params (model parameters from the init method), and tx (an Optax gradient transformation).

Then we use the train_step function to update the model state (gradients and parameters) to proceed the model training. Optax provide the softmax cross entropy as the loss function for the next token prediction task, and jax.value_and_grad calculates the gradients and the loss value for the loss function. Finally, we update the model’s state with the new parameters using the apply_gradients API. [ref] Don’t forget to jit the train_step function to reduce the computation overhead!

def init_train_state(key, config) -> TrainState:
model = GPT(config)
params = model.init(key)
optimizer = optax.adamw(3e-4, b1=0.9, b2=0.98, eps=1e-9, weight_decay=1e-1)
train_state = TrainState.create(
return train_state

def train_step(state: TrainState, x: jnp.ndarray, y: jnp.ndarray) -> Tuple[jnp.ndarray, TrainState]:

def loss_fn(params: FrozenDict) -> jnp.ndarray:

logits = state.apply_fn(params, x, False)
loss = optax.softmax_cross_entropy_with_integer_labels(logits, y).mean()
return loss

loss, grads = jax.value_and_grad(loss_fn, has_aux=False)(state.params)
new_state = state.apply_gradients(grads=grads)
return loss, new_state

Now everything is ready for the poorman’s training loop.. Let’s check the loss value. The model’s prediction should be better than the random guess, so the loss should be lower than -ln(1/50257)≈10.825. What we expect from the overfitting a single batch is that: in the beginning the loss is close to 10.825, then it goes down to close to 0. Let’s take a batch of (x, y) and run the training loop for 50 times. I also add similar log to calculate the training speed.

As we can see, the loss value is exactly what we expect, and the training throughput is around 400–500 k token/sec. Which is already 40x faster than Pytorch’s initial version without any optimization in Andrej’s video. Note that we run the Jax scripts in 1 A100 GPU which should remove the hardware difference for the speed comparison. There is no .to(device) stuff to move your model or data from host CPU to device GPU, which is one of the benefits from Jax!

So that’s it and we made it. We will make the training 10x more faster in Part 2 with more optimizations…

Part 2: The journey of training optimization to 1350k tokens/sec in a single GPU!

“Unless otherwise noted, all images are by the author”

Source link


Productionizing a RAG App with Prefect, Weave, and RAGAS | by Ed Izaguirre | Aug, 2024

Adding evaluation, automated data pulling, and other improvements.

12 min read

15 hours ago

From Film Search to Rosebud 🌹. Image from Unsplash.

Table of Contents

  1. Introduction
  2. Offline Evaluation
  3. Online Evaluation
  4. Automated Data Pulling with Prefect
  5. Summary

Relevant Links

A few months ago, I released the Film Search app, a Retrieval-Augmented Generation (RAG) application designed to recommend films based on user queries. For example, a user may ask: “Find me drama movies in English that are less than 2 hours long and feature dogs.” and receive a recommendation like:

Title of Film: Hachi: A Dog’s Tale

Runtime: 93 minutes
Release Year: 2009
Streaming: Not available for streaming
This film tells the poignant true story of Hachiko, an Akita dog known for his remarkable loyalty to his owner. The emotional depth and the themes of friendship and loyalty resonate strongly, making it a touching drama that showcases the profound bond between humans and dogs. It’s perfect for anyone looking for a heartfelt story that highlights the importance of companionship.

This was not just a simple RAG app, however. It included what is known as self-querying retrieval. This means that the bot takes the user’s query and transforms it by adding metadata filters. This ensures any documents pulled into the chat model’s context respects the constraints set by the user’s query. For more information, I recommend checking out my earlier article linked above.

Unfortunately, there were some issues with the app:

  • There was no offline evaluation done, besides passing the ‘eye test’. This test is necessary, but not sufficient.
  • Observability was non-existent. If a query went poorly, you had to manually pull up the project and run some ad-hoc scripts in an attempt to see what went wrong.
  • The Pinecone vector database had to be pulled manually. This meant the documents would quickly be out of date if, say, a film got pulled from a streaming service.

In this article, I will briefly cover some of the improvements made to the Film Search app. This will cover:

  • Offline Evaluation using RAGAS and Weave
  • Online Evaluation and Observability
  • Automated Data Pulling using Prefect

One thing before we jump in: I found the name Film Search to be a bit generic, so I rebranded the app as Rosebud 🌹, hence the image shown above. Real film geeks will understand the reference.

It is important to be able to judge if a change made to your LLM application improves or degrades its performance. Unfortunately, evaluation of LLM apps is a difficult and novel space. There is simply not much agreement on what constitutes a good evaluation.

For Rosebud 🌹, I decided to tackle what is known as the RAG triad. This approach is promoted by TruLens, a platform to evaluate and track LLM applications.

The RAG Triad. Image by author.

The triad covers three aspects of a RAG app:

  • Context Relevancy: When a query is made by the user, documents fill the context of the chat model. Is the retrieved context actually useful? If not, you may need to tweak things like document embedding, chunking, or metadata filtering.
  • Faithfulness: Is the model’s response actually grounded in the retrieved documents? You don’t want the model making up facts; the whole point of RAG is to help reduce hallucinations by using retrieved documents.
  • Answer Relevancy: Does the model’s response actually answer the user’s query? If the user asks for “Comedy films made in the 1990s?”, the model’s answer better contain only comedy films made in the 1990s.

There are a few ways to attempt to assess these three functions of a RAG app. One way would be to use human expert evaluators. Unfortunately, this would be expensive and wouldn’t scale. For Rosebud 🌹 I decided to use LLMs-as-a-judges. This means using a chat model to look at each of the three criteria above and assigning a score of 0 to 1 for each. This method has the advantage of being cheap and scaling well. To accomplish this, I used RAGAS, a popular framework that helps you evaluate your RAG applications. The RAGAS framework includes the three metrics mentioned above and makes it fairly easy to use them to evaluate your apps. Below is a code snippet demonstrating how I conducted this offline evaluation:

from ragas import evaluate
from ragas.metrics import AnswerRelevancy, ContextRelevancy, Faithfulness
import weave

def evaluate_with_ragas(query, model_output):
# Put data into a Dataset object
data = {
"question": [query],
"contexts": [[model_output['context']]],
"answer": [model_output['answer']]
dataset = Dataset.from_dict(data)

# Define metrics to judge
metrics = [

judge_model = ChatOpenAI(model=config['JUDGE_MODEL_NAME'])
embeddings_model = OpenAIEmbeddings(model=config['EMBEDDING_MODEL_NAME'])

evaluation = evaluate(dataset=dataset, metrics=metrics, llm=judge_model, embeddings=embeddings_model)

return {
"answer_relevancy": float(evaluation['answer_relevancy']),
"context_relevancy": float(evaluation['context_relevancy']),
"faithfulness": float(evaluation['faithfulness']),

def run_evaluation():
# Initialize chat model
model = rosebud_chat_model()

# Define evaluation questions
questions = [
{"query": "Suggest a good movie based on a book."}, # Adaptations
{"query": "Suggest a film for a cozy night in."}, # Mood-Based
{"query": "What are some must-watch horror movies?"}, # Genre-Specific
# Total of 20 questions

# Create Weave Evaluation object
evaluation = weave.Evaluation(dataset=questions, scorers=[evaluate_with_ragas])

# Run the evaluation

if __name__ == "__main__":

A few notes:

  • With twenty questions and three criteria to judge across, you’re looking at sixty LLM calls for a single evaluation! It gets even worse though; with the rosebud_chat_model , there are two calls for every query: one to construct the metadata filter and another to provide the answer, so really this is 120 calls for a single eval! All models used my evaluation are the new gpt-4o-mini , which I strongly recommend. In my experience the calls cost $0.05 per evaluation.
  • Note that we are using asyncio.run to run the evals. It is ideal to use asynchronous calls because you don’t want to evaluate each question sequentially one after the other. Instead, with asyncio we can begin evaluating other questions as we wait for previous I/O operations to finish.
  • There are a total of twenty questions for a single evaluation. These span a variety of typical film queries a user may ask. I mostly came up with these myself, but in practice it would be better to use queries actually asked by users in production.
  • Notice the weave.init and the @weave.op decorator that are being used. These are part of the new Weave library from Weights & Biases (W&B). Weave is a complement to the traditional W&B library, with a focus on LLM applications. It allows you to capture inputs and outputs of LLMs by using a the simple @weave.op decorator. It also allows you to capture the results of evaluations using weave.Evaluation(…) . By integrating RAGAS to perform evaluations and Weave to capture and log them, we get a powerful duo that helps GenAI developers iteratively improve their applications. You also get to log the model latency, cost, and more.
Example of Weave + RAGAS integration. Image by author.

In theory, one can now tweak a hyperparameter (e.g. temperature), re-run the evaluation, and see if the adjustment has a positive or negative impact. Unfortunately, in practice I found the LLM judging to be finicky, and I am not the only one. LLM judges seem to be fairly bad at using a floating point value to assess these metrics. Instead, it appears they seem to do better at classification e.g. a thumbs up/thumbs down. RAGAS doesn’t yet support LLM judges performing classification. Writing it by hand doesn’t seem too difficult, and perhaps in a future update I may attempt this myself.

Offline evaluation is good for seeing how tweaking hyperparameters affects performance, but in my opinion online evaluation is far more useful. In Rosebud 🌹 I have now incorporated the use of 👍/👎 buttons at the bottom of every response to provide feedback.

Example of online feedback. Image by author.

When a user clicks on either button they are told that their feedback was logged. Below is a snippet of how this was accomplished in the Streamlit interface:

def start_log_feedback(feedback):
print("Logging feedback.")
st.session_state.feedback_given = True
st.session_state.sentiment = feedback
thread = threading.Thread(target=log_feedback, args=(st.session_state.sentiment,

def log_feedback(sentiment, query, query_constructor, context, response):
ct = datetime.datetime.now()
name=f"query: {ct}")
table = wandb.Table(columns=["sentiment", "query", "query_constructor", "context", "response"])
wandb.log({"Query Log": table})

Note that the process of sending the feedback to W&B runs on a separate thread rather than on the main thread. This is to prevent the user from getting stuck for a few seconds waiting for the logging to complete.

A W&B table is used to store the feedback. Five quantities are logged in the table:

  • Sentiment: Whether the user clicked thumbs up or thumbs down
  • Query: The user’s query, e.g. Find me drama movies in English that are less than 2 hours long and feature dogs.
  • Query_Constructor: The results of the query constructor, which rewrites the user’s query and includes metadata filtering if necessary, e.g.
"query": "drama English dogs",
"filter": {
"operator": "and",
"arguments": [
"comparator": "eq", "attribute": "Genre", "value": "Drama"
"comparator": "eq", "attribute": "Language", "value": "English"

"comparator": "lt", "attribute": "Runtime (minutes)", "value": 120

  • Context: The retrieved context based on the reconstructed query, e.g. Title: Hachi: A Dog’s Tale. Overview: A drama based on the true story of a college professor’s…
  • Response: The model’s response

All of this is logged conveniently in the same project as the Weave evaluations shown earlier. Now, when a query goes south it is as simple as hitting the thumbs down button to see exactly what happened. This will allow much faster iteration and improvement of the Rosebud 🌹 recommendation application.

Image showing observability of the model’s response. Note on the left-hand side how it is seamless to transition between W&B and Weave. Image by author.

To ensure recommendations from Rosebud 🌹 continue to stay accurate it was important to automate the process of pulling data and uploading them to Pinecone. For this task, I chose Prefect. Prefect is a popular workflow orchestration tool. I was looking for something lightweight, easy to learn, and Pythonic. I found all of this in Prefect.

Automated flow for pulling and updating Pinecone vector store provided by Prefect. Image by author.

Prefect offers a variety of ways to schedule your workflows. I decided to use the push work pools with automatic infrastructure provisioning. I found that this setup balances simplicity with configurability. It allows the user to task Prefect with automatically provisioning all of the infrastructure needed to run your flow in your cloud provider of choice. I chose to deploy on Azure, but deploying on GCP or AWS only requires changing a few lines of code. Refer to the pinecone_flow.py file for more details. A simplified flow is provided below:

def start():
Start-up: check everything works or fail fast!

# Print out some debug info
print("Starting flow!")

# Ensure user has set the appropriate env variables
assert os.environ['LANGCHAIN_API_KEY']
assert os.environ['OPENAI_API_KEY']

@task(retries=3, retry_delay_seconds=[1, 10, 100])
def pull_data_to_csv(config):
TMBD_API_KEY = os.getenv('TMBD_API_KEY')
YEARS = range(config["years"][0], config["years"][-1] + 1)
CSV_HEADER = ['Title', 'Runtime (minutes)', 'Language', 'Overview', ...]

for year in YEARS:
# Grab list of ids for all films made in {YEAR}
movie_list = list(set(get_id_list(TMBD_API_KEY, year)))

FILE_NAME = f'./data/{year}_movie_collection_data.csv'

# Creating file
with open(FILE_NAME, 'w') as f:
writer = csv.writer(f)


print("Successfully pulled data from TMDB and created csv files in data/")

def convert_csv_to_docs():
# Loading in data from all csv files
loader = DirectoryLoader(

docs = loader.load()

metadata_field_info = [
description="The title of the movie", type="string"),
AttributeInfo(name="Runtime (minutes)",
description="The runtime of the movie in minutes", type="integer"),

def convert_to_list(doc, field):
if field in doc.metadata and doc.metadata[field] is not None:
doc.metadata[field] = [item.strip()
for item in doc.metadata[field].split(',')]


fields_to_convert_list = ['Genre', 'Actors', 'Directors',
'Production Companies', 'Stream', 'Buy', 'Rent']

# Set 'overview' and 'keywords' as 'page_content' and other fields as 'metadata'
for doc in docs:
# Parse the page_content string into a dictionary
page_content_dict = dict(line.split(": ", 1)
for line in doc.page_content.split("\n") if ": " in line)

doc.page_content = (
'Title: ' + page_content_dict.get('Title') +
'. Overview: ' + page_content_dict.get('Overview') +


print("Successfully took csv files and created docs")

return docs

def upload_docs_to_pinecone(docs, config):
# Create empty index

pc = Pinecone(api_key=PINECONE_KEY)

# Target index and check status
pc_index = pc.Index(PINECONE_INDEX_NAME)

embeddings = OpenAIEmbeddings(model=config['EMBEDDING_MODEL_NAME'])
namespace = "film_search_prod"


print("Successfully uploaded docs to Pinecone vector store")

def publish_dataset_to_weave(docs):
# Initialize Weave

rows = []
for doc in docs:
row = {
'Title': doc.metadata.get('Title'),
'Runtime (minutes)': doc.metadata.get('Runtime (minutes)'),

dataset = Dataset(name='Movie Collection', rows=rows)
print("Successfully published dataset to Weave")

def pinecone_flow():
with open('./config.json') as f:
config = json.load(f)

docs = convert_csv_to_docs()
upload_docs_to_pinecone(docs, config)

if __name__ == "__main__":
cron="0 0 * * 0",

Notice how simple it is to turn Python functions into a Prefect flow. All you need are some sub-functions styled with @task decorators and a @flow decorator on the main function. Also note that after uploading the documents to Pinecone, the last step of our flow publishes the dataset to Weave. This is important for reproducibility purposes.

At the bottom of the script we see how deployment is done in Prefect.

  • We need to provide a name for the deployment. This is arbitrary.
  • We also need to specify a work_pool_name . Push work pools in Prefect automatically send tasks to serverless computers without needing a middleman. This name needs to match the name used to create the pool, which we’ll see below.
  • You also need to specify a cron , which is short for chronograph. This allows you to specify how often to repeat a workflow. The value “0 0 * * 0” means repeat this workflow every week. Check out this website for details on how the cron syntax works.
  • Finally, you need to specify a DeploymentImage . Here you specify both a name and a platform . The name is arbitrary, but the platform is not. Since I want to deploy to Azure compute instances, and these instances run Linux, it’s important I specify that in the DeploymentImage .

To deploy this flow on Azure using the CLI, run the following commands:

prefect work-pool create --type azure-container-instance:push --provision-infra my-aci-pool
prefect deployment run 'get_repo_info/my-deployment'

These commands will automatically provision all of the necessary infrastructure on Azure. This includes an Azure Container Registry (ACR) that will hold a Docker image containing all files in your directory as well as any necessary libraries listed in a requirements.txt . It will also include an Azure Container Instance (ACI) Identity that will have permissions necessary to deploy a container with the aforementioned Docker image. Finally, the deployment run command will schedule the code to be run every week. You can check the Prefect dashboard to see your flow get run:

Image of a flow in Prefect being successfully run. Image by author.

By updating my Pinecone vector store weekly, I can ensure that the recommendations from Rosebud 🌹 remain accurate.

In this article, I discussed my experience improving the Rosebud 🌹 app. This included the process of incorporating offline and online evaluation, as well as automating the update of my Pinecone vector store.

Some other improvements not mentioned in this article:

  • Including ratings from The Movie Database in the film data. You can now ask for “highly rated films” and the chat model will filter for films above a 7/10.
  • Upgraded chat models. Now the query and summary models are using gpt-4o-mini . Recall that the LLM judge model is also using gpt-4o-mini .
  • Embedding model upgraded to text-embedding-3-small from text-embedding-ada-002 .
  • Years now span 1950–2023, instead of starting at 1920. Film data from 1920–1950 was not high quality, and only messed up recommendations.
  • UI is cleaner, with all details regarding the project relegated to a sidebar.
  • Vastly improved documentation on GitHub.
  • Bug fixes.

As mentioned at the top of the article, the app is now 100% free to use! I will foot the bill for queries for the foreseeable future (hence the choice of gpt-4o-mini instead of the more expensive gpt-4o). I really want to get the experience of running an app in production, and having my readers test out Rosebud 🌹 is a great way to do this. In the unlikely event that the app really blows up, I will have to come up with some other model of funding. But that would a great problem to have.

Enjoy discovering awesome films! 🎥

Source link


What You Need To Know To Build Large Streamlit Applications With Stripe Subscriptions And Firestore Integration. | by Erdogan Taskesen | Aug, 2024

The ability to turn ideas into software products is a great skill to learn. In this blog, I will describe what it takes, and how to put the parts together to create a software product without starting costs but with a subscription model and Firestore integration.

Photo by Shane Aldendorff on Unsplash

Whether you are a data scientist, data engineer, or in another field of software development, turning your thoughts into real working software products using only a laptop may be the greatest skill to have. Various fields of software development come together in such a process, from UX, front-end, towards backend development, data handling, visualizations, cloud/server configurations, and so on. It is a process of going back and forth. Challenging is to decide which idea to start with, and how to avoid (starting) costs until the point that people want your product. In this blog I will discuss different kinds of ideas to and showcase how I created and deployed SkyWalk using Streamlit, with subscriptions using Stripe, and data storage using Google Firestore.

Source link


LLM-Driven Synthetic Data Generation, Curation & Evaluation | by Cobus Greyling | Aug, 2024

Key considerations include:

  • Ensuring readability and interpretability of LLM-generated information to facilitate human understanding.
  • Implementing upstream knowledge enrichment or filtering to optimise human resource use and reduce time spent on low-value tasks.
  • Adding engaging interactive features to make data processing tasks more enjoyable and attract a wider audience.

In traditional crowdsourced annotation, workers receive a codebook detailing the task purpose, data explanation, and background knowledge to better understand their jobs.

Similarly, for LLM-driven data generation, task specification is crucial and can include role-play, format clarification, and knowledge augmentation.

A simple prompt like suppose you are a {xxx} can significantly improve LLM performance by setting the right context . This approach reminds of another study, where the researchers propose a new persona-driven data synthesis method that uses different perspectives within a large language model (LLM) to create varied synthetic data.

To support this method on a large scale, they introduce Persona Hub, a collection of 1 billion diverse personas automatically gathered from web data.

To ensure valid supervision, generated data must be logically and grammatically coherent.

However, inherent issues like hallucination and the fat-tailed knowledge distribution in large language models (LLMs) can introduce significant noise. This often leads to factual errors, incorrect labels, or irrelevant content, particularly when generating long, complex, or domain-specific data.

Diversity refers to the variations in generated data, such as differences in text length, topic, and writing style.

It is crucial for creating synthetic samples that reflect the diversity of real-world data, which helps prevent overfitting and bias during model training or evaluation.

However, inherent biases in large language models (LLMs) often result in monotonous and less diverse content, limiting its usefulness in downstream tasks.

The aim of synthetic data is not to imbue the target model knowledge, but rather train the model on certain personas and special abilities like advanced reasoning or task decomposition.

By combining strong data discovery and data design practices within a well-structured data topology, the process of creating synthetic data becomes more efficient, accurate, and aligned with real-world needs.

This foundational layer is essential for generating high-quality synthetic data that can effectively train and validate machine learning models.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.


Source link


Stable and fast randomization using hash spaces | by David Clarance | Jul, 2024

Generate consistent assignments on the fly across different implementation environments

A bird’s eye view

A core part of running an experiment is to assign an experimental unit (for instance a customer) to a specific treatment (payment button variant, marketing push notification framing). Often this assignment needs to meet the following conditions:

  1. It needs to be random.
  2. It needs to be stable. If the customer comes back to the screen, they need to be exposed to the same payment button variant.
  3. It needs to be retrieved or generated very quickly.
  4. It needs to be available after the actual assignment so it can be used in downstream analysis.

When organizations first start their experimentation journey, a common pattern is to pre-generate assignments, store it in a database and then retrieve it at the time of assignment. This is a perfectly valid method to use and works great when you’re starting off. However, as you start to scale in customer and experiment volumes, this method becomes harder and harder to maintain and use reliably. You’ve got to (a) manage the complexity of storage, (b) ensure that assignments are actually random within and across experiments, and (c) retrieve the assignment reliably. All of these are hard at scale.

Using hash spaces helps solve some of these problems. It’s a simple solution but isn’t as widely known as it probably should be. This blog is an attempt to explain the technique. There are links to code in different languages at the end. However, if you’d like, you can also directly jump to code here.

We’re running an experiment to test which variant of a progress bar on our customer app drives the most engagement. There are three variants: Control (the default experience), Variant A and Variant B.

We have 10 million customers that use our app every week and we want to ensure that these 10 million customers get randomly assigned to one of the three variants. Each time the customer comes back to the app they should see the same variant. We want control to be assigned with a 50% probability, Variant 1 to be assigned with a 30% probability and Variant 2 to be assigned with a 20% probability.

probability_assignments = {"Control": 50, "Variant 1": 30, "Variant 2": 20}

To make things simpler, we’ll start with 4 customers. These customers have IDs that we use to refer to them. These IDs are generally either GUIDs (something like "b7be65e3-c616-4a56-b90a-e546728a6640") or integers (like 1019222, 1028333). Any of these ID types would work but to make things easier to follow we’ll simply assume that these IDs are: “Customer1”, “Customer2”, “Customer3”, “Customer4”.

Our goal is to map these 4 customers to the three possible variants.

This method primarily relies on using hash algorithms that come with some very desirable properties. Hashing algorithms take a string of arbitrary length and map it to a ‘hash’ of a fixed length. The easiest way to understand this is through some examples.

A hash function, takes a string and maps it to a constant hash space. In the example below, a hash function (in this case md5) takes the words: “Hello”, “World”, “Hello World” and “Hello WorLd” (note the capital L) and maps it to an alphanumeric string of 32 characters.

A few important things to note:

  • The hashes are all of the same length.
  • A minor difference in the input (capital L instead of small L) changes the hash.
  • Hashes are a hexadecimal string. That is, they comprise of the numbers 0 to 9 and the first six alphabets (a, b, c, d, e and f).

We can use this same logic and get hashes for our four customers:

import hashlib

representative_customers = ["Customer1", "Customer2", "Customer3", "Customer4"]

def get_hash(customer_id):
hash_object = hashlib.md5(customer_id.encode())
return hash_object.hexdigest()

{customer: get_hash(customer) for customer in representative_customers}

# {'Customer1': 'becfb907888c8d48f8328dba7edf6969',
# 'Customer2': '0b0216b290922f789dd3efd0926d898e',
# 'Customer3': '2c988de9d49d47c78f9f1588a1f99934',
# 'Customer4': 'b7ca9bb43a9387d6f16cd7b93a7e5fb0'}

Hexadecimal strings are just representations of numbers in base 16. We can convert them to integers in base 10.

⚠️ One important note here: We rarely need to use the full hash. In practice (for instance in the linked code) we use a much smaller part of the hash (first 10 characters). Here we use the full hash to make explanations a bit easier.

def get_integer_representation_of_hash(customer_id):
hash_value = get_hash(customer_id)
return int(hash_value, 16)

customer: get_integer_representation_of_hash(customer)
for customer in representative_customers

# {'Customer1': 253631877491484416479881095850175195497,
# 'Customer2': 14632352907717920893144463783570016654,
# 'Customer3': 59278139282750535321500601860939684148,
# 'Customer4': 244300725246749942648452631253508579248}

There are two important properties of these integers:

  1. These integers are stable: Given a fixed input (“Customer1”), the hashing algorithm will always give the same output.
  2. These integers are uniformly distributed: This one hasn’t been explained yet and mostly applies to cryptographic hash functions (such as md5). Uniformity is a design requirement for these hash functions. If they were not uniformly distributed, the chances of collisions (getting the same output for different inputs) would be higher and weaken the security of the hash. There are some explorations of the uniformity property.

Now that we have an integer representation of each ID that’s stable (always has the same value) and uniformly distributed, we can use it to get to an assignment.

Going back to our probability assignments, we want to assign customers to variants with the following distribution:

{"Control": 50, "Variant 1": 30, "Variant 2": 20}

If we had 100 slots, we can divide them into 3 buckets where the number of slots represents the probability we want to assign to that bucket. For instance, in our example, we divide the integer range 0–99 (100 units), into 0–49 (50 units), 50–79 (30 units) and 80–99 (20 units).

def divide_space_into_partitions(prob_distribution):
partition_ranges = []
start = 0
for partition in prob_distribution:
partition_ranges.append((start, start + partition))
start += partition
return partition_ranges


# note that this is zero indexed, lower bound inclusive and upper bound exclusive
# [(0, 50), (50, 80), (80, 100)]

Now, if we assign a customer to one of the 100 slots randomly, the resultant distribution should then be equal to our intended distribution. Another way to think about this is, if we choose a number randomly between 0 and 99, there’s a 50% chance it’ll be between 0 and 49, 30% chance it’ll be between 50 and 79 and 20% chance it’ll be between 80 and 99.

The only remaining step is to map the customer integers we generated to one of these hundred slots. We do this by extracting the last two digits of the integer generated and using that as the assignment. For instance, the last two digits for customer 1 are 97 (you can check the diagram below). This falls in the third bucket (Variant 2) and hence the customer is assigned to Variant 2.

We repeat this process iteratively for each customer. When we’re done with all our customers, we should find that the end distribution will be what we’d expect: 50% of customers are in control, 30% in variant 1, 20% in variant 2.

def assign_groups(customer_id, partitions):
hash_value = get_relevant_place_value(customer_id, 100)
for idx, (start, end) in enumerate(partitions):
if start return idx
return None

partitions = divide_space_into_partitions(

groups = {
customer: list(probability_assignments.keys())[assign_groups(customer, partitions)]
for customer in representative_customers

# output
# {'Customer1': 'Variant 2',
# 'Customer2': 'Variant 1',
# 'Customer3': 'Control',
# 'Customer4': 'Control'}

The linked gist has a replication of the above for 1,000,000 customers where we can observe that customers are distributed in the expected proportions.

# resulting proportions from a simulation on 1 million customers.
{'Variant 1': 0.299799, 'Variant 2': 0.199512, 'Control': 0.500689

Source link

Protected by Security by CleanTalk