05Aug

Let’s reproduce NanoGPT with JAX!(Part 1) | by Louis Wang | Jul, 2024


Inspired by Andrej Kapathy’s recent youtube video on Let’s reproduce GPT-2 (124M), I’d like to rebuild it with most of the training optimizations in Jax. Jax is built for highly efficient computation speed, and it is quite interesting to compare Pytorch with its recent training optimization, and Jax with its related libraries like Flax (Layers API for neural network training for Jax)and Optax (a gradient processing and optimization library for JAX). We will quickly learn what is Jax, and rebuild the GPT with Jax. In the end, we will compare the token/sec with multiGPU training between Pytorch and Jax!

AI generated GPT

What is Jax?

Based on its readthedoc, JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning. I would like to introduce JAX with its name. While someone calls it Just Another XLA (Accelerated Linear Algibra), I prefer to call it J(it) A(utograd) X(LA) to demonstrate its capability of high efficiency.

J — Just-in-time (JIT) Compilation. When you run your python function, Jax converts it into a primitive set of operation called Jaxpr. Then the Jaxpr expression will be converted into an input for XLA, which compiles the lower-level scripts to produce an optimized exutable for target device (CPU, GPU or TPU).

A — Autograd. Computing gradients is a critical part of modern machine learning methods, and you can just call jax.grad() to get gradients which enables you to optimize the models.

X — XLA. This is a open-source machine learning compiler for CPU, GPU and ML accelerators. In general, XLA performs several built-in optimization and analysis passes on the StableHLO graph, then sends the HLO computation to a backend for further HLO-level optimizations. The backend then performs target-specific code generation.

Those are just some key features of JAX, but it also has many user friendly numpy-like APIs in jax.numpy , and automatic vectorization with jax.vmap , and parallize your codes into multiple devices via jax.pmap . We will cover more Jax concepts nd applications in the futher blogs, but now let’s reproduct the NanoGPT with Jax!

From Attention to Transformer

GPT is a decoder-only transformer model, and the key building block is Attention module. We can first define a model config dataclass to save the model hyperparameters of the model, so that the model module can consume it efficiently to initialize the model architecture. Similar to the 124M GPT model, here we initialize a 12-layer transformer decoder with 12 heads and vocab size as 50257 tokens, each of which has 768 embedding dimension. The block size for the attention calculation is 1024.

from dataclasses import dataclass

@dataclass
class ModelConfig:
vocab_size: int = 50257
n_head: int = 12
n_embd: int = 768
block_size: int = 1024
n_layer: int = 12
dropout_rate: float = 0.1

Next comes to the key building block of the transformer model — Attention. The idea is to process the inputs into three weight matrics: Key, Query, and Value. Here we rely on the flax , a the Jax Layer and training API library to initialize the 3 weight matrix, by just call the flax.linen.Dense . As mentioned, Jax has many numpy-like APIs, so we reshape the outputs after the weight matrix with jax.numpy.reshape from [batch_size, sequence_length, embedding_dim] to [batch_size, sequence_length, num_head, embedding_dim / num_head]. Since we need to do matrix multiplication on the key and value matrics, jax also has jax.numpy.matmul API and jax.numpy.transpose (transpose the key matrix for multiplication).

Multihead Attention

Note that we need to put a mask on the attention matrix to avoid information leakage (prevent the previous tokens to have access to the later tokens), jax.numpy.tril helps build a lower triangle array, and jax.numpy.where can fill the infinite number for us to get 0 after softmax jax.nn.softmax . The full codes of multihead attention can be found below.

from flax import linen as nn
import jax.numpy as jnp

class CausalSelfAttention(nn.Module):

config: ModelConfig

@nn.compact
def __call__(self, x, deterministic=True):

assert len(x.shape) == 3

b, l, d = x.shape

q = nn.Dense(self.config.n_embd)(x)
k = nn.Dense(self.config.n_embd)(x)
v = nn.Dense(self.config.n_embd)(x)
# q*k / sqrt(dim) -> softmax -> @v
q = jnp.reshape(q, (b, l, d//self.config.n_head , self.config.n_head))
k = jnp.reshape(k, (b, l, d//self.config.n_head , self.config.n_head))
v = jnp.reshape(v, (b, l, d//self.config.n_head , self.config.n_head))
norm = jnp.sqrt(list(jnp.shape(k))[-1])
attn = jnp.matmul(q,jnp.transpose(k, (0,1,3,2))) / norm
mask = jnp.tril(attn)
attn = jnp.where(mask[:,:,:l,:l], attn, float("-inf"))
probs = jax.nn.softmax(attn, axis=-1)
y = jnp.matmul(probs, v)
y = jnp.reshape(y, (b,l,d))
y = nn.Dense(self.config.n_embd)(y)
return y

You may notice that there is no __init__ or forward methods as you can see in the pytorch. This is the special thing for jax, where you can explicitly define the layers with setup methods, or implicitly define them withn the forward pass by adding nn.compact on top of __call__ method. [ref]

Next let’s build the MLP and Block layer, which includes Dense layer, Gelu activation function, LayerNorm and Dropout. Again flax.linen has the layer APIs to help us build the module. Note that we will pass a deterministic boolean variable to control different behaviors during training or evaluation for some layers like Dropout.

class MLP(nn.Module):

config: ModelConfig

@nn.compact
def __call__(self, x, deterministic=True):
x = nn.Dense(self.config.n_embd*4)(x)
x = nn.gelu(x, approximate=True)
x = nn.Dropout(rate=self.config.dropout_rate)(x, deterministic=deterministic)
x = nn.Dense(self.config.n_embd)(x)
x = nn.Dropout(rate=self.config.dropout_rate)(x, deterministic=deterministic)
return x

class Block(nn.Module):

config: ModelConfig

@nn.compact
def __call__(self, x):
x = nn.LayerNorm()(x)
x = x + CausalSelfAttention(self.config)(x)
x = nn.LayerNorm()(x)
x = x + MLP(self.config)(x)
return x

Now Let’s use the above blocks to build the NanoGPT:

Given the inputs of a sequence token ids, we use the flax.linen.Embed layer to get position embeddings and token embeddings. Them we pass them into the Block module N times, where N is number of the layers defined in the Model Config. In the end, we map the outputs from the last Block into the probabilities for each token in the vocab to predict the next token. Besides the forward __call__ method, let’s also create a init methods to get the dummy inputs to get the model’s parameters.

class GPT(nn.Module):

config: ModelConfig

@nn.compact
def __call__(self, x, deterministic=False):

B, T = x.shape
assert T

pos = jnp.arange(0, T)[None]
pos_emb = nn.Embed(self.config.block_size, self.config.n_embd)(pos)
wte = nn.Embed(self.config.vocab_size, self.config.n_embd)
tok_emb = wte(x)
x = tok_emb + pos_emb

for _ in range(self.config.n_layer):
x = Block(self.config)(x)
x = nn.LayerNorm()(x)
logits = nn.Dense(config.n_embd, config.vocab_size)
# logits = wte.attend(x) # parameter sharing
return logits

def init(self, rng):
tokens = jnp.zeros((1, self.config.block_size), dtype=jnp.uint16)
params = jax.jit(super().init, static_argnums=(2,))(rng, tokens, True)
return params

Now let’s varify the number of parameters: We first initialize the model config dataclass and the random key, then create a dummy inputs and feed in into the GPT model. Then we utilize the jax.util.treemap API to create a count parameter function. We got 124439808 (124M) parameters, same amount as Huggingface’s GPT2, BOOM!

Colab Result: number of parameters
Verify number of params in huggingface’s GPT2

DataLoader and Training Loop

Let’s now overfit a small dataset. To make it comparable inAndrej’s video on Pytorch NanoGPT, let’s use the toy dataset that he shared in his video. We use the GPT2′ tokenizer from tiktoken library to tokenize all the texts from the input file, and convert the tokens into jax.numpy.array for Jax’s model training.

class DataLoader:
def __init__(self, B, T):
self.current_position = 0
self.B = B
self.T = T

with open("input.txt","r") as f:
text = f.read()
enc = tiktoken.get_encoding("gpt2")
self.tokens = jnp.array(enc.encode(text))
print(f"loaded {len(self.tokens)} tokens in the datasets" )
print(f" 1 epoch = {len(self.tokens)//(B*T)} batches")

def next_batch(self):
B,T = self.B, self.T
buf = self.tokens[self.current_position:self.current_position+B*T+1]
x,y = jnp.reshape(buf[:-1],(B,T)), jnp.reshape(buf[1:],(B,T))
self.current_position += B*T
if self.current_position + B*T+1 > len(self.tokens):
self.current_position = 0
return x,y

Colab Result: Simple dataloader with 4 batch size and 128 sequence length

Next, let’s forget distributed training and optimization first, and just create a naive training loop for a sanity check. The first thing after intialize the model is to create a TrainState, a model state where we can update the parameters and gradients. The TrainState takes three important inputs: apply_fn (model forward function), params (model parameters from the init method), and tx (an Optax gradient transformation).

Then we use the train_step function to update the model state (gradients and parameters) to proceed the model training. Optax provide the softmax cross entropy as the loss function for the next token prediction task, and jax.value_and_grad calculates the gradients and the loss value for the loss function. Finally, we update the model’s state with the new parameters using the apply_gradients API. [ref] Don’t forget to jit the train_step function to reduce the computation overhead!

def init_train_state(key, config) -> TrainState:
model = GPT(config)
params = model.init(key)
optimizer = optax.adamw(3e-4, b1=0.9, b2=0.98, eps=1e-9, weight_decay=1e-1)
train_state = TrainState.create(
apply_fn=model.apply,
params=params,
tx=optimizer)
return train_state

@jax.jit
def train_step(state: TrainState, x: jnp.ndarray, y: jnp.ndarray) -> Tuple[jnp.ndarray, TrainState]:

def loss_fn(params: FrozenDict) -> jnp.ndarray:

logits = state.apply_fn(params, x, False)
loss = optax.softmax_cross_entropy_with_integer_labels(logits, y).mean()
return loss

loss, grads = jax.value_and_grad(loss_fn, has_aux=False)(state.params)
new_state = state.apply_gradients(grads=grads)
return loss, new_state

Now everything is ready for the poorman’s training loop.. Let’s check the loss value. The model’s prediction should be better than the random guess, so the loss should be lower than -ln(1/50257)≈10.825. What we expect from the overfitting a single batch is that: in the beginning the loss is close to 10.825, then it goes down to close to 0. Let’s take a batch of (x, y) and run the training loop for 50 times. I also add similar log to calculate the training speed.

As we can see, the loss value is exactly what we expect, and the training throughput is around 400–500 k token/sec. Which is already 40x faster than Pytorch’s initial version without any optimization in Andrej’s video. Note that we run the Jax scripts in 1 A100 GPU which should remove the hardware difference for the speed comparison. There is no .to(device) stuff to move your model or data from host CPU to device GPU, which is one of the benefits from Jax!

So that’s it and we made it. We will make the training 10x more faster in Part 2 with more optimizations…

Part 2: The journey of training optimization to 1350k tokens/sec in a single GPU!

“Unless otherwise noted, all images are by the author”



Source link

04Aug

Productionizing a RAG App with Prefect, Weave, and RAGAS | by Ed Izaguirre | Aug, 2024


Adding evaluation, automated data pulling, and other improvements.

12 min read

15 hours ago

From Film Search to Rosebud 🌹. Image from Unsplash.

Table of Contents

  1. Introduction
  2. Offline Evaluation
  3. Online Evaluation
  4. Automated Data Pulling with Prefect
  5. Summary

Relevant Links

A few months ago, I released the Film Search app, a Retrieval-Augmented Generation (RAG) application designed to recommend films based on user queries. For example, a user may ask: “Find me drama movies in English that are less than 2 hours long and feature dogs.” and receive a recommendation like:

Title of Film: Hachi: A Dog’s Tale

Runtime: 93 minutes
Release Year: 2009
Streaming: Not available for streaming
This film tells the poignant true story of Hachiko, an Akita dog known for his remarkable loyalty to his owner. The emotional depth and the themes of friendship and loyalty resonate strongly, making it a touching drama that showcases the profound bond between humans and dogs. It’s perfect for anyone looking for a heartfelt story that highlights the importance of companionship.

This was not just a simple RAG app, however. It included what is known as self-querying retrieval. This means that the bot takes the user’s query and transforms it by adding metadata filters. This ensures any documents pulled into the chat model’s context respects the constraints set by the user’s query. For more information, I recommend checking out my earlier article linked above.

Unfortunately, there were some issues with the app:

  • There was no offline evaluation done, besides passing the ‘eye test’. This test is necessary, but not sufficient.
  • Observability was non-existent. If a query went poorly, you had to manually pull up the project and run some ad-hoc scripts in an attempt to see what went wrong.
  • The Pinecone vector database had to be pulled manually. This meant the documents would quickly be out of date if, say, a film got pulled from a streaming service.

In this article, I will briefly cover some of the improvements made to the Film Search app. This will cover:

  • Offline Evaluation using RAGAS and Weave
  • Online Evaluation and Observability
  • Automated Data Pulling using Prefect

One thing before we jump in: I found the name Film Search to be a bit generic, so I rebranded the app as Rosebud 🌹, hence the image shown above. Real film geeks will understand the reference.

It is important to be able to judge if a change made to your LLM application improves or degrades its performance. Unfortunately, evaluation of LLM apps is a difficult and novel space. There is simply not much agreement on what constitutes a good evaluation.

For Rosebud 🌹, I decided to tackle what is known as the RAG triad. This approach is promoted by TruLens, a platform to evaluate and track LLM applications.

The RAG Triad. Image by author.

The triad covers three aspects of a RAG app:

  • Context Relevancy: When a query is made by the user, documents fill the context of the chat model. Is the retrieved context actually useful? If not, you may need to tweak things like document embedding, chunking, or metadata filtering.
  • Faithfulness: Is the model’s response actually grounded in the retrieved documents? You don’t want the model making up facts; the whole point of RAG is to help reduce hallucinations by using retrieved documents.
  • Answer Relevancy: Does the model’s response actually answer the user’s query? If the user asks for “Comedy films made in the 1990s?”, the model’s answer better contain only comedy films made in the 1990s.

There are a few ways to attempt to assess these three functions of a RAG app. One way would be to use human expert evaluators. Unfortunately, this would be expensive and wouldn’t scale. For Rosebud 🌹 I decided to use LLMs-as-a-judges. This means using a chat model to look at each of the three criteria above and assigning a score of 0 to 1 for each. This method has the advantage of being cheap and scaling well. To accomplish this, I used RAGAS, a popular framework that helps you evaluate your RAG applications. The RAGAS framework includes the three metrics mentioned above and makes it fairly easy to use them to evaluate your apps. Below is a code snippet demonstrating how I conducted this offline evaluation:

from ragas import evaluate
from ragas.metrics import AnswerRelevancy, ContextRelevancy, Faithfulness
import weave

@weave.op()
def evaluate_with_ragas(query, model_output):
# Put data into a Dataset object
data = {
"question": [query],
"contexts": [[model_output['context']]],
"answer": [model_output['answer']]
}
dataset = Dataset.from_dict(data)

# Define metrics to judge
metrics = [
AnswerRelevancy(),
ContextRelevancy(),
Faithfulness(),
]

judge_model = ChatOpenAI(model=config['JUDGE_MODEL_NAME'])
embeddings_model = OpenAIEmbeddings(model=config['EMBEDDING_MODEL_NAME'])

evaluation = evaluate(dataset=dataset, metrics=metrics, llm=judge_model, embeddings=embeddings_model)

return {
"answer_relevancy": float(evaluation['answer_relevancy']),
"context_relevancy": float(evaluation['context_relevancy']),
"faithfulness": float(evaluation['faithfulness']),
}

def run_evaluation():
# Initialize chat model
model = rosebud_chat_model()

# Define evaluation questions
questions = [
{"query": "Suggest a good movie based on a book."}, # Adaptations
{"query": "Suggest a film for a cozy night in."}, # Mood-Based
{"query": "What are some must-watch horror movies?"}, # Genre-Specific
...
# Total of 20 questions
]

# Create Weave Evaluation object
evaluation = weave.Evaluation(dataset=questions, scorers=[evaluate_with_ragas])

# Run the evaluation
asyncio.run(evaluation.evaluate(model))

if __name__ == "__main__":
weave.init('film-search')
run_evaluation()

A few notes:

  • With twenty questions and three criteria to judge across, you’re looking at sixty LLM calls for a single evaluation! It gets even worse though; with the rosebud_chat_model , there are two calls for every query: one to construct the metadata filter and another to provide the answer, so really this is 120 calls for a single eval! All models used my evaluation are the new gpt-4o-mini , which I strongly recommend. In my experience the calls cost $0.05 per evaluation.
  • Note that we are using asyncio.run to run the evals. It is ideal to use asynchronous calls because you don’t want to evaluate each question sequentially one after the other. Instead, with asyncio we can begin evaluating other questions as we wait for previous I/O operations to finish.
  • There are a total of twenty questions for a single evaluation. These span a variety of typical film queries a user may ask. I mostly came up with these myself, but in practice it would be better to use queries actually asked by users in production.
  • Notice the weave.init and the @weave.op decorator that are being used. These are part of the new Weave library from Weights & Biases (W&B). Weave is a complement to the traditional W&B library, with a focus on LLM applications. It allows you to capture inputs and outputs of LLMs by using a the simple @weave.op decorator. It also allows you to capture the results of evaluations using weave.Evaluation(…) . By integrating RAGAS to perform evaluations and Weave to capture and log them, we get a powerful duo that helps GenAI developers iteratively improve their applications. You also get to log the model latency, cost, and more.
Example of Weave + RAGAS integration. Image by author.

In theory, one can now tweak a hyperparameter (e.g. temperature), re-run the evaluation, and see if the adjustment has a positive or negative impact. Unfortunately, in practice I found the LLM judging to be finicky, and I am not the only one. LLM judges seem to be fairly bad at using a floating point value to assess these metrics. Instead, it appears they seem to do better at classification e.g. a thumbs up/thumbs down. RAGAS doesn’t yet support LLM judges performing classification. Writing it by hand doesn’t seem too difficult, and perhaps in a future update I may attempt this myself.

Offline evaluation is good for seeing how tweaking hyperparameters affects performance, but in my opinion online evaluation is far more useful. In Rosebud 🌹 I have now incorporated the use of 👍/👎 buttons at the bottom of every response to provide feedback.

Example of online feedback. Image by author.

When a user clicks on either button they are told that their feedback was logged. Below is a snippet of how this was accomplished in the Streamlit interface:

def start_log_feedback(feedback):
print("Logging feedback.")
st.session_state.feedback_given = True
st.session_state.sentiment = feedback
thread = threading.Thread(target=log_feedback, args=(st.session_state.sentiment,
st.session_state.query,
st.session_state.query_constructor,
st.session_state.context,
st.session_state.response))
thread.start()

def log_feedback(sentiment, query, query_constructor, context, response):
ct = datetime.datetime.now()
wandb.init(project="film-search",
name=f"query: {ct}")
table = wandb.Table(columns=["sentiment", "query", "query_constructor", "context", "response"])
table.add_data(sentiment,
query,
query_constructor,
context,
response
)
wandb.log({"Query Log": table})
wandb.finish()

Note that the process of sending the feedback to W&B runs on a separate thread rather than on the main thread. This is to prevent the user from getting stuck for a few seconds waiting for the logging to complete.

A W&B table is used to store the feedback. Five quantities are logged in the table:

  • Sentiment: Whether the user clicked thumbs up or thumbs down
  • Query: The user’s query, e.g. Find me drama movies in English that are less than 2 hours long and feature dogs.
  • Query_Constructor: The results of the query constructor, which rewrites the user’s query and includes metadata filtering if necessary, e.g.
{
"query": "drama English dogs",
"filter": {
"operator": "and",
"arguments": [
{
"comparator": "eq", "attribute": "Genre", "value": "Drama"
},
{
"comparator": "eq", "attribute": "Language", "value": "English"
},

{
"comparator": "lt", "attribute": "Runtime (minutes)", "value": 120
}
]
},
}

  • Context: The retrieved context based on the reconstructed query, e.g. Title: Hachi: A Dog’s Tale. Overview: A drama based on the true story of a college professor’s…
  • Response: The model’s response

All of this is logged conveniently in the same project as the Weave evaluations shown earlier. Now, when a query goes south it is as simple as hitting the thumbs down button to see exactly what happened. This will allow much faster iteration and improvement of the Rosebud 🌹 recommendation application.

Image showing observability of the model’s response. Note on the left-hand side how it is seamless to transition between W&B and Weave. Image by author.

To ensure recommendations from Rosebud 🌹 continue to stay accurate it was important to automate the process of pulling data and uploading them to Pinecone. For this task, I chose Prefect. Prefect is a popular workflow orchestration tool. I was looking for something lightweight, easy to learn, and Pythonic. I found all of this in Prefect.

Automated flow for pulling and updating Pinecone vector store provided by Prefect. Image by author.

Prefect offers a variety of ways to schedule your workflows. I decided to use the push work pools with automatic infrastructure provisioning. I found that this setup balances simplicity with configurability. It allows the user to task Prefect with automatically provisioning all of the infrastructure needed to run your flow in your cloud provider of choice. I chose to deploy on Azure, but deploying on GCP or AWS only requires changing a few lines of code. Refer to the pinecone_flow.py file for more details. A simplified flow is provided below:

@task
def start():
"""
Start-up: check everything works or fail fast!
"""

# Print out some debug info
print("Starting flow!")

# Ensure user has set the appropriate env variables
assert os.environ['LANGCHAIN_API_KEY']
assert os.environ['OPENAI_API_KEY']
...

@task(retries=3, retry_delay_seconds=[1, 10, 100])
def pull_data_to_csv(config):
TMBD_API_KEY = os.getenv('TMBD_API_KEY')
YEARS = range(config["years"][0], config["years"][-1] + 1)
CSV_HEADER = ['Title', 'Runtime (minutes)', 'Language', 'Overview', ...]

for year in YEARS:
# Grab list of ids for all films made in {YEAR}
movie_list = list(set(get_id_list(TMBD_API_KEY, year)))

FILE_NAME = f'./data/{year}_movie_collection_data.csv'

# Creating file
with open(FILE_NAME, 'w') as f:
writer = csv.writer(f)
writer.writerow(CSV_HEADER)

...

print("Successfully pulled data from TMDB and created csv files in data/")

@task
def convert_csv_to_docs():
# Loading in data from all csv files
loader = DirectoryLoader(
...
show_progress=True)

docs = loader.load()

metadata_field_info = [
AttributeInfo(name="Title",
description="The title of the movie", type="string"),
AttributeInfo(name="Runtime (minutes)",
description="The runtime of the movie in minutes", type="integer"),
...
]

def convert_to_list(doc, field):
if field in doc.metadata and doc.metadata[field] is not None:
doc.metadata[field] = [item.strip()
for item in doc.metadata[field].split(',')]

...

fields_to_convert_list = ['Genre', 'Actors', 'Directors',
'Production Companies', 'Stream', 'Buy', 'Rent']
...

# Set 'overview' and 'keywords' as 'page_content' and other fields as 'metadata'
for doc in docs:
# Parse the page_content string into a dictionary
page_content_dict = dict(line.split(": ", 1)
for line in doc.page_content.split("\n") if ": " in line)

doc.page_content = (
'Title: ' + page_content_dict.get('Title') +
'. Overview: ' + page_content_dict.get('Overview') +
...
)

...

print("Successfully took csv files and created docs")

return docs

@task
def upload_docs_to_pinecone(docs, config):
# Create empty index
PINECONE_KEY, PINECONE_INDEX_NAME = os.getenv(
'PINECONE_API_KEY'), os.getenv('PINECONE_INDEX_NAME')

pc = Pinecone(api_key=PINECONE_KEY)

# Target index and check status
pc_index = pc.Index(PINECONE_INDEX_NAME)
print(pc_index.describe_index_stats())

embeddings = OpenAIEmbeddings(model=config['EMBEDDING_MODEL_NAME'])
namespace = "film_search_prod"

PineconeVectorStore.from_documents(
docs,
...
)

print("Successfully uploaded docs to Pinecone vector store")

@task
def publish_dataset_to_weave(docs):
# Initialize Weave
weave.init('film-search')

rows = []
for doc in docs:
row = {
'Title': doc.metadata.get('Title'),
'Runtime (minutes)': doc.metadata.get('Runtime (minutes)'),
...
}
rows.append(row)

dataset = Dataset(name='Movie Collection', rows=rows)
weave.publish(dataset)
print("Successfully published dataset to Weave")

@flow(log_prints=True)
def pinecone_flow():
with open('./config.json') as f:
config = json.load(f)

start()
pull_data_to_csv(config)
docs = convert_csv_to_docs()
upload_docs_to_pinecone(docs, config)
publish_dataset_to_weave(docs)

if __name__ == "__main__":
pinecone_flow.deploy(
name="pinecone-flow-deployment",
work_pool_name="my-aci-pool",
cron="0 0 * * 0",
image=DeploymentImage(
name="prefect-flows:latest",
platform="linux/amd64",
)
)

Notice how simple it is to turn Python functions into a Prefect flow. All you need are some sub-functions styled with @task decorators and a @flow decorator on the main function. Also note that after uploading the documents to Pinecone, the last step of our flow publishes the dataset to Weave. This is important for reproducibility purposes.

At the bottom of the script we see how deployment is done in Prefect.

  • We need to provide a name for the deployment. This is arbitrary.
  • We also need to specify a work_pool_name . Push work pools in Prefect automatically send tasks to serverless computers without needing a middleman. This name needs to match the name used to create the pool, which we’ll see below.
  • You also need to specify a cron , which is short for chronograph. This allows you to specify how often to repeat a workflow. The value “0 0 * * 0” means repeat this workflow every week. Check out this website for details on how the cron syntax works.
  • Finally, you need to specify a DeploymentImage . Here you specify both a name and a platform . The name is arbitrary, but the platform is not. Since I want to deploy to Azure compute instances, and these instances run Linux, it’s important I specify that in the DeploymentImage .

To deploy this flow on Azure using the CLI, run the following commands:

prefect work-pool create --type azure-container-instance:push --provision-infra my-aci-pool
prefect deployment run 'get_repo_info/my-deployment'

These commands will automatically provision all of the necessary infrastructure on Azure. This includes an Azure Container Registry (ACR) that will hold a Docker image containing all files in your directory as well as any necessary libraries listed in a requirements.txt . It will also include an Azure Container Instance (ACI) Identity that will have permissions necessary to deploy a container with the aforementioned Docker image. Finally, the deployment run command will schedule the code to be run every week. You can check the Prefect dashboard to see your flow get run:

Image of a flow in Prefect being successfully run. Image by author.

By updating my Pinecone vector store weekly, I can ensure that the recommendations from Rosebud 🌹 remain accurate.

In this article, I discussed my experience improving the Rosebud 🌹 app. This included the process of incorporating offline and online evaluation, as well as automating the update of my Pinecone vector store.

Some other improvements not mentioned in this article:

  • Including ratings from The Movie Database in the film data. You can now ask for “highly rated films” and the chat model will filter for films above a 7/10.
  • Upgraded chat models. Now the query and summary models are using gpt-4o-mini . Recall that the LLM judge model is also using gpt-4o-mini .
  • Embedding model upgraded to text-embedding-3-small from text-embedding-ada-002 .
  • Years now span 1950–2023, instead of starting at 1920. Film data from 1920–1950 was not high quality, and only messed up recommendations.
  • UI is cleaner, with all details regarding the project relegated to a sidebar.
  • Vastly improved documentation on GitHub.
  • Bug fixes.

As mentioned at the top of the article, the app is now 100% free to use! I will foot the bill for queries for the foreseeable future (hence the choice of gpt-4o-mini instead of the more expensive gpt-4o). I really want to get the experience of running an app in production, and having my readers test out Rosebud 🌹 is a great way to do this. In the unlikely event that the app really blows up, I will have to come up with some other model of funding. But that would a great problem to have.

Enjoy discovering awesome films! 🎥



Source link

03Aug

What You Need To Know To Build Large Streamlit Applications With Stripe Subscriptions And Firestore Integration. | by Erdogan Taskesen | Aug, 2024


The ability to turn ideas into software products is a great skill to learn. In this blog, I will describe what it takes, and how to put the parts together to create a software product without starting costs but with a subscription model and Firestore integration.

Photo by Shane Aldendorff on Unsplash

Whether you are a data scientist, data engineer, or in another field of software development, turning your thoughts into real working software products using only a laptop may be the greatest skill to have. Various fields of software development come together in such a process, from UX, front-end, towards backend development, data handling, visualizations, cloud/server configurations, and so on. It is a process of going back and forth. Challenging is to decide which idea to start with, and how to avoid (starting) costs until the point that people want your product. In this blog I will discuss different kinds of ideas to and showcase how I created and deployed SkyWalk using Streamlit, with subscriptions using Stripe, and data storage using Google Firestore.



Source link

02Aug

LLM-Driven Synthetic Data Generation, Curation & Evaluation | by Cobus Greyling | Aug, 2024


Key considerations include:

  • Ensuring readability and interpretability of LLM-generated information to facilitate human understanding.
  • Implementing upstream knowledge enrichment or filtering to optimise human resource use and reduce time spent on low-value tasks.
  • Adding engaging interactive features to make data processing tasks more enjoyable and attract a wider audience.

In traditional crowdsourced annotation, workers receive a codebook detailing the task purpose, data explanation, and background knowledge to better understand their jobs.

Similarly, for LLM-driven data generation, task specification is crucial and can include role-play, format clarification, and knowledge augmentation.

A simple prompt like suppose you are a {xxx} can significantly improve LLM performance by setting the right context . This approach reminds of another study, where the researchers propose a new persona-driven data synthesis method that uses different perspectives within a large language model (LLM) to create varied synthetic data.

To support this method on a large scale, they introduce Persona Hub, a collection of 1 billion diverse personas automatically gathered from web data.

To ensure valid supervision, generated data must be logically and grammatically coherent.

However, inherent issues like hallucination and the fat-tailed knowledge distribution in large language models (LLMs) can introduce significant noise. This often leads to factual errors, incorrect labels, or irrelevant content, particularly when generating long, complex, or domain-specific data.

Diversity refers to the variations in generated data, such as differences in text length, topic, and writing style.

It is crucial for creating synthetic samples that reflect the diversity of real-world data, which helps prevent overfitting and bias during model training or evaluation.

However, inherent biases in large language models (LLMs) often result in monotonous and less diverse content, limiting its usefulness in downstream tasks.

The aim of synthetic data is not to imbue the target model knowledge, but rather train the model on certain personas and special abilities like advanced reasoning or task decomposition.

By combining strong data discovery and data design practices within a well-structured data topology, the process of creating synthetic data becomes more efficient, accurate, and aligned with real-world needs.

This foundational layer is essential for generating high-quality synthetic data that can effectively train and validate machine learning models.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

01Aug

Stable and fast randomization using hash spaces | by David Clarance | Jul, 2024


Generate consistent assignments on the fly across different implementation environments

A bird’s eye view

A core part of running an experiment is to assign an experimental unit (for instance a customer) to a specific treatment (payment button variant, marketing push notification framing). Often this assignment needs to meet the following conditions:

  1. It needs to be random.
  2. It needs to be stable. If the customer comes back to the screen, they need to be exposed to the same payment button variant.
  3. It needs to be retrieved or generated very quickly.
  4. It needs to be available after the actual assignment so it can be used in downstream analysis.

When organizations first start their experimentation journey, a common pattern is to pre-generate assignments, store it in a database and then retrieve it at the time of assignment. This is a perfectly valid method to use and works great when you’re starting off. However, as you start to scale in customer and experiment volumes, this method becomes harder and harder to maintain and use reliably. You’ve got to (a) manage the complexity of storage, (b) ensure that assignments are actually random within and across experiments, and (c) retrieve the assignment reliably. All of these are hard at scale.

Using hash spaces helps solve some of these problems. It’s a simple solution but isn’t as widely known as it probably should be. This blog is an attempt to explain the technique. There are links to code in different languages at the end. However, if you’d like, you can also directly jump to code here.

We’re running an experiment to test which variant of a progress bar on our customer app drives the most engagement. There are three variants: Control (the default experience), Variant A and Variant B.

We have 10 million customers that use our app every week and we want to ensure that these 10 million customers get randomly assigned to one of the three variants. Each time the customer comes back to the app they should see the same variant. We want control to be assigned with a 50% probability, Variant 1 to be assigned with a 30% probability and Variant 2 to be assigned with a 20% probability.

probability_assignments = {"Control": 50, "Variant 1": 30, "Variant 2": 20}

To make things simpler, we’ll start with 4 customers. These customers have IDs that we use to refer to them. These IDs are generally either GUIDs (something like "b7be65e3-c616-4a56-b90a-e546728a6640") or integers (like 1019222, 1028333). Any of these ID types would work but to make things easier to follow we’ll simply assume that these IDs are: “Customer1”, “Customer2”, “Customer3”, “Customer4”.

Our goal is to map these 4 customers to the three possible variants.

This method primarily relies on using hash algorithms that come with some very desirable properties. Hashing algorithms take a string of arbitrary length and map it to a ‘hash’ of a fixed length. The easiest way to understand this is through some examples.

A hash function, takes a string and maps it to a constant hash space. In the example below, a hash function (in this case md5) takes the words: “Hello”, “World”, “Hello World” and “Hello WorLd” (note the capital L) and maps it to an alphanumeric string of 32 characters.

A few important things to note:

  • The hashes are all of the same length.
  • A minor difference in the input (capital L instead of small L) changes the hash.
  • Hashes are a hexadecimal string. That is, they comprise of the numbers 0 to 9 and the first six alphabets (a, b, c, d, e and f).

We can use this same logic and get hashes for our four customers:

import hashlib

representative_customers = ["Customer1", "Customer2", "Customer3", "Customer4"]

def get_hash(customer_id):
hash_object = hashlib.md5(customer_id.encode())
return hash_object.hexdigest()

{customer: get_hash(customer) for customer in representative_customers}

# {'Customer1': 'becfb907888c8d48f8328dba7edf6969',
# 'Customer2': '0b0216b290922f789dd3efd0926d898e',
# 'Customer3': '2c988de9d49d47c78f9f1588a1f99934',
# 'Customer4': 'b7ca9bb43a9387d6f16cd7b93a7e5fb0'}

Hexadecimal strings are just representations of numbers in base 16. We can convert them to integers in base 10.

⚠️ One important note here: We rarely need to use the full hash. In practice (for instance in the linked code) we use a much smaller part of the hash (first 10 characters). Here we use the full hash to make explanations a bit easier.

def get_integer_representation_of_hash(customer_id):
hash_value = get_hash(customer_id)
return int(hash_value, 16)

{
customer: get_integer_representation_of_hash(customer)
for customer in representative_customers
}

# {'Customer1': 253631877491484416479881095850175195497,
# 'Customer2': 14632352907717920893144463783570016654,
# 'Customer3': 59278139282750535321500601860939684148,
# 'Customer4': 244300725246749942648452631253508579248}

There are two important properties of these integers:

  1. These integers are stable: Given a fixed input (“Customer1”), the hashing algorithm will always give the same output.
  2. These integers are uniformly distributed: This one hasn’t been explained yet and mostly applies to cryptographic hash functions (such as md5). Uniformity is a design requirement for these hash functions. If they were not uniformly distributed, the chances of collisions (getting the same output for different inputs) would be higher and weaken the security of the hash. There are some explorations of the uniformity property.

Now that we have an integer representation of each ID that’s stable (always has the same value) and uniformly distributed, we can use it to get to an assignment.

Going back to our probability assignments, we want to assign customers to variants with the following distribution:

{"Control": 50, "Variant 1": 30, "Variant 2": 20}

If we had 100 slots, we can divide them into 3 buckets where the number of slots represents the probability we want to assign to that bucket. For instance, in our example, we divide the integer range 0–99 (100 units), into 0–49 (50 units), 50–79 (30 units) and 80–99 (20 units).

def divide_space_into_partitions(prob_distribution):
partition_ranges = []
start = 0
for partition in prob_distribution:
partition_ranges.append((start, start + partition))
start += partition
return partition_ranges

divide_space_into_partitions(prob_distribution=probability_assignments.values())

# note that this is zero indexed, lower bound inclusive and upper bound exclusive
# [(0, 50), (50, 80), (80, 100)]

Now, if we assign a customer to one of the 100 slots randomly, the resultant distribution should then be equal to our intended distribution. Another way to think about this is, if we choose a number randomly between 0 and 99, there’s a 50% chance it’ll be between 0 and 49, 30% chance it’ll be between 50 and 79 and 20% chance it’ll be between 80 and 99.

The only remaining step is to map the customer integers we generated to one of these hundred slots. We do this by extracting the last two digits of the integer generated and using that as the assignment. For instance, the last two digits for customer 1 are 97 (you can check the diagram below). This falls in the third bucket (Variant 2) and hence the customer is assigned to Variant 2.

We repeat this process iteratively for each customer. When we’re done with all our customers, we should find that the end distribution will be what we’d expect: 50% of customers are in control, 30% in variant 1, 20% in variant 2.

def assign_groups(customer_id, partitions):
hash_value = get_relevant_place_value(customer_id, 100)
for idx, (start, end) in enumerate(partitions):
if start return idx
return None

partitions = divide_space_into_partitions(
prob_distribution=probability_assignments.values()
)

groups = {
customer: list(probability_assignments.keys())[assign_groups(customer, partitions)]
for customer in representative_customers
}

# output
# {'Customer1': 'Variant 2',
# 'Customer2': 'Variant 1',
# 'Customer3': 'Control',
# 'Customer4': 'Control'}

The linked gist has a replication of the above for 1,000,000 customers where we can observe that customers are distributed in the expected proportions.

# resulting proportions from a simulation on 1 million customers.
{'Variant 1': 0.299799, 'Variant 2': 0.199512, 'Control': 0.500689



Source link

29Jul

AI Agents: Exploring Agentic Applications | by Cobus Greyling | Jul, 2024


Applications based on LLMs are evolving & the next step in this progression of AI Agents are Agentic Applications. Agentic applications still have a Foundation Model as their backbone, but have more agency.

Agentic applications are AI-driven systems designed to autonomously perform tasks and make decisions based on user inputs and environmental context.

These applications leverage advanced models and tools to plan, execute, and adapt their actions dynamically.

By integrating capabilities like tool access, multi-step reasoning, and real-time adjustments, agentic applications can generate and complete complex workflows and provide intelligent solutions.

I must add that while many theories and future projections are based on speculation, I prioritise prototyping and creating working examples. This approach grounds commentary in practical experience, leading to more accurate future projections.

Generative and Language related AI are moving at a tremendous pace, as recent as 2018 the first notion of prompt engineering was introduced to combine NLP tasks and cast those as one question answering problem, within a specific context.

AS recent as Apr 2021, the term RAG as coined by a researcher, which was described as Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Only in January 2022 the chain-of-thought prompting technique was proposed by Google researchers.

September 2022 OpenAI introduced Whisper, an open-source acoustic model which approaches human level robustness and accuracy on speech recognition.

In 2023 we saw the progression of Large Language Models from a text-only interface, by introducing image processing and audio.

The term Foundation Model was an apt new reference to Large Language Models which, apart from generating compelling text, can also generate images, videos, speech, music, and more.

The term Foundation Model was coined by Stanford University Human-Centered Artificial Intelligence already in August 2021.

Also in 2023 we saw the rise of Small Language Models (SLMs). And even-though SLMs have a small footprint, they have advanced capabilities in reasoning, Natural Language Generation (NLG), context and dialog management, and more.

In 2023 we also saw the rise of Agents. Agents have as their backbone an LLM, while agents also have access to one or more tools to perform specific tasks.

Agents are able to answer highly ambiguous and complex questions…

Agents leverage LLMs to make a decision on which Action to take. After an Action is completed, the Agent enters the Observation step.

From Observation step, the Agent shares a Thought; if a final answer is not reached, the Agent cycles back to another Action in order to move closer to a Final Answer.

Agents are empowered by tools, these tools can include math libraries, web search, Weather APIs, and other integration points.

Agentic Applications can be seen as the next step in this progression where the agent application have more agency due to being able to browse and interpret the web, have mobile understanding and are capable of accessing multiple modalities.



Source link

29Jul

Why we need Continual Learning for AI models


Why, in a world where the only constant is change, we need a Continual Learning approach to AI models.

Image by the author generated in Midjourney

Imagine you have a small robot that is designed to walk around your garden and water your plants. Initially, you spend a few weeks collecting data to train and test the robot, investing considerable time and resources. The robot learns to navigate the garden efficiently when the ground is covered with grass and bare soil.

However, as the weeks go by, flowers begin to bloom and the appearance of the garden changes significantly. The robot, trained on data from a different season, now fails to recognise its surroundings accurately and struggles to complete its tasks. To fix this, you need to add new examples of the blooming garden to the model.

Your first thought is to add new data examples to the training and retrain the model from scratch. But this is expensive and you do not want to do this every time the environment changes. In addition, you have just realised that you do not have all the historical training data available.

Now you consider just fine-tuning the model with new samples. But this is risky because the model may lose some of its previously learned capabilities, leading to catastrophic forgetting (a situation where the model loses previously acquired knowledge and skills when it learns new information).

..so is there an alternative? Yes, using Continual Learning!

Of course, the robot watering plants in a garden is only an illustrative example of the problem. In the later parts of the text you will see more realistic applications.

Learn adaptively with Continual Learning (CL)

It is not possible to foresee and prepare for all the possible scenarios that a model may be confronted with in the future. Therefore, in many cases, adaptive training of the model as new samples arrive can be a good option.

In CL we want to find a balance between the stability of a model and its plasticity. Stability is the ability of a model to retain previously learned information, and plasticity is its ability to adapt to new information as new tasks are introduced.

“(…) in the Continual Learning scenario, a learning model is required to incrementally build and dynamically update internal representations as the distribution of tasks dynamically changes across its lifetime.” [2]

But how to control for the stability and plasticity?

Researchers have identified a number of ways to build adaptive models. In [3] the following categories have been established:

  1. Regularisation-based approach
  • In this approach we add a regularisation term that should balance the effects of old and new tasks on the model structure.
  • For example, weight regularisation aims to control the variation of the parameters, by adding a penalty term to the loss function, which penalises the change of the parameter by taking into account how much it contributed to the previous tasks.

2. Replay-based approach

  • This group of methods focuses on recovering some of the historical data so that the model can still reliably solve previous tasks. One of the limitations of this approach is that we need access to historical data, which is not always possible.
  • For example, experience replay, where we preserve and replay a sample of old training data. When training a new task, some examples from previous tasks are added to expose the model to a mixture of old and new task types, thereby limiting catastrophic forgetting.

3. Optimisation based approach

  • Here we want to manipulate the optimisation methods to maintain performance for all tasks, while reducing the effects of catastrophic forgetting.
  • For example, gradient projection is a method where gradients computed for new tasks are projected so as not to affect previous gradients.

4. Representation-based approach

  • This group of methods focuses on obtaining and using robust feature representations to avoid catastrophic forgetting.
  • For example, self-supervised learning, where a model can learn a robust representation of the data before being trained on specific tasks. The idea is to learn high-quality features that reflect good generalisation across different tasks that a model may encounter in the future.

5. Architecture-based approach

  • The previous methods assume a single model with a single parameter space, but there are also a number of techniques in CL that exploit model’s architecture.
  • For example, parameter allocation, where, during training, each new task is given a dedicated subspace in a network, which removes the problem of parameter destructive interference. However, if the network is not fixed, its size will grow with the number of new tasks.

And how to evaluate the performance of the CL models?

The basic performance of CL models can be measured from a number of angles [3]:

  • Overall performance evaluation: average performance across all tasks
  • Memory stability evaluation: calculating the difference between maximum performance for a given task before and its current performance after continual training
  • Learning plasticity evaluation: measuring the difference between joint training performance (if trained on all data) and performance when trained using CL

So why don’t all AI researchers switch to Continual Learning right away?

If you have access to the historical training data and are not worried about the computational cost, it may seem easier to just train from scratch.

One of the reasons for this is that the interpretability of what happens in the model during continual training is still limited. If training from scratch gives the same or better results than continual training, then people may prefer the easier approach, i.e. retraining from scratch, rather than spending time trying to understand the performance problems of CL methods.

In addition, current research tends to focus on the evaluation of models and frameworks, which may not reflect well the real use cases that the business may have. As mentioned in [6], there are many synthetic incremental benchmarks that do not reflect well real-world situations where there is a natural evolution of tasks.

Finally, as noted in [4], many papers on the topic of CL focus on storage rather than computational costs, and in reality, storing historical data is much less costly and energy consuming than retraining the model.

If there were more focus on the inclusion of computational and environmental costs in model retraining, more people might be interested in improving the current state of the art in CL methods as they would see measurable benefits. For example, as mentioned in [4], model re-training can exceed 10 000 GPU days of training for recent large models.

Why should we work on improving CL models?

Continual learning seeks to address one of the most challenging bottlenecks of current AI models — the fact that data distribution changes over time. Retraining is expensive and requires large amounts of computation, which is not a very sustainable approach from both an economic and environmental perspective. Therefore, in the future, well-developed CL methods may allow for models that are more accessible and reusable by a larger community of people.

As found and summarised in [4], there is a list of applications that inherently require or could benefit from the well-developed CL methods:

  1. Model Editing
  • Selective editing of an error-prone part of a model without damaging other parts of the model. Continual Learning techniques could help to continuously correct model errors at much lower computational cost.

2. Personalisation and specialisation

  • General purpose models sometimes need to be adapted to be more personalised for specific users. With Continual Learning, we could update only a small set of parameters without introducing catastrophic forgetting into the model.

3. On-device learning

  • Small devices have limited memory and computational resources, so methods that can efficiently train the model in real time as new data arrives, without having to start from scratch, could be useful in this area.

4. Faster retraining with warm start

  • Models need to be updated when new samples become available or when the distribution shifts significantly. With Continual Learning, this process can be made more efficient by updating only the parts affected by new samples, rather than retraining from scratch.

5. Reinforcement learning

  • Reinforcement learning involves agents interacting with an environment that is often non-stationary. Therefore, efficient Continual Learning methods and approaches could be potentially useful for this use case.

Learn more

As you can see, there is still a lot of room for improvement in the area of Continual Learning methods. If you are interested you can start with the materials below:

  • Introduction course: [Continual Learning Course] Lecture #1: Introduction and Motivation from ContinualAI on YouTube https://youtu.be/z9DDg2CJjeE?si=j57_qLNmpRWcmXtP
  • Paper about the motivation for the Continual Learning: Continual Learning: Application and the Road Forward [4]
  • Paper about the state of the art techniques in Continual Learning: Comprehensive Survey of Continual Learning: Theory, Method and Application [3]

If you have any questions or comments, please feel free to share them in the comments section.

Cheers!

Image by the author generated in Midjourney

[1] Awasthi, A., & Sarawagi, S. (2019). Continual Learning with Neural Networks: A Review. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (pp. 362–365). Association for Computing Machinery.

[2] Continual AI Wiki Introduction to Continual Learning https://wiki.continualai.org/the-continualai-wiki/introduction-to-continual-learning

[3] Wang, L., Zhang, X., Su, H., & Zhu, J. (2024). A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5362–5383.

[4] Eli Verwimp, Rahaf Aljundi, Shai Ben-David, Matthias Bethge, Andrea Cossu, Alexander Gepperth, Tyler L. Hayes, Eyke Hüllermeier, Christopher Kanan, Dhireesha Kudithipudi, Christoph H. Lampert, Martin Mundt, Razvan Pascanu, Adrian Popescu, Andreas S. Tolias, Joost van de Weijer, Bing Liu, Vincenzo Lomonaco, Tinne Tuytelaars, & Gido M. van de Ven. (2024). Continual Learning: Applications and the Road Forward https://arxiv.org/abs/2311.11908

[5] Awasthi, A., & Sarawagi, S. (2019). Continual Learning with Neural Networks: A Review. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (pp. 362–365). Association for Computing Machinery.

[6] Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, & Fartash Faghri. (2024). TiC-CLIP: Continual Training of CLIP Models.



Source link

28Jul

I found a hidden gem in Matplotlib’s library: Packed Bubble Charts in Python | by Anna Gordun Peiro | Jul, 2024


For my chart, I am using an Olympic Historical Dataset from Olympedia.org which Joseph Cheng shared in Kaggle with a public domain license.

Screenshot of dataset

It contains event to Athlete level Olympic Games Results from Athens 1896 to Beijing 2022. After an EDA (Exploratory Data Analysis) I transformed it into a dataset that details the number of female athletes in each sport/event per year. My bubble chart idea is to show which sports have a 50/50 female to male ratio athletes and how it has evolved during time.

My plotting data is composed of two different datasets, one for each year: 2020 and 1996. For each dataset I’ve computed the total sum of athletes that participated to each event (athlete_sum) and how much that sum represents compared to the number of total athletes (male + female) (difference). See a screenshot of the data below:

Screen shot of plotting dataset

This is my approach to visualise it:

  • Size proportion. Using radius of bubbles to compare number athletes per sport. Bigger bubbles will represent highly competitive events, such as Athletics
  • Multi variable interpretation. Making use of colours to represent female representation. Light green bubbles will represent events with a 50/50 split, such as Hockey.

Here is my starting point (using the code and approach from above):

First result

Some easy fixes: increasing figure size and changing labels to empty if the size isn’t over 250 to avoid having words outside bubbles.

fig, ax = plt.subplots(figsize=(12,8),subplot_kw=dict(aspect="equal"))

#Labels edited directly in dataset

Second result

Well, now at least it’s readable. But, why is Athletics pink and Boxing blue? Let’s add a legend to illustrate the relationship between colours and female representation.

Because it’s not your regular barplot chart, plt.legend() doesn’t do the trick here.

Using matplotlib Annotation Bbox we can create rectangles (or circles) to show meaning behind each colour. We can also do the same thing to show a bubble scale.

import matplotlib.pyplot as plt
from matplotlib.offsetbox import (AnnotationBbox, DrawingArea,
TextArea,HPacker)
from matplotlib.patches import Circle,Rectangle

# This is an example for one section of the legend

# Define where the annotation (legend) will be
xy = [50, 128]

# Create your colored rectangle or circle
da = DrawingArea(20, 20, 0, 0)
p = Rectangle((10 ,10),10,10,color="#fc8d62ff")
da.add_artist(p)

# Add text

text = TextArea("20%", textprops=dict(color="#fc8d62ff", size=14,fontweight='bold'))

# Combine rectangle and text
vbox = HPacker(children=[da, text], align="top", pad=0, sep=3)

# Annotate both in a box (change alpha if you want to see the box)
ab = AnnotationBbox(vbox, xy,
xybox=(1.005, xy[1]),
xycoords='data',
boxcoords=("axes fraction", "data"),
box_alignment=(0.2, 0.5),
bboxprops=dict(alpha=0)
)
#Add to your bubble chart
ax.add_artist(ab)

I’ve also added a subtitle and a text description under the chart just by using plt.text()

Final visualisation

Straightforward and user friendly interpretations of the graph:

  • Majority of bubbles are light green → green means 50% females → majority of Olympic competitions have an even 50/50 female to male split (yay🙌)
  • Only one sport (Baseball), in dark green colour, has no female participation.
  • 3 sports have only female participation but the number of athletes is fairly low.
  • The biggest sports in terms of athlete number (Swimming, Athletics and Gymnastics) are very close to having a 50/50 split



Source link

27Jul

Radical Simplicity in Data Engineering | by Cai Parry-Jones | Jul, 2024


Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinking

source: unsplash.com

Recently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The main pain points I heard time and time again were:

  • Not knowing why something broke
  • Getting burnt with high cloud compute costs
  • Taking too long to build data solutions/complete data projects
  • Needing expertise on many tools and technologies

These problems aren’t new. I’ve experienced them, you’ve probably experienced them. Yet, we can’t seem to find a solution that solves all of these issues in the long run. You might think to yourself, ‘well point one can be solved with {insert data observability tool}’, or ‘point two just needs a stricter data governance plan in place’. The problem with these style of solutions is they add additional layers of complexity, which cause the final two pain points to increase in seriousness. The aggregate sum of pain remains the same, just a different distribution between the four points.

created by the author using Google Sheets

This article aims to present a contrary style of problem solving: radical simplicity.

TL;DR

  • Software engineers have found massive success in embracing simplicity.
  • Over-engineering and pursuing perfection can result in bloated, slow-to-develop data systems, with sky high costs to the business.
  • Data teams should consider sacrificing some functionality for the sake of simplicity and speed.

A Lesson From Those Software Guys

In 1989, the computer scientist Richard P. Gabriel wrote a relatively famous essay on computer systems paradoxically called ‘Worse Is Better’. I won’t go into the details, you can read the essay here if you like, but the underlying message was that software quality does not necessarily improve as functionality increases. In other words, on occasions, you can sacrifice completeness for simplicity and end up with an inherently ‘better’ product because of it.

This was a strange idea to the pioneers of computing during the 1950/60s. The philosophy of the day was: a computer system needs to be pure, and it can only be pure if it accounts for all possible scenarios. This was likely due to the fact that most leading computer scientists at the time were academics, who very much wanted to treat computer science as a hard science.

Academics at MIT, the leading institution in computing at the time, started working on the operating system for the next generation of computers, called Multics. After nearly a decade of development and millions of dollars of investment, the MIT guys released their new system. It was unquestionably the most advanced operating system of the time, however it was a pain to install due to the computing requirements, and feature updates were slow due to the size of the code base. As a result, it never caught on beyond a few select universities and industries.

While Multics was being built, a small group supporting Multics’s development became frustrated with the growing requirements required for the system. They eventually decided to break away from the project. Armed with this experience they set their sights on creating their own operating system, one with a fundamental philosophy shift:

The design must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.

— Richard P. Gabriel

Five years after Multics’s release, the breakaway group released their operating system, Unix. Slowly but steadily it caught traction, and by the 1990s Unix became the go-to choice for computers, with over 90% of the world’s top 500 fastest supercomputers using it. To this day, Unix is still widely used, most notably as the system underlying macOS.

There were obviously other factors beyond its simplicity that led to Unix’s success. But its lightweight design was, and still is, a highly valuable asset of the system. That could only come about because the designers were willing to sacrifice functionality. The data industry should not be afraid to to think the same way.

Back to Data in the 21st Century

Thinking back at my own experiences, the philosophy of most big data engineering projects I’ve worked on was similar to that of Multics. For example, there was a project where we needed to automate standardising the raw data coming in from all our clients. The decision was made to do this in the data warehouse via dbt, since we could then have a full view of data lineage from the very raw files right through to the standardised single table version and beyond. The problem was that the first stage of transformation was very manual, it required loading each individual raw client file into the warehouse, then dbt creates a model for cleaning each client’s file. This led to 100s of dbt models needing to be generated, all using essentially the same logic. Dbt became so bloated it took minutes for the data lineage chart to load in the dbt docs website, and our GitHub Actions for CI (continuous integration) took over an hour to complete for each pull request.

This could have been resolved fairly simply if leadership had allowed us to make the first layer of transformations outside of the data warehouse, using AWS Lambda and Python. But no, that would have meant the data lineage produced by dbt wouldn’t be 100% complete. That was it. That was the whole reason not to massively simplify the project. Similar to the group who broke away from the Multics project, I left this project mid-build, it was simply too frustrating to work on something that so clearly could have been much simpler. As I write this, I discovered they are still working on the project.

So, What the Heck is Radical Simplicity?

Radical simplicity in data engineering isn’t a framework or data-stack toolkit, it is simply a frame of mind. A philosophy that prioritises simple, straightforward solutions over complex, all-encompassing systems.

Key principles of this philosophy include:

  1. Minimalism: Focusing on core functionalities that deliver the most value, rather than trying to accommodate every possible scenario or requirement.
  2. Accepting trade-offs: Willingly sacrificing some degree of completeness or perfection in favour of simplicity, speed, and ease of maintenance.
  3. Pragmatism over idealism: Prioritising practical, workable solutions that solve real business problems efficiently, rather than pursuing theoretically perfect but overly complex systems.
  4. Reduced cognitive load: Designing systems and processes that are easier to understand, implement, and maintain, thus reducing the expertise required across multiple tools and technologies.
  5. Cost-effectiveness: Embracing simpler solutions that often require less computational resources and human capital, leading to lower overall costs.
  6. Agility and adaptability: Creating systems that are easier to modify and evolve as business needs change, rather than rigid, over-engineered solutions.
  7. Focus on outcomes: Emphasising the end results and business value rather than getting caught up in the intricacies of the data processes themselves.

This mindset can be in direct contradiction to modern data engineering solutions of adding more tools, processes, and layers. As a result, be expected to fight your corner. Before suggesting an alternative, simpler, solution, come prepared with a deep understanding of the problem at hand. I am reminded of the quote:

It takes a lot of hard work to make something simple, to truly understand the underlying challenges and come up with elegant solutions. […] It’s not just minimalism or the absence of clutter. It involves digging through the depth of complexity. To be truly simple, you have to go really deep. […] You have to deeply understand the essence of a product in order to be able to get rid of the parts that are not essential.

— Steve Jobs

Side note: Be aware that adopting radical simplicity doesn’t mean ignoring new tools and advanced technologies. In fact one of my favourite solutions for a data warehouse at the moment is using a new open-source database called duckDB. Check it out, it’s pretty cool.

Conclusion

The lessons from software engineering history offer valuable insights for today’s data landscape. By embracing radical simplicity, data teams can address many of the pain points plaguing modern data solutions.

Don’t be afraid to champion radical simplicity in your data team. Be the catalyst for change if you see opportunities to streamline and simplify. The path to simplicity isn’t easy, but the potential rewards can be substantial.



Source link

26Jul

LangChain Based Plan & Execute AI Agent With GPT-4o-mini | by Cobus Greyling | Jul, 2024


As has been widely established by now, Chain-of-Thought (CoT) prompting is a highly effective method for querying LLMs using a single zero or few-shot approach.

It excels at tasks requiring multi-step reasoning, where the model is guided through step-by-step demonstrations before addressing the problem with the instruction Let us think step by step.

However, recent studies have identified three main limitations of CoT prompting:

Calculations

7% failure rate in test examples.

Missing Steps

12% failure rate in sequential events.

Semantic Misunderstanding

27% failure rate in test examples.

To address these issues, Plan-and-Solve (PS) prompting and its enhanced version, Plan-and-Solve with Detailed Instructions (PS+), have been introduced.

PS involves two key steps:

  1. Creating a plan to break the task into smaller subtasks and then
  2. Executing these subtasks according to the plan.

This simple architecture represents the planning agent framework. It has two main components:

  1. Planner: Prompts an LLM to create a multi-step plan for a large task.
  2. Executors: Receive the user query and a step in the plan, then invoke one or more tools to complete that task.

After execution, the agent is prompted to re-plan, deciding whether to provide a final response or generate a follow-up plan if the initial plan was insufficient.

This design minimises the need to call the large planner LLM for every tool invocation.

However, it remains limited by serial tool calling and requires an LLM for each task, as it doesn’t support variable assignment.

The LLM assign is done in the following way:

llm = OpenAI(temperature=0,model_name=’gpt-4o-mini’)

Below the complete Python code for the AI agent. The only changes you will need to make is adding your OpenAI API Key, and langSmith project variables.

### Install Required Packages:
pip install -qU langchain-openai langchain langchain_community langchain_experimental
pip install -U duckduckgo-search
pip install -U langchain langchain-openai
### Import Required Modules and Set Environment Variables:
import os
from uuid import uuid4
### Setup the LangSmith environment variables
unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"OpenAI_SM_1"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = ""
### Import LangChain Components and OpenAI API Key
from langchain.chains import LLMMathChain
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
from langchain_core.tools import Tool
from langchain_experimental.plan_and_execute import (
PlanAndExecute,
load_agent_executor,
load_chat_planner,
)
from langchain_openai import ChatOpenAI, OpenAI
###
os.environ['OPENAI_API_KEY'] = str("")
llm = OpenAI(temperature=0,model_name='gpt-4o-mini')
### Set Up Search and Math Chain Tools
search = DuckDuckGoSearchAPIWrapper()
llm = OpenAI(temperature=0)
llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True)
tools = [
Tool(
name="Search",
func=search.run,
description="useful for when you need to answer questions about current events",
),
Tool(
name="Calculator",
func=llm_math_chain.run,
description="useful for when you need to answer questions about math",
),
]
### Initialize Planner and Executor
model = ChatOpenAI(model_name='gpt-4o-mini', temperature=0)
planner = load_chat_planner(model)
executor = load_agent_executor(model, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor)
### Invoke the Agent
agent.invoke(
"Who is the founder of SpaceX an what is the square root of his year of birth?"
)



Source link

Protected by Security by CleanTalk