17Sep

An AI Agent Architecture & Framework Is Emerging | by Cobus Greyling | Sep, 2024


We are beginning to see the convergence on fundamental architectural principles that are poised to define the next generation of AI agents…

These architectures are far more than just advanced models — there are definitive building blocks emerging that will enable AI Agents & Agentic Applications to act autonomously, adapt dynamically, and interact and explore seamlessly within digital environments.

And as AI Agents become more capable, builders are converging on the common principles and approaches for core components.

I want to add a caveat: while there’s plenty of futuristic speculation around AI Agents, Agentic Discovery, and Agentic Applications, the insights and comments I share here are grounded in concrete research papers and hands-on experience with prototypes that I’ve either built or forked and tested in my own environment.

But First, Let’s Set The Stage With Some Key Concepts…

At a high level, an AI Agent is a system designed to perform tasks autonomously or semi-autonomously. Considering semi-autonomous for a moment, agents make use of tools to achieve their objective, and a human-in-the-loop can be a tool.

AI Agent tasks can range from a virtual assistant that schedules your appointments, to more complex agents involved in exploring and interacting with digital environments. With regards to digital environments, the most prominent research is from Apple with Ferret-UI, WebVoyager, and research from Microsoft and others; as seen below…

An AI Agent is a program that uses one or more Large Language Models (LLMs) or Foundation Models (FMs) as its backbone, enabling it to operate autonomously.

By decomposing queries, planning & creating a sequence of events, the AI Agent effectively addresses and solves complex problems.

AI Agents can handle highly ambiguous questions by decomposing them through a chain of thought process, similar to human reasoning.

These agents have access to a variety of tools, including programs, APIs, web searches, and more, to perform tasks and find solutions.

Much like how Large language models (LLMs) transformed natural language processing, Large Action Models (LAMs) are poised to revolutionise the way AI agents interact with their environments.

In a recent piece I wrote, I explored the emergence of Large Action Models (LAMs) and their future impact on AI Agents.

Salesforce AI Research open-sourced a number of LAMs, including a Small Action Model.

LAMs are designed to go beyond simple language generation by enabling AI to take meaningful actions in real-world scenarios.

Function calling has become a crucial element in the context of AI Agents, particularly from a model capability standpoint, because it significantly extends the functionality of large language models (LLMs) beyond text generation.

And hence one of the reasons for the advent of Large Action Models which has as one of its main traits the ability to excel at function calling.

AI Agents often need to perform actions based on user input, such as retrieving information, scheduling tasks, or performing computations.

Function calling allows the model to generate parameters for these tasks, enabling the agent to trigger external processes like database queries or API calls.

While LAMs form the action backbone, model orchestration brings together smaller, more specialised language models (SLMs) to assist in niche tasks.

Instead of relying solely on massive, resource-heavy models, agents can utilise these smaller models in tandem, orchestrating them for specific functions — whether that’s summarising data, parsing user commands, or providing insights based on historical context.

Small Language Models are ideal for development and testing, running them in an offline mode locally.

Large Language Models (LLMs) have rapidly gained traction due to several key characteristics that align well with the demands of natural language processing. These characteristics include natural language generation, common-sense reasoning, dialogue and conversation context management, natural language understanding, and the ability to handle unstructured input data. While LLMs are knowledge-intensive and have proven to be powerful tools, they are not without their limitations.

One significant drawback of LLMs is their tendency to hallucinate, meaning they can generate responses that are coherent, contextually accurate, and plausible, yet factually incorrect.

Additionally, LLMs are constrained by the scope of their training data, which has a fixed cut-off date. This means they do not possess ongoing, up-to-date knowledge or specific insights tailored to particular industries, organizations, or companies.

Updating an LLM to address these gaps is not straightforward; it requires fine-tuning the base model, which involves considerable effort in data preparation, costs, and testing. This process introduces a non-transparent, complex approach to data integration within LLMs.

To address these shortcomings, the concept of Retrieval-Augmented Generation (RAG) has been introduced.

RAG helps bridge the gap for Small Language Models (SLMs), supplementing them with the deep, intensive knowledge capabilities they typically lack.

While SLMs inherently manage other key aspects such as language generation and understanding, RAG equips them to perform comparably to their larger counterparts by enhancing their knowledge base.

This makes RAG a critical equalizer in the realm of AI language models, allowing smaller models to function with the robustness of a full-scale LLM.

As AI Agents gain capabilities to explore and interact with digital environments, the integration of vision capabilities with language models becomes crucial.

Projects like Ferret-UI from Apple and WebVoyager are excellent examples of this.

These agents can navigate within their digital surroundings, whether that means identifying elements on a user interface or exploring websites autonomously.

Imagine an AI Agent tasked with setting up an application in a new environment — it would not only read text-based instructions but also recognise UI elements via OCR, mapping bounding boxes and interpreting text to interact with them, and provide visual feedback.

A fundamental shift is happening in how AI agents handle inputs and outputs.

Traditionally, LLMs have operated with unstructured input and generated unstructured output — short to paragraphs of text or responses. But now, with function calling, we are moving toward structured, actionable outputs.

While LLMs are great for understanding and producing unstructured content, LAMs are designed to bridge the gap by turning language into structured, executable actions.

When an AI Agent can structure its output to align with specific functions, it can interact with other systems far more effectively.

For instance, instead of generating a merely unstructured/conversational text response, the AI could call a specific function to book a meeting, send a request, or trigger an API call — all within a more efficient token usage.

Not only does this reduce the overhead of processing unstructured responses, but it also makes interactions between systems more seamless.

Something to realise in terms of Function Calling, is that when using the OpenAI API with function calling, the model does not execute functions directly.

AI Agents can now become truly part of the larger digital ecosystem.

Finally, let’s talk about the importance of tools in the architecture of AI agents.

Tools can be thought of as the mechanisms through which AI Agents interact with the world — whether that’s fetching data, performing calculations, or executing tasks. In many ways, these tools are like pipelines, carrying inputs from one stage to another, transforming them along the way.

What’s even more fascinating is that a tool doesn’t necessarily have to be an algorithm or script. In some cases, the tool can be a human-in-the-loop, where humans intervene at key moments to guide or validate the agent’s actions.

This is particularly valuable in high-stakes environments, such as healthcare or finance, where absolute accuracy is critical.

Tools not only extend the capabilities of AI agents but also serve as the glue that holds various systems together. Whether it’s a human or a digital function, these tools allow AI agents to become more powerful, modular, and context-aware.

As we stand at the cusp of this new era, it’s clear that AI agents are becoming far more sophisticated than we ever anticipated.

With Large Action Models, Model Orchestration, vision-enabled language models, Function Calling, and the critical role of Tools, these agents are active participants in solving problems, exploring digital landscapes, and learning autonomously.

By focusing on these core building blocks, we’re setting the foundation for AI agents that are not just smarter, but more adaptable, efficient, and capable of acting in ways that starts to resemble human problem solving and thought processes.

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

16Sep

Introducing NumPy, Part 3: Manipulating Arrays | by Lee Vaughan | Sep, 2024


Shaping, transposing, joining, and splitting arrays

A grayscale Rubik’s cube hits itself with a hammer, breaking off tiny cubes.
Manipulating an array as imagined by DALL-E3

Welcome to Part 3 of Introducing NumPy, a primer for those new to this essential Python library. Part 1 introduced NumPy arrays and how to create them. Part 2 covered indexing and slicing arrays. Part 3 will show you how to manipulate existing arrays by reshaping them, swapping their axes, and merging and splitting them. These tasks are handy for jobs like rotating, enlarging, and translating images and fitting machine learning models.

NumPy comes with methods to change the shape of arrays, transpose arrays (invert columns with rows), and swap axes. You’ve already been working with the reshape() method in this series.

One thing to be aware of with reshape() is that, like all NumPy assignments, it creates a view of an array rather than a copy. In the following example, reshaping the arr1d array produces only a temporary change to the array:

In [1]: import numpy as np

In [2]: arr1d = np.array([1, 2, 3, 4])

In [3]: arr1d.reshape(2, 2)
Out[3]:
array([[1, 2],
[3, 4]])

In [4]: arr1d
Out[4]: array([1, 2, 3, 4])

This behavior is useful when you want to temporarily change the shape of the array for use in a…



Source link

15Sep

Applications of Rolling Windows for Time Series, with Python | by Piero Paialunga | Sep, 2024


Here’s some powerful applications of Rolling Windows and Time Series

Photo by Claudia Aran on Unsplash

Last night I was doing laundry with my wife. We have this non-verbal agreement (it becomes pretty verbal when I break it though) about laundry: she is the one who puts the laundry in the washer and drier and I am the one who folds it.

The way we do this is usually like this:

Image made by author using DALLE

Now, I don’t really fold all the clothes and put them away. Otherwise, I would be swimming in clothes. What I do is an approach that reminds me of the rolling window method:

Image made by author using DALLE

Why do I say that it reminds me of a rolling window? Let’s see the analogy.

Image made by author using DALLE

The idea of rolling windows is exactly the one that I apply when folding laundry. I have a task to do but you…



Source link

14Sep

Bayesian Linear Regression: A Complete Beginner’s guide | by Samvardhan Vishnoi | Sep, 2024


A workflow and code walkthrough for building a Bayesian regression model in STAN

Note: Check out my previous article for a practical discussion on why Bayesian modeling may be the right choice for your task.

This tutorial will focus on a workflow + code walkthrough for building a Bayesian regression model in STAN, a probabilistic programming language. STAN is widely adopted and interfaces with your language of choice (R, Python, shell, MATLAB, Julia, Stata). See the installation guide and documentation.

I will use Pystan for this tutorial, simply because I code in Python. Even if you use another language, the general Bayesian practices and STAN language syntax I will discuss here doesn’t vary much.

For the more hands-on reader, here is a link to the notebook for this tutorial, part of my Bayesian modeling workshop at Northwestern University (April, 2024).

Let’s dive in!

Lets learn how to build a simple linear regression model, the bread and butter of any statistician, the Bayesian way. Assuming a dependent variable Y and covariate X, I propose the following simple model-

Y = α + β * X + ϵ

Where ⍺ is the intercept, β is the slope, and ϵ is some random error. Assuming that,

ϵ ~ Normal(0, σ)

we can show that

Y ~ Normal(α + β * X, σ)

We will learn how to code this model form in STAN.

Generate Data

First, let’s generate some fake data.

#Model Parameters
alpha = 4.0 #intercept
beta = 0.5 #slope
sigma = 1.0 #error-scale
#Generate fake data
x = 8 * np.random.rand(100)
y = alpha + beta * x
y = np.random.normal(y, scale=sigma) #noise
#visualize generated data
plt.scatter(x, y, alpha = 0.8)
Generated data for Linear Regression (Image from code by Author)

Now that we have some data to model, let’s dive into how to structure it and pass it to STAN along with modeling instructions. This is done via the model string, which typically contains 4 (occasionally more) blocks- data, parameters, model, and generated quantities. Let’s discuss each of these blocks in detail.

DATA block

data {                    //input the data to STAN
int N;
vector[N] x;
vector[N] y;
}

The data block is perhaps the simplest, it tells STAN internally what data it should expect, and in what format. For instance, here we pass-

N: the size of our dataset as type int. The part declares that N≥0. (Even though it is obvious here that data length cannot be negative, stating these bounds is good standard practice that can make STAN’s job easier.)

x: the covariate as a vector of length N.

y: the dependent as a vector of length N.

See docs here for a full range of supported data types. STAN offers support for a wide range of types like arrays, vectors, matrices etc. As we saw above, STAN also has support for encoding limits on variables. Encoding limits is recommended! It leads to better specified models and simplifies the probabilistic sampling processes operating under the hood.

Model Block

Next is the model block, where we tell STAN the structure of our model.

//simple model block 
model {
//priors
alpha ~ normal(0,10);
beta ~ normal(0,1);

//model
y ~ normal(alpha + beta * x, sigma);
}

The model block also contains an important, and often confusing, element: prior specification. Priors are a quintessential part of Bayesian modeling, and must be specified suitably for the sampling task.

See my previous article for a primer on the role and intuition behind priors. To summarize, the prior is a presupposed functional form for the distribution of parameter values — often referred to, simply, as prior belief. Even though priors don’t have to exactly match the final solution, they must allow us to sample from it.

In our example, we use Normal priors of mean 0 with different variances, depending on how sure we are of the supplied mean value: 10 for alpha (very unsure), 1 for beta (somewhat sure). Here, I supplied the general belief that while alpha can take a wide range of different values, the slope is generally more contrained and won’t have a large magnitude.

Hence, in the example above, the prior for alpha is ‘weaker’ than beta.

As models get more complicated, the sampling solution space expands, and supplying beliefs gains importance. Otherwise, if there is no strong intuition, it is good practice to just supply less belief into the model i.e. use a weakly informative prior, and remain flexible to incoming data.

The form for y, which you might have recognized already, is the standard linear regression equation.

Generated Quantities

Lastly, we have our block for generated quantities. Here we tell STAN what quantities we want to calculate and receive as output.

generated quantities {    //get quantities of interest from fitted model
vector[N] yhat;
vector[N] log_lik;
for (n in 1:N){
yhat[n] = normal_rng(alpha + x[n] * beta, sigma);
//generate samples from model
log_lik[n] = normal_lpdf( y[n] | alpha + x[n] * beta, sigma);
//probability of data given the model and parameters
}
}

Note: STAN supports vectors to be passed either directly into equations, or as iterations 1:N for each element n. In practice, I’ve found this support to change with different versions of STAN, so it is good to try the iterative declaration if the vectorized version fails to compile.

In the above example-

yhat: generates samples for y from the fitted parameter values.

log_lik: generates probability of data given the model and fitted parameter value.

The purpose of these values will be clearer when we talk about model evaluation.

Altogether, we have now fully specified our first simple Bayesian regression model:

model = """
data { //input the data to STAN
int N;
vector[N] x;
vector[N] y;
}

All that remains is to compile the model and run the sampling.

#STAN takes data as a dict
data = {'N': len(x), 'x': x, 'y': y}

STAN takes input data in the form of a dictionary. It is important that this dict contains all the variables that we told STAN to expect in the model-data block, otherwise the model won’t compile.

#parameters for STAN fitting
chains = 2
samples = 1000
warmup = 10
# set seed
# Compile the model
posterior = stan.build(model, data=data, random_seed = 42)
# Train the model and generate samples
fit = posterior.sample(num_chains=chains, num_samples=samples)The .sample() method parameters control the Hamiltonian Monte Carlo (HMC) sampling process, where —
  • num_chains: is the number of times we repeat the sampling process.
  • num_samples: is the number of samples to be drawn in each chain.
  • warmup: is the number of initial samples that we discard (as it takes some time to reach the general vicinity of the solution space).

Knowing the right values for these parameters depends on both the complexity of our model and the resources available.

Higher sampling sizes are of course ideal, yet for an ill-specified model they will prove to be just waste of time and computation. Anecdotally, I’ve had large data models I’ve had to wait a week to finish running, only to find that the model didn’t converge. Is is important to start slowly and sanity check your model before running a full-fledged sampling.

Model Evaluation

The generated quantities are used for

  • evaluating the goodness of fit i.e. convergence,
  • predictions
  • model comparison

Convergence

The first step for evaluating the model, in the Bayesian framework, is visual. We observe the sampling draws of the Hamiltonian Monte Carlo (HMC) sampling process.

Model Convergence: visually evaluating the overlap of independent sampling chains (Image from code by Author)

In simplistic terms, STAN iteratively draws samples for our parameter values and evaluates them (HMC does way more, but that’s beyond our current scope). For a good fit, the sample draws must converge to some common general area which would, ideally, be the global optima.

The figure above shows the sampling draws for our model across 2 independent chains (red and blue).

  • On the left, we plot the overall distribution of the fitted parameter value i.e. the posteriors. We expect a normal distribution if the model, and its parameters, are well specified. (Why is that? Well, a normal distribution just implies that there exist a certain range of best fit values for the parameter, which speaks in support of our chosen model form). Furthermore, we should expect a considerable overlap across chains IF the model is converging to an optima.
  • On the right, we plot the actual samples drawn in each iteration (just to be extra sure). Here, again, we wish to see not only a narrow range but also a lot of overlap between the draws.

Not all evaluation metrics are visual. Gelman et al. [1] also propose the Rhat diagnostic which essential is a mathematical measure of the sample similarity across chains. Using Rhat, one can define a cutoff point beyond which the two chains are judged too dissimilar to be converging. The cutoff, however, is hard to define due to the iterative nature of the process, and the variable warmup periods.

Visual comparison is hence a crucial component, regardless of diagnostic tests

A frequentist thought you may have here is that, “well, if all we have is chains and distributions, what is the actual parameter value?” This is exactly the point. The Bayesian formulation only deals in distributions, NOT point estimates with their hard-to-interpret test statistics.

That said, the posterior can still be summarized using credible intervals like the High Density Interval (HDI), which includes all the x% highest probability density points.

95% HDI for beta (Image from code by Author)

It is important to contrast Bayesian credible intervals with frequentist confidence intervals.

  • The credible interval gives a probability distribution on the possible values for the parameter i.e. the probability of the parameter assuming each value in some interval, given the data.
  • The confidence interval regards the parameter value as fixed, and estimates instead the confidence that repeated random samplings of the data would match.

Hence the

Bayesian approach lets the parameter values be fluid and takes the data at face value, while the frequentist approach demands that there exists the one true parameter value… if only we had access to all the data ever

Phew. Let that sink in, read it again until it does.

Another important implication of using credible intervals, or in other words, allowing the parameter to be variable, is that the predictions we make capture this uncertainty with transparency, with a certain HDI % informing the best fit line.

95% HDI line of best fit (Image from code by Author)

Model comparison

In the Bayesian framework, the Watanabe-Akaike Information Metric (WAIC) score is the widely accepted choice for model comparison. A simple explanation of the WAIC score is that it estimates the model likelihood while regularizing for the number of model parameters. In simple words, it can account for overfitting. This is also major draw of the Bayesian framework — one does not necessarily need to hold-out a model validation dataset. Hence,

Bayesian modeling offers a crucial advantage when data is scarce.

The WAIC score is a comparative measure i.e. it only holds meaning when compared across different models that attempt to explain the same underlying data. Thus in practice, one can keep adding more complexity to the model as long as the WAIC increases. If at some point in this process of adding maniacal complexity, the WAIC starts dropping, one can call it a day — any more complexity will not offer an informational advantage in describing the underlying data distribution.

Conclusion

To summarize, the STAN model block is simply a string. It explains to STAN what you are going to give to it (model), what is to be found (parameters), what you think is going on (model), and what it should give you back (generated quantities).

When turned on, STAN simple turns the crank and gives its output.

The real challenge lies in defining a proper model (refer priors), structuring the data appropriately, asking STAN exactly what you need from it, and evaluating the sanity of its output.

Once we have this part down, we can delve into the real power of STAN, where specifying increasingly complicated models becomes just a simple syntactical task. In fact, in our next tutorial we will do exactly this. We will build upon this simple regression example to explore Bayesian Hierarchical models: an industry standard, state-of-the-art, defacto… you name it. We will see how to add group-level radom or fixed effects into our models, and marvel at the ease of adding complexity while maintaining comparability in the Bayesian framework.

Subscribe if this article helped, and to stay-tuned for more!

References

[1] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari and Donald B. Rubin (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC.



Source link

14Sep

Emergence of Large Action Models (LAMs) and Their Impact on AI Agents | by Cobus Greyling | Sep, 2024


While LLMs are great for understanding and producing unstructured content, LAMs are designed to bridge the gap by turning language into structured, executable actions.

As I have mentioned in the past, Autonomous AI Agents powered by large language models (LLMs) have recently emerged as a key focus of research, driving the development of concepts like agentic applications, agentic retrieval-augmented generation (RAG), and agentic discovery.

However, according to Salesforce AI Research, the open-source community continues to face significant challenges in building specialised models tailored for these tasks.

A major hurdle is the scarcity of high-quality, agent-specific datasets, coupled with the absence of standardised protocols, which complicates the development process.

To bridge this gap, researchers at Salesforce have introduced xLAM, a series of Large Action Models specifically designed for AI agent tasks.

The xLAM series comprises five models, featuring architectures that range from dense to mixture-of-experts, with parameter sizes from 1 billion upwards.

These models aim to advance the capabilities of autonomous agents by providing purpose-built solutions tailored to the complex demands of agentic tasks.

Function calling has become a crucial element in the context of AI agents, particularly from a model capability standpoint, because it significantly extends the functionality of large language models (LLMs) beyond static text generation.

And hence one of the reasons for the advent of Large Action Models which has as one of its main traits the ability to excel at function calling.

AI agents often need to perform actions based on user input, such as retrieving information, scheduling tasks, or performing computations.

Function calling allows the model to generate parameters for these tasks, enabling the agent to trigger external processes like database queries or API calls.

This makes the agent not just reactive, but action-oriented, turning passive responses into dynamic interactions.

Interoperability with External Systems

For AI Agents, sub-tasks involve interacting with various tools. Tools are in turn linked to external systems (CRM systems, financial databases, weather APIs, etc).

Through function calling, LAMs can serve as a broker, providing the necessary data or actions for those systems without needing the model itself to have direct access. This allows for seamless integration with other software environments and tools.

By moving from a LLM to a LAM, the model utility is also expanded, and LAMs can thus be seen as purpose built to act as the centre piece for an agentic implementation.

Large Language Models (LLMs) are designed to handle unstructured input and output, excelling at tasks like generating human-like text, summarising content, and answering open-ended questions.

LLMs are highly flexible, allowing them to process diverse forms of natural language without needing predefined formats.

However, their outputs can be ambiguous or loosely structured, which can limit their effectiveness for specific task execution. And using a LLM for an agentic implementation is not wrong, and serves the purpose quite well.

But Large Action Models (LAMs) can be considered as purpose built, focusing on structuring outputs by generating precise parameters or instructions for specific actions, making them suitable for tasks that require clear and actionable results, such as function calling or API interactions.

While LLMs are great for understanding and producing unstructured content, LAMs are designed to bridge the gap by turning language into structured, executable actions.

Overall, in the context of AI agents, function calling enables more robust, capable, and practical applications by allowing LLMs to serve as a bridge between natural language understanding and actionable tasks within digital systems.



Source link

12Sep

Strategic Chain-of-Thought (SCoT) | by Cobus Greyling | Sep, 2024


As LLMs evolve, I believe that while CoT remains simple and transparent, managing the growing complexity of prompts and multi-inference architectures will demand more sophisticated tools and a strong focus on data-centric approaches.

Human oversight will be essential to maintaining the integrity of these systems.

As LLM-based applications become more complex, their underlying processes must be accommodated somewhere, and preferably a resilient platform that can handle the growing functionality and complexity.

The prompt engineering process itself can become intricate, requiring dedicated infrastructure to manage data flow, API calls, and multi-step reasoning.

But as this complexity scales, introducing an agentic approach becomes essential to scale automated tasks, manage complex workflows, and navigate digital environments efficiently.

These agents enable applications to break down complex requests into manageable steps, optimising both performance and scalability.

Ultimately, hosting this complexity requires adaptable systems that support real-time interaction and seamless integration with broader data and AI ecosystems.

Strategic knowledge refers to a clear method or principle that guides reasoning toward a correct and stable solution. It involves using structured processes that logically lead to the desired outcome, thereby improving the stability and quality of CoT generation.



Source link

11Sep

Market Basket Analysis Using High Utility Itemset Mining | by Laurin Brechter | Sep, 2024


Finding high-value patterns in transactions

In this post, I will give an alternative to popular techniques in market basket analysis that can help practitioners find high-value patterns rather than just the most frequent ones. We will gain some intuition into different pattern mining problems and look at a real-world example. The full code can be found here. All images are created by the author.

I have written a more introductory article about pattern mining already; if you’re not familiar with some of the concepts that come up here, feel free to check that one out first.

In short, pattern mining tries to find patterns in data (duuh). Most of the time, this data comes in the form of (multi-)sets or sequences. In my last article, for example, I looked at the sequence of actions that a user performs on a website. In this case, we would care about the ordering of the items.

In other cases, such as the one we will discuss below, we do not care about the ordering of the items. We only list all the items that were in the transaction and how often they appeared.

Example Transacton Database

So for example, transaction 1 contained 🥪 3 times and 🍎 once. As we see, we lose information about the ordering of the items, but in many scenarios (as the one we will discuss below), there is no logical ordering of the items. This is similar to a bag of words in NLP.

Market Basket Analysis (MBA) is a data analysis technique commonly used in retail and marketing to uncover relationships between products that customers tend to purchase together. It aims to identify patterns in customers’ shopping baskets or transactions by analyzing their purchasing behavior. The central idea is to understand the co-occurrence of items in shopping transactions, which helps businesses optimize their strategies for product placement, cross-selling, and targeted marketing campaigns.

Frequent Itemset Mining (FIM) is the process of finding frequent patterns in transaction databases. We can look at the frequency of a pattern (i.e. a set of items) by calculating its support. In other words, the support of a pattern X is the number of transactions T that contain X (and are in the database D). That is, we are simply looking at how often the pattern X appears in the database.

Definition of the support.

In FIM, we then want to find all the sequences that have a support bigger than some threshold (often called minsup). If the support of a sequence is higher than minsup, it is considered frequent.

Limitations

In FIM, we only look at the existence of an item in a sequence. That is, whether an item appears two times or 200 times does not matter, we simply represent it as a one. But we often have cases (such as MBA), where not only the existence of an item in a transaction is relevant but also how many times it appeared in the transaction.

Another problem is that frequency does not always imply relevance. In that sense, FIM assumes that all items in the transaction are equally important. However, it is reasonable to assume that someone buying caviar might be more important for a business than someone buying bread, as caviar is potentially a high ROI/profit item.

These limitations directly bring us to High Utility Itemset Mining (HUIM) and High Utility Quantitative Itemset Mining (HUQIM) which are generalizations of FIM that try to address some of the problems of normal FIM.

Our first generalization is that items can appear more than once in a transaction (i.e. we have a multiset instead of a simple set). As said before, in normal itemset mining, we transform the transaction into a set and only look at whether the item exists in the transaction or not. So for example the two transactions below would have the same representation.

t1 = [a,a,a,a,a,b] # repr. as {a,b} in FIM
t2 = [a,b] # repr. as {a,b} in FIM

Above, both these two transactions would be represented as [a,b] in regular FIM. We quickly see that, in some cases, we could miss important details. For example, if a and b were some items in a customer’s shopping cart, it would matter a lot whether we have a (e.g. a loaf of bread) five times or only once. Therefore, we represent the transaction as a multiset in which we write down, how many times each item appeared.

# multiset representation
t1_ms = {(a,5),(b,1)}
t2_ms = {(a,1),(b,1)}

This is also efficient if the items can appear in a large number of items (e.g. 100 or 1000 times). In that case, we need not write down all the a’s or b’s but simply how often they appear.

The generalization that both the quantitative and non-quantitative methods make, is to assign every item in the transaction a utility (e.g. profit or time). Below, we have a table that assigns every possible item a unit profit.

Utility of Items

We can then calculate the utility of a specific pattern such as {🥪, 🍎} by summing up the utility of those items in the transactions that contain them. In our example we would have:

(3🥪 * $1 + 1🍎 * $2) +

(1 🥪 * $1 + 2🍎 * $2) = $10

Transacton Database from Above

So, we get that this pattern has a utility of $10. With FIM, we had the task of finding frequent patterns. Now, we have to find patterns with high utility. This is mainly because we assume that frequency does not imply importance. In regular FIM, we might have missed rare (infrequent) patterns that provide a high utility (e.g. the diamond), which is not true with HUIM.

We also need to define the notion of a transaction utility. This is simply the sum of the utility of all the items in the transaction. For our transaction 3 in the database, this would be

1🥪 * $1 + 2🦞*$10 + 2🍎*$2 = $25

Note that solving this problem and finding all high-utility items is more difficult than regular FPM. This is because the utility does not follow the Apriori property.

The Apriori Property

Let X and Y be two patterns occurring in a transaction database D. The apriori property says that if X is a subset of Y, then the support of X must be at least as big as Y’s.

Apriori property.

This means that if a subset of Y is infrequent, Y itself must be infrequent since it must have a smaller support. Let’s say we have X = {a} and Y = {a,b}. If Y appears 4 times in our database, then X must appear at least 4 times, since X is a subset of Y. This makes sense since we are making the pattern less general / more specific by adding an item which means that it will fit less transactions. This property is used in most algorithms as it implies that if {a} is infrequent all supersets are also infrequent and we can eliminate them from the search space [3].

This property does not hold when we are talking about utility. A superset Y of transaction X could have more or less utility. If we take the example from above, {🥪} has a utility of $4. But this does not mean we cannot look at supersets of this pattern. For example, the superset we looked at {🥪, 🍎} has a higher utility of $10. At the same time, a superset of a pattern won’t always have more utility since it might be that this superset just doesn’t appear very often in the DB.

Idea Behind HUIM

Since we can’t use the apriori property for HUIM directly, we have to come up with some other upper bound for narrowing down the search space. One such bound is called Transaction-Weighted Utilization (TWU). To calculate it, we sum up the transaction utility of the transactions that contain the pattern X of interest. Any superset Y of X can’t have a higher utility than the TWU. Let’s make this clearer with an example. The TWU of {🥪,🍎} is $30 ($5 from transaction 1 and $5 from transaction 3). When we look at a superset pattern Y such as {🥪 🦞 🍎} we can see that there is no way it would have more utility since all transactions that have Y in them also have X in them.

There are now various algorithms for solving HUIM. All of them receive a minimum utility and produce the patterns that have at least that utility as their output. In this case, I have used the EFIM algorithm since it is fast and memory efficient.

For this article, I will work with the Market Basket Analysis dataset from Kaggle (used with permission from the original dataset author).

Above, we can see the distribution of transaction values found in the data. There is a total of around 19,500 transactions with an average transaction value of $526 and 26 distinct items per transaction. In total, there are around 4000 unique items. We can also make an ABC analysis where we put items into different buckets depending on their share of total revenue. We can see that around 500 of the 4000 items make up around 70% of the revenue (A-items). We then have a long right-tail of items (around 2250) that make up around 5% of the revenue (C-items).

Preprocessing

The initial data is in a long format where each row is a line item within a bill. From the BillNo we can see to which transaction the item belongs.

Initial Data Format

After some preprocessing, we get the data into the format required by PAMI which is the Python library we are going to use for applying the EFIM algorithm.

data['item_id'] = pd.factorize(data.Itemname)[0].astype(str) # map item names to id
data["Value_Int"] = data["Value"].astype(int).astype(str)
data = data.loc[data.Value_Int != '0'] # exclude items w/o utility

transaction_db = data.groupby('BillNo').agg(
items=('item_id', lambda x: ' '.join(list(x))),
total_value=('Value', lambda x: int(x.sum())),
values=('Value_Int', lambda x: ' '.join(list(x))),
)

# filter out long transactions, only use subset of transactions
transaction_db = transaction_db.loc[transaction_db.num_items

Transaction Database

We can then apply the EFIM algorithm.

import PAMI.highUtilityPattern.basic.EFIM as efim 

obj = efim.EFIM('tdb.csv', minUtil=1000, sep=' ')
obj.startMine() #start the mining process
obj.save('out.txt') #store the patterns in file
results = obj.getPatternsAsDataFrame() #Get the patterns discovered into a dataframe
obj.printResults()

The algorithm then returns a list of patterns that meet this minimum utility criterion.



Source link

10Sep

Logistic Regression, Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | Sep, 2024


CLASSIFICATION ALGORITHM

Finding the perfect weights to fit the data in

While some probabilistic-based machine learning models (like Naive Bayes) make bold assumptions about feature independence, logistic regression takes a more measured approach. Think of it as drawing a line (or plane) that separates two outcomes, allowing us to predict probabilities with a bit more flexibility.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Logistic regression is a statistical method used for predicting binary outcomes. Despite its name, it’s used for classification rather than regression. It estimates the probability that an instance belongs to a particular class. If the estimated probability is greater than 50%, the model predicts that the instance belongs to that class (or vice versa).

Throughout this article, we’ll use this artificial golf dataset (inspired by [1]) as an example. This dataset predicts whether a person will play golf based on weather conditions.

Just like in KNN, logistic regression requires the data to be scaled first. Convert categorical columns into 0 & 1 and also scale the numerical features so that no single feature dominates the distance metric.

Columns: ‘Outlook’, ‘Temperature’, ‘Humidity’, ‘Wind’ and ‘Play’ (target feature). The categorical columns (Outlook & Windy) are encoded using one-hot encoding while the numerical columns are scaled using standard scaling (z-normalization).
# Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Create dataset from dictionary
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Prepare data: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]

# Split data into features and target
X, y = df.drop(columns='Play'), df['Play']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical features
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])

# Print results
print("Training set:")
print(pd.concat([X_train, y_train], axis=1), '\n')
print("Test set:")
print(pd.concat([X_test, y_test], axis=1))

Logistic regression works by applying the logistic function to a linear combination of the input features. Here’s how it operates:

  1. Calculate a weighted sum of the input features (similar to linear regression).
  2. Apply the logistic function (also called sigmoid function) to this sum, which maps any real number to a value between 0 and 1.
  3. Interpret this value as the probability of belonging to the positive class.
  4. Use a threshold (typically 0.5) to make the final classification decision.
For our golf dataset, logistic regression might combine the weather factors into a single score, then transform this score into a probability of playing golf.

The training process for logistic regression involves finding the best weights for the input features. Here’s the general outline:

  1. Initialize the weights (often to small random values).
# Initialize weights (including bias) to 0.1
initial_weights = np.full(X_train_np.shape[1], 0.1)

# Create and display DataFrame for initial weights
print(f"Initial Weights: {initial_weights}")

2. For each training example:
a. Calculate the predicted probability using the current weights.

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def calculate_probabilities(X, weights):
z = np.dot(X, weights)
return sigmoid(z)

def calculate_log_loss(probabilities, y):
return -y * np.log(probabilities) - (1 - y) * np.log(1 - probabilities)

def create_output_dataframe(X, y, weights):
probabilities = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(probabilities, y)

df = pd.DataFrame({
'Probability': probabilities,
'Label': y,
'Log Loss': log_losses
})

return df

def calculate_average_log_loss(X, y, weights):
probabilities = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(probabilities, y)
return np.mean(log_losses)

# Convert X_train and y_train to numpy arrays for easier computation
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()

# Add a column of 1s to X_train_np for the bias term
X_train_np = np.column_stack((np.ones(X_train_np.shape[0]), X_train_np))

# Create and display DataFrame for initial weights
initial_df = create_output_dataframe(X_train_np, y_train_np, initial_weights)
print(initial_df.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print(f"\nAverage Log Loss: {calculate_average_log_loss(X_train_np, y_train_np, initial_weights):.6f}")

b. Compare this probability to the actual class label by calculating its log loss.

3. Update the weights to minimize the loss (usually using some optimization algorithm, like gradient descent. This include repeatedly do Step 2 until log loss cannot get smaller).

def gradient_descent_step(X, y, weights, learning_rate):
m = len(y)
probabilities = calculate_probabilities(X, weights)
gradient = np.dot(X.T, (probabilities - y)) / m
new_weights = weights - learning_rate * gradient # Create new array for updated weights
return new_weights

# Perform one step of gradient descent (one of the simplest optimization algorithm)
learning_rate = 0.1
updated_weights = gradient_descent_step(X_train_np, y_train_np, initial_weights, learning_rate)

# Print initial and updated weights
print("\nInitial weights:")
for feature, weight in zip(['Bias'] + list(X_train.columns), initial_weights):
print(f"{feature:11}: {weight:.2f}")

print("\nUpdated weights after one iteration:")
for feature, weight in zip(['Bias'] + list(X_train.columns), updated_weights):
print(f"{feature:11}: {weight:.2f}")

# With sklearn, you can get the final weights (coefficients)
# and final bias (intercepts) easily.
# The result is almost the same as doing it manually above.

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(penalty=None, solver='saga')
lr_clf.fit(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

y_train_prob = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(y_train_prob) + (1 - y_train) * np.log(1 - y_train_prob))

print(f"Weights & Bias Final: {coefficients[0].round(2)}, {round(intercept[0],2)}")
print("Loss Final:", loss.round(3))

Once the model is trained:
1. For a new instance, calculate the probability with the final weights (also called coefficients), just like during the training step.

2. Interpret the output by seeing the probability: if p ≥ 0.5, predict class 1; otherwise, predict class 0

# Calculate prediction probability
predicted_probs = lr_clf.predict_proba(X_test)[:, 1]

z_values = np.log(predicted_probs / (1 - predicted_probs))

result_df = pd.DataFrame({
'ID': X_test.index,
'Z-Values': z_values.round(3),
'Probabilities': predicted_probs.round(3)
}).set_index('ID')

print(result_df)

# Make predictions
y_pred = lr_clf.predict(X_test)
print(y_pred)

Evaluation Step

result_df = pd.DataFrame({
'ID': X_test.index,
'Label': y_test,
'Probabilities': predicted_probs.round(2),
'Prediction': y_pred,
}).set_index('ID')

print(result_df)

Logistic regression has several important parameters that control its behavior:

1.Penalty: The type of regularization to use (‘l1’, ‘l2’, ‘elasticnet’, or ‘none’). Regularization in logistic regression prevents overfitting by adding a penalty term to the model’s loss function, that encourages simpler models.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

regs = [None, 'l1', 'l2']
coeff_dict = {}

for reg in regs:
lr_clf = LogisticRegression(penalty=reg, solver='saga')
lr_clf.fit(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[reg] = {
'Coefficients': coefficients,
'Intercept': intercept,
'Loss': loss,
'Accuracy': accuracy
}

for reg, vals in coeff_dict.items():
print(f"{reg}: Coeff: {vals['Coefficients'][0].round(2)}, Intercept: {vals['Intercept'].round(2)}, Loss: {vals['Loss'].round(3)}, Accuracy: {vals['Accuracy'].round(3)}")

2. Regularization Strength (C): Controls the trade-off between fitting the training data and keeping the model simple. A smaller C means stronger regularization.

# List of regularization strengths to try for L1
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for strength in strengths:
lr_clf = LogisticRegression(penalty='l1', C=strength, solver='saga')
lr_clf.fit(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L1_{strength}'] = {
'Coefficients': coefficients[0].round(2),
'Intercept': round(intercept[0],2),
'Loss': round(loss,3),
'Accuracy': round(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

# List of regularization strengths to try for L2
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for strength in strengths:
lr_clf = LogisticRegression(penalty='l2', C=strength, solver='saga')
lr_clf.fit(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L2_{strength}'] = {
'Coefficients': coefficients[0].round(2),
'Intercept': round(intercept[0],2),
'Loss': round(loss,3),
'Accuracy': round(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

3. Solver: The algorithm to use for optimization (‘liblinear’, ‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’). Some regularization might require a particular algorithm.

4. Max Iterations: The maximum number of iterations for the solver to converge.

For our golf dataset, we might start with ‘l2’ penalty, ‘liblinear’ solver, and C=1.0 as a baseline.

Like any algorithm in machine learning, logistic regression has its strengths and limitations.

Pros:

  1. Simplicity: Easy to implement and understand.
  2. Interpretability: The weights directly show the importance of each feature.
  3. Efficiency: Doesn’t require too much computational power.
  4. Probabilistic Output: Provides probabilities rather than just classifications.

Cons:

  1. Linearity Assumption: Assumes a linear relationship between features and log-odds of the outcome.
  2. Feature Independence: Assumes features are not highly correlated.
  3. Limited Complexity: May underfit in cases where the decision boundary is highly non-linear.
  4. Requires More Data: Needs a relatively large sample size for stable results.

In our golf example, logistic regression might provide a clear, interpretable model of how each weather factor influences the decision to play golf. However, it might struggle if the decision involves complex interactions between weather conditions that can’t be captured by a linear model.

Logistic regression shines as a powerful yet straightforward classification tool. It stands out for its ability to handle complex data while remaining easy to interpret. Unlike some other basic models, it provides smooth probability estimates and works well with many features. In the real world, from predicting customer behavior to medical diagnoses, logistic regression often performs surprisingly well. It’s not just a stepping stone — it’s a reliable model that can match more complex models in many situations.

# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Prepare data: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Split data into training and testing sets
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical features
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])

# Train the model
lr_clf = LogisticRegression(penalty='l2', C=1, solver='saga')
lr_clf.fit(X_train, y_train)

# Make predictions
y_pred = lr_clf.predict(X_test)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")



Source link

08Sep

Python QuickStart for People Learning AI | by Shaw Talebi | Sep, 2024


Many computers come with Python pre-installed. To see if your machine has it, go to your Terminal (Mac/Linux) or Command Prompt (Windows), and simply enter “python”.

Using Python in Terminal. Image by author.

If you don’t see a screen like this, you can download Python manually (Windows/ Mac). Alternatively, one can install Anaconda, a popular Python package system for AI and data science. If you run into installation issues, ask your favorite AI assistant for help!

With Python running, we can now start writing some code. I recommend running the examples on your computer as we go along. You can also download all the example code from the GitHub repo.

Strings & Numbers

A data type (or just “type”) is a way to classify data so that it can be processed appropriately and efficiently in a computer.

Types are defined by a possible set of values and operations. For example, strings are arbitrary character sequences (i.e. text) that can be manipulated in specific ways. Try the following strings in your command line Python instance.

"this is a string"
>> 'this is a string'
'so is this:-1*!@&04"(*&^}":>?'
>> 'so is this:-1*!@&04"(*&^}":>?'
"""and
this is
too!!11!"""
>> 'and\n this is\n too!!11!'
"we can even " + "add strings together"
>> 'we can even add strings together'

Although strings can be added together (i.e. concatenated), they can’t be added to numerical data types like int (i.e. integers) or float (i.e. numbers with decimals). If we try that in Python, we will get an error message because operations are only defined for compatible types.

# we can't add strings to other data types (BTW this is how you write comments in Python)
"I am " + 29
>> TypeError: can only concatenate str (not "int") to str
# so we have to write 29 as a string
"I am " + "29"
>> 'I am 29'

Lists & Dictionaries

Beyond the basic types of strings, ints, and floats, Python has types for structuring larger collections of data.

One such type is a list, an ordered collection of values. We can have lists of strings, numbers, strings + numbers, or even lists of lists.

# a list of strings
["a", "b", "c"]

# a list of ints
[1, 2, 3]

# list with a string, int, and float
["a", 2, 3.14]

# a list of lists
[["a", "b"], [1, 2], [1.0, 2.0]]

Another core data type is a dictionary, which consists of key-value pair sequences where keys are strings and values can be any data type. This is a great way to represent data with multiple attributes.

# a dictionary
{"Name":"Shaw"}

# a dictionary with multiple key-value pairs
{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]}

# a list of dictionaries
[{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]},
{"Name":"Ify", "Age":27, "Interests":["Marketing", "YouTube", "Shopping"]}]

# a nested dictionary
{"User":{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]},
"Last_login":"2024-09-06",
"Membership_Tier":"Free"}

So far, we’ve seen some basic Python data types and operations. However, we are still missing an essential feature: variables.

Variables provide an abstract representation of an underlying data type instance. For example, I might create a variable called user_name, which represents a string containing my name, “Shaw.” This enables us to write flexible programs not limited to specific values.

# creating a variable and printing it
user_name = "Shaw"
print(user_name)

#>> Shaw

We can do the same thing with other data types e.g. ints and lists.

# defining more variables and printing them as a formatted string. 
user_age = 29
user_interests = ["AI", "Music", "Bread"]

print(f"{user_name} is {user_age} years old. His interests include {user_interests}.")

#>> Shaw is 29 years old. His interests include ['AI', 'Music', 'Bread'].

Now that our example code snippets are getting longer, let’s see how to create our first script. This is how we write and execute more sophisticated programs from the command line.

To do that, create a new folder on your computer. I’ll call mine python-quickstart. If you have a favorite IDE (e.g., the Integrated Development Environment), use that to open this new folder and create a new Python file, e.g., my-script.py. There, we can write the ceremonial “Hello, world” program.

# ceremonial first program
print("Hello, world!")

If you don’t have an IDE (not recommended), you can use a basic text editor (e.g. Apple’s Text Edit, Window’s Notepad). In those cases, you can open the text editor and save a new text file using the .py extension instead of .txt. Note: If you use TextEditor on Mac, you may need to put the application in plain text mode via Format > Make Plain Text.

We can then run this script using the Terminal (Mac/Linux) or Command Prompt (Windows) by navigating to the folder with our new Python file and running the following command.

python my-script.py

Congrats! You ran your first Python script. Feel free to expand this program by copy-pasting the upcoming code examples and rerunning the script to see their outputs.

Two fundamental functionalities of Python (or any other programming language) are loops and conditions.

Loops allow us to run a particular chunk of code multiple times. The most popular is the for loop, which runs the same code while iterating over a variable.

# a simple for loop iterating over a sequence of numbers
for i in range(5):
print(i) # print ith element

# for loop iterating over a list
user_interests = ["AI", "Music", "Bread"]

for interest in user_interests:
print(interest) # print each item in list

# for loop iterating over items in a dictionary
user_dict = {"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]}

for key in user_dict.keys():
print(key, "=", user_dict[key]) # print each key and corresponding value

The other core function is conditions, such as if-else statements, which enable us to program logic. For example, we may want to check if the user is an adult or evaluate their wisdom.

# check if user is 18 or older
if user_dict["Age"] >= 18:
print("User is an adult")

# check if user is 1000 or older, if not print they have much to learn
if user_dict["Age"] >= 1000:
print("User is wise")
else:
print("User has much to learn")

It’s common to use conditionals within for loops to apply different operations based on specific conditions, such as counting the number of users interested in bread.

# count the number of users interested in bread
user_list = [{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]},
{"Name":"Ify", "Age":27, "Interests":["Marketing", "YouTube", "Shopping"]}]
count = 0 # intialize count

for user in user_list:
if "Bread" in user["Interests"]:
count = count + 1 # update count

print(count, "user(s) interested in Bread")

Functions are operations we can perform on specific data types.

We’ve already seen a basic function print(), which is defined for any datatype. However, there are a few other handy ones worth knowing.

# print(), a function we've used several times already
for key in user_dict.keys():
print(key, ":", user_dict[key])

# type(), getting the data type of a variable
for key in user_dict.keys():
print(key, ":", type(user_dict[key]))

# len(), getting the length of a variable
for key in user_dict.keys():
print(key, ":", len(user_dict[key]))
# TypeError: object of type 'int' has no len()

We see that, unlike print() and type(), len() is not defined for all data types, so it throws an error when applied to an int. There are several other type-specific functions like this.

# string methods
# --------------
# make string all lowercase
print(user_dict["Name"].lower())

# make string all uppercase
print(user_dict["Name"].upper())

# split string into list based on a specific character sequence
print(user_dict["Name"].split("ha"))

# replace a character sequence with another
print(user_dict["Name"].replace("w", "whin"))

# list methods
# ------------
# add an element to the end of a list
user_dict["Interests"].append("Entrepreneurship")
print(user_dict["Interests"])

# remove a specific element from a list
user_dict["Interests"].pop(0)
print(user_dict["Interests"])

# insert an element into a specific place in a list
user_dict["Interests"].insert(1, "AI")
print(user_dict["Interests"])

# dict methods
# ------------
# accessing dict keys
print(user_dict.keys())

# accessing dict values
print(user_dict.values())

# accessing dict items
print(user_dict.items())

# removing a key
user_dict.pop("Name")
print(user_dict.items())

# adding a key
user_dict["Name"] = "Shaw"
print(user_dict.items())

While the core Python functions are helpful, the real power comes from creating user-defined functions to perform custom operations. Additionally, custom functions allow us to write much cleaner code. For example, here are some of the previous code snippets repackaged as user-defined functions.

# define a custom function
def user_description(user_dict):
"""
Function to return a sentence (string) describing input user
"""
return f'{user_dict["Name"]} is {user_dict["Age"]} years old and is interested in {user_dict["Interests"][0]}.'

# print user description
description = user_description(user_dict)
print(description)

# print description for a new user!
new_user_dict = {"Name":"Ify", "Age":27, "Interests":["Marketing", "YouTube", "Shopping"]}
print(user_description(new_user_dict))

# define another custom function
def interested_user_count(user_list, topic):
"""
Function to count number of users interested in an arbitrary topic
"""
count = 0

for user in user_list:
if topic in user["Interests"]:
count = count + 1

return count

# define user list and topic
user_list = [user_dict, new_user_dict]
topic = "Shopping"

# compute interested user count and print it
count = interested_user_count(user_list, topic)
print(f"{count} user(s) interested in {topic}")

Although we could implement an arbitrary program using core Python, this can be incredibly time-consuming for some use cases. One of Python’s key benefits is its vibrant developer community and a robust ecosystem of software packages. Almost anything you might want to implement with core Python (probably) already exists as an open-source library.

We can install such packages using Python’s native package manager, pip. To install new packages, we run pip commands from the command line. Here is how we can install numpy, an essential data science library that implements basic mathematical objects and operations.

pip install numpy

After we’ve installed numpy, we can import it into a new Python script and use some of its data types and functions.

import numpy as np

# create a "vector"
v = np.array([1, 3, 6])
print(v)

# multiply a "vector"
print(2*v)

# create a matrix
X = np.array([v, 2*v, v/2])
print(X)

# matrix multiplication
print(X*v)

The previous pip command added numpy to our base Python environment. Alternatively, it’s a best practice to create so-called virtual environments. These are collections of Python libraries that can be readily interchanged for different projects.

Here’s how to create a new virtual environment called my-env.

python -m venv my-env

Then, we can activate it.

# mac/linux
source my-env/bin/activate

# windows
.\my-env\Scripts\activate.bat

Finally, we can install new libraries, such as numpy, using pip.

pip install pip

Note: If you’re using Anaconda, check out this handy cheatsheet for creating a new conda environment.

Several other libraries are commonly used in AI and data science. Here is a non-comprehensive overview of some helpful ones for building AI projects.

A non-comprehensive overview of Python libs for data science and AI. Image by author.

Now that we have been exposed to the basics of Python, let’s see how we can use it to implement a simple AI project. Here, I will use the OpenAI API to create a research paper summarizer and keyword extractor.

Like all the other snippets in this guide, the example code is available at the GitHub repository.



Source link

04Sep

Small Language Models Supporting Large Language Models | by Cobus Greyling | Sep, 2024


Considering the image above which demonstrates Hallucination Detection with an LLM as a Constrained Reasoner…

Initial Detection: Grounding sources and hypothesis pairs are input into a small language model (SLM) classifier.
No Hallucination: If no hallucination is detected, the “no hallucination” result is sent directly to the client.
Hallucination Detected: If the SLM detects a hallucination, an LLM-based constrained reasoner steps in to interpret the SLM’s decision.
Alignment Check: If the reasoner agrees with the SLM’s hallucination detection, this information, along with the original hypothesis, is sent to the client.
Discrepancy: If there’s a disagreement, the potentially problematic hypothesis is either filtered out or used as feedback to improve the SLM.

Given the infrequent occurrence of hallucinations in practical use, the average time and cost of using LLMs for reasoning on hallucinated texts is manageable.

This approach leverages the existing reasoning and explanation capabilities of LLMs, eliminating the need for substantial domain-specific data and costly fine-tuning.

While LLMs have traditionally been used as end-to-end solutions, recent approaches have explored their ability to explain small classifiers through latent features.

We propose a novel workflow to address this challenge by balancing latency and interpretability. ~ Source

One challenge of this implementation is the possible delta between the SLM’s decisions and the LLM’s explanations…

  • This work introduces a constrained reasoner for hallucination detection, balancing latency and interpretability.
  • Provides a comprehensive analysis of upstream-downstream consistency.
  • Offers practical solutions to improve alignment between detection and explanation.
  • Demonstrates effectiveness on multiple open-source datasets.

If you find any of my observations to be inaccurate, please feel free to let me know…🙂

  • I appreciate that this study focuses on introducing guardrails & checks for conversational UIs.
  • When interacting with real users, incorporating a human-in-the-loop approach helps with data annotation and continuous improvement by reviewing conversations.
  • It also adds an element of discovery, observation and interpretation, providing insights into the effectiveness of hallucination detection.
  • The architecture presented in this study offers a glimpse into the future, showcasing a more orchestrated approach where multiple models work together.
  • The study also addresses current challenges like cost, latency, and the need to critically evaluate any additional overhead.
  • Using small language models is advantageous as it allows for the use of open-source models, which reduces costs, offers hosting flexibility, and provides other benefits.
  • Additionally, this architecture can be applied asynchronously, where the framework reviews conversations after they occur. These human-supervised reviews can then be used to fine-tune the SLM or perform system updates.

✨ Follow me on LinkedIn for updates ✨

I’m currently the Chief Evangelist @ Kore.ai. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

Protected by Security by CleanTalk