29Jul

AI Agents: Exploring Agentic Applications | by Cobus Greyling | Jul, 2024


Applications based on LLMs are evolving & the next step in this progression of AI Agents are Agentic Applications. Agentic applications still have a Foundation Model as their backbone, but have more agency.

Agentic applications are AI-driven systems designed to autonomously perform tasks and make decisions based on user inputs and environmental context.

These applications leverage advanced models and tools to plan, execute, and adapt their actions dynamically.

By integrating capabilities like tool access, multi-step reasoning, and real-time adjustments, agentic applications can generate and complete complex workflows and provide intelligent solutions.

I must add that while many theories and future projections are based on speculation, I prioritise prototyping and creating working examples. This approach grounds commentary in practical experience, leading to more accurate future projections.

Generative and Language related AI are moving at a tremendous pace, as recent as 2018 the first notion of prompt engineering was introduced to combine NLP tasks and cast those as one question answering problem, within a specific context.

AS recent as Apr 2021, the term RAG as coined by a researcher, which was described as Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Only in January 2022 the chain-of-thought prompting technique was proposed by Google researchers.

September 2022 OpenAI introduced Whisper, an open-source acoustic model which approaches human level robustness and accuracy on speech recognition.

In 2023 we saw the progression of Large Language Models from a text-only interface, by introducing image processing and audio.

The term Foundation Model was an apt new reference to Large Language Models which, apart from generating compelling text, can also generate images, videos, speech, music, and more.

The term Foundation Model was coined by Stanford University Human-Centered Artificial Intelligence already in August 2021.

Also in 2023 we saw the rise of Small Language Models (SLMs). And even-though SLMs have a small footprint, they have advanced capabilities in reasoning, Natural Language Generation (NLG), context and dialog management, and more.

In 2023 we also saw the rise of Agents. Agents have as their backbone an LLM, while agents also have access to one or more tools to perform specific tasks.

Agents are able to answer highly ambiguous and complex questions…

Agents leverage LLMs to make a decision on which Action to take. After an Action is completed, the Agent enters the Observation step.

From Observation step, the Agent shares a Thought; if a final answer is not reached, the Agent cycles back to another Action in order to move closer to a Final Answer.

Agents are empowered by tools, these tools can include math libraries, web search, Weather APIs, and other integration points.

Agentic Applications can be seen as the next step in this progression where the agent application have more agency due to being able to browse and interpret the web, have mobile understanding and are capable of accessing multiple modalities.



Source link

29Jul

Why we need Continual Learning for AI models


Why, in a world where the only constant is change, we need a Continual Learning approach to AI models.

Image by the author generated in Midjourney

Imagine you have a small robot that is designed to walk around your garden and water your plants. Initially, you spend a few weeks collecting data to train and test the robot, investing considerable time and resources. The robot learns to navigate the garden efficiently when the ground is covered with grass and bare soil.

However, as the weeks go by, flowers begin to bloom and the appearance of the garden changes significantly. The robot, trained on data from a different season, now fails to recognise its surroundings accurately and struggles to complete its tasks. To fix this, you need to add new examples of the blooming garden to the model.

Your first thought is to add new data examples to the training and retrain the model from scratch. But this is expensive and you do not want to do this every time the environment changes. In addition, you have just realised that you do not have all the historical training data available.

Now you consider just fine-tuning the model with new samples. But this is risky because the model may lose some of its previously learned capabilities, leading to catastrophic forgetting (a situation where the model loses previously acquired knowledge and skills when it learns new information).

..so is there an alternative? Yes, using Continual Learning!

Of course, the robot watering plants in a garden is only an illustrative example of the problem. In the later parts of the text you will see more realistic applications.

Learn adaptively with Continual Learning (CL)

It is not possible to foresee and prepare for all the possible scenarios that a model may be confronted with in the future. Therefore, in many cases, adaptive training of the model as new samples arrive can be a good option.

In CL we want to find a balance between the stability of a model and its plasticity. Stability is the ability of a model to retain previously learned information, and plasticity is its ability to adapt to new information as new tasks are introduced.

“(…) in the Continual Learning scenario, a learning model is required to incrementally build and dynamically update internal representations as the distribution of tasks dynamically changes across its lifetime.” [2]

But how to control for the stability and plasticity?

Researchers have identified a number of ways to build adaptive models. In [3] the following categories have been established:

  1. Regularisation-based approach
  • In this approach we add a regularisation term that should balance the effects of old and new tasks on the model structure.
  • For example, weight regularisation aims to control the variation of the parameters, by adding a penalty term to the loss function, which penalises the change of the parameter by taking into account how much it contributed to the previous tasks.

2. Replay-based approach

  • This group of methods focuses on recovering some of the historical data so that the model can still reliably solve previous tasks. One of the limitations of this approach is that we need access to historical data, which is not always possible.
  • For example, experience replay, where we preserve and replay a sample of old training data. When training a new task, some examples from previous tasks are added to expose the model to a mixture of old and new task types, thereby limiting catastrophic forgetting.

3. Optimisation based approach

  • Here we want to manipulate the optimisation methods to maintain performance for all tasks, while reducing the effects of catastrophic forgetting.
  • For example, gradient projection is a method where gradients computed for new tasks are projected so as not to affect previous gradients.

4. Representation-based approach

  • This group of methods focuses on obtaining and using robust feature representations to avoid catastrophic forgetting.
  • For example, self-supervised learning, where a model can learn a robust representation of the data before being trained on specific tasks. The idea is to learn high-quality features that reflect good generalisation across different tasks that a model may encounter in the future.

5. Architecture-based approach

  • The previous methods assume a single model with a single parameter space, but there are also a number of techniques in CL that exploit model’s architecture.
  • For example, parameter allocation, where, during training, each new task is given a dedicated subspace in a network, which removes the problem of parameter destructive interference. However, if the network is not fixed, its size will grow with the number of new tasks.

And how to evaluate the performance of the CL models?

The basic performance of CL models can be measured from a number of angles [3]:

  • Overall performance evaluation: average performance across all tasks
  • Memory stability evaluation: calculating the difference between maximum performance for a given task before and its current performance after continual training
  • Learning plasticity evaluation: measuring the difference between joint training performance (if trained on all data) and performance when trained using CL

So why don’t all AI researchers switch to Continual Learning right away?

If you have access to the historical training data and are not worried about the computational cost, it may seem easier to just train from scratch.

One of the reasons for this is that the interpretability of what happens in the model during continual training is still limited. If training from scratch gives the same or better results than continual training, then people may prefer the easier approach, i.e. retraining from scratch, rather than spending time trying to understand the performance problems of CL methods.

In addition, current research tends to focus on the evaluation of models and frameworks, which may not reflect well the real use cases that the business may have. As mentioned in [6], there are many synthetic incremental benchmarks that do not reflect well real-world situations where there is a natural evolution of tasks.

Finally, as noted in [4], many papers on the topic of CL focus on storage rather than computational costs, and in reality, storing historical data is much less costly and energy consuming than retraining the model.

If there were more focus on the inclusion of computational and environmental costs in model retraining, more people might be interested in improving the current state of the art in CL methods as they would see measurable benefits. For example, as mentioned in [4], model re-training can exceed 10 000 GPU days of training for recent large models.

Why should we work on improving CL models?

Continual learning seeks to address one of the most challenging bottlenecks of current AI models — the fact that data distribution changes over time. Retraining is expensive and requires large amounts of computation, which is not a very sustainable approach from both an economic and environmental perspective. Therefore, in the future, well-developed CL methods may allow for models that are more accessible and reusable by a larger community of people.

As found and summarised in [4], there is a list of applications that inherently require or could benefit from the well-developed CL methods:

  1. Model Editing
  • Selective editing of an error-prone part of a model without damaging other parts of the model. Continual Learning techniques could help to continuously correct model errors at much lower computational cost.

2. Personalisation and specialisation

  • General purpose models sometimes need to be adapted to be more personalised for specific users. With Continual Learning, we could update only a small set of parameters without introducing catastrophic forgetting into the model.

3. On-device learning

  • Small devices have limited memory and computational resources, so methods that can efficiently train the model in real time as new data arrives, without having to start from scratch, could be useful in this area.

4. Faster retraining with warm start

  • Models need to be updated when new samples become available or when the distribution shifts significantly. With Continual Learning, this process can be made more efficient by updating only the parts affected by new samples, rather than retraining from scratch.

5. Reinforcement learning

  • Reinforcement learning involves agents interacting with an environment that is often non-stationary. Therefore, efficient Continual Learning methods and approaches could be potentially useful for this use case.

Learn more

As you can see, there is still a lot of room for improvement in the area of Continual Learning methods. If you are interested you can start with the materials below:

  • Introduction course: [Continual Learning Course] Lecture #1: Introduction and Motivation from ContinualAI on YouTube https://youtu.be/z9DDg2CJjeE?si=j57_qLNmpRWcmXtP
  • Paper about the motivation for the Continual Learning: Continual Learning: Application and the Road Forward [4]
  • Paper about the state of the art techniques in Continual Learning: Comprehensive Survey of Continual Learning: Theory, Method and Application [3]

If you have any questions or comments, please feel free to share them in the comments section.

Cheers!

Image by the author generated in Midjourney

[1] Awasthi, A., & Sarawagi, S. (2019). Continual Learning with Neural Networks: A Review. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (pp. 362–365). Association for Computing Machinery.

[2] Continual AI Wiki Introduction to Continual Learning https://wiki.continualai.org/the-continualai-wiki/introduction-to-continual-learning

[3] Wang, L., Zhang, X., Su, H., & Zhu, J. (2024). A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5362–5383.

[4] Eli Verwimp, Rahaf Aljundi, Shai Ben-David, Matthias Bethge, Andrea Cossu, Alexander Gepperth, Tyler L. Hayes, Eyke Hüllermeier, Christopher Kanan, Dhireesha Kudithipudi, Christoph H. Lampert, Martin Mundt, Razvan Pascanu, Adrian Popescu, Andreas S. Tolias, Joost van de Weijer, Bing Liu, Vincenzo Lomonaco, Tinne Tuytelaars, & Gido M. van de Ven. (2024). Continual Learning: Applications and the Road Forward https://arxiv.org/abs/2311.11908

[5] Awasthi, A., & Sarawagi, S. (2019). Continual Learning with Neural Networks: A Review. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (pp. 362–365). Association for Computing Machinery.

[6] Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, & Fartash Faghri. (2024). TiC-CLIP: Continual Training of CLIP Models.



Source link

28Jul

I found a hidden gem in Matplotlib’s library: Packed Bubble Charts in Python | by Anna Gordun Peiro | Jul, 2024


For my chart, I am using an Olympic Historical Dataset from Olympedia.org which Joseph Cheng shared in Kaggle with a public domain license.

Screenshot of dataset

It contains event to Athlete level Olympic Games Results from Athens 1896 to Beijing 2022. After an EDA (Exploratory Data Analysis) I transformed it into a dataset that details the number of female athletes in each sport/event per year. My bubble chart idea is to show which sports have a 50/50 female to male ratio athletes and how it has evolved during time.

My plotting data is composed of two different datasets, one for each year: 2020 and 1996. For each dataset I’ve computed the total sum of athletes that participated to each event (athlete_sum) and how much that sum represents compared to the number of total athletes (male + female) (difference). See a screenshot of the data below:

Screen shot of plotting dataset

This is my approach to visualise it:

  • Size proportion. Using radius of bubbles to compare number athletes per sport. Bigger bubbles will represent highly competitive events, such as Athletics
  • Multi variable interpretation. Making use of colours to represent female representation. Light green bubbles will represent events with a 50/50 split, such as Hockey.

Here is my starting point (using the code and approach from above):

First result

Some easy fixes: increasing figure size and changing labels to empty if the size isn’t over 250 to avoid having words outside bubbles.

fig, ax = plt.subplots(figsize=(12,8),subplot_kw=dict(aspect="equal"))

#Labels edited directly in dataset

Second result

Well, now at least it’s readable. But, why is Athletics pink and Boxing blue? Let’s add a legend to illustrate the relationship between colours and female representation.

Because it’s not your regular barplot chart, plt.legend() doesn’t do the trick here.

Using matplotlib Annotation Bbox we can create rectangles (or circles) to show meaning behind each colour. We can also do the same thing to show a bubble scale.

import matplotlib.pyplot as plt
from matplotlib.offsetbox import (AnnotationBbox, DrawingArea,
TextArea,HPacker)
from matplotlib.patches import Circle,Rectangle

# This is an example for one section of the legend

# Define where the annotation (legend) will be
xy = [50, 128]

# Create your colored rectangle or circle
da = DrawingArea(20, 20, 0, 0)
p = Rectangle((10 ,10),10,10,color="#fc8d62ff")
da.add_artist(p)

# Add text

text = TextArea("20%", textprops=dict(color="#fc8d62ff", size=14,fontweight='bold'))

# Combine rectangle and text
vbox = HPacker(children=[da, text], align="top", pad=0, sep=3)

# Annotate both in a box (change alpha if you want to see the box)
ab = AnnotationBbox(vbox, xy,
xybox=(1.005, xy[1]),
xycoords='data',
boxcoords=("axes fraction", "data"),
box_alignment=(0.2, 0.5),
bboxprops=dict(alpha=0)
)
#Add to your bubble chart
ax.add_artist(ab)

I’ve also added a subtitle and a text description under the chart just by using plt.text()

Final visualisation

Straightforward and user friendly interpretations of the graph:

  • Majority of bubbles are light green → green means 50% females → majority of Olympic competitions have an even 50/50 female to male split (yay🙌)
  • Only one sport (Baseball), in dark green colour, has no female participation.
  • 3 sports have only female participation but the number of athletes is fairly low.
  • The biggest sports in terms of athlete number (Swimming, Athletics and Gymnastics) are very close to having a 50/50 split



Source link

27Jul

Radical Simplicity in Data Engineering | by Cai Parry-Jones | Jul, 2024


Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinking

source: unsplash.com

Recently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The main pain points I heard time and time again were:

  • Not knowing why something broke
  • Getting burnt with high cloud compute costs
  • Taking too long to build data solutions/complete data projects
  • Needing expertise on many tools and technologies

These problems aren’t new. I’ve experienced them, you’ve probably experienced them. Yet, we can’t seem to find a solution that solves all of these issues in the long run. You might think to yourself, ‘well point one can be solved with {insert data observability tool}’, or ‘point two just needs a stricter data governance plan in place’. The problem with these style of solutions is they add additional layers of complexity, which cause the final two pain points to increase in seriousness. The aggregate sum of pain remains the same, just a different distribution between the four points.

created by the author using Google Sheets

This article aims to present a contrary style of problem solving: radical simplicity.

TL;DR

  • Software engineers have found massive success in embracing simplicity.
  • Over-engineering and pursuing perfection can result in bloated, slow-to-develop data systems, with sky high costs to the business.
  • Data teams should consider sacrificing some functionality for the sake of simplicity and speed.

A Lesson From Those Software Guys

In 1989, the computer scientist Richard P. Gabriel wrote a relatively famous essay on computer systems paradoxically called ‘Worse Is Better’. I won’t go into the details, you can read the essay here if you like, but the underlying message was that software quality does not necessarily improve as functionality increases. In other words, on occasions, you can sacrifice completeness for simplicity and end up with an inherently ‘better’ product because of it.

This was a strange idea to the pioneers of computing during the 1950/60s. The philosophy of the day was: a computer system needs to be pure, and it can only be pure if it accounts for all possible scenarios. This was likely due to the fact that most leading computer scientists at the time were academics, who very much wanted to treat computer science as a hard science.

Academics at MIT, the leading institution in computing at the time, started working on the operating system for the next generation of computers, called Multics. After nearly a decade of development and millions of dollars of investment, the MIT guys released their new system. It was unquestionably the most advanced operating system of the time, however it was a pain to install due to the computing requirements, and feature updates were slow due to the size of the code base. As a result, it never caught on beyond a few select universities and industries.

While Multics was being built, a small group supporting Multics’s development became frustrated with the growing requirements required for the system. They eventually decided to break away from the project. Armed with this experience they set their sights on creating their own operating system, one with a fundamental philosophy shift:

The design must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.

— Richard P. Gabriel

Five years after Multics’s release, the breakaway group released their operating system, Unix. Slowly but steadily it caught traction, and by the 1990s Unix became the go-to choice for computers, with over 90% of the world’s top 500 fastest supercomputers using it. To this day, Unix is still widely used, most notably as the system underlying macOS.

There were obviously other factors beyond its simplicity that led to Unix’s success. But its lightweight design was, and still is, a highly valuable asset of the system. That could only come about because the designers were willing to sacrifice functionality. The data industry should not be afraid to to think the same way.

Back to Data in the 21st Century

Thinking back at my own experiences, the philosophy of most big data engineering projects I’ve worked on was similar to that of Multics. For example, there was a project where we needed to automate standardising the raw data coming in from all our clients. The decision was made to do this in the data warehouse via dbt, since we could then have a full view of data lineage from the very raw files right through to the standardised single table version and beyond. The problem was that the first stage of transformation was very manual, it required loading each individual raw client file into the warehouse, then dbt creates a model for cleaning each client’s file. This led to 100s of dbt models needing to be generated, all using essentially the same logic. Dbt became so bloated it took minutes for the data lineage chart to load in the dbt docs website, and our GitHub Actions for CI (continuous integration) took over an hour to complete for each pull request.

This could have been resolved fairly simply if leadership had allowed us to make the first layer of transformations outside of the data warehouse, using AWS Lambda and Python. But no, that would have meant the data lineage produced by dbt wouldn’t be 100% complete. That was it. That was the whole reason not to massively simplify the project. Similar to the group who broke away from the Multics project, I left this project mid-build, it was simply too frustrating to work on something that so clearly could have been much simpler. As I write this, I discovered they are still working on the project.

So, What the Heck is Radical Simplicity?

Radical simplicity in data engineering isn’t a framework or data-stack toolkit, it is simply a frame of mind. A philosophy that prioritises simple, straightforward solutions over complex, all-encompassing systems.

Key principles of this philosophy include:

  1. Minimalism: Focusing on core functionalities that deliver the most value, rather than trying to accommodate every possible scenario or requirement.
  2. Accepting trade-offs: Willingly sacrificing some degree of completeness or perfection in favour of simplicity, speed, and ease of maintenance.
  3. Pragmatism over idealism: Prioritising practical, workable solutions that solve real business problems efficiently, rather than pursuing theoretically perfect but overly complex systems.
  4. Reduced cognitive load: Designing systems and processes that are easier to understand, implement, and maintain, thus reducing the expertise required across multiple tools and technologies.
  5. Cost-effectiveness: Embracing simpler solutions that often require less computational resources and human capital, leading to lower overall costs.
  6. Agility and adaptability: Creating systems that are easier to modify and evolve as business needs change, rather than rigid, over-engineered solutions.
  7. Focus on outcomes: Emphasising the end results and business value rather than getting caught up in the intricacies of the data processes themselves.

This mindset can be in direct contradiction to modern data engineering solutions of adding more tools, processes, and layers. As a result, be expected to fight your corner. Before suggesting an alternative, simpler, solution, come prepared with a deep understanding of the problem at hand. I am reminded of the quote:

It takes a lot of hard work to make something simple, to truly understand the underlying challenges and come up with elegant solutions. […] It’s not just minimalism or the absence of clutter. It involves digging through the depth of complexity. To be truly simple, you have to go really deep. […] You have to deeply understand the essence of a product in order to be able to get rid of the parts that are not essential.

— Steve Jobs

Side note: Be aware that adopting radical simplicity doesn’t mean ignoring new tools and advanced technologies. In fact one of my favourite solutions for a data warehouse at the moment is using a new open-source database called duckDB. Check it out, it’s pretty cool.

Conclusion

The lessons from software engineering history offer valuable insights for today’s data landscape. By embracing radical simplicity, data teams can address many of the pain points plaguing modern data solutions.

Don’t be afraid to champion radical simplicity in your data team. Be the catalyst for change if you see opportunities to streamline and simplify. The path to simplicity isn’t easy, but the potential rewards can be substantial.



Source link

26Jul

LangChain Based Plan & Execute AI Agent With GPT-4o-mini | by Cobus Greyling | Jul, 2024


As has been widely established by now, Chain-of-Thought (CoT) prompting is a highly effective method for querying LLMs using a single zero or few-shot approach.

It excels at tasks requiring multi-step reasoning, where the model is guided through step-by-step demonstrations before addressing the problem with the instruction Let us think step by step.

However, recent studies have identified three main limitations of CoT prompting:

Calculations

7% failure rate in test examples.

Missing Steps

12% failure rate in sequential events.

Semantic Misunderstanding

27% failure rate in test examples.

To address these issues, Plan-and-Solve (PS) prompting and its enhanced version, Plan-and-Solve with Detailed Instructions (PS+), have been introduced.

PS involves two key steps:

  1. Creating a plan to break the task into smaller subtasks and then
  2. Executing these subtasks according to the plan.

This simple architecture represents the planning agent framework. It has two main components:

  1. Planner: Prompts an LLM to create a multi-step plan for a large task.
  2. Executors: Receive the user query and a step in the plan, then invoke one or more tools to complete that task.

After execution, the agent is prompted to re-plan, deciding whether to provide a final response or generate a follow-up plan if the initial plan was insufficient.

This design minimises the need to call the large planner LLM for every tool invocation.

However, it remains limited by serial tool calling and requires an LLM for each task, as it doesn’t support variable assignment.

The LLM assign is done in the following way:

llm = OpenAI(temperature=0,model_name=’gpt-4o-mini’)

Below the complete Python code for the AI agent. The only changes you will need to make is adding your OpenAI API Key, and langSmith project variables.

### Install Required Packages:
pip install -qU langchain-openai langchain langchain_community langchain_experimental
pip install -U duckduckgo-search
pip install -U langchain langchain-openai
### Import Required Modules and Set Environment Variables:
import os
from uuid import uuid4
### Setup the LangSmith environment variables
unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"OpenAI_SM_1"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = ""
### Import LangChain Components and OpenAI API Key
from langchain.chains import LLMMathChain
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
from langchain_core.tools import Tool
from langchain_experimental.plan_and_execute import (
PlanAndExecute,
load_agent_executor,
load_chat_planner,
)
from langchain_openai import ChatOpenAI, OpenAI
###
os.environ['OPENAI_API_KEY'] = str("")
llm = OpenAI(temperature=0,model_name='gpt-4o-mini')
### Set Up Search and Math Chain Tools
search = DuckDuckGoSearchAPIWrapper()
llm = OpenAI(temperature=0)
llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True)
tools = [
Tool(
name="Search",
func=search.run,
description="useful for when you need to answer questions about current events",
),
Tool(
name="Calculator",
func=llm_math_chain.run,
description="useful for when you need to answer questions about math",
),
]
### Initialize Planner and Executor
model = ChatOpenAI(model_name='gpt-4o-mini', temperature=0)
planner = load_chat_planner(model)
executor = load_agent_executor(model, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor)
### Invoke the Agent
agent.invoke(
"Who is the founder of SpaceX an what is the square root of his year of birth?"
)



Source link

26Jul

I Built a New AI Marketing Tool and am giving it out for free! | by Hasan Aboul Hasan | Jul, 2024


In this post, I want to introduce a new AI tool I’ve recently developed.

It’s designed to help you conduct intensive topic research, complete with search metrics, and create blog post drafts directly on WordPress.

Something like this:

The best part?

This tool, along with its source code, will be available to you completely free!

Let’s Start with a simple demo of the tool:

The concept is simple: you input a topic you want to research, and the tool uses recursive LLM calls to perform in-depth research with AI, generating hundreds of related subtopics organized and interconnected in a graph structure.

But that’s not all — it comes with additional features like fetching keyword data for each generated topic and automatically creating and publishing blog post drafts on your WordPress site.

Before we dive in and run the tool, let’s explore some of its benefits.

This tool can be super valuable for:

  • – Conducting master’s and Ph.D. level academic surveys and research
  • – Generating blogging ideas and drafts
  • – Automating SEO processes
  • – Brainstorming YouTube content ideas
  • – Building your own topic research tools on top of this tool.
  • – Offering freelance services based on this technology

Download the project files

First, ensure you have Python installed on your computer. Then, click on the link below to download the project files.

Download Project Files

Once downloaded, open the files with your preferred IDE. I use Visual Studio Code, but feel free to use whichever editor you’re comfortable with.

Start by creating a virtual environment and installing the libraries.

So, open a new terminal and run the following step-by-step:

1- Create the Virtual Environment:

python — m venv venv

2- Create and Activate the Virtual Environment:

venv/scripts/activate

3- Install Libraries:

pip install -r requirements.txt

Now that we have a virtual environment with the necessary libraries installed, it will help isolate the project and avoid conflicts between package versions.

To run the tool, enter this command:

streamlit run ui.py

If you only want to do topic research, you are good to go from here.

However, if you want to get keyword data and draft blog posts, bear with me a little longer 🙂

To get keyword data, you need to use a trusted source like the Google Keyword Planner API, Semrush API, SpyFu API, or any other reliable API.

In my case, I am using my own Keyword Metrics API hosted on RapidAPI.

If you want to try it, you will need to get the API key from here:

Then, add it to the .env File like this:

RAPIDAPI_API_KEY = “99d728b8e3msh2345tfdgdfsdjsnb1dcf1c170eb”

If you prefer using other APIs, change the following function in the helpers.py file:

def get_keyword_metrics(keywords):
rapid = RapidAPIClient()
response = rapid.call_api(api_url=f"https://bulk-keyword-metrics.p.rapidapi.com/seo-tools/get-bulk-keyword-metrics?keywords_count=20&query={keywords}&countryCode=US")
return response

If you want to automatically create post drafts, you will need to generate an application password from your WordPress dashboard.

Navigate to users, click on your admin user, create an application password, and then add it to the .env file like this:

WORDPRESS_URL = “http://data-tools.local/”

WORDPRESS_USER = “admin”

WORDPRESS_APP_PASSWORD = “AaM7XsSe3x51RKLe3ccogilp”

💡 Pro tip: You can use LocalWP to run a local WordPress website on your PC for testing. It’s free.

The core of this tool, which is topic research, is built on the concept of generating topic ideas for a topic on multiple levels.

Let me explain.

If you open ChatGPT now and ask it to generate the top child topics for “Quantum computing,” with a special prompt like this:

as an expert in keyword and topic research specialized in {topic}, 
generate {count} sub topics to write about in the form of SEARCHABLE keywords
for the the following parent topic: {topic}

It will respond like this:

Then, if you ask it to generate a list of child topics for one of these topics, like “Quantum Computing Algorithms” it will respond like this:

So, you are digging deeper and discovering child topics for each topic.

I automated this process with a recursive function that keeps calling the LLM to generate topics based on the parent topic.

I used SimplerLLM to make my life easier when creating the project.

Here is the code for running these recursive calls:

def generate_subtopics_graph(graph, topic, current_level, max_level):
if current_level > max_level:
return

print(f"Level {current_level}: Generating subtopics for '{topic}'")

subtopics = get_topic_children(topic)
for subtopic in subtopics:
print(f" Adding edge from '{topic}' to '{subtopic}'")
graph.add_edge(topic, subtopic)
generate_subtopics_graph(graph, subtopic, current_level + 1, max_level)

And this is the get_topic_children Function:

def get_topic_children(topic :str, num_results = 3):
prompt = sub_topics_prompt.format(topic=topic,count = num_results)

response = generate_pydantic_json_model(model_class=SubTopics,
prompt=prompt,llm_instance=llm_instance,
max_tokens=1024)
return response.sub_topics

I used generate_pydantic_json_model from SimplerLLM to get the list in JSON format, making it easy to loop and call each topic with my script.

Let me share some advanced tips to help you get the most out of this tool and even monetize it!

First, you are not obliged to use OpenAI. You can simply change the provider by changing this line:

llm_instance = LLM.create(provider=LLMProvider.OPENAI,model_name=”gpt-4o”)

This is the power of SimplerLLM!

Second, when starting with the tool, it is better to test with 1–2 levels to make sure it works, then you can test with more sub-levels.

Third, this tool can be a great prototype or MVP for a SaaS.

By adding features like saving data for each user and adding an authentication system, or maybe recreating it with NextJS, you can turn it into your next online business.

Or you can turn it into a simple tool on WordPress and monetize it with a points system, as I do on my website with my tools.

Remember, if you need help running this tool or have any questions, I will be available almost every day on the forum 🙂

Have fun!



Source link

25Jul

LangChain Search AI Agent Using GPT-4o-mini | by Cobus Greyling | Jul, 2024


LangSmith also allows for the creation of datasets, output can be annotated, set to correct and incorrect and auto evaluations can be run to determine the correctness.

The agent decomposes the compound user question into sub-questions which are the then individually answered.

This code demonstrates how to set up and use a language model (LLM) from OpenAI, specifically with the LangChain framework. It also integrates the Tavily search tool for enhanced information retrieval.

Web search LangChain agent:

### Install Necessary Packages
pip install -qU langchain-openai langchain langchain_community
### Import Required Modules
import getpass
import os
### Set Environment Variables for API Keys
os.environ["OPENAI_API_KEY"] = getpass.getpass()
os.environ["TAVILY_API_KEY"] = getpass.getpass()
### Initialize the OpenAI LLM
from langchain_openai import ChatOpenAI
###
llm = ChatOpenAI(model="gpt-4o-mini")
### Import Necessary LangChain Components
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_core.prompts import ChatPromptTemplate
### Set Up the Tavily Search Tool
tools = [TavilySearchResults(max_results=1)]
### Create a Chat Prompt Template
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are a helpful assistant. Make sure to use the tavily_search_results_json tool for information.",
),
("placeholder", "{chat_history}"),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
]
)

### Construct the Tools agent
agent = create_tool_calling_agent(llm, tools, prompt)

### Create an agent executor by passing in the agent and tools
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

### Invoke the Agent Executor:
agent_executor.invoke({"input": "What year was IBM founded and in what year was Apple founded?"})

And the output from the agent. It is clear how there are two queries, year IBM founded and year Apple founded with the final output being consolidated with web references.

> Entering new AgentExecutor chain...

Invoking: `tavily_search_results_json` with `{'query': 'year IBM founded'}`

[{'url': 'https://en.wikipedia.org/wiki/History_of_IBM', 'content': 'Recognizing this trend, management, with the support of the Board of Directors, began to implement a plan to split IBM into increasingly autonomous business units (e.g. processors, storage, software, services, printers, etc.) to compete more effectively with competitors that were more focused and nimble and had lower cost structures.[citation needed]\nIBM also began spinning off its many divisions into autonomous subsidiaries (so-called "Baby Blues") in an attempt to make the company more manageable and to streamline IBM by having other investors finance those companies.[180][181] These included AdStar, dedicated to disk drives and other data storage products (on creation the largest data storage business in the world);[182] IBM Application Business Systems, dedicated to mid-range computers; IBM Enterprise Systems, dedicated to mainframes; Pennant Systems, dedicated to mid-range and large printers; Lexmark, dedicated to small printers, keyboards, and typewriters (such as the Selectric); and more.[183] Lexmark was acquired by Clayton & Dubilier in a leveraged buyout shortly after its formation.[184]\nIn September 1992, IBM combined and spun off their various non-mainframe and non-midrange, personal computer manufacturing divisions into an autonomous wholly owned subsidiary known as the IBM Personal Computer Company (IBM PC Co.).[185][186] Purchases were often instigated by middle managers and senior staff who saw the potential – once the revolutionary VisiCalc spreadsheet, the killer app, had been surpassed by a far more powerful and stable product, Lotus 1-2-3.[citation needed]\nIBM\'s dominance of the mainframe market in Europe and the US encouraged existing customers to buy the PC,[171][173] and vice versa; as sales of what had been an experiment in a new market became a substantial part of IBM\'s financials, the company found that customers also bought larger IBM computers.[174][167][162] Unlike the BUNCH and other rivals IBM quickly adjusted to the retail market,[171][175] with its own sales force competing with outside retailers for the first time.[162] The US government\'s case sustained by four US Presidents and their Attorneys General was dropped as "without merit" in 1982 by William Baxter, US President Reagans\' Assistant Attorney General in charge of the Antitrust Division of the U.S. Department of Justice.[315]\nCDC filed an antitrust lawsuit against IBM in Minnesota\'s federal court alleging that IBM had monopolized the market for computers in violation of section 2 of the Sherman Antitrust Act by among other things announcing products it could not deliver.[316] A 1965 internal IBM memo by an IBM attorney noted that Control Data had publicly blamed its declining earnings on IBM, "and its frequent model and price changes. By 1981 its stock price had declined by 22%.[166] IBM\'s earnings for the first half of the year grew by 5.3% – one third of the inflation rate – while those of DEC grew by more than 35%.[165] Although IBM began selling minicomputers,[170] in January 1982 the Justice Department ended the antitrust suit because, The New York Times reported, the government "recognized what computer experts and securities analysts had long since concluded: I.B.M. no longer dominates the computer business".[147]\nIBM wished to avoid the same outcome with the new personal computer industry.[169] He asserts that the company was not plundered, its leased machinery was not confiscated, and IBM continued to receive funds through its Geneva-based subsidiary.[86] Black argues that IBM persisted in its business relations with the Nazi regime beyond the point where they should have ceased, maintaining and expanding services to the Third Reich[86] until the seizure of Dehomag following the United States\' declaration of war against Germany in 1941.[citation needed]\nIBM countered these claims by stating that the allegations were based on known facts and previously disclosed documents, asserting the absence of new revelations.'}]
Invoking: `tavily_search_results_json` with `{'query': 'year Apple founded'}`

[{'url': 'https://en.wikipedia.org/wiki/History_of_Apple_Inc.', 'content': 'The head of a retail chain said "It appears that IBM had a better understanding of why the Apple II was successful than had Apple".[78] Gene Amdahl predicted that Apple would be another of the many "brash young companies" that IBM had defeated.[88]\nBy 1984 the press called the two companies archrivals,[89] but IBM had $4 billion in annual PC revenue, more than twice that of Apple and as much as the sales of it and the next three companies combined.[90] A Fortune survey found that 56% of American companies with personal computers used IBM PCs, compared to 16% for Apple.[91] Small businesses, schools, and some homes became the II\'s primary market.[76]\nXerox PARC and the Lisa[edit]\nApple Computer\'s business division was focused on the Apple III, another iteration of the text-based computer. In 1979, the Apple II was chosen to be the desktop platform for the first "killer application" of the business world: VisiCalc, a spreadsheet.[55] So important that the Apple II became what John Markoff described as a "VisiCalc accessory",[57] the application created a business market for the computer and gave home users an additional reason to buy it: compatibility with the office.[55] Before VisiCalc, Apple had been a distant third place competitor to Commodore and Tandy.[58][59]\nThe Apple II was one of the three "1977 Trinity" computers generally credited with creating the home computer market (the other two being the Commodore PET and the Tandy Corporation TRS-80).[60] After giving their results for the first quarter of 2011, Microsoft\'s net profits of $5.2 billion were lower for the quarter than those of Apple, which earned $6 billion in net profit for the quarter.[182][183] The late April announcement of profits by the companies marked the first time in 20 years that Microsoft\'s profits had been lower than Apple\'s,[184] a situation described by Ars Technica as "unimaginable a decade ago".[182]\nThe Guardian reported that one of the reasons for the change was because PC software, where Microsoft dominates, has become less important compared to the tablet and smartphone markets, where Apple has a strong presence.[184] One reason for this was a surprise drop in PC sales in the quarter.[184] Nonetheless, he kept his word and paid the two Steves the money promised.[39][37][38][40]\nThe Apple I went on sale in July 1976 as an assembled circuit board with a retail price of $666.66.[41][42][43] Wozniak later said he had had no idea about the relation between the number and the mark of the beast, and that he came up with the price because he liked repeating digits.[39] About 200 units of the Apple I were eventually sold.[44]\nThe Apple I computer had some notable features, including the use of a TV display, whereas many machines had no display at all. The new corporation bought out the partnership the two Steves had formed nine months earlier.[46]\nIn February 1977, Markkula recruited Michael Scott from National Semiconductor to serve as the first president and CEO of Apple Computer, as the two Steves were both insufficiently experienced and he was not interested in taking that position himself.[47][48]\nThat same month, Wozniak resigned from his job at Hewlett-Packard to work full-time for Apple.[46][49]\nApple II[edit]\nAlmost as soon as Apple had started selling its first computers, Wozniak moved on from the Apple I and began designing a greatly improved computer: the Apple II.[45] Wozniak completed a working prototype of the new machine by August 1976.[38][50]'}]IBM was founded in 1911, originally as the Computing-Tabulating-Recording Company (CTR), and was renamed International Business Machines Corporation (IBM) in 1924.

Apple Inc. was founded in 1976.

For more detailed histories, you can visit the following links:
- [IBM History](https://en.wikipedia.org/wiki/History_of_IBM)
- [Apple History](https://en.wikipedia.org/wiki/History_of_Apple_Inc.)

> Finished chain.
{'input': 'What year was IBM founded and in what year was Apple founded?',
'output': 'IBM was founded in 1911, originally as the Computing-Tabulating-Recording Company (CTR), and was renamed International Business Machines Corporation (IBM) in 1924. \n\nApple Inc. was founded in 1976. \n\nFor more detailed histories, you can visit the following links: \n- [IBM History](https://en.wikipedia.org/wiki/History_of_IBM)\n- [Apple History](https://en.wikipedia.org/wiki/History_of_Apple_Inc.)'}

Below just the input and output text from the agent.

{'input': 'What year was IBM founded and in what year was Apple founded?',
'output': 'IBM was founded in 1911, originally as the Computing-Tabulating-Recording Company (CTR), and was renamed International Business Machines Corporation (IBM) in 1924. \n\nApple Inc. was founded in 1976. \n\nFor more detailed histories, you can visit the following links: \n- [IBM History](https://en.wikipedia.org/wiki/History_of_IBM)\n- [Apple History](https://en.wikipedia.org/wiki/History_of_Apple_Inc.)'}

In summary, this code sets up an LLM-based agent using the LangChain framework, integrates a custom search tool, and uses it to answer a specific query.

GPT-4o-mini Considerations

  • Local Control: Open-source SLMs offer the advantage of running models locally with full control over inferencing, which is not applicable for OpenAI’s commercial hosted API model.
  • OpenAI Focus: OpenAI emphasises speed, cost, and capability while following the trend of smaller models.
  • Competitors: Highly capable open-source text-based SLMs like Orca-2, Phi3, and TinyLlama are notable competitors.
  • Differentiators: GPT-4o-mini stands out with its cost, speed, capability, and available modalities.

Advantages of GPT-4o-mini

  • Text & Vision Support: GPT-4o-mini supports text and vision inputs in both the API and playground.
  • Future Expansions: Upcoming features include handling text, image, video, and audio inputs and outputs.
  • Large Context Window: The model boasts a context window of 128K tokens and has knowledge up to October 2023.
  • Multi-Language Capabilities: The model supports multiple languages.
  • Enhanced Inference Speeds: Improved speeds make it suitable for various applications.
  • Cost-Effective: The combination of speed and cost makes it ideal for agentic applications with multiple parallel calls. It costs 15 cents per million input tokens and 60 cents per million output tokens.
  • Fine-Tuning: Fine-tuning for GPT-4o-mini will be available soon.



Source link

24Jul

Large Language Model Use & Augmentation | by Cobus Greyling | Jul, 2024


GPT-3 was launched on May 28, 2020, and over the past four years, a rapidly developing ecosystem has emerged to create LLM-based solutions.

As the potential and use of LLMs become relevant, there emerged a drive to develop applications and tools around LLMs. There is obviously also a significant business opportunity which is continuously unfolding.

However, while trying to harness LLMs and build applications with LLMs at the centre, it was discovered that LLMs have vulnerabilities. And solutions and frameworks had to be built to accommodate the vulnerabilities and support LLM and Generative AI application development.

And hence now we have a whole host of terms not known before, like RAG, ICL and others.

No Memory

LLMs can’t remember previous prompts, which limits their use in applications needing state retention. The same goes for maintaining context within a conversation. Hence methods of few-shot learning had to be implemented, summarising conversation history and injecting the prompts with the summary.

Or approaches like seeding had to be implemented, where a particular response can be replicated for a certain input.

Stochastic Responses

LLMs give different answers to the same prompt. Parameters like temperature can limit this variability, but it remains an inherent trait.

This can be good in the case of some conversational UIs, where level of liveness is required, simulating us as humans.

Outdated Information

LLMs can’t access current data or know the present time, relying only on their training data. Hence the notion of RAG, and retrieving highly contextually relevant information for each inference.

Large Size

LLMs require costly GPU resources for training and serving, leading to potential latency issues.

Here Small Language Models have come to the fore; SLMs have all the characteristics of LLMs, apart from their knowledge intensive nature. The knowledge intensive nature has been superseded by ICL and RAG. Hence SLMs are ideal for most solutions.

Hallucinations

LLMs can generate highly plausible, succinct and coherent answers, but answers which are factually incorrect.

These limitations, especially hallucinations, have sparked interest and led to various prompt engineering and LLM augmentation strategies.

In LLMs, hallucinations refer to the generation of nonsensical or unfaithful content.

This phenomenon has gained significant attention, particularly highlighted in the Survey of Hallucination in Natural Language Generation paper.

According to a recent study, hallucinations can be categorised as:

Intrinsic Hallucinations

These introduce factual inaccuracies or logical inconsistencies, directly conflicting with the source material.

Extrinsic Hallucinations

These are unverifiable against the source, involving speculative or unconformable elements.

The definition of source varies with the task.

In dialogues, it can refer to world knowledge, while in text summarisation, it pertains to the input text.

The context of hallucinations matters too; in creative writing, like poetry, they might be acceptable or even beneficial.

LLMs are trained on diverse datasets, including the internet, books, and Wikipedia, and generate text based on probabilistic models without an inherent understanding of truth.

Techniques like instruct tuning and Reinforcement Learning from Human Feedback (RLHF) aim to produce more factual outputs, but the inherent probabilistic nature remains a challenge.

Chain-of-Thought (CoT) prompting makes the implicit reasoning process of LLMs explicit.



Source link

20Jul

Forecasting in the Age of Foundation Models | by Alvaro Corrales Cano | Jul, 2024


Benchmarking Lag-Llama against XGBoost

Cliffs near Ribadesella. Photo by Enric Domas on Unsplash

On Hugging Face, there are 20 models tagged “time series” at the time of writing. While certainly not a lot (the “text-generation-inference” tag yields 125,950 results), time series forecasting with foundation models is an interesting enough niche for big companies like Amazon, IBM and Salesforce to have developed their own models: Chronos, TinyTimeMixer and Moirai, respectively. At the time of writing, one of the most popular on Hugging Face by number of likes is Lag-Llama, a univariate probabilistic model. Developed by Kashif Rasul, Arjun Ashok and co-authors [1], Lag-Llama was open sourced in February 2024. The authors of the model claim “strong zero-shot generalization capabilities” on a variety of datasets across different domains. Once fine-tuned for specific tasks, they also claim it to be the best general-purpose model of its kind. Big words!

In this blog, I showcase my experience fine-tuning Lag-Llama, and test its capabilities against a more classical machine learning approach. In particular, I benchmark it against an XGBoost model designed to handle univariate time series data. Gradient boosting algorithms such as XGBoost are widely considered the epitome of “classical” machine learning (as opposed to deep-learning), and have been shown to perform extremely well with tabular data [2]. Therefore, it seems fitting to use XGBoost to test if Lag-Llama lives up to its promises. Will the foundation model do better? Spoiler alert: it is not that simple.

By the way, I will not go into the details of the model architecture, but the paper is worth a read, as is this nice walk-through by Marco Peixeiro.

The data that I use for this exercise is a 4-year-long series of hourly wave heights off the coast of Ribadesella, a town in the Spanish region of Asturias. The series is available at the Spanish ports authority data portal. The measurements were taken at a station located in the coordinates (43.5, -5.083), from 18/06/2020 00:00 to 18/06/2024 23:00 [3]. I have decided to aggregate the series to a daily level, taking the max over the 24 observations in each day. The reason is that the concepts that we go through in this post are better illustrated from a slightly less granular point of view. Otherwise, the results become very volatile very quickly. Therefore, our target variable is the maximum height of the waves recorded in a day, measured in meters.

Distribution of target data. Image by author

There are several reasons why I chose this series: the first one is that the Lag-Llama model was trained on some weather-related data, although not a lot, relatively. I would expect the model to find this type of data slightly challenging, but still manageable. The second one is that, while meteorological forecasts are typically produced using numerical weather models, statistical models can still complement these forecasts, specially for long-range predictions. At the very least, in the era of climate change, I think statistical models can tell us what we would typically expect, and how far off it is from what is actually happening.

The dataset is pretty standard and does not require much preprocessing other than imputing a few missing values. The plot below shows what it looks like after we split it into train, validation and test sets. The last two sets have a length of 5 months. To know more about how we preprocess the data, have a look at this notebook.

Maximum daily wave heights in Ribadesella. Image by author

We are going to benchmark Lag-Llama against XGBoost on two univariate forecasting tasks: point forecasting and probabilistic forecasting. The two tasks complement each other: point forecasting gives us a specific, single-number prediction, whereas probabilistic forecasting gives us a confidence region around it. One could say that Lag-Llama was only trained for the latter, so we should focus on that one. While that is true, I believe that humans find it easier to understand a single number than a confidence interval, so I think the point forecast is still useful, even if just for illustrative purposes.

There are many factors that we need to consider when producing a forecast. Some of the most important include the forecast horizon, the last observation(s) that we feed the model, or how often we update the model (if at all). Different combinations of factors yield their own types of forecast with their own interpretations. In our case, we are going to do a recursive multi-step forecast without updating the model, with a step size of 7 days. This means that we are going to use one single model to produce batches of 7 forecasts at a time. After producing one batch, the model sees 7 more data points, corresponding to the dates that it just predicted, and it produces 7 more forecasts. The model, however, is not retrained as new data is available. In terms of our dataset, this means that we will produce a forecast of maximum wave heights for each day of the next week.

For point forecasting, we are going to use the Mean Absolute Error (MAE) as performance metric. In the case of probabilistic forecasting, we will aim for empirical coverage or coverage probability of 80%.

The scene is set. Let’s get our hands dirty with the experiments!

While originally not designed for time series forecasting, gradient boosting algorithms in general, and XGBoost in particular, can be great predictors. We just need to feed the algorithm the data in the right format. For instance, if we want to use three lags of our target series, we can simply create three columns (say, in a pandas dataframe) with the lagged values and voilà! An XGBoost forecaster. However, this process can quickly become onerous, especially if we intend to use many lags. Luckily for us, the library Skforecast [4] can do this. In fact, Skforecast is the one-stop shop for developing and testing all sorts of forecasters. I honestly can’t recommend it enough!

Creating a forecaster with Skforecast is pretty straightforward. We just need to create a ForecasterAutoreg object with an XGBoost regressor, which we can then fine-tune. On top of the XGBoost hyperparamters that we would typically optimise for, we also need to search for the best number of lags to include in our model. To do that, Skforecast provides a Bayesian optimisation method that runs Optuna on the background, bayesian_search_forecaster.

Defining and optimising hyperparameters of XGBoost forecaster

The search yields an optimised XGBoost forecaster which, among other hyperparameters, uses 21 lags of the target variable, i.e. 21 days of maximum wave heights to predict the next:

Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21] 
Parameters: {'n_estimators': 900,
'max_depth': 12,
'learning_rate': 0.30394338985367425,
'reg_alpha': 0.5,
'reg_lambda': 0.0,
'subsample': 1.0,
'colsample_bytree': 0.2}

But is the model any good? Let’s find out!

Point forecasting

First, let’s look at how well the XGBoost forecaster does at predicting the next 7 days of maximum wave heights. The chart below plots the predictions against the actual values of our test set. We can see that the prediction tends to follow the general trend of the actual data, but it is far from perfect.

Maximum wave heights and XGBoost predictions. Image by author

To create the predictions depicted above, we have used Skforecast’s backtesting_forecaster function, which allows us to evaluate the model on a test set, as shown in the following code snippet. On top of the predictions, we also get a performance metric, which in our case is the MAE.

Backtesting our XGBoost forecaster

Our model’s MAE is 0.64. This means that, on average, our predictions are 64cm off the actual measurement. To put this value in context, the standard deviation of the target variable is 0.86. Therefore, our model’s average error is about 0.74 units of the standard deviation. Furthermore, if we were to simply use the previous equivalent observation as a dummy best guess for our forecast, we would get a MAE of 0.84 (see point 1 of this notebook). All things considered, it seems that, so far, our model is better than a simple logical rule, which is a relief!

Probabilistic forecasting

Skforecast allows us to calculate distribution intervals where the future outcome is likely to fall. The library provides two methods: using either bootstrapped residuals or quantile regression. The results are not very different, so I am going to focus here on the bootstrapped residuals method. You can see more results in part 3 of this notebook.

The idea of constructing prediction intervals using bootstrapped residuals is that we can randomly take a model’s forecast errors (residuals) an add them to the same model’s forecasts. By repeating the process a number of times, we can construct an equal number of alternative forecasts. These predictions follow a distribution that we can get prediction intervals from. In other words, if we assume that the forecast errors are random and identically distributed in time, adding these errors creates a universe of equally possible forecasts. In this universe, we would expect to see at least a percentage of the actual values of the forecasted series. In our case, we will aim for 80% of the values (that is, a coverage of 80%).

To construct the prediction intervals with Skforecast, we follow a 3-step process: first, we generate forecasts for our validation set; second, we compute the residuals from those forecasts and store them in our forecaster class; third, we get the probabilistic forecasts for our test set. The second and third steps are illustrated in the snippet below (the first one corresponds to the code snippet in the previous section). Lines 14-17 are the parameters that govern our bootstrap calculation.

Generating prediction intervals with bootstrapped residuals

The resulting prediction intervals are depicted in the chart below.

Bootstraped prediction intervals with XGBoost forecaster. Image by author

An 84.67% of values in the test set fall within our prediction intervals, which is just above our target of 80%. While this is not bad, it may also mean that we are overshooting and our intervals are too big. Think of it this way: if we said that tomorrow’s waves would be between 0 and infinity meters high, we would always be right, but the forecast would be useless! To get a idea of how big our intervals are, Skforecast’s docs suggest that we compute the area of our intervals by thaking the sum of the differences between the upper and lower boundaries of the intervals. This is not an absolute measure, but it can help us compare across forecasters. In our case, the area is 348.28.

These are our XGBoost results. How about Lag-Llama?

The authors of Lag-Llama provide a demo notebook to start forecasting with the model without fine-tuning it. The code is ready to produce probabilistic forecasts given a set horizon, or prediction length, and a context length, or the amount of previous data points to consider in the forecast. We just need to call the get_llama_predictions function below:

Modified version of get_llama_predictions function to produce probabilistic forecasts.

The core of the funtion is a LagLlamaEstimatorclass (lines 19–47), which is a Pytorch Lightning Estimator based on the GluonTS [5] package for probabilistic forecasting. I suggest you go through the GluonTS docs to get familiar with the package.

We can leverage the get_llama_predictions function to produce recursive multistep forecasts. We simply need to produce batches of predictions over consecutive batches. This is what we do in the function below, recursive_forecast:

This function produces recursive probabilistic and point forecasts

In lines 37 to 39 of the code snippet above, we extract the percentiles 10 and 90 to produce an 80% probabilistic forecast (90–10), as well as the median of the probabilistic prediction to get a point forecast. If you need to learn more about the output of the model, I suggest you have a look at the author’s tutorial mentioned above.

The authors of the model advise that different datasets and forecasting tasks may require differen context lenghts. In our case, we try context lenghts of 32, 64 and 128 tokens (lags). The chart below shows the results of the 64-token model.

Zero-shot Lag-Llama predictions with a context length of 128 tokens. Image by author

Point forecasting

As we said above, Lag-Llama is not meant to calculate point forecasts, but we can get one by taking the median of the probabilistic interval that it returns. Another potential point forecast would be the mean, although it would be subject to outliers in the interval. In any case, for our particular dataset, both options yield similar results.

The MAE of the 32-token model was 0.75. That of the 64-token model was 0.77, while the MAE of the 128-token model was 0.77 as well. These are all higher than the XGBoost forecaster’s, which went down to 0.64. In fact, they are very close to the baseline, dummy model that used the previous week’s value as today’s forecast (MAE 0.84).

Probabilistic forecasting

With a predicted interval coverage of 68.67% and an interval area of 280.05, the 32-token forecast does not perform up to our required standard. The 64-token one, reaches an 74.0% coverage, which gets closer to the 80% region that we are looking for. To do so, it takes an interval area of 343.74. The 128-token model overshoots but is closer to the mark, with an 84.67% coverage and an area of 399.25. We can grasp an interesting trend here: more coverage implies a larger interval area. This should not always be the case — a very narrow interval could always be right. However, in practice this trade-off is very much present in all the models I have trained.

Notice the periodic bulges in the chart (around March 10 or April 7, for instance). Since we are producing a 7-day forecast, the bulges represent the increased uncertainty as we move away from the last observation that the model saw. In other words, a forecast for the next day will be less uncertain than a forecast for the day after next, and so on.

The 128-token model yields very similar results to the XGBoost forecaster, which had an area 348.28 and a coverage of 84.67%. Based on these results, we can say that, with no training, Lag-Llama’s performance is rather solid and up to par with an optimised traditional forecaster.

Lag-Llama’s Github repo comes with a “best practices” section with tips to use and fine-tune the model. The authors especially recommend tuning the context length and the learning rate. We are going to explore some of the suggested values for these hyperparameters. The code snippet below, which I have taken and modified from the authors’ fine-tuning tutorial notebook, shows how we can conduct a small grid search:

Grid search for fine-tuning Lag-Llama

In the code above, we loop over context lengths of 32, 64, and 128 tokens, as well as learning rates of 0.001, 0.001, and 0.005. Within the loop, we also calculate some test metrics: Coverage[0.8], Coverage[0.9] and Mean Absolute Error of (MAE) Coverage. Coverage[0.x] measures how many predictions fall within their prediction interval. For instance, a good model should have a Coverage[0.8] of around 80%. MAE Coverage, on the other hand, measures the deviation of the actual coverage probabilities from the nominal coverage levels. Therefore, a good model in our case should be one with a small MAE and coverages of around 80% and 90%, respectively.

One of the main differences with respect to the original fine-tuning code from the authors is line 46. In that line, the original code does not include a validation set. In my experience, not including it meant that all models that I trained ended up overfitting the training data. On the other hand, with a validation set most models were optimised in Epoch 0 and did not improve the validation loss thereafter. With more data, we may see less extreme outcomes.

Once trained, most of the models in the loop yield a MAE of 0.5 and coverages of 1 on the test set. This means that the models have very broad prediction intervals, but the prediction is not very precise. The model that strikes a better balance is model 6 (counting from 0 to 8 in the loop), with the following hyperparameters and metrics:

 {'context_length': 128,
'lr': 0.001,
'Coverage[0.8]': 0.7142857142857143,
'Coverage[0.9]': 0.8571428571428571,
'MAE_Coverage': 0.36666666666666664}

Since this is the most promising model, we are going to run it through the tests that we have with the other forecasters.

The chart below shows the predictions from the fine-tuned model.

Fine-tuned Lag-Llama predictions with a context length of 64 tokens. Image by author

Something that catches the eye very quickly is that prediction intervals are substantially smaller than those from the zero-shot version. In fact, the interval area is 188.69. With these prediction intervals, the model reaches a coverage of 56.67% over the 7-day recursive forecast. Remember that our best zero-shot predictions, with a 128-token context, had an area of 399.25, reaching a coverage of 84.67%. This means a 55% reduction in the interval area, with only a 33% decrease in coverage. However, the fine-tuned model is too far from the 80% coverage that we are aiming for, whereas the zero-shot model with 128 tokens wasn’t.

When it comes to point forecasting, the MAE of the model is 0.77, which is not an improvement over the zero-shot forecasts and worse than the XGBoost forecaster.

Overall, the fine-tuned model leaves doesn’t leave us a good picture: it doesn’t do better than a zero-shot better at either point of probabilistic forecasting. The authors do suggest that the model can improve if fine-tuned with more data, so it may be that our training set was not large enough.

To recap, let’s ask again the question that we set out at the beginning of this blog: Is Lag-Llama better at forecasting than XGBoost? For our dataset, the short answer is no, they are similar. The long answer is more complicated, though. Zero-shot forecasts with a 128-token context length were at the same level as XGBoost in terms of probabilistic forecasting. Fine-tuning Lag-Llama further reduced the prediction area, making the model’s correct forecasts more precise, albeit at a substantial cost in terms of probabilistc coverage. This raises the question of where the model could get with more training data. But more data we did not have, so we can’t say that Lag-Llama beat XGBoost.

These results inevitably open a broader debate: since one is not better than the other in terms of performance, which one should we use? In this case, we’d need to consider other variables such as ease of use, deployment and maintenance and inference costs. While I haven’t formally tested the two options in any of those aspects, I suspect the XGBoost would come out better. Less data- and resource-hungry, pretty robust to overfitting and time-tested are hard-to-beat characteristics, and XGBoost has them all.

But do not believe me! The code that I used is publicly available on this Github repo, so go have a look and run it yourself.



Source link

20Jul

RAG Implementations Fail Due To Insufficient Focus On Question Intent | by Cobus Greyling | Jul, 2024


  1. Large Language Models (LLMs) are good at generating coherent and contextually relevant text but struggle with knowledge-intensive queries, especially in domain-specific and factual question-answering tasks.
  2. Retrieval-augmented generation (RAG) systems address this challenge by integrating external knowledge sources like structured knowledge graphs (KGs).
  3. Despite access to KG-extracted information, LLMs often fail to deliver accurate answers.
  4. A recent study examines this issue by analysing error patterns in KG-based RAG methods, identifying eight critical failure points.

Research found that these errors stem largely from inadequate understanding of the question’s intent and hence insufficient context extraction from knowledge graph facts.

Based on this analysis, th study propose Mindful-RAG, a framework focused on intent-based and contextually aligned knowledge retrieval.

This approach aims to enhance the accuracy and relevance of LLM responses, marking a significant advancement over current methods.

In the diagram below, the two error categories are shown, Reasoning Failures and Knowledge Graph, data topology challenges.

The error types are listed, with a description and failure examples…



Source link

Protected by Security by CleanTalk