22Oct

Windows Agent Arena (WAA). And The Multi-Modal Agent Called Navi | by Cobus Greyling | Oct, 2024


Lastly, below is an example of an agent prompt, within the WindowsAgentArena environment with the Navi Agent.

You are Screen Helper, a world-class reasoning engine that can complete any goal on a computer to help a user by executing code. When you output actions, they will be executed **on the user’s computer**. The user has given you **full and complete permission** to execute any code necessary to complete the task. In general, try to make plans with as few steps as possible. As for actually executing actions to carry out that plan, **don’t do more than one action per step**. Verify at each step whether or not you’re on track.
# Inputs
1. User objective. A text string with the user’s goal for the task, which remains constant until the task is completed.
2. Window title. A string with the title of the foreground active window.
3. All window names. A list with the names of all the windows/apps currently open on the user’s computer. These names can be used in case the user’s objective involves switching between windows.
4. Clipboard content. A string with the current content of the clipboard. If the clipboard contains copied text this will show the text itself. If the clipboard contains an image, this will contain some description of the image. This can be useful for storing information which you plan to use later.
5. Text rendering. A multi-line block of text with the screen’s text OCR contents, rendered with their approximate screen locations. Note that none of the images or icons will be present in the screen rendering, even though they are visible on the real computer screen. 6. List of candidate screen elements. A list of candidate screen elements which which you can interact, each represented with the following fields:
- ID: A unique identifier for the element.
- Type: The type of the element (e.g., image, button, icon).
- Content: The content of the element, expressed in text format. This is the text content of each button region, or empty in the case of images and icons classes.
- Location: The normalized location of the element on the screen (0-1), expressed as a tuple (x1, y1, x2, y2) where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner.
7. Images of the current screen:
7.0 Raw previous screen image.
7.1 Raw screen image.
7.2 Annotated screen with bounding boxes drawn around the image (red bounding boxes) and icon (green bounding boxes) elements, tagged with their respective IDs. Note that the button text elements are not annotated in this screen, even though they might be the most relevant for the current step’s objective.
Very important note about annotated screen image: the element IDs from images and icons are marked on the bottom right corner of each respective element with a white font on top of a colored background box. Be very careful not to confuse the element numbers with other numbered elements which occur on the screen, such as numbered lists or specially numbers marking slide thumbnails on the left side of a in a powerpoint presentation. When selecting an element for interaction you should reference the colored annotated IDs, and not the other numbers that might be present on the screen.
8. History of the previous N actions code blocks taken to reach the current screen, which can help you understand the context of the current screen.
9. Textual memory. A multi-line block of text where you can choose to store information for steps in the future. This can be useful for storing information which you plan to use later steps.
# Outputs
Your goal is to analyze all the inputs and output the following items:
Screen annotation:
0. Complete filling in the ”List of candidate screen elements” which was inputted to you. Analyze both image inputs (raw screen and annoteted screen) and output a list containing the ID and functional description of each image and icon type element. There is no need to repeat the text elements.
Reasoning over the screen content. Answer the following questions:
1. In a few words, what is happening on the screen?
2. How does the screen content relate to the current step’s objective?
Multi-step planning:
3. On a high level, what are the next actions and screens you expect to happen between now and the goal being accomplished?
4. Consider the very next step that should be performed on the current screen. Think out loud about which elements you need to interact with to fulfill the user’s objective at this step. Provide a clear rationale and train-of-thought for your choice.
Reasoning about current action step:
5. Output a high-level decision about what to do in the current step. You may choose only one from the following options:
- DONE: If the task is completed and no further action is needed. This will trigger the end of the episode.
- FAIL: If the task is impossible to complete due to an error or unexpected issue. This can be useful if the task cannot be completed due to a technical issue, or if the user’s objective is unclear or impossible to achieve. This will trigger the end of the episode.
- WAIT: If the screen is in a loading state such as a page being rendered, or a download in progress, and you need to wait for the next screen to be ready before taking further actions. This will trigger a sleep delay until your next iteration.
- COMMAND: This decision will execute the code block output for the current action step, which is explained in more detail below. Make sure that you wrap the decision in a block with the following format:
ˋˋˋdecision
# your comment about the decision
COMMAND # or DONE, FAIL, WAIT
ˋˋˋ
6. Output a block of code that represents the action to be taken on the current screen. The code should be wrapped around a python block with the following format:
ˋˋˋpython
# your code here
# more code...
# last line of code
ˋˋˋ
7. Textual memory output. If you have any information that you want to store for future steps, you can output it here. This can be useful for storing information which you plan to use later steps (for example if you want to store a piece of text like a summary, description of a previous page, or a song title which you will type or use as context later). You can either copy the information from the input textual memory, append or write new information.
ˋˋˋmemory
# your memory here
# more memory...
# more memory...
ˋˋˋ
Note: remember that you are a multi-modal vision and text reasoning engine, and can store information on your textual memory based on what you see and receive as text input.
Below we provide further instructions about which functions are available for you to use in the code block. # Instructions for outputting code for the current action step
You may use the ‘computer‘ Python module to complete tasks:
ˋˋˋpython
# GUI-related functions
computer.mouse.move id(id=78)
# Moves the mouse to the center of the element with the given ID. Use this very frequently.
computer.mouse.move abs(x=0.22, y=0.75)
# Moves the mouse to the absolute normalized position on the screen. The top-left corner is (0, 0) and the bottom-right corner is (1, 1). Use this rarely, only if you don’t have an element ID to interact with, since this is highly innacurate. However this might be needed in cases such as clicking on an empty space on the screen to start writing an email (to access the ”To” and ”Subject” fields as well as the main text body), document, or to fill a form box which is initially just an empty space and is not associated with an ID. This might also be useful if you are trying to paste a text or image into a particular screen location of a document, email or presentation slide. computer.mouse.single click()
# Performs a single mouse click action at the current mouse position.
computer.mouse.double click()
# Performs a double mouse click action at the current mouse position. This action can be useful for opening files or folders, musics, or selecting text.
computer.mouse.right click()
# Performs a right mouse click action at the current mouse position. This action can be useful for opening context menus or other options.
computer.mouse.scroll(dir="down")
# Scrolls the screen in a particular direction (”up” or ”down”). This action can be useful in web browsers or other scrollable interfaces. # keyboard-related functions
computer.keyboard.write("hello") # Writes the given text string computer.keyboard.press("enter") # Presses the enter key
# OS-related functions
computer.clipboard.copy text("text to copy")
# Copies the given text to the clipboard. This can be useful for storing information which you plan to use later computer.clipboard.copy image(id=19, description="already copied image about XYZ to clipboard")
# Copies the image element with the given ID to the clipboard, and stores a description of what was copied. This can be useful for copying images to paste them somewhere else.
computer.clipboard.paste()
# Pastes the current clipboard content. Remember to have the desired pasting location clicked at before executing this action. computer.os.open program("msedge")
# Opens the program with the given name (e.g., ”spotify”, ”notepad”, ”outlook”, ”msedge”, ”winword”, ”excel”, ”powerpnt”). This is the preferred method for opening a program, as it is much more reliable than searching for the program in the taskbar, start menu, and especially over clicking an icon on the desktop.
computer.window manager.switch to application("semester review.pptx - PowerPoint")
# Switches to the foreground window application with that exact given name, which can be extracted from the ”All window names” input list
# Examples ## Example 0
User query = ”search news about ’Artificial Intelligence’”.
The current screen shows the user’s desktop.
Output:
ˋˋˋpython
computer.os.open program("msedge") # Open the web browser as the first thing to do ˋˋˋ
## Example 1
User query = ”buy a baby monitor”.
The current screen shows an new empty browser window.
Output:
ˋˋˋpython
computer.mouse.move id(id=29) # Move the mouse to element with ID 29 which has text saying ’Search or enter web address’ computer.mouse.single click() # Click on the current mouse location, which will be above the search bar at this point computer.keyboard.write("amazon.com") # Type ’baby monitor’ into the search bar
computer.keyboard.press("enter") # go to website
ˋˋˋ
## Example 2
User query = ”play hips don’t lie by shakira”.
The current screen shows a music player with a search bar and a list of songs, one of which is hips don’t lie by shakira.
Output:
ˋˋˋpython
computer.mouse.move id(id=107) # Move the mouse to element with ID 107 which has text saying ’Hips don’t’, the first part of the song name
computer.mouse.double click() # Double click on the current mouse location, which will be above the song at this point, so that it starts playing
ˋˋˋ
## Example 3
User query = ”email the report’s revenue projection plot to Justin Wagle with a short summary”.
The current screen shows a powerpoint presentation with a slide containing text and images with finantial information about a company. One of the plots contains the revenue projection.
Output:
ˋˋˋpython
computer.clipboard.copy image(id=140, description="already copied image about revenue projection plot to clipboard") # Copy the image with ID 140 which contains the revenue projection plot
computer.os.open program("outlook") # Open the email client so that we can open a new email in the next step
ˋˋˋ
## Example 4 User query = ”email the report’s revenue projection plot to Justin Wagle with a short summary”.
The current screen shows newly opened email window with the ”To”, ”Cc”, ”Subject”, and ”Body” fields empty.
Output:
ˋˋˋpython
computer.mouse.move abs(x=0.25, y=0.25) # Move the mouse to the text area to the right of the ”To” button (44 — ocr — To — [0.14, 0.24, 0.16, 0.26]). This is where the email recipient’s email address should be typed.
computer.mouse.single click() # Click on the current mouse location, which will be above the text area to the right of the ”To” button.
computer.keyboard.write("Justin Wagle") # Type the email recipient’s email address
computer.keyboard.press("enter") # select the person from the list of suggestions that should auto-appear
ˋˋˋ
## Example 5
User query = ”email the report’s revenue projection plot to Justin Wagle with a short summary”.
The current screen shows an email window with the ”To” field filled, but ”Cc”, ”Subject”, and ”Body” fields empty.
Output:
ˋˋˋpython
computer.mouse.move abs(x=0.25, y=0.34) # Move the mouse to the text area to the right of the ”Subject” button (25 — ocr — Subject — [0.13, 0.33, 0.17, 0.35]). This is where the email subject line should be typed.
computer.mouse.single click() # Click on the current mouse location, which will be above the text area to the right of the ”Subject” button.
computer.keyboard.write("Revenue projections") # Type the email subject line
ˋˋˋ
## Example 6
User query = ”copy the ppt’s architecture diagram and paste into the doc”.
The current screen shows the first slide of a powerpoint presentation with multiple slides. The left side of the screen shows a list of slide thumbnails. There are numbers by the side of each thumbnail which indicate the slide number. The current slide just shows a title ”The New Era of AI”, with no architecture diagram. The thumbnail of slide number 4 shows an ”Architecture” title and an image that looks like a block diagram. Therefore we need to switch to slide number 4 first, and then once there copy the architecture diagram image on a next step.
Output:
ˋˋˋpython
# Move the mouse to the thumbnail of the slide titled ”Architecture”
computer.mouse.move id(id=12) # The ID for the slide thumbnail with the architecture diagram. Note that the ID is not the slide number, but a unique identifier for the element based on the numbering of the red bounding boxes in the annotated screen image.
# Click on the thumbnail to make it the active slide
computer.mouse.single click()
ˋˋˋ
## Example 7
User query = ”share the doc with jaques”.
The current screen shows a word doc.
Output:
ˋˋˋpython
computer.mouse.move id(id=78) # The ID for the ”Share” button on the top right corner of the screen. Move the mouse to the ”Share” button.
computer.mouse.single click()
ˋˋˋ
## Example 8
User query = ”find the lyrics for this song”.
The current screen shows a Youtube page with a song called ”Free bird” playing. Output:
ˋˋˋpython
computer.os.open program("msedge") # Open the web browser so that we can search for the lyrics in the next step
ˋˋˋ
ˋˋˋmemory
# The user is looking for the lyrics of the song ”Free bird”
ˋˋˋ
Remember, do not try to complete the entire task in one step. Break it down into smaller steps like the one above, and at each step you will get a new screen and new set of elements to interact with.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.



Source link

20Oct

Evaluating Model Retraining Strategies | by Reinhard Sellmair | Oct, 2024


How data drift and concept drift matter to choose the right retraining strategy?

(created with Image Creator in Bing)

Many people in the field of MLOps have probably heard a story like this:

Company A embarked on an ambitious quest to harness the power of machine learning. It was a journey fraught with challenges, as the team struggled to pinpoint a topic that would not only leverage the prowess of machine learning but also deliver tangible business value. After many brainstorming sessions, they finally settled on a use case that promised to revolutionize their operations. With excitement, they contracted Company B, a reputed expert, to build and deploy a ML model. Following months of rigorous development and testing, the model passed all acceptance criteria, marking a significant milestone for Company A, who looked forward to future opportunities.

However, as time passed, the model began producing unexpected results, rendering it ineffective for its intended use. Company A reached out to Company B for advice, only to learn that the changed circumstances required building a new model, necessitating an even higher investment as the original.

What went wrong? Was the model Company B created not as good as expected? Was Company A just unlucky that something unexpected happened?

Probably the issue was that even the most rigorous testing of a model before deployment does not guarantee that this model will perform well for an unlimited amount of time. The two most important aspects that impact a model’s performance over time are data drift and concept drift.

Data Drift: Also known as covariate shift, this occurs when the statistical properties of the input data change over time. If an ML model was trained on data from a specific demographic but the demographic characteristics of the input data change, the model’s performance can degrade. Imagine you taught a child multiplication tables until 10. It can quickly give you the correct answers for what is 3 * 7 or 4 * 9. However, one time you ask what is 4 * 13, and although the rules of multiplication did not change it may give you the wrong answer because it did not memorize the solution.

Concept Drift: This happens when the relationship between the input data and the target variable changes. This can lead to a degradation in model performance as the model’s predictions no longer align with the evolving data patterns. An example here could be spelling reforms. When you were a child, you may have learned to write “co-operate”, however now it is written as “cooperate”. Although you mean the same word, your output of writing that word has changed over time.

In this article I investigate how different scenarios of data drift and concept drift impact a model’s performance over time. Furthermore, I show what retraining strategies can mitigate performance degradation.

I focus on evaluating retraining strategies with respect to the model’s prediction performance. In practice more aspects like:

  • Data Availability and Quality: Ensure that sufficient and high-quality data is available for retraining the model.
  • Computational Costs: Evaluate the computational resources required for retraining, including hardware and processing time.
  • Business Impact: Consider the potential impact on business operations and outcomes when choosing a retraining strategy.
  • Regulatory Compliance: Ensure that the retraining strategy complies with any relevant regulations and standards, e.g. anti-discrimination.

need to be considered to identify a suitable retraining strategy.

(created with Image Creator in Bing)

To highlight the differences between data drift and concept drift I synthesized datasets where I controlled to what extent these aspects appear.

I generated datasets in 100 steps where I changed parameters incrementally to simulate the evolution of the dataset. Each step contains multiple data points and can be interpreted as the amount of data that was collected over an hour, a day or a week. After every step the model was re-evaluated and could be retrained.

To create the datasets, I first randomly sampled features from a normal distribution where mean µ and standard deviation σ depend on the step number s:

The data drift of feature xi depends on how much µi and σi are changing with respect to the step number s.

All features are aggregated as follows:

Where ci are coefficients that describe the impact of feature xi on X. Concept drift can be controlled by changing these coefficients with respect to s. A random number ε which is not available for model training is added to consider that the features do not contain complete information to predict the target y.

The target variable y is calculated by inputting X into a non-linear function. By doing this we create a more challenging task for the ML model since there is no linear relation between the features and the target. For the scenarios in this article, I chose a sine function.

(created with Image Creator in Bing)

I created the following scenarios to analyze:

  • Steady State: simulating no data or concept drift — parameters µ, σ, and c were independent of step s
  • Distribution Drift: simulating data drift — parameters µ, σ were linear functions of s, parameters c is independent of s
  • Coefficient Drift: simulating concept drift: parameters µ, σ were independent of s, parameters c are a linear function of s
  • Black Swan: simulating an unexpected and sudden change — parameters µ, σ, and c were independent of step s except for one step when these parameters were changed

The COVID-19 pandemic serves as a quintessential example of a Black Swan event. A Black Swan is characterized by its extreme rarity and unexpectedness. COVID-19 could not have been predicted to mitigate its effects beforehand. Many deployed ML models suddenly produced unexpected results and had to be retrained after the outbreak.

For each scenario I used the first 20 steps as training data of the initial model. For the remaining steps I evaluated three retraining strategies:

  • None: No retraining — the model trained on the training data was used for all remaining steps.
  • All Data: All previous data was used to train a new model, e.g. the model evaluated at step 30 was trained on the data from step 0 to 29.
  • Window: A fixed window size was used to select the training data, e.g. for a window size of 10 the training data at step 30 contained step 20 to 29.

I used a XG Boost regression model and mean squared error (MSE) as evaluation metric.

Steady State

Prediction error of steady state scenario

The diagram above shows the evaluation results of the steady state scenario. As the first 20 steps were used to train the models the evaluation error was much lower than at later steps. The performance of the None and Window retraining strategies remained at a similar level throughout the scenario. The All Data strategy slightly reduced the prediction error at higher step numbers.

In this case All Data is the best strategy because it profits from an increasing amount of training data while the models of the other strategies were trained on a constant training data size.

Distribution Drift (Data Drift)

Prediction error of distribution drift scenario

When the input data distributions changed, we can clearly see that the prediction error continuously increased if the model was not retrained on the latest data. Retraining on all data or on a data window resulted in very similar performances. The reason for this is that although All Data was using more data, older data was not relevant for predicting the most recent data.

Coefficient Drift (Concept Drift)

Prediction error of coefficient drift scenario

Changing coefficients means that the importance of features changes over time. In this case we can see that the None retraining strategy had drastic increase of the prediction error. Additionally, the results showed that retraining on all data also lead to a continuous increase of prediction error while the Window retraining strategy kept the prediction error on a constant level.

The reason why the All Data strategy performance also decreased over time was that the training data contained more and more cases where similar inputs resulted in different outputs. Hence, it became more challenging for the model to identify clear patterns to derive decision rules. This was less of a problem for the Window strategy since older data was ignore which allowed the model to “forget” older patterns and focus on most recent cases.

Black Swan

Prediction error of black swan event scenario

The black swan event occurred at step 39, the errors of all models suddenly increased at this point. However, after retraining a new model on the latest data, the errors of the All Data and Window strategy recovered to the previous level. Which is not the case with the None retraining strategy, here the error increased around 3-fold compared to before the black swan event and remained on that level until the end of the scenario.

In contrast to the previous scenarios, the black swan event contained both: data drift and concept drift. It is remarkable that the All Data and Window strategy recovered in the same way after the black swan event while we found a significant difference between these strategies in the concept drift scenario. Probably the reason for this is that data drift occurred at the same time as concept drift. Hence, patterns that have been learned on older data were not relevant anymore after the black swan event because the input data has shifted.

An example for this could be that you are a translator and you get requests to translate a language that you haven’t translated before (data drift). At the same time there was a comprehensive spelling reform of this language (concept drift). While translators who translated this language for many years may be struggling with applying the reform it wouldn’t affect you because you even didn’t know the rules before the reform.

To reproduce this analysis or explore further you can check out my git repository.

Identifying, quantifying, and mitigating the impact of data drift and concept drift is a challenging topic. In this article I analyzed simple scenarios to present basic characteristics of these concepts. More comprehensive analyses will undoubtedly provide deeper and more detailed conclusions on this topic.

Here is what I learned from this project:

Mitigating concept drift is more challenging than data drift. While data drift could be handled by basic retraining strategies concept drift requires a more careful selection of training data. Ironically, cases where data drift and concept drift occur at the same time may be easier to handle than pure concept drift cases.

A comprehensive analysis of the training data would be the ideal starting point of finding an appropriate retraining strategy. Thereby, it is essential to partition the training data with respect to the time when it was recorded. To make the most realistic assessment of the model’s performance, the latest data should only be used as test data. To make an initial assessment regarding data drift and concept drift the remaining training data can be split into two equally sized sets with the older data in one set and the newer data in the other. Comparing feature distributions of these sets allows to assess data drift. Training one model on each set and comparing the change of feature importance would allow to make an initial assessment on concept drift.

No retraining turned out to be the worst option in all scenarios. Furthermore, in cases where model retraining is not taken into consideration it is also more likely that data to evaluate and/or retrain the model is not collected in an automated way. This means that model performance degradation may be unrecognized or only be noticed at a late stage. Once developers become aware that there is a potential issue with the model precious time would be lost until new data is collected that can be used to retrain the model.

Identifying the perfect retraining strategy at an early stage is very difficult and may be even impossible if there are unexpected changes in the serving data. Hence, I think a reasonable approach is to start with a retraining strategy that performed well on the partitioned training data. This strategy should be reviewed and updated the time when cases occurred where it did not address changes in the optimal way. Continuous model monitoring is essential to quickly notice and react when the model performance decreases.

If not otherwise stated all images were created by the author.



Source link

20Oct

Visualization of Data with Pie Charts in Matplotlib | by Diana Rozenshteyn | Oct, 2024


Examples of how to create different types of pie charts using Matplotlib to visualize the results of database analysis in a Jupyter Notebook with Pandas

Photo by Niko Nieminen on Unsplash

While working on my Master’s Thesis titled “Factors Associated with Impactful Scientific Publications in NIH-Funded Heart Disease Research”, I have used different types of pie charts to illustrate some of the key findings from the database analysis.

A pie chart can be an effective choice for data visualization when a dataset contains a limited number of categories representing parts of a whole, making it well-suited for displaying categorical data with an emphasis on comparing the relative proportions of each category.

In this article, I will demonstrate how to create four different types of pie charts using the same dataset to provide a more comprehensive visual representation and deeper insight into the data. To achieve this, I will use Matplotlib, Python’s plotting library, to display pie chart visualizations of the statistical data stored in the dataframe. If you are not familiar with Matplotlib library, a good start is Python Data Science Handbook by Jake VanderPlas, specifically chapter on Visualization with Matplotlib and matplotlib.org.

First, let’s import all the necessary libraries and extensions:

Next, we’ll prepare the CSV file for processing:

The mini dataset used in this article highlights the top 10 journals for heart disease research publications from 2002 to 2020 and is part of a larger database collected for the Master’s Thesis research. The columns “Female,” “Male,” and “Unknown” represent the gender of the first author of the published articles, while the “Total” column reflects the total number of heart disease research articles published in each journal.

Image by the author and represents output of the Pie_Chart_Artcile_2.py sample code above.

For smaller datasets with fewer categories, a pie chart with exploding slices can effectively highlight a key category by pulling it out slightly from the rest of the chart. This visual effect draws attention to specific categories, making them stand out from the whole. Each slice represents a portion of the total, with its size proportional to the data it represents. Labels can be added to each slice to indicate the category, along with percentages to show their proportion to the total. This visual technique makes the exploded slice stand out without losing the context of the full data representation.

Image by the author and represents output of the Pie_Chart_Artcile_3.py sample code above.

The same exploding slices technique can be applied to all other entries in the sample dataset, and the resulting charts can be displayed within a single figure. This type of visualization helps to highlight the over representation or under representation of a particular category within the dataset. In the example provided, presenting all 10 charts in one figure reveals that none of the top 10 journals in heart disease research published more articles authored by women than men, thereby emphasizing the gender disparity.

Gender distributions for top 10 journals for heart disease research publications, 2002–2020. Image by the author and represents output of the Pie_Chart_Artcile_4.py sample code above.

A variation of the pie chart, known as a donut chart, can also be used to visualize data. Donut charts, like pie charts, display the proportions of categories that make up a whole, but the center of the donut chart can also be utilized to present additional data. This format is less cluttered visually and can make it easier to compare the relative sizes of slices compared to a standard pie chart. In the example used in this article, the donut chart highlights that among the top 10 journals for heart disease research publications, the American Journal of Physiology, Heart and Circulatory Physiology published the most articles, accounting for 21.8%.

Image by the author and represents output of the Pie_Chart_Artcile_5.py sample code above.

We can enhance the visualization of additional information from the sample dataset by building on the previous donut chart and creating a nested version. The add_artist() method from Matplotlib’s figure module is used to incorporate any additional Artist (such as figures or objects) into the base figure. Similar to the earlier donut chart, this variation displays the distribution of publications across the top 10 journals for heart disease research. However, it also includes an additional layer that shows the gender distribution of first authors for each journal. This visualization highlights that a larger percentage of the first authors are male.

Image by the author and represents output of the Pie_Chart_Artcile_6.py sample code above.

In conclusion, pie charts are effective for visualizing data with a limited number of categories, as they enable viewers to quickly understand the most important categories or dominant proportions at a glance. In this specific example, the use of four different types of pie charts provides a clear visualization of the gender distribution among first authors in the top 10 journals for heart disease research publications, based on the 2002 to 2020 mini dataset used in this study. It is evident that a higher percentage of the publication’s first authors are males, and none of the top 10 journals for heart disease research published more articles authored by females than by males during the examined period.

Jupyter Notebook and dataset used for this article can be found at GitHub

Thank you for reading,

Diana

Note: I used GitHub embeds to publish this article.



Source link

19Oct

UI-Focused AI Agent


The UFO AI Agent aims to seamlessly navigate applications within the Windows OS and orchestrate events to fulfil a user query.

Initial Observations

This Windows OS based AI Agent called UFO can work well as a personal workflow optimiser for suggestions on the most optimal workflow to achieve a task on your PC.

We all have a process through which we interact with our UI…this agent can help optimise this personal workflow.

I once read that when a new type of UI is introduced, like a surface or touch screen, we start interacting with it and over time loose patterns of behaviour are established, which later turns into UI design conventions.

The same is happening with AI agents. Key ingredients of AI Agents are complex task decomposition, creating a sequence of chains. And AI Agent framework creators are converging on a set of good ideas.

Going through an iterative process of action, observation, thought, prior to taking next step.

AI Agents are also starting to exist within digital worlds, like in this case, Windows OS. Other examples are Apple’s iOS, or a web browser like Web Voyager.

You will see that as we as users have design affordances at our disposal to interact and navigate, these affordances are also available to the AI Agent.

There are also a set of action identified which are potentially high in consequence, like deleting files, or sending an email. The ramifications of these risks will grow considerably when AI Agents are embodied in the real world.

Lastly, quite a while ago I wrote about the ability of LLMs to perform symbolic reasoning. The ability of Language Models to do symbolic reasoning was a feature which I felt did not enjoy the attention it should.

We all perform symbolic reasoning as humans, we observe a room, and are able to mentally plan and project tasks based on what we have seen in a spatial setting. LLMs also have this capability, and visual models were always delivered via a test description. With the advent of vision capabilities in LLMs, images can be used.

The image below shows a common trait in AI Agents with a digital environment, where observation, thought and action are really all language based.

In User interface design, loose patterns of behaviour in time turns into UI design conventions

UFO = “U”I-”Fo”cused AI Agent

The goal of UFO as an AI agent is to effortlessly navigate and operate within individual applications, as well as across multiple apps, to complete user requests.

One powerful use-case is leveraging Vision-Language Models (VLMs) to interact with software interfaces, responding to natural language commands and executing them within real-world environments.

The development of Language Models with vision marks a shift from Large Language Models (LLMs) to Large Action Models (LAMs), enabling AI to translate decisions into real-world actions.

UFO also features an application-switching mechanism, allowing it to seamlessly transition between apps when necessary.

Vision-Language-Action models transfer web knowledge to robotic control

The image above illustrates the UFO Windows AI Agent. The AI Agent completes the user request by retrieving information from various applications including Word, Photos, PowerPoint etc. An email is then compose with the synthesised information.

UFO Process Overview

Initial Setup — UFO provides HostAgent with a full desktop screenshot and a list of available applications. The HostAgent uses this information to select the appropriate application for the task and creates a global plan to complete the user request.

Focus and Execution — The selected application is brought into focus on the desktop. The AppAgent begins executing actions based on the global plan.

Action Selection — Before each action, UFO captures a screenshot of the current application window with annotated controls. UFO provides details about each control for AppAgent’s observation.

Below is an image with annotation examples…

Action Execution — the AppAgent chooses a control, selects an action to execute, and carries it out using a control interaction module.

After each action, UFO builds a local plan for the next step and continues the process until the task is completed in the application.

Handling Multi-App Requests

If the task requires multiple applications, AppAgent passes control back to HostAgent to switch to the next app.

The process is repeated for each application until the user request is fully completed.

To some extent, it feels like the HostAgent acts as the orchestration agent, and the AppAgents are really agents in their own right. There are no tools per say, but rather applications it is accessing.

Interactive Requests

Users can introduce new requests at any time, prompting UFO to repeat the process.

Once all user requests are completed or fulfilled, UFO ends its operation.

More On The HostAgent

The process begins with a detailed observation of the current desktop window, captured through screenshots that provide a clear view of the active interface.

Based on this observation, the next logical step to complete the task is determined, following the Chain-of-Thought (CoT) reasoning approach.

Once the appropriate application is selected, its label and name are identified and noted.

The status of the task is then assessed, with the system indicating whether to continue or finish.

Alongside this, a global plan is devised — typically a broad, high-level outline for fulfilling the user request. If this plan is visible to the user, and editable, it would make for an excellent human-in-the-loop feedback loop and future improvements.

Throughout the process, additional comments or information are provided, often including a brief summary of progress or highlighting key points for further consideration.

More On The AppAgent

The process starts with the user submitting a request to UFO, which is identical to the one received by the HostAgent.

UFO then captures screenshots of the application, divided into three types:

  1. a previous screenshot,
  2. a clean current one,
  3. and an annotated version showing available controls.

Alongside this, control information is collected, listing the names and types of controls available for interaction in the selected application.

The system also recalls previous thoughts, comments, actions and execution results, building a memory that mirrors the HostAgent’s own recollections.

Additionally, examples are provided to demonstrate possible action choices for the AppAgent.

With this comprehensive input, AppAgent carefully analyses the details.

First, it makes an observation, providing a detailed description of the current application window and evaluating whether the last action had the intended effect.

The rationale behind each action is also considered, as the AppAgent logically determines the next move.

Control

Once a control is selected for interaction, its label and name are identified, and the specific function to be performed on it is defined.

The AppAgent then assesses the task status, deciding whether to continue if further actions are needed, finish if the task is complete, pending if awaiting user confirmation, screenshot if a new screenshot is required for more control annotations, or App Selection if it’s time to switch to another application.

To ensure smooth progress, the AppAgent generates a local plan, a more detailed and precise roadmap for upcoming steps to fully satisfy the user request.

Throughout this process, additional comments are provided, summarising progress or highlighting key points, mirroring the feedback offered by the HostAgent.

Observation & Thought

When HostAgent is prompted to provide its Observation and Thoughts, it serves two key purposes.

First, it pushes HostAgent to thoroughly analyse the current state of the task, offering a clear explanation of its logic and decision-making process.

This not only strengthens the internal consistency of its choices but also makes UFO’s operations more transparent and easier to understand.

Second, HostAgent assesses the task’s progress, outputting “FINISH” if the task is complete.

It can also provide feedback to the user, such as reporting progress, pointing out potential issues, or answering any queries.

Once the correct application is identified, UFO moves forward with the task, and AppAgent takes charge of executing the necessary actions within the application to fulfil the user request.

Design Consideration

UFO integrates a range of design features specifically crafted for the Windows OS.

These enhancements streamline interactions with UI controls, making them more efficient, automated, and secure, ultimately improving UFO’s ability to handle user requests.

Key aspects include interactive mode, customisable actions, control filtering, plan reflection, and safety mechanisms, each of which is discussed in more detail in the following sections.

Interactive Mode

UFO allows users to engage in interactive and iterative exchanges instead of relying on one-time completions.

After finishing a task, users can request enhancements to the previous task, propose entirely new tasks, or even assist UFO with operations it might struggle with, such as entering a password.

The researchers believe the user-friendly approach sets UFO apart from other UI agents in the market, enabling it to absorb user feedback and effectively manage longer, more complex tasks.

Follow me on LinkedIn ✨✨

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.





Source link

18Oct

Revisiting Karpathy’s “State of Computer Vision and AI” | by Dr. Leon Eversberg | Oct, 2024


Looking back at AI progress since the 2012 blog post “The state of Computer Vision and AI: we are really, really far away”

On August 9, 2010, President Barack Obama jokingly put his toe on the scale as Trip Director Marvin Nicholson weighed himself in the volleyball locker room at the University of Texas in Austin.
President Barack Obama jokingly puts his toe on the scale. Photo by Pete Souza on flickr.com

“What would it take for a computer to understand this image as you or I do? I challenge you to think explicitly of all the pieces of knowledge that have to fall in place for it to make sense.” [1]

Twelve years ago, on October 22, 2012, Andrej Karpathy published a blog post titled “The state of computer vision and AI: we are really, really far away” [1].

In his blog post, he used the image of former President Barack Obama jokingly putting his toe on the scale as a starting point for his take on the state of computer vision and artificial intelligence (AI) in 2012.

Karpathy argues that AI models need to have a lot of knowledge about our world in order to make inferences based on the values of pixels in an image, not only to understand what’s happening but also to understand the context of why it’s funny.

“It is mind-boggling that all of the above inferences unfold from a brief…



Source link

17Oct

A Novel Approach to Detect Coordinated Attacks Using Clustering | by Trupti Bavalatti | Oct, 2024


Unveiling hidden patterns: grouping malicious behavior

Clustering is a powerful technique within unsupervised machine learning that groups a given data based on their inherent similarities. Unlike supervised learning methods, such as classification, which rely on pre-labeled data to guide the learning process, clustering operates on unlabeled data. This means there are no predefined categories or labels and instead, the algorithm discovers the underlying structure of the data without prior knowledge of what the grouping should look like.

The main goal of clustering is to organize data points into clusters, where data points within the same cluster have higher similarity to each other compared to those in different clusters. This distinction allows the clustering algorithm to form groups that reflect natural patterns in the data. Essentially, clustering aims to maximize intra-cluster similarity while minimizing inter-cluster similarity. This technique is particularly useful in use-cases where you need to find hidden relationships or structure in data, making it valuable in areas such as fraud detection and anomaly identification.

By applying clustering, one can reveal patterns and insights that might not be obvious through other methods, and its simplicity and flexibility makes it adaptable to a wide variety of data types and applications.

A practical application of clustering is fraud detection in online systems. Consider an example where multiple users are making requests to a website, and each request includes details like the IP address, time of the request, and transaction amount.

Here’s how clustering can help detect fraud:

  • Imagine that most users are making requests from unique IP addresses, and their transaction patterns naturally differ.
  • However, if multiple requests come from the same IP address and show similar transaction patterns (such as frequent, high-value transactions), it could indicate that a fraudster is making multiple fake transactions from one source.

By clustering all user requests based on IP address and transaction behavior, we could detect suspicious clusters of requests that all originate from a single IP. This can flag potentially fraudulent activity and help in taking preventive measures.

An example diagram that visually demonstrates the concept of clustering is shown in the figure below.

Imagine you have data points representing transaction requests, plotted on a graph where:

  • X-axis: Number of requests from the same IP address.
  • Y-axis: Average transaction amount.

On the left side, we have the raw data. Without labels, we might already see some patterns forming. On the right, after applying clustering, the data points are grouped into clusters, with each cluster representing a different user behavior.

Example of clustering of fraudulent user behavior. Image source (CC BY 4.0)

To group data effectively, we must define a similarity measure, or metric, that quantifies how close data points are to each other. This similarity can be measured in multiple ways, depending on the data’s structure and the insights we aim to discover. There are two key approaches to measuring similarity — manual similarity measures and embedded similarity measures.

A manual similarity measure involves explicitly defining a mathematical formula to compare data points based on their raw features. This method is intuitive and we can use distance metrics like Euclidean distance, cosine similarity, or Jaccard similarity to evaluate how similar two points are. For instance, in fraud detection, we could manually compute the Euclidean distance between transaction attributes (e.g transaction amount, frequency of requests) to detect clusters of suspicious behavior. Although this approach is relatively easy to set up, it requires careful selection of the relevant features and may miss deeper patterns in the data.

On the other hand, an embedded similarity measure leverages the power of machine learning models to create learned representations, or embeddings of the data. Embeddings are vectors that capture complex relationships in the data and can be generated from models like Word2Vec for text or neural networks for images. Once these embeddings are computed, similarity can be measured using traditional metrics like cosine similarity, but now the comparison occurs in a transformed, lower-dimensional space that captures more meaningful information. Embedded similarity is particularly useful for complex data, such as user behavior on websites or text data in natural language processing. For example, in a movie or ads recommendation system, user actions can be embedded into vectors, and similarities in this embedding space can be used to recommend content to similar users.

While manual similarity measures provide transparency and greater control on feature selection and setup, embedded similarity measures give the ability to capture deeper and more abstract relationships in the data. The choice between the two depends on the complexity of the data and the specific goals of the clustering task. If you have well-understood, structured data, a manual measure may be sufficient. But if your data is rich and multi-dimensional, such as in text or image analysis, an embedding-based approach may give more meaningful clusters. Understanding these trade-offs is key to selecting the right approach for your clustering task.

In cases like fraud detection, where the data is often rich and based on behavior of user activity, an embedding-based approach is generally more effective for capturing nuanced patterns that could signal risky activity.

Coordinated fraudulent attack behaviors often exhibit specific patterns or characteristics. For instance, fraudulent activity may originate from a set of similar IP addresses or rely on consistent, repeated tactics. Detecting these patterns is crucial for maintaining the integrity of a system, and clustering is an effective technique for grouping entities based on shared traits. This helps the identification of potential threats by examining the collective behavior within clusters.

However, clustering alone may not be enough to accurately detect fraud, as it can also group benign activities alongside harmful ones. For example, in a social media environment, users posting harmless messages like “How are you today?” might be grouped with those engaged in phishing attacks. Hence, additional criteria is necessary to separate harmful behavior from benign actions.

To address this, we introduce the Behavioral Analysis and Cluster Classification System (BACCS) as a framework designed to detect and manage abusive behaviors. BACCS works by generating and classifying clusters of entities, such as individual accounts, organizational profiles, and transactional nodes, and can be applied across a wide range of sectors including social media, banking, and e-commerce. Importantly, BACCS focuses on classifying behaviors rather than content, making it more suitable for identifying complex fraudulent activities.

The system evaluates clusters by analyzing the aggregate properties of the entities within them. These properties are typically boolean (true/false), and the system assesses the proportion of entities exhibiting a specific characteristic to determine the overall nature of the cluster. For example, a high percentage of newly created accounts within a cluster might indicate fraudulent activity. Based on predefined policies, BACCS identifies combinations of property ratios that suggest abusive behavior and determines the appropriate actions to mitigate the threat.

The BACCS framework offers several advantages:

  • It enables the grouping of entities based on behavioral similarities, enabling the detection of coordinated attacks.
  • It allows for the classification of clusters by defining relevant properties of the cluster members and applying custom policies to identify potential abuse.
  • It supports automatic actions against clusters flagged as harmful, ensuring system integrity and enhancing protection against malicious activities.

This flexible and adaptive approach allows BACCS to continuously evolve, ensuring that it remains effective in addressing new and emerging forms of coordinated attacks across different platforms and industries.

Let’s understand more with the help of an analogy: Let’s say you have a wagon full of apples that you want to sell. All apples are put into bags before being loaded onto the wagon by multiple workers. Some of these workers don’t like you, and try to fill their bags with sour apples to mess with you. You need to identify any bag that might contain sour apples. To identify a sour apple you need to check if it is soft, the only problem is that some apples are naturally softer than others. You solve the problem of these malicious workers by opening each bag and picking out five apples, and you check if they are soft or not. If almost all the apples are soft it’s likely that the bag contains sour apples, and you put it to the side for further inspection later on. Once you’ve identified all the potential bags with a suspicious amount of softness you pour out their contents and pick out the healthy apples which are hard and throw away all the soft ones. You’ve now minimized the risk of your customers taking a bite of a sour apple.

BACCS operates in a similar manner; instead of apples, you have entities (e.g., user accounts). Instead of bad workers, you have malicious users, and instead of the bag of apples, you have entities grouped by common characteristics (e.g., similar account creation times). BACCS samples each group of entities and checks for signs of malicious behavior (e.g., a high rate of policy violations). If a group shows a high prevalence of these signs, it’s flagged for further investigation.

Just like checking the materials in the classroom, BACCS uses predefined signals (also referred to as properties) to assess the quality of entities within a cluster. If a cluster is found to be problematic, further actions can be taken to isolate or remove the malicious entities. This system is flexible and can adapt to new types of malicious behavior by adjusting the criteria for flagging clusters or by creating new types of clusters based on emerging patterns of abuse.

This analogy illustrates how BACCS helps maintain the integrity of the environment by proactively identifying and mitigating potential issues, ensuring a safer and more reliable space for all legitimate users.

The system offers numerous advantages:

  • Better Precision: By clustering entities, BACCS provides strong evidence of coordination, enabling the creation of policies that would be too imprecise if applied to individual entities in isolation.
  • Explainability: Unlike some machine learning techniques, the classifications made by BACCS are transparent and understandable. It is straightforward to trace and understand how a particular decision was made.
  • Quick Response Time: Since BACCS operates on a rule-based system rather than relying on machine learning, there is no need for extensive model training. This results in faster response times, which is important for immediate issue resolution.

BACCS might be the right solution for your needs if you:

  • Focus on classifying behavior rather than content: While many clusters in BACCS may be formed around content (e.g., images, email content, user phone numbers), the system itself does not classify content directly.
  • Handle issues with a relatively high frequancy of occurance: BACCS employs a statistical approach that is most effective when the clusters contain a significant proportion of abusive entities. It may not be as effective for harmful events that sparsely occur but is more suited for highly prevalent problems such as spam.
  • Deal with coordinated or similar behavior: The clustering signal primarily indicates coordinated or similar behavior, making BACCS particularly useful for addressing these types of issues.

Here’s how you can incorporate BACCS framework in a real production system:

Setting up BACCS in production. Image by Author
  1. When entities engage in activities on a platform, you build an observation layer to capture this activity and convert it into events. These events can then be monitored by a system designed for cluster analysis and actioning.
  2. Based on these events, the system needs to group entities into clusters using various attributes — for example, all users posting from the same IP address are grouped into one cluster. These clusters should then be forwarded for further classification.
  3. During the classification process, the system needs to compute a set of specialized boolean signals for a sample of the cluster members. An example of such a signal could be whether the account age is less than a day. The system then aggregates these signal counts for the cluster, such as determining that, in a sample of 100 users, 80 have an account age of less than one day.
  4. These aggregated signal counts should be evaluated against policies that determine whether a cluster appears to be anomalous and what actions should be taken if it is. For instance, a policy might state that if more than 60% of the members in an IP cluster have an account age of less than a day, these members should undergo further verification.
  5. If a policy identifies a cluster as anomalous, the system should identify all members of the cluster exhibiting the signals that triggered the policy (e.g., all members with an account age of less than one day).
  6. The system should then direct all such users to the appropriate action framework, implementing the action specified by the policy (e.g., further verification or blocking their account).

Typically, the entire process from activity of an entity to the application of an action is completed within several minutes. It’s also crucial to recognize that while this system provides a framework and infrastructure for cluster classification, clients/organizations need to supply their own cluster definitions, properties, and policies tailored to their specific domain.

Let’s look at the example where we try to mitigate spam via clustering users by ip when they send an email, and blocking them if >60% of the cluster members have account age less than a day.

Clustering and blocking in action. Image by Author

Members can already be present in the clusters. A re-classification of a cluster can be triggered when it reaches a certain size or has enough changes since the previous classification.

When selecting clustering criteria and defining properties for users, the goal is to identify patterns or behaviors that align with the specific risks or activities you’re trying to detect. For instance, if you’re working on detecting fraudulent behavior or coordinated attacks, the criteria should capture traits that are often shared by malicious actors. Here are some factors to consider when picking clustering criteria and defining user properties:

The clustering criteria you choose should revolve around characteristics that represent behavior likely to signal risk. These characteristics could include:

  • Time-Based Patterns: For example, grouping users by account creation times or the frequency of actions in a given time period can help detect spikes in activity that may be indicative of coordinated behavior.
  • Geolocation or IP Addresses: Clustering users by their IP address or geographical location can be especially effective in detecting coordinated actions, such as multiple fraudulent logins or content submissions originating from the same region.
  • Content Similarity: In cases like misinformation or spam detection, clustering by the similarity of content (e.g., similar text in posts/emails) can identify suspiciously coordinated efforts.
  • Behavioral Metrics: Characteristics like the number of transactions made, average session time, or the types of interactions with the platform (e.g., likes, comments, or clicks) can indicate unusual patterns when grouped together.

The key is to choose criteria that are not just correlated with benign user behavior but also distinct enough to isolate risky patterns, which will lead to more effective clustering.

Defining User Properties

Once you’ve chosen the criteria for clustering, defining meaningful properties for the users within each cluster is critical. These properties should be measurable signals that can help you assess the likelihood of harmful behavior. Common properties include:

  • Account Age: Newly created accounts tend to have a higher risk of being involved in malicious activities, so a property like “Account Age
  • Connection Density: For social media platforms, properties like the number of connections or interactions between accounts within a cluster can signal abnormal behavior.
  • Transaction Amounts: In cases of financial fraud, the average transaction size or the frequency of high-value transactions can be key properties to flag risky clusters.

Each property should be clearly linked to a behavior that could indicate either legitimate use or potential abuse. Importantly, properties should be boolean or numerical values that allow for easy aggregation and comparison across the cluster.

Another advanced strategy is using a machine learning classifier’s output as a property, but with an adjusted threshold. Normally, you would set a high threshold for classifying harmful behavior to avoid false positives. However, when combined with clustering, you can afford to lower this threshold because the clustering itself acts as an additional signal to reinforce the property.

Let’s consider that there is a model X, that catches scam and disables email accounts that have model X score > 0.95. Assume this model is already live in production and is disabling bad email accounts at threshold 0.95 with 100% precision. We have to increase the recall of this model, without impacting the precision.

  • First, we need to define clusters that can group coordinated activity together. Let’s say we know that there’s a coordinated activity going on, where bad actors are using the same subject line but different email ids to send scammy emails. So using BACCS, we will form clusters of email accounts that all have the same subject name in their sent emails.
  • Next, we need to lower the raw model threshold and define a BACCS property. We will now integrate model X into our production detection infra and create property using lowered model threshold, say 0.75. This property will have a value of “True” for an email account that has model X score >= 0.75.
  • Then we’ll define the anomaly threshold and say, if 50% of entities in the campaign name clusters have this property, then classify the clusters as bad and take down ad accounts that have this property as True.

So we essentially lowered the model’s threshold and started disabling entities in particular clusters at significantly lower threshold than what the model is currently enforcing at, and yet can be sure the precision of enforcement does not drop and we get an increase in recall. Let’s understand how –

Supposed we have 6 entities that have the same subject line, that have model X score as follows:

Entities actioned by ML model. Image by Author

If we use the raw model score (0.95) we would have disabled 2/6 email accounts only.

If we cluster entities on subject line text, and define a policy to find bad clusters having greater than 50% entities with model X score >= 0.75, we would have taken down all these accounts:

Entities actioned by clustering, using ML scores as properties. Image by Author

So we increased the recall of enforcement from 33% to 83%. Essentially, even if individual behaviors seem less risky, the fact that they are part of a suspicious cluster elevates their importance. This combination provides a strong signal for detecting harmful activity while minimizing the chances of false positives.

By lowering the threshold, you allow the clustering process to surface patterns that might otherwise be missed if you relied on classification alone. This approach takes advantage of both the granular insights from machine learning models and the broader behavioral patterns that clustering can identify. Together, they create a more robust system for detecting and mitigating risks and catching many more entities while still keeping a lower false positive rate.

Clustering techniques remain an important method for detecting coordinated attacks and ensuring system safety, particularly on platforms more prone to fraud, abuse or other malicious activities. By grouping similar behaviors into clusters and applying policies to take down bad entities from such clusters, we can detect and mitigate harmful activity and ensure a safer digital ecosystem for all users. Choosing more advanced embedding-based approaches helps represent complex user behavioral patterns better than manual methods of similarity detection measures.

As we continue advancing our security protocols, frameworks like BACCS play a crucial role in taking down large coordinated attacks. The integration of clustering with behavior-based policies allows for dynamic adaptation, enabling us to respond swiftly to new forms of abuse while reinforcing trust and safety across platforms.

In the future, there is a big opportunity for further research and exploration into complementary techniques that could enhance clustering’s effectiveness. Techniques such as graph-based analysis for mapping complex relationships between entities could be integrated with clustering to offer even higher precision in threat detection. Moreover, hybrid approaches that combine clustering with machine learning classification can be a very effective approach for detecting malicious activities at higher recall and lower false positive rate. Exploring these methods, along with continuous refinement of current methods, will ensure that we remain resilient against the evolving landscape of digital threats.

References

  1. https://developers.google.com/machine-learning/clustering/overview



Source link

15Oct

AI Feels Easier Than Ever, But Is It Really? | by Anna Via | Oct, 2024


The 4 Big Challenges of building AI products

Picture by ynsplt on Unsplash

A few days ago, I was speaking at an event about how to move from using ChatGPT at a personal level to implementing AI-powered technical solutions for teams and companies. We covered everything from prompt engineering and fine-tuning to agents and function calling. One of the questions from the audience stood up to me, even though it was one I should have expected: “How long does it take to get an AI-powered feature into production?”

In many ways, integrating AI into features can be incredibly easy. With recent progress, leveraging a state-of-the-art LLM can be as simple as making an API call. The entry barriers to use and integrate AI are now really low. There is a big but though. Getting an AI feature into production while accounting for all risks linked with this new technology can be a real challenge.

And that’s the paradox: AI feels easier and more accessible than ever, but its open-ended (free input / free output…



Source link

12Oct

Gaussian Naive Bayes, Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | Oct, 2024


CLASSIFICATION ALGORITHM

Bell-shaped assumptions for better predictions

⛳️ More CLASSIFICATION ALGORITHM, explained:
· Dummy Classifier
· K Nearest Neighbor Classifier
· Bernoulli Naive Bayes
Gaussian Naive Bayes
· Decision Tree Classifier
· Logistic Regression
· Support Vector Classifier
· Multilayer Perceptron (soon!)

Building on our previous article about Bernoulli Naive Bayes, which handles binary data, we now explore Gaussian Naive Bayes for continuous data. Unlike the binary approach, this algorithm assumes each feature follows a normal (Gaussian) distribution.

Here, we’ll see how Gaussian Naive Bayes handles continuous, bell-shaped data — ringing in accurate predictions — all without getting into the intricate math of Bayes’ Theorem.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Like other Naive Bayes variants, Gaussian Naive Bayes makes the “naive” assumption of feature independence. It assumes that the features are conditionally independent given the class label.

However, while Bernoulli Naive Bayes is suited for datasets with binary features, Gaussian Naive Bayes assumes that the features follow a continuous normal (Gaussian) distribution. Although this assumption may not always hold true in reality, it simplifies the calculations and often leads to surprisingly accurate results.

Bernoulli NB assumes binary data, Multinomial NB works with discrete counts, and Gaussian NB handles continuous data assuming a normal distribution.

Throughout this article, we’ll use this artificial golf dataset (made by author) as an example. This dataset predicts whether a person will play golf based on weather conditions.

Columns: ‘RainfallAmount’ (in mm), ‘Temperature’ (in Celcius), ‘Humidity’ (in %), ‘WindSpeed’ (in km/h) and ‘Play’ (Yes/No, target feature)
# IMPORTING DATASET #
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

dataset_dict = {
'Rainfall': [0.0, 2.0, 7.0, 18.0, 3.0, 3.0, 0.0, 1.0, 0.0, 25.0, 0.0, 18.0, 9.0, 5.0, 0.0, 1.0, 7.0, 0.0, 0.0, 7.0, 5.0, 3.0, 0.0, 2.0, 0.0, 8.0, 4.0, 4.0],
'Temperature': [29.4, 26.7, 28.3, 21.1, 20.0, 18.3, 17.8, 22.2, 20.6, 23.9, 23.9, 22.2, 27.2, 21.7, 27.2, 23.3, 24.4, 25.6, 27.8, 19.4, 29.4, 22.8, 31.1, 25.0, 26.1, 26.7, 18.9, 28.9],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'WindSpeed': [2.1, 21.2, 1.5, 3.3, 2.0, 17.4, 14.9, 6.9, 2.7, 1.6, 30.3, 10.9, 3.0, 7.5, 10.3, 3.0, 3.9, 21.9, 2.6, 17.3, 9.6, 1.9, 16.0, 4.6, 3.2, 8.3, 3.2, 2.2],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Set feature matrix X and target vector y
X, y = df.drop(columns='Play'), df['Play']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
print(pd.concat([X_train, y_train], axis=1), end='\n\n')
print(pd.concat([X_test, y_test], axis=1))

Gaussian Naive Bayes works with continuous data, assuming each feature follows a Gaussian (normal) distribution.

  1. Calculate the probability of each class in the training data.
  2. For each feature and class, estimate the mean and variance of the feature values within that class.
  3. For a new instance:
    a. For each class, calculate the probability density function (PDF) of each feature value under the Gaussian distribution of that feature within the class.
    b. Multiply the class probability by the product of the PDF values for all features.
  4. Predict the class with the highest resulting probability.
Gaussian Naive Bayes uses the normal distribution to model the likelihood of different feature values for each class. It then combines these likelihoods to make a prediction.

Transforming non-Gaussian distributed data

Remember that this algorithm naively assume that all the input features are having Gaussian/normal distribution?

Since we are not really sure about the distribution of our data, especially for features that clearly don’t follow a Gaussian distribution, applying a power transformation (like Box-Cox) before using Gaussian Naive Bayes can be beneficial. This approach can help make the data more Gaussian-like, which aligns better with the assumptions of the algorithm.

All columns are scaled using Power Transformation (Box-Cox Transformation) and then standardized.
from sklearn.preprocessing import PowerTransformer

# Initialize and fit the PowerTransformer
pt = PowerTransformer(standardize=True) # Standard Scaling already included
X_train_transformed = pt.fit_transform(X_train)
X_test_transformed = pt.transform(X_test)

Now we are ready for the training.

1. Class Probability Calculation: For each class, calculate its probability: (Number of instances in this class) / (Total number of instances)

from fractions import Fraction

def calc_target_prob(attr):
total_counts = attr.value_counts().sum()
prob_series = attr.value_counts().apply(lambda x: Fraction(x, total_counts).limit_denominator())
return prob_series

print(calc_target_prob(y_train))

2. Feature Probability Calculation : For each feature and each class, calculate the mean (μ) and standard deviation (σ) of the feature values within that class using the training data. Then, calculate the probability using Gaussian Probability Density Function (PDF) formula.

For each weather condition, determine the mean and standard deviation for both “YES” and “NO” instances. Then calculate their PDF using the PDF formula for normal/Gaussian distribution.
The same process is applied to all of the other features.
def calculate_class_probabilities(X_train_transformed, y_train, feature_names):
classes = y_train.unique()
equations = pd.DataFrame(index=classes, columns=feature_names)

for cls in classes:
X_class = X_train_transformed[y_train == cls]
mean = X_class.mean(axis=0)
std = X_class.std(axis=0)
k1 = 1 / (std * np.sqrt(2 * np.pi))
k2 = 2 * (std ** 2)

for i, column in enumerate(feature_names):
equation = f"{k1[i]:.3f}·exp(-(x-({mean[i]:.2f}))²/{k2[i]:.3f})"
equations.loc[cls, column] = equation

return equations

# Use the function with the transformed training data
equation_table = calculate_class_probabilities(X_train_transformed, y_train, X.columns)

# Display the equation table
print(equation_table)

3. Smoothing: Gaussian Naive Bayes uses a unique smoothing approach. Unlike Laplace smoothing in other variants, it adds a tiny value (0.000000001 times the largest variance) to all variances. This prevents numerical instability from division by zero or very small numbers.

Given a new instance with continuous features:

1. Probability Collection:
For each possible class:
· Start with the probability of this class occurring (class probability).
· For each feature in the new instance, calculate the probability density function of that feature within the class.

For ID 14, we calculate the PDF each of the feature for both “YES” and “NO” instances.

2. Score Calculation & Prediction:
For each class:
· Multiply all the collected PDF values together.
· The result is the score for this class.
· The class with the highest score is the prediction.

from scipy.stats import norm

def calculate_class_probability_products(X_train_transformed, y_train, X_new, feature_names, target_name):
classes = y_train.unique()
n_features = X_train_transformed.shape[1]

# Create column names using actual feature names
column_names = [target_name] + list(feature_names) + ['Product']

probability_products = pd.DataFrame(index=classes, columns=column_names)

for cls in classes:
X_class = X_train_transformed[y_train == cls]
mean = X_class.mean(axis=0)
std = X_class.std(axis=0)

prior_prob = np.mean(y_train == cls)
probability_products.loc[cls, target_name] = prior_prob

feature_probs = []
for i, feature in enumerate(feature_names):
prob = norm.pdf(X_new[0, i], mean[i], std[i])
probability_products.loc[cls, feature] = prob
feature_probs.append(prob)

product = prior_prob * np.prod(feature_probs)
probability_products.loc[cls, 'Product'] = product

return probability_products

# Assuming X_new is your new sample reshaped to (1, n_features)
X_new = np.array([-1.28, 1.115, 0.84, 0.68]).reshape(1, -1)

# Calculate probability products
prob_products = calculate_class_probability_products(X_train_transformed, y_train, X_new, X.columns, y.name)

# Display the probability product table
print(prob_products)

For this particular dataset, this accuracy is considered quite good.
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Initialize and train the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train_transformed, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test_transformed)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy:.4f}")

GaussianNB is known for its simplicity and effectiveness. The main thing to remember about its parameters is:

  1. priors: This is the most notable parameter, similar to Bernoulli Naive Bayes. In most cases, you don’t need to set it manually. By default, it’s calculated from your training data, which often works well.
  2. var_smoothing: This is a stability parameter that you rarely need to adjust. (the default is 0.000000001)

The key takeaway is that this algoritm is designed to work well out-of-the-box. In most situations, you can use it without worrying about parameter tuning.

Pros:

  1. Simplicity: Maintains the easy-to-implement and understand trait.
  2. Efficiency: Remains swift in training and prediction, making it suitable for large-scale applications with continuous features.
  3. Flexibility with Data: Handles both small and large datasets well, adapting to the scale of the problem at hand.
  4. Continuous Feature Handling: Thrives with continuous and real-valued features, making it ideal for tasks like predicting real-valued outputs or working with data where features vary on a continuum.

Cons:

  1. Independence Assumption: Still assumes that features are conditionally independent given the class, which might not hold in all real-world scenarios.
  2. Gaussian Distribution Assumption: Works best when feature values truly follow a normal distribution. Non-normal distributions may lead to suboptimal performance (but can be fixed with Power Transformation we’ve discussed)
  3. Sensitivity to Outliers: Can be significantly affected by outliers in the training data, as they skew the mean and variance calculations.

Gaussian Naive Bayes stands as an efficient classifier for a wide range of applications involving continuous data. Its ability to handle real-valued features extends its use beyond binary classification tasks, making it a go-to choice for numerous applications.

While it makes some assumptions about data (feature independence and normal distribution), when these conditions are met, it gives robust performance, making it a favorite among both beginners and seasoned data scientists for its balance of simplicity and power.

import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load the dataset
dataset_dict = {
'Rainfall': [0.0, 2.0, 7.0, 18.0, 3.0, 3.0, 0.0, 1.0, 0.0, 25.0, 0.0, 18.0, 9.0, 5.0, 0.0, 1.0, 7.0, 0.0, 0.0, 7.0, 5.0, 3.0, 0.0, 2.0, 0.0, 8.0, 4.0, 4.0],
'Temperature': [29.4, 26.7, 28.3, 21.1, 20.0, 18.3, 17.8, 22.2, 20.6, 23.9, 23.9, 22.2, 27.2, 21.7, 27.2, 23.3, 24.4, 25.6, 27.8, 19.4, 29.4, 22.8, 31.1, 25.0, 26.1, 26.7, 18.9, 28.9],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'WindSpeed': [2.1, 21.2, 1.5, 3.3, 2.0, 17.4, 14.9, 6.9, 2.7, 1.6, 30.3, 10.9, 3.0, 7.5, 10.3, 3.0, 3.9, 21.9, 2.6, 17.3, 9.6, 1.9, 16.0, 4.6, 3.2, 8.3, 3.2, 2.2],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}

df = pd.DataFrame(dataset_dict)

# Prepare data for model
X, y = df.drop('Play', axis=1), (df['Play'] == 'Yes').astype(int)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

# Apply PowerTransformer
pt = PowerTransformer(standardize=True)
X_train_transformed = pt.fit_transform(X_train)
X_test_transformed = pt.transform(X_test)

# Train the model
nb_clf = GaussianNB()
nb_clf.fit(X_train_transformed, y_train)

# Make predictions
y_pred = nb_clf.predict(X_test_transformed)

# Check accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")



Source link

10Oct

Building 5 Machine Learning Models: From Simplicity to Optimization


Building, comparing, and optimizing models.

Model Selection

Now we are moving to the second part of our project on Machine Learning Model Selection in Multivariate Analysis with Anonymized Data.

This second part is where the glamour comes in — predictive modeling, machine learning. Everyone is eager to jump straight into building machine learning models. I get that, and I feel the same excitement because I love this stage.

But before we get there, we must go through data processing — which is exactly what we covered in the previous tutorial.

We begin by installing the XGBoost package, one of the favorites among those who participate in Machine Learning competitions on the Kaggle platform.

# This package does not come with Anaconda and needs to be installed
!pip install -q xgboost

This package doesn’t come with Anaconda, so you need to install it separately. To…



Source link

09Oct

Implementing Sequential Algorithms on TPU | by Chaim Rand | Oct, 2024


Accelerating AI/ML Model Training with Custom Operators — Part 3.A

Photo by Bernd Dittrich on Unsplash

This is a direct sequel to a previous post on the topic of implementing custom TPU operations with Pallas. Of particular interest are custom kernels that leverage the unique properties of the TPU architecture in a manner that optimizes runtime performance. In this post, we will attempt to demonstrate this opportunity by applying the power of Pallas to the challenge of running sequential algorithms that are interspersed within a predominantly parallelizable deep learning (DL) workload.

We will focus on Non Maximum Suppression (NMS) of bounding-box proposals as a representative algorithm, and explore ways to optimize its implementation. An important component of computer vision (CV) object detection solutions (e.g., Mask RCNN), NMS is commonly used to filter out overlapping bounding boxes, keeping only the “best” ones. NMS receives a list of bounding box proposals, an associated list of scores, and an IOU threshold, and proceeds to greedily and iteratively choose the remaining box with the highest score and disqualify all other boxes with which it has an IOU that exceeds the given threshold. The fact that the box chosen at the n-th iteration depends on the preceding n-1 steps of the algorithm dictates the sequential nature of its implementation. Please see here and/or here for more on the rational behind NMS and its implementation. Although we have chosen to focus on one specific algorithm, most of our discussion should carry over to other sequential algorithms.

Offloading Sequential Algorithms to CPU

The presence of a sequential algorithm within a predominantly parallelizable ML model (e.g., Mask R-CNN) presents an interesting challenge. While GPUs, commonly used for such workloads, excel at executing parallel operations like matrix multiplication, they can significantly underperform compared to CPUs when handling sequential algorithms. This often leads to computation graphs that include crossovers between the GPU and CPU, where the GPU handles the parallel operations and the CPU handles the sequential ones. NMS is a prime example of a sequential algorithm that is commonly offloaded onto the CPU. In fact, a close analysis of torchvision’s “CUDA” implementation of NMS, reveals that even it runs a significant portion of the algorithm on CPU.

Although offloading sequential operations to the CPU may lead to improved runtime performance, there are several potential drawbacks to consider:

  1. Cross-device execution between the CPU and GPU usually requires multiple points of synchronization between the devices which commonly results in idle time on the GPU while it waits for the CPU to complete its tasks. Given that the GPU is typically the most expensive component of the training platform our goal is to minimize such idle time.
  2. In standard ML workflows, the CPU is responsible for preparing and feeding data to the model, which resides on the GPU. If the data input pipeline involves compute-intensive processing, this can strain the CPU, leading to “input starvation” on the GPU. In such scenarios, offloading portions of the model’s computation to the CPU could further exacerbate this issue.

To avoid these drawbacks you could consider alternative approaches, such as replacing the sequential algorithm with a comparable alternative (e.g., the one suggested here), settling for a slow/suboptimal GPU implementation of the sequential algorithm, or running the workload on CPU — each of which come with there own potential trade-offs.

Sequential Algorithms on TPU

This is where the unique architecture of the TPU could present an opportunity. Contrary to GPUs, TPUs are sequential processors. While their ability to run highly vectorized operations makes them competitive with GPUs when running parallelizable operations such as matrix multiplication, their sequential nature could make them uniquely suited for running ML workloads that include a mix of both sequential and parallel components. Armed with the Pallas extension to JAX, our newfound TPU kernel creation tool, we will evaluate this opportunity by implementing and evaluating a custom implementation of NMS for TPU.

Disclaimers

The NMS implementations we will share below are intended for demonstrative purposes only. We have not made any significant effort to optimize them or to verify their robustness, durability, or accuracy. Please keep in mind that, as of the time of this writing, Pallas is an experimental feature — still under active development. The code we share (based on JAX version 0.4.32) may become outdated by the time you read this. Be sure to refer to the most up-to-date APIs and resources available for your Pallas development. Please do not view our mention of any algorithm, library, or API as an endorsement for their use.

We begin with a simple implementation of NMS in numpy that will serve as a baseline for performance comparison:

import numpy as np

def nms_cpu(boxes, scores, max_output_size, threshold=0.1):
epsilon = 1e-5

# Convert bounding boxes and scores to numpy
boxes = np.array(boxes)
scores = np.array(scores)

# coordinates of bounding boxes
start_x = boxes[:, 0]
start_y = boxes[:, 1]
end_x = boxes[:, 2]
end_y = boxes[:, 3]

# Compute areas of bounding boxes
areas = (end_x - start_x) * (end_y - start_y)

# Sort by confidence score of bounding boxes
order = np.argsort(scores)

# Picked bounding boxes
picked_boxes = []

# Iterate over bounding boxes
while order.size > 0 and len(picked_boxes)

# The index of the remaining box with the highest score
index = order[-1]

# Pick the bounding box with largest confidence score
picked_boxes.append(index.item())

# Compute coordinates of intersection
x1 = np.maximum(start_x[index], start_x[order[:-1]])
x2 = np.minimum(end_x[index], end_x[order[:-1]])
y1 = np.maximum(start_y[index], start_y[order[:-1]])
y2 = np.minimum(end_y[index], end_y[order[:-1]])

# Compute areas of intersection and union
w = np.maximum(x2 - x1, 0.0)
h = np.maximum(y2 - y1, 0.0)

intersection = w * h
union = areas[index] + areas[order[:-1]] - intersection

# Compute the ratio between intersection and union
ratio = intersection / np.clip(union, min=epsilon)

# discard boxes above overlap threshold
keep = np.where(ratio order = order[keep]

return picked_boxes

To evaluate the performance of our NMS function, we generate a batch of random boxes and scores (as JAX tensors) and run the script on a Google Cloud TPU v5e system using the same environment and same benchmarking utility as in our previous post. For this experiment, we specify the CPU as the JAX default device:

import jax
from jax import random
import jax.numpy as jnp

def generate_random_boxes(run_on_cpu = False):
if run_on_cpu:
jax.config.update('jax_default_device', jax.devices('cpu')[0])
else:
jax.config.update('jax_default_device', jax.devices('tpu')[0])

n_boxes = 1024
img_size = 1024

k1, k2, k3 = random.split(random.key(0), 3)

# Randomly generate box sizes and positions
box_sizes = random.randint(k1,
shape=(n_boxes, 2),
minval=1,
maxval=img_size)
top_left = random.randint(k2,
shape=(n_boxes, 2),
minval=0,
maxval=img_size - 1)
bottom_right = jnp.clip(top_left + box_sizes, 0, img_size - 1)

# Concatenate top-left and bottom-right coordinates
rand_boxes = jnp.concatenate((top_left, bottom_right),
axis=1).astype(jnp.bfloat16)
rand_scores = jax.random.uniform(k3,
shape=(n_boxes,),
minval=0.0,
maxval=1.0)

return rand_boxes, rand_scores

rand_boxes, rand_scores = generate_random_boxes(run_on_cpu=True)

time = benchmark(nms_cpu)(rand_boxes, rand_scores, max_output_size=128)
print(f'nms_cpu: {time}')

The resultant average runtime is 2.99 milliseconds. Note the assumption that the input and output tensors reside on the CPU. If they are on the TPU, then the time to copy them between the devices should also be taken into consideration.

If our NMS function is a component within a larger computation graph running on the TPU, we might prefer a TPU-compatible implementation to avoid the drawbacks of cross-device execution. The code block below contains a JAX implementation of NMS specifically designed to enable acceleration via JIT compilation. Denoting the number of boxes by N, we begin by calculating the IOU between each of the N(N-1) pairs of boxes and preparing an NxN boolean tensor (mask_threshold) where the (i,j)-th entry indicates whether the IOU between boxes i and j exceed the predefined threshold.

To simplify the iterative selection of boxes, we create a copy of the mask tensor (mask_threshold2) where the diagonal elements are zeroed to prevent a box from suppressing itself. We further define two score-tracking tensors: out_scores, which retains the scores of the chosen boxes (and zeros the scores of the eliminated ones), and remaining_scores, which maintains the scores of the boxes still being considered. We then use the jax.lax.while_loop function to iteratively choose boxes while updating the out_scores and remaining_scores tensors. Note that the format of the output of this function differs from the previous function and may need to be adjusted to fit into subsequent steps of the computation graph.

import functools

# Given N boxes, calculates mask_threshold an NxN boolean mask
# where the (i,j) entry indicates whether the IOU of boxes i and j
# exceed the threshold. Returns mask_threshold, mask_threshold2
# which is equivalent to mask_threshold with zero diagonal and
# the scores modified so that all values are greater than 0
def init_tensors(boxes, scores, threshold=0.1):
epsilon = 1e-5

# Extract left, top, right, bottom coordinates
left = boxes[:, 0]
top = boxes[:, 1]
right = boxes[:, 2]
bottom = boxes[:, 3]

# Compute areas of boxes
areas = (right - left) * (bottom - top)

# Calculate intersection points
inter_l = jnp.maximum(left[None, :], left[:, None])
inter_t = jnp.maximum(top[None, :], top[:, None])
inter_r = jnp.minimum(right[None, :], right[:, None])
inter_b = jnp.minimum(bottom[None, :], bottom[:, None])

# Width, height, and area of the intersection
inter_w = jnp.clip(inter_r - inter_l, 0)
inter_h = jnp.clip(inter_b - inter_t, 0)
inter_area = inter_w * inter_h

# Union of the areas
union = areas[None, :] + areas[:, None] - inter_area

# IoU calculation
iou = inter_area / jnp.clip(union, epsilon)

# Shift scores to be greater than zero
out_scores = scores - jnp.min(scores) + epsilon

# Create mask based on IoU threshold
mask_threshold = iou > threshold

# Create mask excluding diagonal (i.e., self IoU is ignored)
mask_threshold2 = mask_threshold * (1-jnp.eye(mask_threshold.shape[0],
dtype=mask_threshold.dtype))

return mask_threshold, mask_threshold2, out_scores

@functools.partial(jax.jit, static_argnames=['max_output_size', 'threshold'])
def nms_jax(boxes, scores, max_output_size, threshold=0.1):
# initialize mask and score tensors
mask_threshold, mask_threshold2, out_scores = init_tensors(boxes,
scores,
threshold)

# The out_scores tensor will retain the scores of the chosen boxes
# and zero the scores of the eliminated ones
# remaining_scores will maintain non-zero scores for boxes that
# have not been chosen or eliminated
remaining_scores = out_scores.copy()

def choose_box(state):
i, remaining_scores, out_scores = state
# choose index of box with highest score from remaining scores
index = jnp.argmax(remaining_scores)
# check validity of chosen box
valid = remaining_scores[index] > 0
# If valid, zero all scores with IOU greater than threshold
# (including the chosen index)
remaining_scores = jnp.where(mask_threshold[index] *valid,
0,
remaining_scores)
# zero the scores of the eliminated tensors (not including
# the chosen index)
out_scores = jnp.where(mask_threshold2[index]*valid,
0,
out_scores)

i = i + 1
return i, remaining_scores, out_scores

def cond_fun(state):
i, _, _ = state
return (i

i = 0
state = (i, remaining_scores, out_scores)

_, _, out_scores = jax.lax.while_loop(cond_fun, choose_box, state)

# Output the resultant scores. To extract the chosen boxes,
# Take the max_output_size highest scores:
# min = jnp.minimum(jnp.count_nonzero(scores), max_output_size)
# indexes = jnp.argsort(out_scores, descending=True)[:min]
return out_scores

# nms_jax can be run on either the CPU the TPU
rand_boxes, rand_scores = generate_random_boxes(run_on_cpu=True)

time = benchmark(nms_jax)(rand_boxes, rand_scores, max_output_size=128)
print(f'nms_jax on CPU: {time}')

rand_boxes, rand_scores = generate_random_boxes(run_on_cpu=False)

time = benchmark(nms_jax)(rand_boxes, rand_scores, max_output_size=128)
print(f'nms_jax on TPU: {time}')

The runtimes of this implementation of NMS are 1.231 and 0.416 milliseconds on CPU and TPU, respectively.

We now present a custom implementation of NMS in which we explicitly leverage the fact that on TPUs Pallas kernels are executed in a sequential manner. Our implementation uses two boolean matrix masks and two score-keeping tensors, similar to the approach in our previous function.

We define a kernel function, choose_box, responsible for selecting the next box and updating the score-keeping tensors, which are maintained in scratch memory. We invoke the kernel across a one-dimensional grid where the number of steps (i.e., the grid-size) is determined by the max_output_size parameter.

Note that due to some limitations (as of the time of this writing) on the operations supported by Pallas, some acrobatics are required to implement both the “argmax” function and the validity check for the selected boxes. For the sake of brevity, we omit the technical details and refer the interested reader to the comments in the code below.

from jax.experimental import pallas as pl
from jax.experimental.pallas import tpu as pltpu

# argmax helper function
def pallas_argmax(scores, n_boxes):
# we assume that the index of each box is stored in the
# least significant bits of the score (see below)
idx = jnp.max(scores.astype(float)).astype(int) % n_boxes
return idx

# Pallas kernel definition
def choose_box(scores, thresh_mask1, thresh_mask2, ret_scores,
scores_scratch, remaining_scores_scratch, *, nsteps, n_boxes):
# initialize scratch memory on first step
@pl.when(pl.program_id(0) == 0)
def _():
scores_scratch[...] = scores[...]
remaining_scores_scratch[...] = scores[...]

remaining_scores = remaining_scores_scratch[...]

# choose box
idx = pallas_argmax(remaining_scores, n_boxes)

# we use any to verfiy validity of the chosen box due
# to limitations on indexing in pallas
valid = (remaining_scores>0).any()

# updating score tensors
remaining_scores_scratch[...] = jnp.where(thresh_mask1[idx,...]*valid,
0,
remaining_scores)
scores_scratch[...] = jnp.where(thresh_mask2[idx,...]*valid,
0,
scores_scratch[...])

# set return value on final step
@pl.when(pl.program_id(0) == nsteps - 1)
def _():
ret_scores[...] = scores_scratch[...]

@functools.partial(jax.jit, static_argnames=['max_output_size', 'threshold'])
def nms_pallas(boxes, scores, max_output_size, threshold=0.1):
n_boxes = scores.size
mask_threshold, mask_threshold2, scores = init_tensors(boxes,
scores,
threshold)

# In order to work around the Pallas argsort limitation
# we create a new scores tensor with the same ordering of
# the input scores tensor in which the index of each score
# in the ordering is encoded in the least significant bits
sorted = jnp.argsort(scores, descending=True)

# descending integers: n_boxes-1, ..., 2, 1, 0
descending = jnp.flip(jnp.arange(n_boxes))

# new scores in descending with the least significant
# bits carrying the argsort of the input scores
ordered_scores = n_boxes * descending + sorted

# new scores with same ordering as input scores
scores = jnp.empty_like(ordered_scores
).at[sorted].set(ordered_scores)

grid = (max_output_size,)
return pl.pallas_call(
functools.partial(choose_box,
nsteps=max_output_size,
n_boxes=n_boxes),
grid_spec=pltpu.PrefetchScalarGridSpec(
num_scalar_prefetch=0,
in_specs=[
pl.BlockSpec(block_shape=(n_boxes,)),
pl.BlockSpec(block_shape=(n_boxes, n_boxes)),
pl.BlockSpec(block_shape=(n_boxes, n_boxes)),
],
out_specs=pl.BlockSpec(block_shape=(n_boxes,)),
scratch_shapes=[pltpu.VMEM((n_boxes,), scores.dtype),
pltpu.VMEM((n_boxes,), scores.dtype)],
grid=grid,
),
out_shape=jax.ShapeDtypeStruct((n_boxes,), scores.dtype),
compiler_params=dict(mosaic=dict(
dimension_semantics=("arbitrary",)))
)(scores, mask_threshold, mask_threshold2)

rand_boxes, rand_scores = generate_random_boxes(run_on_cpu=False)

time = benchmark(nms_pallas)(rand_boxes, rand_scores, max_output_size=128)
print(f'nms_pallas: {time}')

The average runtime of our custom NMS operator is 0.139 milliseconds, making it roughly three times faster than our JAX-native implementation. This result highlights the potential of tailoring the implementation of sequential algorithms to the unique properties of the TPU architecture.

Note that in our Pallas kernel implementation, we load the full input tensors into TPU VMEM memory. Given the limited the capacity of VMEM, scaling up the input size (i.e., increase the number of bounding boxes) will likely lead to memory issues. Typically, such limitations can be addressed by chunking the inputs with BlockSpecs. Unfortunately, applying this approach would break the current NMS implementation. Implementing NMS across input chunks would require a different design, which is beyond the scope of this post.

The results of our experiments are summarized in the table below:

Results of NMS experiments (lower is better) — by Author

These results demonstrate the potential for running full ML computation graphs on TPU, even when they include sequential components. The performance improvement demonstrated by our Pallas NMS operator, in particular, highlights the opportunity of customizing kernels in a way that leverages the TPUs strengths.

In our previous post we learned of the opportunity for building custom TPU operators using the Pallas extension for JAX. Maximizing this opportunity requires tailoring the kernel implementations to the specific properties of the TPU architecture. In this post, we focused on the sequential nature of the TPU processor and its use in optimizing a custom NMS kernel. While scaling the solution to support an unrestricted number of bounding boxes would require further work, the core principles we have discussed remain applicable.

Still in the experimental phase of its development, there remain some limitations in Pallas that may require creative workarounds. But the strength and potential are clearly evident and we anticipate that they will only increase as the framework matures.



Source link

Protected by Security by CleanTalk