11Sep

Market Basket Analysis Using High Utility Itemset Mining | by Laurin Brechter | Sep, 2024


Finding high-value patterns in transactions

In this post, I will give an alternative to popular techniques in market basket analysis that can help practitioners find high-value patterns rather than just the most frequent ones. We will gain some intuition into different pattern mining problems and look at a real-world example. The full code can be found here. All images are created by the author.

I have written a more introductory article about pattern mining already; if you’re not familiar with some of the concepts that come up here, feel free to check that one out first.

In short, pattern mining tries to find patterns in data (duuh). Most of the time, this data comes in the form of (multi-)sets or sequences. In my last article, for example, I looked at the sequence of actions that a user performs on a website. In this case, we would care about the ordering of the items.

In other cases, such as the one we will discuss below, we do not care about the ordering of the items. We only list all the items that were in the transaction and how often they appeared.

Example Transacton Database

So for example, transaction 1 contained đŸ„Ș 3 times and 🍎 once. As we see, we lose information about the ordering of the items, but in many scenarios (as the one we will discuss below), there is no logical ordering of the items. This is similar to a bag of words in NLP.

Market Basket Analysis (MBA) is a data analysis technique commonly used in retail and marketing to uncover relationships between products that customers tend to purchase together. It aims to identify patterns in customers’ shopping baskets or transactions by analyzing their purchasing behavior. The central idea is to understand the co-occurrence of items in shopping transactions, which helps businesses optimize their strategies for product placement, cross-selling, and targeted marketing campaigns.

Frequent Itemset Mining (FIM) is the process of finding frequent patterns in transaction databases. We can look at the frequency of a pattern (i.e. a set of items) by calculating its support. In other words, the support of a pattern X is the number of transactions T that contain X (and are in the database D). That is, we are simply looking at how often the pattern X appears in the database.

Definition of the support.

In FIM, we then want to find all the sequences that have a support bigger than some threshold (often called minsup). If the support of a sequence is higher than minsup, it is considered frequent.

Limitations

In FIM, we only look at the existence of an item in a sequence. That is, whether an item appears two times or 200 times does not matter, we simply represent it as a one. But we often have cases (such as MBA), where not only the existence of an item in a transaction is relevant but also how many times it appeared in the transaction.

Another problem is that frequency does not always imply relevance. In that sense, FIM assumes that all items in the transaction are equally important. However, it is reasonable to assume that someone buying caviar might be more important for a business than someone buying bread, as caviar is potentially a high ROI/profit item.

These limitations directly bring us to High Utility Itemset Mining (HUIM) and High Utility Quantitative Itemset Mining (HUQIM) which are generalizations of FIM that try to address some of the problems of normal FIM.

Our first generalization is that items can appear more than once in a transaction (i.e. we have a multiset instead of a simple set). As said before, in normal itemset mining, we transform the transaction into a set and only look at whether the item exists in the transaction or not. So for example the two transactions below would have the same representation.

t1 = [a,a,a,a,a,b] # repr. as {a,b} in FIM
t2 = [a,b] # repr. as {a,b} in FIM

Above, both these two transactions would be represented as [a,b] in regular FIM. We quickly see that, in some cases, we could miss important details. For example, if a and b were some items in a customer’s shopping cart, it would matter a lot whether we have a (e.g. a loaf of bread) five times or only once. Therefore, we represent the transaction as a multiset in which we write down, how many times each item appeared.

# multiset representation
t1_ms = {(a,5),(b,1)}
t2_ms = {(a,1),(b,1)}

This is also efficient if the items can appear in a large number of items (e.g. 100 or 1000 times). In that case, we need not write down all the a’s or b’s but simply how often they appear.

The generalization that both the quantitative and non-quantitative methods make, is to assign every item in the transaction a utility (e.g. profit or time). Below, we have a table that assigns every possible item a unit profit.

Utility of Items

We can then calculate the utility of a specific pattern such as {đŸ„Ș, 🍎} by summing up the utility of those items in the transactions that contain them. In our example we would have:

(3đŸ„Ș * $1 + 1🍎 * $2) +

(1 đŸ„Ș * $1 + 2🍎 * $2) = $10

Transacton Database from Above

So, we get that this pattern has a utility of $10. With FIM, we had the task of finding frequent patterns. Now, we have to find patterns with high utility. This is mainly because we assume that frequency does not imply importance. In regular FIM, we might have missed rare (infrequent) patterns that provide a high utility (e.g. the diamond), which is not true with HUIM.

We also need to define the notion of a transaction utility. This is simply the sum of the utility of all the items in the transaction. For our transaction 3 in the database, this would be

1đŸ„Ș * $1 + 2🩞*$10 + 2🍎*$2 = $25

Note that solving this problem and finding all high-utility items is more difficult than regular FPM. This is because the utility does not follow the Apriori property.

The Apriori Property

Let X and Y be two patterns occurring in a transaction database D. The apriori property says that if X is a subset of Y, then the support of X must be at least as big as Y’s.

Apriori property.

This means that if a subset of Y is infrequent, Y itself must be infrequent since it must have a smaller support. Let’s say we have X = {a} and Y = {a,b}. If Y appears 4 times in our database, then X must appear at least 4 times, since X is a subset of Y. This makes sense since we are making the pattern less general / more specific by adding an item which means that it will fit less transactions. This property is used in most algorithms as it implies that if {a} is infrequent all supersets are also infrequent and we can eliminate them from the search space [3].

This property does not hold when we are talking about utility. A superset Y of transaction X could have more or less utility. If we take the example from above, {đŸ„Ș} has a utility of $4. But this does not mean we cannot look at supersets of this pattern. For example, the superset we looked at {đŸ„Ș, 🍎} has a higher utility of $10. At the same time, a superset of a pattern won’t always have more utility since it might be that this superset just doesn’t appear very often in the DB.

Idea Behind HUIM

Since we can’t use the apriori property for HUIM directly, we have to come up with some other upper bound for narrowing down the search space. One such bound is called Transaction-Weighted Utilization (TWU). To calculate it, we sum up the transaction utility of the transactions that contain the pattern X of interest. Any superset Y of X can’t have a higher utility than the TWU. Let’s make this clearer with an example. The TWU of {đŸ„Ș,🍎} is $30 ($5 from transaction 1 and $5 from transaction 3). When we look at a superset pattern Y such as {đŸ„Ș 🩞 🍎} we can see that there is no way it would have more utility since all transactions that have Y in them also have X in them.

There are now various algorithms for solving HUIM. All of them receive a minimum utility and produce the patterns that have at least that utility as their output. In this case, I have used the EFIM algorithm since it is fast and memory efficient.

For this article, I will work with the Market Basket Analysis dataset from Kaggle (used with permission from the original dataset author).

Above, we can see the distribution of transaction values found in the data. There is a total of around 19,500 transactions with an average transaction value of $526 and 26 distinct items per transaction. In total, there are around 4000 unique items. We can also make an ABC analysis where we put items into different buckets depending on their share of total revenue. We can see that around 500 of the 4000 items make up around 70% of the revenue (A-items). We then have a long right-tail of items (around 2250) that make up around 5% of the revenue (C-items).

Preprocessing

The initial data is in a long format where each row is a line item within a bill. From the BillNo we can see to which transaction the item belongs.

Initial Data Format

After some preprocessing, we get the data into the format required by PAMI which is the Python library we are going to use for applying the EFIM algorithm.

data['item_id'] = pd.factorize(data.Itemname)[0].astype(str) # map item names to id
data["Value_Int"] = data["Value"].astype(int).astype(str)
data = data.loc[data.Value_Int != '0'] # exclude items w/o utility

transaction_db = data.groupby('BillNo').agg(
items=('item_id', lambda x: ' '.join(list(x))),
total_value=('Value', lambda x: int(x.sum())),
values=('Value_Int', lambda x: ' '.join(list(x))),
)

# filter out long transactions, only use subset of transactions
transaction_db = transaction_db.loc[transaction_db.num_items

Transaction Database

We can then apply the EFIM algorithm.

import PAMI.highUtilityPattern.basic.EFIM as efim 

obj = efim.EFIM('tdb.csv', minUtil=1000, sep=' ')
obj.startMine() #start the mining process
obj.save('out.txt') #store the patterns in file
results = obj.getPatternsAsDataFrame() #Get the patterns discovered into a dataframe
obj.printResults()

The algorithm then returns a list of patterns that meet this minimum utility criterion.



Source link

10Sep

Logistic Regression, Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | Sep, 2024


CLASSIFICATION ALGORITHM

Finding the perfect weights to fit the data in

While some probabilistic-based machine learning models (like Naive Bayes) make bold assumptions about feature independence, logistic regression takes a more measured approach. Think of it as drawing a line (or plane) that separates two outcomes, allowing us to predict probabilities with a bit more flexibility.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Logistic regression is a statistical method used for predicting binary outcomes. Despite its name, it’s used for classification rather than regression. It estimates the probability that an instance belongs to a particular class. If the estimated probability is greater than 50%, the model predicts that the instance belongs to that class (or vice versa).

Throughout this article, we’ll use this artificial golf dataset (inspired by [1]) as an example. This dataset predicts whether a person will play golf based on weather conditions.

Just like in KNN, logistic regression requires the data to be scaled first. Convert categorical columns into 0 & 1 and also scale the numerical features so that no single feature dominates the distance metric.

Columns: ‘Outlook’, ‘Temperature’, ‘Humidity’, ‘Wind’ and ‘Play’ (target feature). The categorical columns (Outlook & Windy) are encoded using one-hot encoding while the numerical columns are scaled using standard scaling (z-normalization).
# Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Create dataset from dictionary
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Prepare data: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]

# Split data into features and target
X, y = df.drop(columns='Play'), df['Play']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical features
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])

# Print results
print("Training set:")
print(pd.concat([X_train, y_train], axis=1), '\n')
print("Test set:")
print(pd.concat([X_test, y_test], axis=1))

Logistic regression works by applying the logistic function to a linear combination of the input features. Here’s how it operates:

  1. Calculate a weighted sum of the input features (similar to linear regression).
  2. Apply the logistic function (also called sigmoid function) to this sum, which maps any real number to a value between 0 and 1.
  3. Interpret this value as the probability of belonging to the positive class.
  4. Use a threshold (typically 0.5) to make the final classification decision.
For our golf dataset, logistic regression might combine the weather factors into a single score, then transform this score into a probability of playing golf.

The training process for logistic regression involves finding the best weights for the input features. Here’s the general outline:

  1. Initialize the weights (often to small random values).
# Initialize weights (including bias) to 0.1
initial_weights = np.full(X_train_np.shape[1], 0.1)

# Create and display DataFrame for initial weights
print(f"Initial Weights: {initial_weights}")

2. For each training example:
a. Calculate the predicted probability using the current weights.

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def calculate_probabilities(X, weights):
z = np.dot(X, weights)
return sigmoid(z)

def calculate_log_loss(probabilities, y):
return -y * np.log(probabilities) - (1 - y) * np.log(1 - probabilities)

def create_output_dataframe(X, y, weights):
probabilities = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(probabilities, y)

df = pd.DataFrame({
'Probability': probabilities,
'Label': y,
'Log Loss': log_losses
})

return df

def calculate_average_log_loss(X, y, weights):
probabilities = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(probabilities, y)
return np.mean(log_losses)

# Convert X_train and y_train to numpy arrays for easier computation
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()

# Add a column of 1s to X_train_np for the bias term
X_train_np = np.column_stack((np.ones(X_train_np.shape[0]), X_train_np))

# Create and display DataFrame for initial weights
initial_df = create_output_dataframe(X_train_np, y_train_np, initial_weights)
print(initial_df.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print(f"\nAverage Log Loss: {calculate_average_log_loss(X_train_np, y_train_np, initial_weights):.6f}")

b. Compare this probability to the actual class label by calculating its log loss.

3. Update the weights to minimize the loss (usually using some optimization algorithm, like gradient descent. This include repeatedly do Step 2 until log loss cannot get smaller).

def gradient_descent_step(X, y, weights, learning_rate):
m = len(y)
probabilities = calculate_probabilities(X, weights)
gradient = np.dot(X.T, (probabilities - y)) / m
new_weights = weights - learning_rate * gradient # Create new array for updated weights
return new_weights

# Perform one step of gradient descent (one of the simplest optimization algorithm)
learning_rate = 0.1
updated_weights = gradient_descent_step(X_train_np, y_train_np, initial_weights, learning_rate)

# Print initial and updated weights
print("\nInitial weights:")
for feature, weight in zip(['Bias'] + list(X_train.columns), initial_weights):
print(f"{feature:11}: {weight:.2f}")

print("\nUpdated weights after one iteration:")
for feature, weight in zip(['Bias'] + list(X_train.columns), updated_weights):
print(f"{feature:11}: {weight:.2f}")

# With sklearn, you can get the final weights (coefficients)
# and final bias (intercepts) easily.
# The result is almost the same as doing it manually above.

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(penalty=None, solver='saga')
lr_clf.fit(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

y_train_prob = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(y_train_prob) + (1 - y_train) * np.log(1 - y_train_prob))

print(f"Weights & Bias Final: {coefficients[0].round(2)}, {round(intercept[0],2)}")
print("Loss Final:", loss.round(3))

Once the model is trained:
1. For a new instance, calculate the probability with the final weights (also called coefficients), just like during the training step.

2. Interpret the output by seeing the probability: if p ≄ 0.5, predict class 1; otherwise, predict class 0

# Calculate prediction probability
predicted_probs = lr_clf.predict_proba(X_test)[:, 1]

z_values = np.log(predicted_probs / (1 - predicted_probs))

result_df = pd.DataFrame({
'ID': X_test.index,
'Z-Values': z_values.round(3),
'Probabilities': predicted_probs.round(3)
}).set_index('ID')

print(result_df)

# Make predictions
y_pred = lr_clf.predict(X_test)
print(y_pred)

Evaluation Step

result_df = pd.DataFrame({
'ID': X_test.index,
'Label': y_test,
'Probabilities': predicted_probs.round(2),
'Prediction': y_pred,
}).set_index('ID')

print(result_df)

Logistic regression has several important parameters that control its behavior:

1.Penalty: The type of regularization to use (‘l1’, ‘l2’, ‘elasticnet’, or ‘none’). Regularization in logistic regression prevents overfitting by adding a penalty term to the model’s loss function, that encourages simpler models.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

regs = [None, 'l1', 'l2']
coeff_dict = {}

for reg in regs:
lr_clf = LogisticRegression(penalty=reg, solver='saga')
lr_clf.fit(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[reg] = {
'Coefficients': coefficients,
'Intercept': intercept,
'Loss': loss,
'Accuracy': accuracy
}

for reg, vals in coeff_dict.items():
print(f"{reg}: Coeff: {vals['Coefficients'][0].round(2)}, Intercept: {vals['Intercept'].round(2)}, Loss: {vals['Loss'].round(3)}, Accuracy: {vals['Accuracy'].round(3)}")

2. Regularization Strength (C): Controls the trade-off between fitting the training data and keeping the model simple. A smaller C means stronger regularization.

# List of regularization strengths to try for L1
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for strength in strengths:
lr_clf = LogisticRegression(penalty='l1', C=strength, solver='saga')
lr_clf.fit(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L1_{strength}'] = {
'Coefficients': coefficients[0].round(2),
'Intercept': round(intercept[0],2),
'Loss': round(loss,3),
'Accuracy': round(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

# List of regularization strengths to try for L2
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for strength in strengths:
lr_clf = LogisticRegression(penalty='l2', C=strength, solver='saga')
lr_clf.fit(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.mean(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L2_{strength}'] = {
'Coefficients': coefficients[0].round(2),
'Intercept': round(intercept[0],2),
'Loss': round(loss,3),
'Accuracy': round(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

3. Solver: The algorithm to use for optimization (‘liblinear’, ‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’). Some regularization might require a particular algorithm.

4. Max Iterations: The maximum number of iterations for the solver to converge.

For our golf dataset, we might start with ‘l2’ penalty, ‘liblinear’ solver, and C=1.0 as a baseline.

Like any algorithm in machine learning, logistic regression has its strengths and limitations.

Pros:

  1. Simplicity: Easy to implement and understand.
  2. Interpretability: The weights directly show the importance of each feature.
  3. Efficiency: Doesn’t require too much computational power.
  4. Probabilistic Output: Provides probabilities rather than just classifications.

Cons:

  1. Linearity Assumption: Assumes a linear relationship between features and log-odds of the outcome.
  2. Feature Independence: Assumes features are not highly correlated.
  3. Limited Complexity: May underfit in cases where the decision boundary is highly non-linear.
  4. Requires More Data: Needs a relatively large sample size for stable results.

In our golf example, logistic regression might provide a clear, interpretable model of how each weather factor influences the decision to play golf. However, it might struggle if the decision involves complex interactions between weather conditions that can’t be captured by a linear model.

Logistic regression shines as a powerful yet straightforward classification tool. It stands out for its ability to handle complex data while remaining easy to interpret. Unlike some other basic models, it provides smooth probability estimates and works well with many features. In the real world, from predicting customer behavior to medical diagnoses, logistic regression often performs surprisingly well. It’s not just a stepping stone — it’s a reliable model that can match more complex models in many situations.

# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Prepare data: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Split data into training and testing sets
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical features
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])

# Train the model
lr_clf = LogisticRegression(penalty='l2', C=1, solver='saga')
lr_clf.fit(X_train, y_train)

# Make predictions
y_pred = lr_clf.predict(X_test)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")



Source link

09Sep

Cultivated Foie Gras flies into Europe – prepare for legal disruption · European Law Blog


Introduction

The opening of the Paris Olympics on 26 July 2024 coincided with a potentially significant development in one of France’s most renowned gastronomic traditions. The French cultivated meat startup, Gourmey, applied for a novel food authorization for its cultivated foie gras in the EU, as well as in Singapore, Switzerland, the United Kingdom, and the United States. This first-ever authorization procedure for a cultivated meat product in the EU represents a potential “legal disruption” and warrants close attention from both policy and research communities.

Cultivated meat, produced from animal cells grown in controlled environments outside of animals, is hailed as a potential solution to the numerous environmental and ethical challenges posed by conventional meat production. At the same time, scientific, sustainability and regulatory challenges are well regarded. Notwithstanding, the first commercial cultivated meat product was introduced in Singapore in 2020, following regulatory approval granted to the American company “Eat Just” for chicken nuggets partially composed of cultivated cells. As of May 2024, these nuggets are also available in Singaporean retail stores. Meanwhile, several other cultivated meat products have received approval in Singapore, the United States, and Israel. Regulators in various jurisdictions are actively collaborating with innovators to establish pathways to market for these products.

The EU has not yet taken a leading role in the regulation of cultivated meat. Most food innovations are governed by the EU’s novel food framework, defined by Regulation (EU) No 2015/2283, and cultured meat is no exception. The novel food framework is praised for its robustness but criticized for hindering innovation due to lengthy and demanding procedures. Innovators also fear political interference in the authorization process, as cultivated meat faces intense political backlash in several EU Member States. Since 2020, for example, the French legislature has repeatedly attempted to prohibit the use of meat-related terms for alternative protein products. In November 2023, the Italian government adopted Law No. 172/2023 prohibiting the production and commercialization of cultivated meat. Article 1 provides that the ban is necessary to:

“ensure the protection of human health and citizens’ interests as well as preserve the agri-food heritage, as a set of products that are an expression of the socio-economic and cultural evolution process of Italy, of strategic importance for the national interest”.

Policymakers in Poland and Romania have expressed similar intentions, and the governments of these sceptical countries are proposing revisions to the novel food framework at the EU level.

Disruptive Potential for Novel Food Framework and Animal Welfare

Legal disruption occurs when new technologies challenge the applicability and suitability of existing regulatory frameworks. In our view, the authorization procedure for cultivated foie gras could trigger such disruption concerning the novel food framework, food labelling regulations, and animal welfare laws.

First and foremost, the authorization procedure at the EU level will test the Commission’s claim, that the existing novel food framework is adequate for handling such applications. The European Food Safety Authority (EFSA) has recently taken several steps to engage with stakeholders in the field of cellular agriculture through a Scientific Colloquium on cell culture-derived foods and food ingredients. It has promised specific guidelines for submitting dossiers on cultivated meat products, which are expected to be included in the new general guidance for novel food applications to be published in September 2024. This application will illustrate whether these steps effectively address the concerns of the cellular agriculture industry. 

The parallel filing of applications in Singapore, Switzerland, the UK and the United States will also enable a comparative assessment of regulatory regimes and potentially expedite regulatory cooperation. While all countries share the fundamental objective of ensuring food safety, their specific authorization procedures, approval times and transparency requirements vary significantly. Different countries exhibit varying levels of risk acceptance when it comes to food innovations. For instance, Singapore aims to position itself as a regulatory pioneer to attract innovators because it views cultivated meat and novel foods as essential to achieve the objectives of the national food security strategy ‘30 by 30’, aiming to produce locally 30% of the country’s nutritional needs by 2030.

In this context, the foie gras application may influence the EU’s stance on radical food innovation more broadly. Whilst the EU’s novel food framework primarily focuses on food safety, the political discourse on cultivated meat encompasses additional aspects. Legislative efforts in France and Italy reflect concerns about agriculture and, rural development, the right to informed consumer choices. An EU novel food authorization would challenge the effectiveness of such national legislation and compel stakeholders to defend the (perceived) interests of conventional animal production at the EU level.

Unlike most dairy and meat products, foie gras is a “luxury” product that is already highly controversial. It has been the subject of heated political debate and regulatory action. The process of force-feeding geese to enlarge their livers has been banned in several countries, including more than half of the EU Member States, and some countries have started banning foie gras imports. Protecting its conventional producers is unlikely to garner broad public support.

Au contraire, the authorization of cultivated foie gras could even spur advancements in animal welfare regulation. Animal welfare is enshrined as a principle of EU primary law in Article 13 TFEU. For animals kept for farming purposes, this translates into Article 3 of Directive 98/58/EC, according to which ‘Member States shall [
] ensure that the owners or keepers take all reasonable steps to ensure the welfare of animals under their care and to ensure that those animals are not caused any unnecessary pain, suffering or injury’. The availability of cultivated alternatives to animal products alters the trade-offs implied by this rule. The authorization of cultivated foie gras could thus reshape the regulatory debate on biotechnological food innovation in general. Until now, opponents have argued for consideration of the broader socio-economic implications of innovative products during the authorization process, assuming this would justify limitations and prohibitions. However, considerations of animal welfare (or other aspects such as working conditions, one health, or ecological impacts) could support the urgent approval of such products.

Conclusion

The first novel food application concerning cultivated meat in Europe is now a reality. Gourmey’s focus on foie gras, as a controversial and high-value luxury item, appears to be a smart strategy given the polarized political debate on cultivated meat in Europe. This move should prompt French and other European policymakers to reconsider their positions and potentially reinvent one of their most recognizable food delicacies. The timing of the application’s publication on the opening day of the Paris Olympics 2024 may have limited broader public scrutiny, but this should not deter food innovation scholars from carefully monitoring its development.



Source link

09Sep

ILPC Annual Conference: AI and Power: Regulating Risk and Rights (21-22 November 2024) · European Law Blog


We are looking for high-quality contributions exploring how best to regulate and govern the use of AI, that are used across society, particularly their implications for human rights and the responsibilities of organisations. Including generative AI and other automated decision-making and data-driven systems.

Papers should address the development and future of regulation, policymaking, and governance within the United Kingdom, Europe, and/or internationally. Interdisciplinary and cross-sector papers are welcomed.

The conference organisers would like to encourage submissions from Early Career Researchers and post-doctoral researchers who have been awarded their PhD within the past five years.

The ILPC Annual Conference will include the ILPC Annual Lecture 2024, and we are delighted to announce that this will be delivered by world-leading scholar danah boyd.

The event is hosted by the Institute of Advanced Legal Studies (IALS) and supported by the School of Advanced Study, University of London (SAS).

Further information can be found here at:

https://ials.sas.ac.uk/sites/default/files/institute_advanced_legal_studies/Call%20for%20Papers%20%282024%29.pdf



Source link

09Sep

ERDAL: “Towards ‘oversight by design’? Legal foundations for effective oversight in automated public administration” · European Law Blog


The scientific open access journal “ERDAL”, European Review of Digital Administration & Law, is pleased to announce its upcoming special issue on the topic of oversight of automated public administration.

Topic areas for contributions could include:

  • Human oversight requirements and challenges in relation to fully or semi-automated decision-making procedures in public administration.

  • The relationship between oversight requirements stemming from traditional principles of administrative law and those stemming from regulatory frameworks aimed at automated decision-making or other facets of automated administration.

  • Impact assessment requirements and challenges across legal frameworks at national and European regulatory levels.

  • Transparency requirements and challenges linked to the enabling of effective monitoring or supervision of automated procedures within the administration, by courts, by supervisory bodies or by affected individuals.

  • AI-literacy requirements and challenges in relation to public administration personnel, including case managers and decision-makers across hierarchical levels within the administration.

  • The utilisation of automated procedures in supervising other automated procedures.

  • Oversight measures from the perspective of mitigating biases and ensuring equitable accessibility in digital self-service systems, particularly for vulnerable citizen groups such as the elderly and disabled.

  • National and EU-level administrative supervisory infrastructures over automated processes as well as cross-border cooperative structures for oversight over AI systems.

  • Other topics related to oversight over automated public administration.

Under the thematic heading of “Towards ‘oversight by design’? Legal foundations for effective oversight in automated public administration”, we invite submissions that critically examine these issues. Contributions may encompass theoretical analyses, empirical studies, case studies, and policy proposals aimed at advancing our understanding of oversight and associated legal frameworks that aim to uphold fundamental principles of justice, accountability, and public trust in automated public administration practises.

SUBMISSION DETAILS

  • We encourage authors to reach out to our editorial group to confirm if the topic of your proposed article aligns with the focus of the special issue.

  • The contributions should be previously unpublished scientific papers.

  • Contributions should follow the ERDAL style guidelines (to be found here).

  • Papers should be between 8 000 and 15 000 words in length.

  • Contributions must be submitted by 15 July 2025.

  • Contributions should be submitted in two versions: one with author details included and one anonymised version.

  • Erdal uses a procedure of double-blind peer review (further information may be found here).

  • All submissions should be sent to: [email protected]

Best regards from the editorial group:

Sen. Lect. Dr. Lena Enqvist, UmeÄ University, Sweden, [email protected] Associate editor of ERDAL and co-editor for this special issue

Prof. Dr. Hanne Marie Motzfeldt, University of Copenhagen, Denmark, [email protected] Associate editor of ERDAL and co-editor for this special issue

Prof. Dr. Markku Suksi, Åbo Akademi University, Finland, [email protected] Associate editor of ERDAL and co-editor for this special issue



Source link

08Sep

Python QuickStart for People Learning AI | by Shaw Talebi | Sep, 2024


Many computers come with Python pre-installed. To see if your machine has it, go to your Terminal (Mac/Linux) or Command Prompt (Windows), and simply enter “python”.

Using Python in Terminal. Image by author.

If you don’t see a screen like this, you can download Python manually (Windows/ Mac). Alternatively, one can install Anaconda, a popular Python package system for AI and data science. If you run into installation issues, ask your favorite AI assistant for help!

With Python running, we can now start writing some code. I recommend running the examples on your computer as we go along. You can also download all the example code from the GitHub repo.

Strings & Numbers

A data type (or just “type”) is a way to classify data so that it can be processed appropriately and efficiently in a computer.

Types are defined by a possible set of values and operations. For example, strings are arbitrary character sequences (i.e. text) that can be manipulated in specific ways. Try the following strings in your command line Python instance.

"this is a string"
>> 'this is a string'
'so is this:-1*!@&04"(*&^}":>?'
>> 'so is this:-1*!@&04"(*&^}":>?'
"""and
this is
too!!11!"""
>> 'and\n this is\n too!!11!'
"we can even " + "add strings together"
>> 'we can even add strings together'

Although strings can be added together (i.e. concatenated), they can’t be added to numerical data types like int (i.e. integers) or float (i.e. numbers with decimals). If we try that in Python, we will get an error message because operations are only defined for compatible types.

# we can't add strings to other data types (BTW this is how you write comments in Python)
"I am " + 29
>> TypeError: can only concatenate str (not "int") to str
# so we have to write 29 as a string
"I am " + "29"
>> 'I am 29'

Lists & Dictionaries

Beyond the basic types of strings, ints, and floats, Python has types for structuring larger collections of data.

One such type is a list, an ordered collection of values. We can have lists of strings, numbers, strings + numbers, or even lists of lists.

# a list of strings
["a", "b", "c"]

# a list of ints
[1, 2, 3]

# list with a string, int, and float
["a", 2, 3.14]

# a list of lists
[["a", "b"], [1, 2], [1.0, 2.0]]

Another core data type is a dictionary, which consists of key-value pair sequences where keys are strings and values can be any data type. This is a great way to represent data with multiple attributes.

# a dictionary
{"Name":"Shaw"}

# a dictionary with multiple key-value pairs
{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]}

# a list of dictionaries
[{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]},
{"Name":"Ify", "Age":27, "Interests":["Marketing", "YouTube", "Shopping"]}]

# a nested dictionary
{"User":{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]},
"Last_login":"2024-09-06",
"Membership_Tier":"Free"}

So far, we’ve seen some basic Python data types and operations. However, we are still missing an essential feature: variables.

Variables provide an abstract representation of an underlying data type instance. For example, I might create a variable called user_name, which represents a string containing my name, “Shaw.” This enables us to write flexible programs not limited to specific values.

# creating a variable and printing it
user_name = "Shaw"
print(user_name)

#>> Shaw

We can do the same thing with other data types e.g. ints and lists.

# defining more variables and printing them as a formatted string. 
user_age = 29
user_interests = ["AI", "Music", "Bread"]

print(f"{user_name} is {user_age} years old. His interests include {user_interests}.")

#>> Shaw is 29 years old. His interests include ['AI', 'Music', 'Bread'].

Now that our example code snippets are getting longer, let’s see how to create our first script. This is how we write and execute more sophisticated programs from the command line.

To do that, create a new folder on your computer. I’ll call mine python-quickstart. If you have a favorite IDE (e.g., the Integrated Development Environment), use that to open this new folder and create a new Python file, e.g., my-script.py. There, we can write the ceremonial “Hello, world” program.

# ceremonial first program
print("Hello, world!")

If you don’t have an IDE (not recommended), you can use a basic text editor (e.g. Apple’s Text Edit, Window’s Notepad). In those cases, you can open the text editor and save a new text file using the .py extension instead of .txt. Note: If you use TextEditor on Mac, you may need to put the application in plain text mode via Format > Make Plain Text.

We can then run this script using the Terminal (Mac/Linux) or Command Prompt (Windows) by navigating to the folder with our new Python file and running the following command.

python my-script.py

Congrats! You ran your first Python script. Feel free to expand this program by copy-pasting the upcoming code examples and rerunning the script to see their outputs.

Two fundamental functionalities of Python (or any other programming language) are loops and conditions.

Loops allow us to run a particular chunk of code multiple times. The most popular is the for loop, which runs the same code while iterating over a variable.

# a simple for loop iterating over a sequence of numbers
for i in range(5):
print(i) # print ith element

# for loop iterating over a list
user_interests = ["AI", "Music", "Bread"]

for interest in user_interests:
print(interest) # print each item in list

# for loop iterating over items in a dictionary
user_dict = {"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]}

for key in user_dict.keys():
print(key, "=", user_dict[key]) # print each key and corresponding value

The other core function is conditions, such as if-else statements, which enable us to program logic. For example, we may want to check if the user is an adult or evaluate their wisdom.

# check if user is 18 or older
if user_dict["Age"] >= 18:
print("User is an adult")

# check if user is 1000 or older, if not print they have much to learn
if user_dict["Age"] >= 1000:
print("User is wise")
else:
print("User has much to learn")

It’s common to use conditionals within for loops to apply different operations based on specific conditions, such as counting the number of users interested in bread.

# count the number of users interested in bread
user_list = [{"Name":"Shaw", "Age":29, "Interests":["AI", "Music", "Bread"]},
{"Name":"Ify", "Age":27, "Interests":["Marketing", "YouTube", "Shopping"]}]
count = 0 # intialize count

for user in user_list:
if "Bread" in user["Interests"]:
count = count + 1 # update count

print(count, "user(s) interested in Bread")

Functions are operations we can perform on specific data types.

We’ve already seen a basic function print(), which is defined for any datatype. However, there are a few other handy ones worth knowing.

# print(), a function we've used several times already
for key in user_dict.keys():
print(key, ":", user_dict[key])

# type(), getting the data type of a variable
for key in user_dict.keys():
print(key, ":", type(user_dict[key]))

# len(), getting the length of a variable
for key in user_dict.keys():
print(key, ":", len(user_dict[key]))
# TypeError: object of type 'int' has no len()

We see that, unlike print() and type(), len() is not defined for all data types, so it throws an error when applied to an int. There are several other type-specific functions like this.

# string methods
# --------------
# make string all lowercase
print(user_dict["Name"].lower())

# make string all uppercase
print(user_dict["Name"].upper())

# split string into list based on a specific character sequence
print(user_dict["Name"].split("ha"))

# replace a character sequence with another
print(user_dict["Name"].replace("w", "whin"))

# list methods
# ------------
# add an element to the end of a list
user_dict["Interests"].append("Entrepreneurship")
print(user_dict["Interests"])

# remove a specific element from a list
user_dict["Interests"].pop(0)
print(user_dict["Interests"])

# insert an element into a specific place in a list
user_dict["Interests"].insert(1, "AI")
print(user_dict["Interests"])

# dict methods
# ------------
# accessing dict keys
print(user_dict.keys())

# accessing dict values
print(user_dict.values())

# accessing dict items
print(user_dict.items())

# removing a key
user_dict.pop("Name")
print(user_dict.items())

# adding a key
user_dict["Name"] = "Shaw"
print(user_dict.items())

While the core Python functions are helpful, the real power comes from creating user-defined functions to perform custom operations. Additionally, custom functions allow us to write much cleaner code. For example, here are some of the previous code snippets repackaged as user-defined functions.

# define a custom function
def user_description(user_dict):
"""
Function to return a sentence (string) describing input user
"""
return f'{user_dict["Name"]} is {user_dict["Age"]} years old and is interested in {user_dict["Interests"][0]}.'

# print user description
description = user_description(user_dict)
print(description)

# print description for a new user!
new_user_dict = {"Name":"Ify", "Age":27, "Interests":["Marketing", "YouTube", "Shopping"]}
print(user_description(new_user_dict))

# define another custom function
def interested_user_count(user_list, topic):
"""
Function to count number of users interested in an arbitrary topic
"""
count = 0

for user in user_list:
if topic in user["Interests"]:
count = count + 1

return count

# define user list and topic
user_list = [user_dict, new_user_dict]
topic = "Shopping"

# compute interested user count and print it
count = interested_user_count(user_list, topic)
print(f"{count} user(s) interested in {topic}")

Although we could implement an arbitrary program using core Python, this can be incredibly time-consuming for some use cases. One of Python’s key benefits is its vibrant developer community and a robust ecosystem of software packages. Almost anything you might want to implement with core Python (probably) already exists as an open-source library.

We can install such packages using Python’s native package manager, pip. To install new packages, we run pip commands from the command line. Here is how we can install numpy, an essential data science library that implements basic mathematical objects and operations.

pip install numpy

After we’ve installed numpy, we can import it into a new Python script and use some of its data types and functions.

import numpy as np

# create a "vector"
v = np.array([1, 3, 6])
print(v)

# multiply a "vector"
print(2*v)

# create a matrix
X = np.array([v, 2*v, v/2])
print(X)

# matrix multiplication
print(X*v)

The previous pip command added numpy to our base Python environment. Alternatively, it’s a best practice to create so-called virtual environments. These are collections of Python libraries that can be readily interchanged for different projects.

Here’s how to create a new virtual environment called my-env.

python -m venv my-env

Then, we can activate it.

# mac/linux
source my-env/bin/activate

# windows
.\my-env\Scripts\activate.bat

Finally, we can install new libraries, such as numpy, using pip.

pip install pip

Note: If you’re using Anaconda, check out this handy cheatsheet for creating a new conda environment.

Several other libraries are commonly used in AI and data science. Here is a non-comprehensive overview of some helpful ones for building AI projects.

A non-comprehensive overview of Python libs for data science and AI. Image by author.

Now that we have been exposed to the basics of Python, let’s see how we can use it to implement a simple AI project. Here, I will use the OpenAI API to create a research paper summarizer and keyword extractor.

Like all the other snippets in this guide, the example code is available at the GitHub repository.



Source link

04Sep

Small Language Models Supporting Large Language Models | by Cobus Greyling | Sep, 2024


Considering the image above which demonstrates Hallucination Detection with an LLM as a Constrained Reasoner


Initial Detection: Grounding sources and hypothesis pairs are input into a small language model (SLM) classifier.
No Hallucination: If no hallucination is detected, the “no hallucination” result is sent directly to the client.
Hallucination Detected: If the SLM detects a hallucination, an LLM-based constrained reasoner steps in to interpret the SLM’s decision.
Alignment Check: If the reasoner agrees with the SLM’s hallucination detection, this information, along with the original hypothesis, is sent to the client.
Discrepancy: If there’s a disagreement, the potentially problematic hypothesis is either filtered out or used as feedback to improve the SLM.

Given the infrequent occurrence of hallucinations in practical use, the average time and cost of using LLMs for reasoning on hallucinated texts is manageable.

This approach leverages the existing reasoning and explanation capabilities of LLMs, eliminating the need for substantial domain-specific data and costly fine-tuning.

While LLMs have traditionally been used as end-to-end solutions, recent approaches have explored their ability to explain small classifiers through latent features.

We propose a novel workflow to address this challenge by balancing latency and interpretability. ~ Source

One challenge of this implementation is the possible delta between the SLM’s decisions and the LLM’s explanations


  • This work introduces a constrained reasoner for hallucination detection, balancing latency and interpretability.
  • Provides a comprehensive analysis of upstream-downstream consistency.
  • Offers practical solutions to improve alignment between detection and explanation.
  • Demonstrates effectiveness on multiple open-source datasets.

If you find any of my observations to be inaccurate, please feel free to let me know
🙂

  • I appreciate that this study focuses on introducing guardrails & checks for conversational UIs.
  • When interacting with real users, incorporating a human-in-the-loop approach helps with data annotation and continuous improvement by reviewing conversations.
  • It also adds an element of discovery, observation and interpretation, providing insights into the effectiveness of hallucination detection.
  • The architecture presented in this study offers a glimpse into the future, showcasing a more orchestrated approach where multiple models work together.
  • The study also addresses current challenges like cost, latency, and the need to critically evaluate any additional overhead.
  • Using small language models is advantageous as it allows for the use of open-source models, which reduces costs, offers hosting flexibility, and provides other benefits.
  • Additionally, this architecture can be applied asynchronously, where the framework reviews conversations after they occur. These human-supervised reviews can then be used to fine-tune the SLM or perform system updates.

✹ Follow me on LinkedIn for updates ✹

I’m currently the Chief Evangelist @ Kore.ai. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

04Sep

The History & Future of Prompt Engineering | by Cobus Greyling | Sep, 2024


Prompt #1: You are a scholar in machine learning and lan- guage models. I am writing a paper on the history of prompt engineering and generation. Can you give me a timeline for prompt engineering evolution? (We used this timeline to create prompts for each section later)

Prompt #2: Write the introduction of this paper. Emphasize that this paper focuses on how language prompts and queries have been used so far.

Prompt #3: Now generate history of prompting or querying in early language models an information retrieval

Prompt #4: Now write the history between 2010 and 2015 before attention mechanism was invented

Prompt #5: now write a section on how attention mechanism changed the future of prompt engineering in 2015

Prompt #6: now discuss how the advent of reinforcement learning techniques in 2017 changed the prompt engineering

Prompt #7: (a) Now write the section on developments in prompt engineering in 2019. (b) Now can you rewrite the section on developments in prompt engineering in 2019? Please organize your thoughts in paragraphs instead of bullet points.

Prompt #8: (a) now write the section for 2020 and 2021 in prompt engineering (b) now rewrite the section for 2020 and 2021 in prompt engineering? Please organize your thoughts in paragraphs instead of bullet points

Prompt #9: (a) Can you now write a section on 2022 and 2023 on advanced prompt techniques? (b) can you write the section on 2022 and 2023 on advanced prompt techniques in paragraphs instead of bullet points?



Source link

03Sep

What do we know so far? · European Law Blog


Since C-300/21 Österreichische Post, the first ECJ decision on non-material damages under GDPR, the ECJ has handed down multiple other decisions on the topic (C-340/21 Natsionalna agentsia za prihodite, C-667/21 Krankenversicherung Nordrhein, C-456/22 Gemeinde Ummendorf and C‑687/21 MediaMarktSaturn). There seems to be a marked effort by the Court to create a reliable jurisprudence for non-material damages. In fact, all the decisions have been assigned to and decided by the Third Chamber under Article 60 of the Rules of Procedure of the Court of Justice. This post analyses the subsequent cases after Österreichische Post to flesh out the Court’s conception of non-material damages under Article 82 GDPR and to analyse whether a coherent approach emerges from the case law.

Requirements

Based on Article 82(2) GDPR, the Court delineates three cumulative elements for non-material damages (Österreichische Post at 36, Natsionalna agentsia za prihodite at 77, Gemeinde Ummendorf at 14,  Krankenversicherung Nordrhein at 82 and MediaMarktSaturnat 58):

  1. Infringement of the GDPR

  2. Damage

  3. A causal link between the infringement and damage

Once these three elements are in place, a controller is liable for the non-material damage and must compensate the claimant in accordance with Article 82(1) GDPR.

(1) Infringement

As per Article 82 GDPR, a controller has to compensate for a damage which arose as the consequence of an infringement of the GDPR (Österreichische Post at 31). However, mere infringement alone is insufficient to confer a right to compensation (MediaMarktSaturn at 58, Österreichische Post at 33 and 34). This is because the three elements are cumulative (as seen above).

Infringement of the GDPR cannot simply be determined by the fact that there was, for example, a data breach (MediaMarktSaturn at 45). In MediaMarktSaturn, the hearing of an action for damages under Article 82 must also take into account all the evidence that a controller provides to demonstrate, for example, that their technical and organisational measures were sufficient and therefore, complied with Articles 24 and 32 GDPR (MediaMarktSaturn at 44).

In other words, to ascertain whether an “infringement” occurred in the specific case, the Court seems to consider not only the factual consequences of it (i.e. whether the controller lost control over the personal data following a breach). It also determines whether that event is attributable to the controller in terms of intent or culpability (did the controller want that event or were they negligent in adopting any reasonable countermeasures?). It seems that a controller can use a lack of intent or negligence to argue against their alleged infringement. For example, if a breach occurred but the controller proved that they were not negligent and had the necessary technical and organisational measures, then there is arguably no infringement and a claim for damages would end here.

(2) Damage

Recital 85 to the GDPR provides a non-binding list of what could constitute material or non-material damage under the GDPR. It lists the following: ‘loss of control over [
] personal data, limitation of [
] rights, discrimination, identity theft or fraud, financial loss, unauthorised reversal of pseudonymisation, damage to reputation, loss of confidentiality of personal data protected by professional secrecy or any other significant economic or social disadvantage to the natural person concerned.’

The first of this list – loss of control over personal data – has been clarified further and defined rather broadly by the ECJ. Fear deriving from the loss of control over personal data from an infringement of the GDPR is sufficient to give raise to non-material damages (Natsionalna agentsia za prihodite at 80). The amount of time that the fear is felt by the claimant can be short. In Gemeinde Ummendorf, a few days, which did not have a noticeable consequence for the claimant beyond the fear itself, were sufficient for non-material damages (Gemeinde Ummendorf at 22). This follows a previous decision, which in doing away with a threshold of seriousness for non-material damages, allows all non-material damages, even if they are limited in scope, to lead to possible claims (Österreichische Post at 49). The fear itself is sufficient, as there is no requirement that the damage be linked to an actual misuse of the data by third parties by the time of the claim (Natsionalna agentsia za prihodite at 79). Nor does the claimant need to show that there has been a misuse to their detriment (Natsionalna agentsia za prihodite at 82 and Gemeinde Ummendorf at 22). Thus, it is sufficient that the breach of the GDPR be linked to the claimant’s fear that such misuse may occur in the future.

This is a broad reading of loss of control. As noted by AG Pitruzzella, the GDPR does not state that fear should create a ground for compensation for non-material damages (AG Opinion in C‑340/21 at 78). There is undoubtedly ‘a fine line between mere upset (which is not eligible for compensation) and genuine non-material damage (which is eligible for compensation)’ (AG Opinion in C‑340/21 at 83). The Court here could have gone either way, especially in a case on the facts such as Natsionalna agentsia za prihodite where the fear suffered by the claimant of a possible misuse of personal data in the future had no established misuse and the claimant had not suffered further harm (AG Opinion in C‑340/21 at 77). Nonetheless, because the definition of damage should be ‘broad’ and allow for ‘full and effective’ compensation as per Recital 146 to the GDPR, the AG Pitruzzella stated that the Court should hold the fear itself to be sufficient (AG Opinion in C‑340/21 at 71 and 77). Not only did the Court follow the AG’s Opinion at paragraph 81 of the judgment, but it has consistently referred to the broadness point of Recital 146 in its later non-material damages judgments (Gemeinde Ummendorf at 19 and 20 and MediaMarktSaturn at 65).

The ECJ did not, however, go as far as to establish a presumption that all infringements would result in a damage (cf. AG Opinion in C‑340/21 at 74). The claimant still needs to show consequences from the infringement (Österreichische Post at 50 and MediaMarktSaturn at 60). Thus, they must show that they have suffered an actual damage, however minimal it may be (Gemeinde Ummendorf at 22). The burden of proof is also on the claimant to show this damage (MediaMarkt at 61 and 68 and Natsionalna agentsia za prihodite at 84). This makes sense given that the claimant is the only one who has experienced the damage (for example, fear) and is in a position to prove it.

It is perhaps due to this logic, that the ECJ (on the concept of loss of control) also stated that the fear must be ‘well-founded’ and that the risk cannot be hypothetical (MediaMarkt at 67 and 68 and Natsionalna agentsia za prihodite at 85). While it is for national courts to determine whether these requirements are met (MediaMarktSaturn at 67 and 6), the ECJ nonetheless determined that the disclosure of data to a third party, who did not know about it, would not give rise to non-material damages (MediaMarktSaturn at 69). In this case, it was clear that the risk was unfounded; the third party never became aware of the personal data during the breach and the document containing the data was returned within half an hour. So, the fear linked to this so-called hypothetical risk proved insufficient for non-material damages. If the claimant cannot evidence damage as defined above, then a successful claim for damages will also end at this point.

(3) Causal link

A causal link must exist between the infringement and damage (Österreichische Post at 32 and under Article 82(1) GDPR). The Court has not yet developed this criterion in detail, but it can be inferred that the claimant should show there to be some form of reasonable relationship between the infringement and their damage. If there is no causal link it follows that there cannot be a right to receive compensation under Article 82 GDPR.

The fact that damage was caused by a third party, as defined by Article 4(10) GDPR, rather than the controller themselves, is not a limiting factor. Article 4(10) GDPR defines third parties as being under the ‘direct authority’ of the controller or processor and authorised to process the data. The Court in Natsionalna agentsia za prihodite found hackers to be third parties under Article 4(10) GDPR (at 71). Thus, Article 4(10) has been interpreted broadly in that it does not require third parties to be employees of the controller or subject to its control (at 66). Nonetheless, for the third party act to be attributable to the controller, the controller must have made the infringement possible in the first place by failing to comply with their GDPR obligations, for example, by failing to implement appropriate technical and organisational measures (at 71).

Defences

Liability is subject to fault on the part of the controller, which is presupposed unless it proves that it is ‘not in any way responsible’ for the event giving rise to the damage (MediaMarkt at 52, Recital 146 GDPR, and Natsionalna agentsia za prihodite at 37 and 69). The circumstances in which the controller may claim to be exempt from civil liability under Article 82 GDPR are ‘strictly limited’ to those in which the controller is able to demonstrate that the damage is not attributable to it (Natsionalna agentsia za prihodite at 70). It is explicitly for the controller to rebut this presumption of fault (Krankenversicherung Nordrhein at 94 and alsoNatsionalna agentsia za prihodite at 69 and 70). This allocation of the burden of proof to the controller ensures that the effectiveness of the right to compensation (Article  82 GDPR)  is maintained ( MediaMarktSaturn at 42).

Questions remain over what type of defence Article 82(3) is and how it relates more widely to the concept of non-material damages. For example, if liability (the link between the controller’s fault and the damage) is presupposed, does this mean that the causal link (between the infringement and the damage) is presupposed as well? Is Article 82(3) GDPR, therefore, a defence against causation or a separate general defence against liability? Moreover, does this presumption of fault also mean that intent or negligence should become a rebuttable presumption when deciding on an infringement? These are questions that will inevitably arise before the ECJ in the future. 

Compensation

Article 82(1) GDPR has a compensatory instead of punitive function (MediaMarktSaturn at 48). Compensation is limited to monetary compensation and should only fully compensate for the damage suffered by the infringement of the GDPR (Krankenversicherung Nordrhein at 84 to 87, Österreichische Post at 58 and MediaMarktSaturn at 54). It is because of this compensatory function that national courts should not look at the controller’s behaviour when quantifying non-material damages. The compensation will not be affected by the degree of the controller’s responsibility, and it does not matter whether there was intent or negligence from the side of the controller (Krankenversicherung Nordrhein at 86, 87, and 102 and MediaMarktSaturn at 48).

Final compensation must be ‘full and effective’ (Recital 146 to the GDPR). This means that national rules must enable the claiming of compensation (Österreichische Post at 56). Nonetheless, it is for national courts to determine the exact amount of pecuniary damages in accordance with their national law (Krankenversicherung Nordrhein at 83 and 101), as long as the internal rules of the Member State follow the principles of equivalence and effectiveness of EU law (MediaMarktSaturn at 53).

Damages under the GDPR are conceptually autonomous and therefore ‘special national’ interpretations, except for the amount of the compensation, should not occur (MediaMarkt at 59). In general, the divergence or unity of GDPR damages in comparison with national law conceptions of damages will require a more detailed discussion than is possible within this blogpost.

A coherent vision

Having briefly analysed the cases above, there seems to be a coherent line of argumentation behind the non-material damages cases under Article 82 GDPR. The rulings do not radically diverge from each other, and the concepts developed are re-used, cross-referenced, and built upon. As more preliminary references arrive and non-material damages develop further, the Court could even begin to send some questions back to national courts under Article 99 (Reply by Reasoned Order) of the Rules of Procedure of the Court.This is where the question referred is identical to a question on which the court has already ruled or where the answer to such a question may be clearly deduced from existing case law.

A practical point to mention is that the definition of non-material damages is likely to affect also class action suits and collective redress. A broad interpretation of non-material damages could lead to data breaches becoming exorbitantly expensive for controllers, to the point that they may no longer want to operate in Europe. Instead of restricting the concept of damages, a solution would be to avoid the creation of an impossible threshold for controllers and processors to prove that they have complied with Articles of the GDPR. It is perhaps for this reason that the Court has so far been reasonable with its thresholds and decided, for example, that unauthorised disclosure of personal data to third parties is not sufficient in itself to hold that Articles 24 and 32 GDPR have been infringed by the controller (MediaMarktSaturn at 40).

Material and non-material damages are well defined concepts within national law, and so conflicts will inevitably occur between national systems and the GDPR. It is important that the ECJ maintain its coherent vision of non-material damages to create a uniform application of the GDPR and therefore, protect the effectiveness of Articles 7 and 8 of the Charter of Fundamental Rights of the European Union and Article 16 of the Treaty on the Functioning of the European Union.



Source link

Protected by Security by CleanTalk