06Aug


A glimpse into how neural networks perceive categoricals and their hierarchies

Photo by Rachael Crowe on Unsplash

Industry data often contains non-numeric data with many possible values, for example zip codes, medical diagnosis codes, preferred footwear brand. These high-cardinality categorical features contain useful information, but incorporating them into machine learning models is a bit of an art form.

I’ve been writing a series of blog posts on methods for these features. Last episode, I showed how perturbed training data (stochastic regularization) in neural network models can dramatically reduce overfitting and improve performance on unseen categorical codes [1].

In fact, model performance for unseen codes can approach that of known codes when hierarchical information is used with stochastic regularization!

Here, I use visualizations and SHAP values to “look under the hood” and gain some insights into how entity embeddings respond to stochastic regularization. The pictures are pretty, and it’s cool to see plots shift as data is changed. Plus, the visualizations suggest model improvements and can identify groups that might be of interest to analysts.

NAICS Codes



Source link

Protected by Security by CleanTalk