26Nov

Data Operations Manager at LegalAndGeneral – London, United Kingdom


Company Description

The brand with the brolly is choosing today to change tomorrow.

Since 1836, we’ve grown to become one of the world’s largest asset managers, homebuilders, pension providers and insurance brands. 

We’re all here to improve the lives of our customers, build a better society for the long term, and create value for our shareholders – helping to shape a better future for society and the planet.

We need people who share our ambitions, agility and entrepreneurial spirit to help us do it.

At L&G, you’ll find a balance that helps you be your best. Empowered by hybrid working, we’re supported by technology and workplaces that enable us to work effectively wherever we are. We come together in offices to collaborate and connect, and use time at home for individual, focused activities. And, when we achieve great things, we celebrate our success and reward strong performance.

Today, there’s over 10,000 of us, working towards our mission, with plenty of opportunities to grow your career as we grow L&G. Will you join us?

Job Description

As the Data Operations Manager at Legal & General Affordable Homes, you will play a crucial role in overseeing our data transformation initiatives. This position requires a strategic approach to data management and a commitment to driving organisational efficiency through data-driven solutions.

Key responsibilities include:

  1. Designing and implementing robust data controls to ensure the highest standards of data quality across the organisation
  2. Managing and optimising our Salesforce CRM system, leveraging its capabilities to enhance data management and operational efficiency
  3. Developing and maintaining comprehensive data governance frameworks to ensure compliance and data integrity
  4. Leading multiple high-priority projects simultaneously, collaborating with a diverse, high-performing team to achieve organisational objectives
  5. Serving as a liaison between subject matter experts and business stakeholders, effectively translating complex data concepts into accessible, actionable insights
  6. Developing innovative solutions to address organisational challenges, utilising benchmarking methodologies, professional networking, and strategic idea sharing across departments

This role demands a thorough understanding of data management principles and the ability to apply them in a fast-paced, evolving business environment. The successful candidate will be instrumental in shaping our data strategy and driving our organisation forward in the digital age.

Qualifications

Ready to join our fantastic team? Here’s what we’re looking for in our future Data Operations Manager

  • Extensive experience in data quality and controls, with a proven ability to design and implement robust data controls, structures, quality metrics, and advanced statistics and data visualisation techniques.

  • Proficiency in major CRM technology solutions, with Salesforce experience being essential for this role.

  • A comprehensive background blending operations experience (potentially including operational risk) with substantial exposure to data governance basics and practices.

  • In-depth knowledge of data structures, data marts, business intelligence systems, and data quality and governance capabilities.

  • Familiarity with data management technologies, including data catalogues and data quality management systems, as well as enterprise and Cloud technologies, is highly advantageous.

  • Thorough understanding of core business functions and associated data roles and responsibilities (owner, steward, custodian, consumer) to ensure compliance with stringent data quality and control requirements.

  • Exceptional communication skills, encompassing verbal, written, and energetic listening abilities, coupled with strong inter-personal and influencing capabilities. The ability to effectively bridge the gap between subject matter experts and business stakeholders through clear, concise, and user-friendly language is crucial.

  • Demonstrated passion and a curious mind for developing innovative solutions to complex organisational challenges, utilising benchmarking methodologies, professional networking, and the strategic sharing of creative ideas and empirical data across organisational boundaries.

So, if you’re ready to embark on a wild data adventure with Legal & General Affordable Homes, apply now! 

Additional Information

Legal & General is a leading financial services group and major global investor, named Britain’s Most Admired Company in 2023, for the second year running. Rated top in our sector and top for inspirational leadership, we have a strong heritage and an exciting future.

We aim to build a better society for the long term by investing our customers’ money in things that make life better for everyone.

If you join us, you’ll be part of a welcoming culture, with opportunities to collaborate with people of diverse backgrounds, views and experiences. Guided by leaders with integrity who care about your future and wellbeing. Empowered through initiatives which support people to develop their careers and excel.

We strive to be open, mindful and inclusive, so are always willing to discussing flexible working arrangements and reasonable accommodations for candidates with specific needs.

If you’re open to find out more, we’d love to hear from you.



Source link

25Nov

How Did We Go From Large Language Models to Large Behaviour Models? | by Cobus Greyling | Nov, 2024


Large Content & Behaviour Models (LBM / LCBM) integrate language & vision to simulate & understand both ontent (like text, images, or videos) and behaviours (such as user interactions or preferences).

It connects a language model with a visual encoder and is fine-tuned using datasets linking content to human behaviours.

For example, it analyses YouTube videos and their engagement metrics or Twitter posts and their interactions. LCBMs aim to predict behaviours based on content and vice versa, demonstrating both domain adaptation and generalisation.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.



Source link

25Nov

Business Insights Analyst at TaskUs – PHL – Imus – Lizzy’s Nook


About TaskUs: TaskUs is a provider of outsourced digital services and next-generation customer experience to fast-growing technology companies, helping its clients represent, protect and grow their brands. Leveraging a cloud-based infrastructure, TaskUs serves clients in the fastest-growing sectors, including social media, e-commerce, gaming, streaming media, food delivery, ride-sharing, HiTech, FinTech, and HealthTech. 

The People First culture at TaskUs has enabled the company to expand its workforce to approximately 45,000 employees globally. Presently, we have a presence in twenty-three locations across twelve countries, which include the Philippines, India, and the United States.

It started with one ridiculously good idea to create a different breed of Business Processing Outsourcing (BPO)! We at TaskUs understand that achieving growth for our partners requires a culture of constant motion, exploring new technologies, being ready to handle any challenge at a moment’s notice, and mastering consistency in an ever-changing world.

What We Offer: At TaskUs, we prioritize our employees’ well-being by offering competitive industry salaries and comprehensive benefits packages. Our commitment to a People First culture is reflected in the various departments we have established, including Total Rewards, Wellness, HR, and Diversity. We take pride in our inclusive environment and positive impact on the community. Moreover, we actively encourage internal mobility and professional growth at all stages of an employee’s career within TaskUs. Join our team today and experience firsthand our dedication to supporting People First.

What can you expect in a “Business Insight Analyst” role with TaskUs: 

A business insights analyst plays a crucial role in an organization by extracting valuable information from various data sources and transforming it into actionable insights. They leverage their analytical skills and expertise to identify patterns, trends, and correlations in the data, enabling businesses to make informed decisions and drive growth. By analyzing market trends, customer behavior, and internal operations, the analyst provides valuable recommendations to optimize business strategies, improve operational efficiency, and identify new growth opportunities. They often collaborate with cross-functional teams, including marketing, finance, and operations, to ensure data-driven decision-making and effective implementation of strategies. Overall, the business insights analyst plays a pivotal role in helping organizations leverage data to gain a competitive edge and achieve their goals.

In addition to data analysis, the business insights analyst is responsible for creating and maintaining robust reporting systems and dashboards that provide stakeholders with timely and accurate information. They work closely with IT teams to design and develop data visualization tools that effectively communicate complex information in a clear and concise manner. These reports and dashboards are instrumental in tracking key performance indicators, monitoring business metrics, and measuring the success of various initiatives. The analyst ensures data integrity, accuracy, and consistency by implementing rigorous data quality control measures and addressing any discrepancies or anomalies. They also stay updated with the latest industry trends, tools, and technologies to continually enhance data analysis processes and deliver innovative solutions. Overall, the business insights analyst bridges the gap between data and decision-making, empowering businesses to make data-driven choices that drive profitability and success.

Required Qualifications: 

  • Bachelor’s degree in a relevant field such as business analytics, data science, statistics, economics, or a related discipline.

  • Proven experience in data analysis, data manipulation, and statistical analysis techniques.

  • Proficiency in programming languages such as SQL, Python, or R.

  • Strong analytical thinking and problem-solving skills to interpret complex data sets and generate actionable insights.

  • Excellent communication skills to effectively convey data insights to diverse stakeholders.

  • Familiarity with data visualization tools such as Tableau, Power BI, or Excel.

  • Ability to work with large datasets and apply data mining techniques.

Preferred Qualifications: 

  • Master’s degree in a relevant field or advanced certifications in business analytics or data science.

  • Prior experience in a business insights or data analysis role, preferably in a similar industry or sector.

  • Knowledge of specific domain areas, such as marketing, finance, or operations, relevant to the organization.

  • Experience with machine learning techniques and predictive modeling.

  • Familiarity with cloud-based data platforms and big data technologies.

  • Strong business acumen and the ability to connect data insights to strategic objectives.

  • Proven track record of delivering impactful insights and recommendations to drive business growth.

Education / Certifications: 

  • Bachelor’s or Master’s degree in a relevant field such as business analytics, data science, statistics, economics, or a related discipline.

  • Certifications in data analysis, data science, or related fields (e.g., Certified Analytics Professional, Microsoft Certified: Data Analyst Associate).

Work Location / Work Schedule / Travel: 

  • Hybrid Work

  • Occasional visit in the office may be required for meetings, conferences, or client engagements, as per business needs.

How We Partner To Protect You: TaskUs will neither solicit money from you during your application process nor require any form of payment in order to proceed with your application. Kindly ensure that you are always in communication with only authorized recruiters of TaskUs.

DEI: In TaskUs we believe that innovation and higher performance are brought by people from all walks of life. We welcome applicants of different backgrounds, demographics, and circumstances. Inclusive and equitable practices are our responsibility as a business. TaskUs is committed to providing equal access to opportunities. If you need reasonable accommodations in any part of the hiring process, please let us know.

We invite you to explore all TaskUs career opportunities and apply through the provided URL https://www.taskus.com/careers/.



Source link

25Nov

Senior Analyst, Global Real Estate and Retail Franchise at Levi Strauss & Co. – Bengaluru, India


JOB DESCRIPTION

The Senior Analyst, Global Real Estate and Retail Franchise will play a pivotal role in shaping and supporting Levi Strauss & Co.’s global real estate and retail franchise strategy. This position involves data-driven decision support for the Global DTC Investment Board, development of mapping tools, and insightful analysis to guide and enhance distribution strategies. A strong focus on automation and Python programming will be key to streamline and elevate reporting processes.

Key Responsibilities:

Support for Global DTC Investment Board:

  • Drive the enhancement and ongoing maintenance of the PER hindsighting tool to streamline data collection and reporting.
  • Document all approved real estate transactions and deals from diverse market clusters to maintain an accurate, up-to-date database.
  • Retrieve and analyze data from PER files to derive strategic insights and hindsight reporting for future investments.
  • Develop and implement automated data pipelines using Python to ensure seamless updates and minimize manual effort.

Development of Global Distribution Mapping Tool:

  • Build a comprehensive distribution mapping tool on ArcGIS, covering all fleet locations globally and incorporating key performance metrics.
  • Ensure the tool is fully integrated with existing databases, such as HANA and ELM, for automated, real-time data updates.
  • Use Python to automate data extraction, transformation, and loading processes for consistent and efficient data integration.
  • Explore the potential to expand the mapping tool’s capabilities to include corporate real estate data for broader utility across the organization.

Analysis and Reporting to Enhance Distribution Strategy:

  • Leverage the distribution mapping tool to perform fleet analysis, including market penetration and growth opportunity assessments.
  • Utilize Python for automating and expanding current reports and analytical models to ensure timely, accurate reporting in real estate and franchise areas.
  • Conduct ad-hoc analyses to support critical decision-making processes, helping refine and optimize global distribution strategies.
  • Implement Python-based dashboards and reporting scripts to automate recurring reports and enable data visualization for strategic insights.

Qualifications:

  • Education: Bachelor’s in Business, Real Estate, Data Science, or related field (or equivalent experience) accepted.
  • Experience: 3 to 5 years in Retail, Franchise, Account Management, or Strategy Consulting, ideally with international exposure.
  • Skills:
    • Proficient in ArcGIS, SAP HANA, ELM, and Python for data automation.
    • Advanced Excel and PowerPoint; comfortable adopting new tools.
    • Strong analytical skills, with experience in store mathematics, P&L, and KPIs.
    • Excellent organizational, influencing, and stakeholder management abilities.
  • Attributes: Commercially driven, pragmatic, passionate about retail/fashion.
  • Language: Fluent in English.
  • Travel: 5-10%.

Additional Skills:

  • Proven ability to manage and enhance complex tools and reporting mechanisms.
  • Experience with retail market analysis, penetration, and fleet analysis.
  • Comfort with managing multiple projects across regions and time zones.

LOCATION

Bengaluru, India

FULL TIME/PART TIME

Full time



Source link

25Nov

CDP Engineer (Adobe RTCDP) (Contract) at Bounteous – United States


We are searching for a CDP Data Engineer (Adobe RTCDP) to help clients get value out of their investment in Customer Data Platforms like Adobe Real-Time CDP (RTCDP) by extracting, transforming, and loading the data into the platform to build the unified customer profile. You will work with a multidisciplinary team of Consultants, Solution Architects, Data Scientists, and Digital Marketers.

Role and Responsibilities

  • Be a platform expert in Adobe Experience Platform Real-Time CDP
  • Provide deep domain expertise in our client’s business, technology stack and data infrastructure, along with a broad knowledge base in data engineering
  • Extract, transform and load marketing and customer data into the platform in an automated and scalable manner
  • Design, build/configure and test the necessary Adobe Experience Platform source and destination connections, data sets, XDM schemas and identity namespaces
  • Build the unified customer profile 
  • Identify and write necessary queries needed for segmentation, reporting, analysis and ML models in query service

Preferred Qualifications

  • College degree in Computer Science, Data Science, Analytics or related field
  • Must have hands-on experience configuring Adobe Experience Platform RTCDP, including schema creation, ingesting data from a variety of sources, configuring identity resolution, and connecting destinations for activationAt least 5 years of experience architecting and building data pipelines
  • At least 3 years of experience working in an agency environment and strong consulting skills working closely with clients
  • Strong understanding and application of SQL Strong understanding of working with APIsStrong understanding of customer data platforms and the modern data infrastructure
  • Experience working with cloud technologies such as AWS, Google Cloud, Azure or similar
  • Experience working with data warehouse solutions like Amazon Redshift, Google BigQuery, Snowflake or similar
  • Experience with data visualization tools like Tableau, PowerBI or similar is a plus



Source link

25Nov

Perform outlier detection more effectively using subsets of features | by W Brett Kennedy | Nov, 2024


Identify relevant subspaces: subsets of features that allow you to most effectively perform outlier detection on tabular data

This article is part of a series related to the challenges, and the techniques that may be used, to best identify outliers in data, including articles related to using PCA, Distance Metric Learning, Shared Nearest Neighbors, Frequent Patterns Outlier Factor, Counts Outlier Detector (a multi-dimensional histogram-based method), and doping. This article also contains an excerpt from my book, Outlier Detection in Python.

We look here at techniques to create, instead of a single outlier detector examining all features within a dataset, a series of smaller outlier detectors, each working with a subset of the features (referred to as subspaces).

When performing outlier detection on tabular data, we’re looking for the records in the data that are the most unusual — either relative to the other records in the same dataset, or relative to previous data.

There are a number of challenges associated with finding the most meaningful outliers, particularly that there is no definition of statistically unusual that definitively specifies which anomalies in the data should be considered the strongest. As well, the outliers that are most relevant (and not necessarily the most statistically unusual) for your purposes will be specific to your project, and may evolve over time.

There are also a number of technical challenges that appear in outlier detection. Among these are the difficulties that occur where data has many features. As covered in previous articles related to Counts Outlier Detector and Shared Nearest Neighbors, where we have many features, we often face an issue known as the curse of dimensionality.

This has a number of implications for outlier detection, including that it makes distance metrics unreliable. Many outlier detection algorithms rely on calculating the distances between records — in order to identify as outliers the records that are similar to unusually few other records, and that are unusually different from most other records — that is, records that are close to few other records and far from most other records.

For example, if we have a table with 40 features, each record in the data may be viewed as a point in 40-dimensional space, and its outlierness can be evaluated by the distances from it to the other points in this space. This, then, requires a way to measure the distance between records. A variety of measures are used, with Euclidean distances being quite common (assuming the data is numeric, or is converted to numeric values). So, the outlierness of each record is often measured based on the Euclidean distance between it and the other records in the dataset.

These distance calculations can, though, break down where we are working with many features and, in fact, issues with distance metrics may appear even with only ten or twenty features, and very often with about thirty or forty or more.

We should note though, issues dealing with large numbers of features do not appear with all outlier detectors. For example, they do not tend to be significant when working with univariate tests (tests such as z-score or interquartile range tests, that consider each feature one at a time, independently of the other features — described in more detail in A Simple Example Using PCA for Outlier Detection) or when using categorical outlier detectors such as FPOF.

However, the majority of outlier detectors commonly used are numeric multi-variate outlier detectors — detectors that assume all features are numeric, and that generally work on all features at once. For example, LOF (Local Outlier Factor) and KNN (k-Nearest Neighbors) are two the the most widely-used detectors and these both evaluate the outlierness of each record based on their distances (in the high-dimensional spaces the data points live in) to the other records.

Consider the plots below. This presents a dataset with six features, shown in three 2d scatter plots. This includes two points that can reasonably be considered outliers, P1 and P2.

Looking, for now, at P1, it is far from the other points, at least in feature A. That is, considering just feature A, P1 can easily be flagged as an outlier. However, most detectors will consider the distance of each point to the other points using all six dimensions, which, unfortunately, means P1 may not necessarily stand out as an outlier, due to the nature of distance calculations in high-dimensional spaces. P1 is fairly typical in the other five features, and so it’s distance to the other points, in 6d space, may be fairly normal.

Nevertheless, we can see that this general approach to outlier detection — where we examine the distances from each record to the other records — is quite reasonable: P1 and P2 are outliers because they are far (at least in some dimensions) from the other points.

As KNN and LOF are very commonly used detectors, we’ll look at them a little closer here, and then look specifically at using subspaces with these algorithms.

With the KNN outlier detector, we pick a value for k, which determines how many neighbors each record is compared to. Let’s say we pick 10 (in practice, this would be a fairly typical value).

For each record, we then measure the distance to its 10 nearest neighbors, which provides a good sense of how isolated and remote each point is. We then need to create a single outlier score (i.e., a single number) for each record based on these 10 distances. For this, we generally then take either the mean or the maximum of these distances.

Let’s assume we take the maximum (using the mean, median, or other function works similarly, though each have their nuances). If a record has an unusually large distance to its 10th nearest neighbor, this means there are at most 9 records that are reasonably close to it (and possibly less), and that it is otherwise unusually far from most other points, so can be considered an outlier.

With the LOF outlier detector, we use a similar approach, though it works a bit differently. We also look at the distance of each point to its k nearest neighbors, but then compare this to the distances of these k neighbors to their k nearest neighbors. So LOF measures the outlierness of each point relative to the other points in their neighborhoods.

That is, while KNN uses a global standard to determine what are unusually large distances to their neighbors, LOF uses a local standard to determine what are unusually large distances.

The details of the LOF algorithm are actually a bit more involved, and the implications of the specific differences in these two algorithms (and the many variations of these algorithms) are covered in more detail in Outlier Detection in Python.

These are interesting considerations in themselves, but the main point for here is that KNN and LOF both evaluate records based on their distances to their closest neighbors. And that these distance metrics can work sub-optimally (or even completely breakdown) if using many features at once, which is reduced greatly by working with small numbers of features (subspaces) at a time.

The idea of using subspaces is useful even where the detector used does not use distance metrics, but where detectors based on distance calculations are used, some of the benefits of using subspaces can be a bit more clear. And, using distances in ways similar to KNN and LOF is quite common among detectors. As well as KNN and LOF, for example, Radius, ODIN, INFLO, and LoOP detectors, as well as detectors based on sampling, and detectors based on clustering, all use distances.

However, issues with the curse of dimensionality can occur with other detectors as well. For example, ABOD (Angle-based Outlier Detector) uses the angles between records to evaluate the outlierness of each record, as opposed to the distances. But, the idea is similar, and using subspaces can also be helpful when working with ABOD.

As well, other benefits of subspaces I’ll go through below apply equally to many detectors, whether using distance calculations or not. Still, the curse of dimensionality is a serious concern in outlier detection: where detectors use distance calculations (or similar measures, such as angle calculations), and there are many features, these distance calculations can break down. In the plots above, P1 and P2 may be detected well considering only six dimensions, and quite possibly if using 10 or 20 features, but if there were, say, 100 dimensions, the distances between all points would actually end up about the same, and P1 and P2 would not stand out at all as unusual.

Outside of the issues related to working with very large numbers of features, our attempts to identify the most unusual records in a dataset can be undermined even when working with fairly small numbers of features.

While very large numbers of features can make the distances calculated between records meaningless, even moderate numbers of features can make records that are unusual in just one or two features more difficult to identify.

Consider again the scatter plot shown earlier, repeated here. Point P1 is an outlier in feature A (thought not in the other five features). Point P2 is unusual in features C and D, but not in the other four features). However, when considering the Euclidean distances of these points to the other points in 6-dimensional space, they may not reliably stand out as outliers. The same would be true using Manhattan, and most other distance metrics as well.

The left pane shows point P1 in a 2D dataspace. The point is unusual considering feature A, but less so if using Euclidean distances in the full 6D dataspace, or even the 2D dataspace shown in this plot. This is an example where using additional features can be counterproductive. In the middle pane, we see another point, point P2, which is an outlier in the C–D subspace but not in the A-B or E–F subspaces. We need only features C and D to identify this outlier, and again including other features will simply make P2 more difficult to identify.

P1, for example, even in the 2d space shown in the left-most plot, is not unusually far from most other points. It’s unusual that there are no other points near it (which KNN and LOF will detect), but the distance from P1 to the other points in this 2d space is not unusual: it’s similar to the distances between most other pairs of points.

Using a KNN algorithm, we would likely be able to detect this, at least if k is set fairly low, for example, to 5 or 10 — most records have their 5th (and their 10th) nearest neighbors much closer than P1 does. Though, when including all six features in the calculations, this is much less clear than when viewing just feature A, or just the left-most plot, with just features A and B.

Point P2 stands out well as an outlier when considering just features C and D. Using a KNN detector with a k value of, say, 5, we can identify its 5 nearest neighbors, and the distances to these would be larger than is typical for points in this dataset.

Using an LOF detector, again with a k value of, say, 5, we can compare the distances to P1’s or P2’s 5 nearest neighbors to the distances to their 5 nearest neighbors and here as well, the distance from P1 or P2 to their 5 nearest neighbors would be found to be unusually large.

At least this is straightforward when considering only Features A and B, or Features C and D, but again, when considering the full 6-d space, they become more difficult to identify as outliers.

While many outlier detectors may still be able to identify P1 and P2 even with six, or a small number more, dimensions, it is clearly easier and more reliable to use fewer features. To detect P1, we really only need to consider feature A; and to identify P2, we really only need to consider features C and D. Including other features in the process simply makes this more difficult.

This is actually a common theme with outlier detection. We often have many features in the datasets we work with, and each can be useful. For example, if we have a table with 50 features, it may be that all 50 features are relevant: either a rare value in any of these features would be interesting, or a rare combination of values in two or more features, for each of these 50 features, would be interesting. It would be, then, worth keeping all 50 features for analysis.

But, to identify any one anomaly, we generally need only a small number of features. In fact, it’s very rare for a record to be unusual in all features. And it’s very rare for a record to have a anomaly based on a rare combination of many features (see Counts Outlier Detector for more explanation of this).

Any given outlier will likely have a rare value in one or two features, or a rare combination of values in a pair, or a set of perhaps three or four features. Only these features are necessary to identify the anomalies in that row, even though the other features may be necessary to detect the anomalies in other rows.

To address these issues, an important technique in outlier detection is using subspaces. The term subspaces simply refers to subsets of the features. In the example above, if we use the subspaces: A-B, C-D, E-F, A-E, B-C, B-D-F, and A-B-E, then we have seven subspaces (five 2d subspaces and two 3d subspaces). Creating these, we would run one (or more) detectors on each subspace, so would run at least seven detectors on each record.

Realistically, subspaces become more useful where we have many more features that six, and generally even the the subspaces themselves will have more than six features, and not just two or three, but viewing this simple case, for now, with a small number of small subspaces is fairly easy to understand.

Using these subspaces, we can more reliably find P1 and P2 as outliers. P1 would likely be scored high by the detector running on features A-B, the detector running on features A-E, and the detector running on features A-B-E. P2 would likely be detected by the detector running on features C-D, and possibly the detector running on B-C.

However, we have to be careful: using only these seven subspaces, as opposed to a single 6d space covering all features, would miss any rare combinations of, for example, A and D, or C and E. These may or may not be detected using a detector covering all six features, but definitely could not be detected using a suite of detectors that simply never examine these combinations of features.

Using subspaces does have some large benefits, but does have some risk of missing relevant outliers. We’ll cover some techniques to generate subspaces below that mitigate this issue, but it can be useful to still run one or more outlier detectors on the full dataspace as well. In general, with outlier detection, we’re rarely able to find the full set of outliers we’re interested in unless we apply many techniques. As important as the use of subspaces can be, it is still often useful to use a variety of techniques, which may include running some detectors on the full data.

Similarly, with each subspace, we may execute multiple detectors. For example, we may use both a KNN and LOF detector, as well as Radius, ABOD, and possibly a number of other detectors — again, using multiple techniques allows us to better cover the range of outliers we wish to detect.

We’ve seen, then, a couple motivations for working with subspaces: we can mitigate the curse of dimensionality, and we can reduce where anomalies are not identified reliably where they are based on small numbers of features that are lost among many features.

As well as handling situations like this, there are a number of other advantages to using subspaces with outlier detection. These include:

  • Accuracy due to the effects of using ensembles — Using multiple subspaces allows us to create ensembles (collections of outlier detectors), which allows us to combine the results of many detectors. In general, using ensembles of detectors provides greater accuracy than using a single detector. This is similar (though with some real differences too) to the way ensembles of predictors tend to be stronger for classification and regression problems than a single predictor. Here, using subspaces, each record is examined multiple times, which provides a more stable evaluation of each record than any single detector would.
  • Interpretability — The results can be more interpretable, and interpretability is often a key concern in outlier detection. Very often in outlier detection, we’re flagging unusual records with the idea that they may be a concern, or a point of interest, in some way, and often they will be manually examined. Knowing why they are unusual is necessary to be able to do this efficiently and effectively. Manually assessing outliers that are flagged by detectors that examined many features can be especially difficult; on the other hand, outliers flagged by detectors using only a small number of features can be much more manageable to asses.
  • Faster systems — Using fewer features allows us to create faster (and less memory-intensive) detectors. This can speed up both fitting and inference, particularly when working with detectors whose execution time is non-linear in the number of features (many detectors are, for example, quadratic in execution time based on the number of features). Depending on the detectors, using, say, 20 detectors, each covering 8 features, may actually execute faster than a single detector covering 100 features.
  • Execution in parallel — Given that we use many small detectors instead of one large detector, it’s possible to execute both the fitting and the predicting steps in parallel, allowing for faster execution where there are the hardware resources to support this.
  • Ease of tuning over time — Using many simple detectors creates a system that’s easier to tune over time. Very often with outlier detection, we’re simply evaluating a single dataset and wish just to identify the outliers in this. But it’s also very common to execute outlier detection systems on a long-running basis, for example, monitoring industrial processes, website activity, financial transactions, the data being input to machine learning systems or other software applications, the output of these systems, and so on. In these cases, we generally wish to improve the outlier detection system over time, allowing us to focus better on the more relevant outliers. Having a suite of simple detectors, each based on a small number of features, makes this much more manageable. It allows us to, over time, increase the weight of the more useful detectors and decrease the weight of the less useful detectors.

As indicated, we will need, for each dataset evaluated, to determine the appropriate subspaces. It can, though, be difficult to find the relevant set of subspaces, or at least to find the optimal set of subspaces. That is, assuming we are interested in finding any unusual combinations of values, it can be difficult to know which sets of features will contain the most relevant of the unusual combinations.

As an example, if a dataset has 100 features, we may train 10 models, each covering 10 features. We may use, say, the first 10 features for the first detector, the second set of 10 features for the second, and so on, If the first two features have some rows with anomalous combinations of values, we will detect this. But if there are anomalous combinations related to the first feature and any of the 90 features not covered by the same model, we will miss these.

We can improve the odds of putting relevant features together by using many more subspaces, but it can be difficult to ensure all sets of features that should be together are actually together at least once, particularly where there are relevant outliers in the data that are based on three, four, or more features — which must appear together in at least one subspace to be detected. For example, in a table of staff expenses, you may wish to identify expenses for rare combinations of Department, Expense Type, and Amount. If so, these three features must appear together in at least one subspace.

So, we have the questions of how many features should be in each subspace, which features should go together, and how many subspaces to create.

There are a very large number of combinations to consider. If there are 20 features, there are ²²⁰ possible subspaces, which is just over a million. If there are 30 features, there over a billion. If we decide ahead of time how many features will be in each subspace, the numbers of combinations decreases, but is still very large. If there are 20 features and we wish to use subspaces with 8 features each, there are 20 chose 8, or 125,970 combinations. If there are 30 features and we wish for subspaces with 7 features each, there are 30 chose 7, or 2,035,800 combinations.

One approach we may wish to take is to keep the subspaces small, which allows for greater interpretability. The most interpretable option, using two features per subspace, also allows for simple visualization. However, if we have d features, we will need d*(d-1)/2 models to cover all combinations, which can be intractable. With 100 features, we would require 4,950 detectors. We usually need to use at least several features per detector, though not necessarily a large number.

We wish to use enough detectors, and enough features per detector, that each pair of features appears together ideally at least once, and few enough features per detector that the detectors have largely different features from each other. For example, if each detector used 90 out of the 100 features, we’d cover all combinations of features well, but the subspaces would still be quite large (undoing much of the benefit of using subspaces), and all the subspaces will be quite similar to each other (undoing much of the benefit of creating ensembles).

While the number of features per subspace requires balancing these concerns, the number of subspaces created is a bit more straightforward: in terms of accuracy, using more subspaces is strictly better, but is computationally more expensive.

There are a few broad approaches to finding useful subspaces. I list these here quickly, then look at some in more detail below.

  • Based on domain knowledge — Here we consider which sets of features could potentially have combinations of values we would consider noteworthy.
  • Based on associations — Unusual combinations of values are only possible if a set of features are associated in some way. In prediction problems, we often wish to minimize the correlations between features, but with outlier detection, these are the features that are most useful to consider together. The features with the strongest associations will have the most meaningful outliers if there are exceptions to the normal patterns.
  • Based on finding very sparse regions — Records are typically considered as outliers if they are unlike most other records in the data, which implies they are located in sparse regions of the data. Therefore, useful subspaces can be found as those that contain large, nearly-empty regions.
  • Randomly — This is the method used by a technique shown later called FeatureBagging and, while it can be suboptimal, it avoids the expensive searches for associations and sparse regions, and can work reasonably well where many subspaces are used.
  • Exhaustive searches — This is the method employed by Counts Outlier Detector. This is limited to subspaces with small numbers of features, but the results are highly interpretable. It also avoids any computation, or biases, associated with selecting only a subset of the possible subspaces.
  • Using the features related to any known outliers — If we have a set of known outliers, can identify why they are outliers (the relevant features), and are in a situation where we do not wish to identify unknown outliers (only these specific outliers), then we can take advantage of this and identify the sets of features relevant for each known outlier, and construct models for the various sets of features required.

We’ll look at a few of these next in a little more detail.

Domain knowledge

Let’s take the example of a dataset, specifically an expenses table, shown below. If examining this table, we may be able to determine the types of outliers we would and would not be interested in. Unusual combinations of Account and Amount, as well as unusual combinations of Department and Account, may be of interest; whereas Date of Expense and Time would likely not be a useful combination. We can continue in this way, creating a small number of subspaces, each with likely two, three, or four features, which can allow for very efficient and interpretable outlier detection, flagging the most relevant outliers.

Expenses table

This can miss cases where we have an association in the data, though the association is not obvious. So, as well as taking advantage of domain knowledge, it may be worth searching the data for associations. We can discover relationships among the features, for example, testing where features can be predicted accurately from the other features using simple predictive models. Where we find such associations, these can be worth investigating.

Discovering these associations, though, may be useful for some purposes, but may or may not be useful for the outlier detection process. If there is, for example, a relationship between accounts and the time of the day, this may simply be due to the process people happen to typically use to submit their expenses, and it may be that deviations from this are of interest, but more likely they are not.

Random feature subspaces

Creating subspaces randomly can be effective if there is no domain knowledge to draw on. This is fast and can create a set of subspaces that will tend to catch the strongest outliers, though it can miss some important outliers too.

The code below provides an example of one method to create a set of random subspaces. This example uses a set of eight features, named A through H, and creates a set of subspaces of these.

Each subspace starts by selecting the feature that is so far the least-used (if there is a tie, one is selected randomly). It uses a variable called ft_used_counts to track this. It then adds features to this subspace one at a time, each step selecting the feature that has appeared in other subspaces the least often with the features so far in the subspace. It uses a feature called ft_pair_mtx to track how many subspaces each pair of features have appeared in together so far. Doing this, we create a set of subspaces that matches each pair of features roughly equally often.

import pandas as pd
import numpy as np

def get_random_subspaces(features_arr, num_base_detectors,
num_feats_per_detector):
num_feats = len(features_arr)
feat_sets_arr = []
ft_used_counts = np.zeros(num_feats)
ft_pair_mtx = np.zeros((num_feats, num_feats))

# Each loop generates one subspace, which is one set of features
for _ in range(num_base_detectors):
# Get the set of features with the minimum count
min_count = ft_used_counts.min()
idxs = np.where(ft_used_counts == min_count)[0]

# Pick one of these randomly and add to the current set
feat_set = [np.random.choice(idxs)]

# Find the remaining set of features
while len(feat_set) mtx_with_set = ft_pair_mtx[:, feat_set]
sums = mtx_with_set.sum(axis=1)
min_sum = sums.min()
min_idxs = np.where(sums==min_sum)[0]
new_feat = np.random.choice(min_idxs)
feat_set.append(new_feat)
feat_set = list(set(feat_set))

# Updates ft_pair_mtx
for c in feat_set:
ft_pair_mtx[c][new_feat] += 1
ft_pair_mtx[new_feat][c] += 1

# Updates ft_used_counts
for c in feat_set:
ft_used_counts[c] += 1

feat_sets_arr.append(feat_set)

return feat_sets_arr

np.random.seed(0)
features_arr = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
num_base_detectors = 4
num_feats_per_detector = 5

feat_sets_arr = get_random_subspaces(features_arr,
num_base_detectors,
num_feats_per_detector)
for feat_set in feat_sets_arr:
print([features_arr[x] for x in feat_set])

Normally we would create many more base detectors (each subspace often corresponds to one base detector, though we can also run multiple base detectors on each subspace) than we do in this example, but this uses just four to keep things simple. This will output the following subspaces:

['A', 'E', 'F', 'G', 'H']
['B', 'C', 'D', 'F', 'H']
['A', 'B', 'C', 'D', 'E']
['B', 'D', 'E', 'F', 'G']

The code here will create the subspaces such that all have the same number of features. There is also an advantage in having the subspaces cover different numbers of features, as this can introduce some more diversity (which is important when creating ensembles), but there is strong diversity in any case from using different features (so long as each uses a relatively small number of features, such that the subspaces are largely different features).

Having the same number of features has a couple benefits. It simplifies tuning the models, as many parameters used by outlier detectors depend on the number of features. If all subspaces have the same number of features, they can also use the same parameters.

It also simplifies combining the scores, as the detectors will be more comparable to each other. If using different numbers of features, this can produce scores that are on different scales, and not easily comparable. For example, with k-Nearest Neighbors (KNN), we expect greater distances between neighbors if there are more features.

Feature subspaces based on correlations

Everything else equal, in creating the subspaces, it’s useful to keep associated features together as much as possible. In the code below, we provide an example of code to select subspaces based on correlations.

There are several ways to test for associations. We can create predictive models to attempt to predict each feature from each other single feature (this will capture even relatively complex relationships between features). With numeric features, the simplest method is likely to check for Spearman correlations, which will miss nonmonotonic relationships, but will detect most strong relationships. This is what is used in the code example below.

To execute the code, we first specify the number of subspaces desired and the number of features in each.

This executes by first finding all pairwise correlations between the features and storing this in a matrix. We then create the first subspace, starting by finding the largest correlation in the correlation matrix (this adds two features to this subspace) and then looping over the number of other features to be added to this subspace. For each, we take the largest correlation in the correlation matrix for any pair of features, such that one feature is currently in the subspace and one is not. Once this subspace has a sufficient number of features, we create the next subspace, taking the largest correlation remaining in the correlation matrix, and so on.

For this example, we use a real dataset, the baseball dataset from OpenML (available with a public license). The dataset turns out to contain some large correlations. The correlation, for example, between At bats and Runs is 0.94, indicating that any values that deviate significantly from this pattern would likely be outliers.

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

# Function to find the pair of features remaining in the matrix with the
# highest correlation
def get_highest_corr():
return np.unravel_index(
np.argmax(corr_matrix.values, axis=None),
corr_matrix.shape)

def get_correlated_subspaces(corr_matrix, num_base_detectors,
num_feats_per_detector):
sets = []

# Loop through each subspace to be created
for _ in range(num_base_detectors):
m1, m2 = get_highest_corr()

# Start each subspace as the two remaining features with
# the highest correlation
curr_set = [m1, m2]
for _ in range(2, num_feats_per_detector):
# Get the other remaining correlations
m = np.unravel_index(np.argsort(corr_matrix.values, axis=None),
corr_matrix.shape)
m0 = m[0][::-1]
m1 = m[1][::-1]
for i in range(len(m0)):
d0 = m0[i]
d1 = m1[i]
# Add the pair if either feature is already in the subset
if (d0 in curr_set) or (d1 in curr_set):
curr_set.append(d0)
curr_set = list(set(curr_set))
if len(curr_set) curr_set.append(d1)
# Remove duplicates
curr_set = list(set(curr_set))
if len(curr_set) >= num_feats_per_detector:
break

# Update the correlation matrix, removing the features now used
# in the current subspace
for i in curr_set:
i_idx = corr_matrix.index[i]
for j in curr_set:
j_idx = corr_matrix.columns[j]
corr_matrix.loc[i_idx, j_idx] = 0
if len(curr_set) >= num_feats_per_detector:
break

sets.append(curr_set)
return sets

data = fetch_openml('baseball', version=1)
df = pd.DataFrame(data.data, columns=data.feature_names)

corr_matrix = abs(df.corr(method='spearman'))
corr_matrix = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
corr_matrix = corr_matrix.fillna(0)

feat_sets_arr = get_correlated_subspaces(corr_matrix, num_base_detectors=5,
num_feats_per_detector=4)
for feat_set in feat_sets_arr:
print([df.columns[x] for x in feat_set])

This produces:

['Games_played', 'At_bats', 'Runs', 'Hits']
['RBIs', 'At_bats', 'Hits', 'Doubles']
['RBIs', 'Games_played', 'Runs', 'Doubles']
['Walks', 'Runs', 'Games_played', 'Triples']
['RBIs', 'Strikeouts', 'Slugging_pct', 'Home_runs']

PyOD is likely the most comprehensive and well-used tool for outlier detection on numeric tabular data available in Python today. It includes a large number of detectors, ranging from very simple to very complex — including several deep learning-based methods.

Now that we have an idea of how subspaces work with outlier detection, we’ll look at two tools provided by PyOD that work with subspaces, called SOD and FeatureBagging. Both of these tools identify a set of subspaces, execute a detector on each subspace, and combine the results for a single score for each record.

Whether using subspaces or not, it’s necessary to determine what base detectors to use. If not using subspaces, we would select one or more detectors and run these on the full dataset. And, if we are using subspaces, we again select one or more detectors, here running these on each subspace. As indicated above, LOF and KNN can be reasonable choices, but PyOD provides a number of others as well that can work well if executed on each subspace, including, for example, Angle-based Outlier Detector (ABOD), models based on Gaussian Mixture Models (GMMs), Kernel Density Estimations (KDE), and several others. Other detectors, provided outside PyOD can work very effectively as well.

SOD was designed specifically to handle situations such as shown in the scatter plots above. SOD works, similar to KNN and LOF, by identifying a neighborhood of k neighbors for each point, known as the reference set. The reference set is found in a different way, though, using a method called shared nearest neighbors (SNN).

Shared nearest neighbors are described thoroughly in this article, but the general idea is that if two points are generated by the same mechanism, they will tend to not only be close, but also to have many of the same neighbors. And so, the similarity of any two records can be measured by the number of shared neighbors they have. Given this, neighborhoods can be identified by using not only the sets of points with the smallest Euclidean distances between them (as KNN and LOF do), but the points with the most shared neighbors. This tends to be robust even in high dimensions and even where there are many irrelevant features: the rank order of neighbors tends to remain meaningful even in these cases, and so the set of nearest neighbors can be reliably found even where specific distances cannot.

Once we have the reference set, we use this to determine the subspace, which here is the set of features that explain the greatest amount of variance for the reference set. Once we identify these subspaces, SOD examines the distances of each point to the data center.

I provide a quick example using SOD below. This assumes pyod has been installed, which requires running:

pip install pyod

We’ll use, as an example, a synthetic dataset, which allows us to experiment with the data and model hyperparameters to get a better sense of the strengths and limitations of each detector. The code here provides an example of working with 35 features, where two features (features 8 and 9) are correlated and the other features are irrelevant. A single outlier is created as an unusual combination of the two correlated features.

SOD is able to identify the one known outlier as the top outlier. I set the contamination rate to 0.01 to specify to return (given there are 100 records) only a single outlier. Testing this beyond 35 features, though, SOD scores this point much lower. This example specifies the size of the reference set to be 3; different results may be seen with different values.

import pandas as pd
import numpy as np
from pyod.models.sod import SOD

np.random.seed(0)
d = np.random.randn(100, 35)
d = pd.DataFrame(d)

#A Ensure features 8 and 9 are correlated, while all others are irrelevant
d[9] = d[9] + d[8]

# Insert a single outlier
d.loc[99, 8] = 3.5
d.loc[99, 9] = -3.8

#C Execute SOD, flagging only 1 outlier
clf = SOD(ref_set=3, contamination=0.01)
d['SOD Scores'] = clf.fit (d)
d['SOD Scores'] = clf.labels_

We display four scatterplots below, showing four pairs of the 35 features. The known outlier is shown as a star in each of these. We can see features 8 and 9 (the two relevant features) in the second pane, and we can see the point is a clear outlier, though it is typical in all other dimensions.

Testing SOD with 35-dimensional data. One outlier was inserted into the data and can be seen clearly in the second pane for features 8 and 9. Although the point is typical otherwise, it is flagged as the top outlier by SOD. The third pane also includes feature 9, and we can see the point is somewhat unusual here, though no more so than many other points in other dimensions. The relationship in features 8 and 9 is the most relevant, and SOD appears to detect this

FeatureBagging was designed to solve the same problem as SOD, though takes a different approach to determining the subspaces. It creates the subspaces completely randomly (so slightly differently than the example above, which keeps a record of how often each pair of features are placed in a subspace together and attempts to balance this). It also subsamples the rows for each base detector, which provides a little more diversity between the detectors.

A specified number of base detectors are used (10 by default, though it is preferable to use more), each of which selects a random set of rows and features. For each, the maximum number of features that may be selected is specified as a parameter, defaulting to all. So, for each base detector, FeatureBagging:

  • Determines the number of features to use, up to the specified maximum.
  • Chooses this many features randomly.
  • Chooses a set of rows randomly. This is a bootstrap sample of the same size as the number of rows.
  • Creates an LOF detector (by default; other base detectors may be used) to evaluate the subspace.

Once this is complete, each row will have been scored by each base detector and the scores must then be combined into a single, final score for each row. PyOD’s FeatureBagging provides two options for this: using the maximum score and using the mean score.

As we saw in the scatter plots above, points can be strong outliers in some subspaces and not in others, and averaging in their scores from the subspaces where they are typical can water down their scores and defeat the benefit of using subspaces. In other forms of ensembling with outlier detection, using the mean can work well, but when working with multiple subspaces, using the maximum will typically be the better of the two options. Doing that, we give each record a score based on the subspace where it was most unusual. This isn’t perfect either, and there can be better options, but using the maximum is simple and is almost always preferable to the mean.

Any detector can be used within the subspaces. PyOD uses LOF by default, as did the original paper describing FeatureBagging. LOF is a strong detector and a sensible choice, though you may find better results with other base detectors.

In the original paper, subspaces are created randomly, each using between d/2 and d — 1 features, where d is the total number of features. Some researchers have pointed out that the number of features used in the original paper is likely much larger than is appropriate.

If the full number of features is large, using over half the features at once will allow the curse of dimensionality to take effect. And using many features in each detector will result in the detectors being correlated with each other (for example, if all base detectors use 90% of the features, they will use roughly the same features and tend to score each record roughly the same), which can also remove much of the benefit of creating ensembles.

PyOD allows setting the number of features used in each subspace, and it should be typically set fairly low, with a large number of base estimators created.

In this article we’ve looked at subspaces as a way to improve outlier detection in a number of ways, including reducing the curse of dimensionality, increasing interpretability, allowing parallel execution, allowing easier tuning over time, and so on. Each of these are important considerations, and using subspaces is often very helpful.

There are, though, often other approaches as well that can be used for these purposes, sometimes as alternatives, and sometimes in combination of with the use of subspaces. For example, to improve interpretability, its important to, as much as possible, select model types that are inherently interpretable (for example univariate tests such as z-score tests, Counts Outlier Detector, or a detector provided by PyOD called ECOD).

Where the main interest is in reducing the curse of dimensionality, here again, it can be useful to look at model types that scale to many features well, for instance Isolation Forest or Counts Outlier Detector. It can also be useful to look at executing univariate tests, or applying PCA.

One thing to be aware of when constructing subspaces, if they are formed based on correlations, or on sparse regions, is that the relevant subspaces may change over time as the data changes. New associations may emerge between features and new sparse regions may form that will be useful for identifying outliers, though these will be missed if the subspaces are not recalculated from time to time. Finding the relevant subspaces in these ways can be quite effective, but they may need to to be updated on some schedule, or where the data is known to have changed.

It’s common with outlier detection projects on tabular data for it to be worth looking at using subspaces, particularly where we have many features. Using subspaces is a relatively straightforward technique with a number of noteworthy advantages.

Where you face issues related to large data volumes, execution times, or memory limits, using PCA may also be a useful technique, and may work better in some cases than creating sub-spaces, though working with sub-spaces (and so, working with the original features, and not the components created by PCA) can be substantially more interpretable, and interpretability is often quite important with outlier detection.

Subspaces can be used in combination with other techniques to improve outlier detection. As an example, using subspaces can be combined with other ways to create ensembles: it’s possible to create larger ensembles using both subspaces (where different detectors in the ensemble use different features) as well as different model types, different training rows, different pre-processing, and so on. This can provide some further benefits, though with some increase in computation as well.

All images by author



Source link

24Nov

Data Analyst – Freelance, Remote at Magic – LatAm – Argentina


Data Analyst – Freelance, Remote

Department: Boutique Client

Employment Type: Freelance

Location: LatAm – Argentina

Reporting To: Client via Magic

Compensation: $5.00 / hour

Description

About the OpportunityWe’re building an inclusive pool of Data Analysts across all experience levels and specializations to match with exciting client projects. Whether you’re just starting your data analysis journey or you’re a seasoned expert, we welcome your application. This is not a traditional job posting – instead, we’re creating a talent pipeline to connect analysts with opportunities that match their expertise level and interests.
How It Works

  1. Submit your application with detailed information about your expertise
  2. Complete our skills assessment and technical evaluation
  3. Join our talent pool if qualified
  4. Get matched with relevant client projects based on your skills and availability
  5. Accept projects that interest you and match your schedule

Why Join Our Talent Pool?

  • Opportunities at Every Level: From entry-level to expert positions
  • Flexible Engagement: Work on projects that match your schedule and interests
  • Growth Potential: Start with basic projects and progress to more complex ones
  • Mentorship Possibilities: Opportunities to learn from experienced analysts
  • Diverse Projects: Access opportunities across industries and analytical specialties
  • Quick Matching: Get fast-tracked to relevant opportunities when they arise
  • Professional Development: Expand your portfolio with varied client projects
  • Competitive Rates: Benefit from fair compensation based on your experience level

What We’re Looking For

We welcome applications from Data Analysts of all levels with expertise or interest in:

  • Business Intelligence & Reporting
  • Marketing Analytics
  • Financial Analysis
  • Operations Analytics
  • Product Analytics
  • Customer Analytics
  • Data Visualization
  • Statistical Analysis
  • Predictive Modeling
  • Real-Time Data Monitoring
  • Dashboard Operation
  • Report Generation & Distribution
  • Data Quality Monitoring
  • Basic Data Collection & Cleaning
  • Excel-Based Analysis
  • KPI Tracking & Reporting
  • Performance Metrics Monitoring
  • Junior Data Analyst
  • Data Analysis Assistant
  • Reporting Analyst
  • Data Quality Analyst
  • Analytics Support Specialist
  • Metrics Coordinator
  • Entry-Level Business Analyst
  • Data Operations Analyst
  • E-commerce Analytics
  • Social Media Analytics
  • Website Analytics
  • Sales Analytics
  • Support Analytics
  • Inventory Analytics
  • Basic Market Research
  • Survey Data Analysis

Skills, Knowledge and Expertise

Required:

  • Bachelor’s degree in a relevant field (Statistics, Mathematics, Economics, Computer Science, or related) OR equivalent practical experience
  • Strong Excel/Google Sheets skills
  • Basic understanding of data analysis concepts
  • Attention to detail and accuracy
  • Willingness to learn and grow
  • Excellent written and verbal communication skills
  • Basic problem-solving abilities
  • Bachelor’s degree in a relevant field (Statistics, Mathematics, Economics, Computer Science, or related)
  • 2+ years of professional experience in data analysis
  • Strong proficiency in data manipulation and analysis tools
  • Excellence in data visualization and presentation
  • Strong problem-solving and analytical thinking skills
  • Excellent written and verbal communication skills
  • Ability to work independently
  • Experience working in a remote/freelance capacity (preferred)

Candidates should be proficient in one or more of the following:

  • Excel/Google Sheets
  • Basic SQL or willingness to learn
  • Basic data visualization skills
  • Report generation
  • Data entry and validation
  • SQL and database querying
  • Python or R for data analysis
  • Business Intelligence tools (Power BI, Tableau, Looker, etc.)
  • Advanced Excel/Google Sheets
  • Statistical analysis software
  • ETL tools and processes

What to expect…

Schedule:

  • Majority of our clients are within US business hours from Monday to Friday but some may have unusual business hours. Opportunities maybe full-time or part-time.

Compensation:

  • Rate starts at $5 per hour
  • Payment is made via Deel

Equipment:

  • Candidate must have the following WFH Set-Up:
    • Computer with at least 8GB RAM, an Intel i5 core processor/AMD Ryzen 5 Processor and up.
    • Internet speed of at least 40MBPS
    • Headset with an extended mic that has noise cancellation and a webcam
    • Back-up computer and internet connection
    • Quiet, dedicated workspace at home

Benefits



Source link

24Nov

ML Student at Samsung Research and Development Center Israel – Tel Aviv-Yafo, Tel Aviv District, IL


Description

Join the cutting-edge team at Samsung R&D Center as we embark on shaping the future of technology today. Beyond merely envisioning the horizon, we are actively pushing boundaries in various key areas of innovation. Samsung is committed to pioneering continuous advancements, delivering value to society, and fostering an environment where our employees can fully unleash their talents, creativity, and passion. Join us in building the tomorrow we envision,

The Group

Samsung Semiconductor Israel’s Computer Vision team is responsible for developing and productizing highly challenging, cutting edge, computer vision and machine learning algorithms for Samsung’s next generation products. Our team is making a significant impact on technology that reaches hundreds of millions of users around the world.

What will you do?

Take part in project of Building a proof of concept for a new camera feature with image generation capabilities for Samsung phones.

  • Explore recent computer vision (CV)/machine learning (ML) technologies
  • Collaborate with top notch CV and ML researchers
  • Define, implement, test, train and evaluate neural network schemes
  • Deal with HW and SW implementation constraints
  • Define and implement algorithmic evaluation protocols 

Requirements

  • MSc/PhD student (or exceptional BSc students) of CS/EE/Math or related field
  • ML and NN Knowledge and experience
  • Great coding skills with Python experience
  • Capability to quickly grasp new fields and technologies.
  • Great interpersonal skills



Source link

24Nov

Data Science in Marketing: Hands-on Propensity Modelling with Python | by Rebecca Vickery | Nov, 2024


All the code you need to predict the likelihood of a customer purchasing your product

Photo by Campaign Creators on Unsplash

Propensity models are a powerful application of machine learning in marketing. These models use historical examples of customer behaviour to make predictions about future behaviour. The predictions generated by the propensity model are commonly used to understand the likelihood of a customer purchasing a particular product or taking up a specific offer within a given time frame.

In essence, propensity models are examples of the machine learning technique known as classification. What makes propensity models unique is the problem statement they solve and how the output needs to be crafted for use in marketing.

The output of a propensity model is a probability score describing the predicted likelihood of the desired customer behaviour. This score can be used to create customer segments or rank customers for increased personalisation and targeting of new products or offers.

In this article, I’ll provide an end-to-end practical tutorial describing how to build a propensity model ready for use by a marketing team.

This is the first in a series of hands-on Python tutorials I’ll be writing



Source link

24Nov

Manager, Software Engineering (DLP) at Palo Alto Networks – Santa Clara, CA, United States


Company Description

Our Mission

At Palo Alto Networks® everything starts and ends with our mission:

Being the cybersecurity partner of choice, protecting our digital way of life.
Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we’re looking for innovators who are as committed to shaping the future of cybersecurity as we are.

Who We Are

We take our mission of protecting the digital way of life seriously. We are relentless in protecting our customers and we believe that the unique ideas of every member of our team contributes to our collective success. Our values were crowdsourced by employees and are brought to life through each of us everyday – from disruptive innovation and collaboration, to execution. From showing up for each other with integrity to creating an environment where we all feel included.

As a member of our team, you will be shaping the future of cybersecurity. We work fast, value ongoing learning, and we respect each employee as a unique individual. Knowing we all have different needs, our development and personal wellbeing programs are designed to give you choice in how you are supported. This includes our FLEXBenefits wellbeing spending account with over 1,000 eligible items selected by employees, our mental and financial health resources, and our personalized learning opportunities – just to name a few!

At Palo Alto Networks, we believe in the power of collaboration and value in-person interactions. This is why our employees generally work full time from our office with flexibility offered where needed. This setup fosters casual conversations, problem-solving, and trusted relationships. Our goal is to create an environment where we all win with precision.

Job Description

Your Career

You can build a career solving network security challenges by leveraging advanced machine learning techniques on exabytes of data. You will help create and deliver the most advanced AI and Machine Learning solutions to detect and remediate network issues, discover threats, provide intelligence, and protect devices.

At Palo Alto Networks, Managers of Engineering are:

  • Deep technical experts and thought leaders help accelerate the adoption of the very best engineering practices while maintaining knowledge of industry innovations, trends, and practices
  • Visionaries who deliver on critical business needs and are recognized across the company as go-to engineering resources on given domains
  • Role models and mentors who exemplify the best of Palo Alto Networks culture – They do the right thing even when it’s hard, treat challenges as a chance to learn, and provide honest opinions so the team can improve
  • Leaders who can communicate cogently with hands-on engineers as well as executives

Your Impact

  • You will work in a fast-paced team to create and deliver new product and mission features to the network and security platform that many customers use daily
  • You will actively engage and contribute to the network security community through collaborations
  • You will be part of the team that is leading and driving industry technology evolution and creating a positive impact on business
  • Participate in strategy planning to build next-generation data security products
  • Manage the day-to-day to ensure the software is developed on schedule
  • Manage multiple projects simultaneously
  • Actively participate in code and architecture reviews
  • Mentor and manage the professional development of the team members
  • Thought leader of customer-first mindset

Qualifications

Your Experience 

  • Passion for leadership and innovation
  • Experience leading a team from either a technical or managerial perspective
  • Demonstrable project management skills
  • Excellent written and verbal communication skills
  • Proven ability to collaborate with cross-functional teams, including Product Management and Design teams, emphasizing end-to-end delivery
  • Excellent communication skills with the ability to influence at all levels of the organization
  • M.S. or Ph.D in Computer Science, Mathematics, Statistics, or Electrical Engineering or equivalent military experience required
  • 8+ years of development experience in Public Cloud, Data Engineering and Analytics
  • 4+ years of direct people leadership experience
  • 8+ years experience in design, algorithms, and data structures – Expertise with one or more of the following languages is a must – Java, Python, Golang
  • Proven leadership skills with an ability to deliver under tight deadlines
  • Familiarity with Agile (e.g., Scrum Process), Big Data technologies like Hive, Kafka, Hadoop, SQL, developing APIs, Cloud platforms such as GCP, AWS and Azure

Additional Information

The Team

To stay ahead of the curve, it’s critical to know where the curve is, and how to anticipate the changes we may be facing. For the fastest growing cybersecurity company, the curve is the evolution of cyberattacks, and the products and services that proactively address them. As a predictive enterprise environment, we analyze petabytes of data that pass through our walls daily and we hire the most talented minds in data science to build creative predictive analytics and data science solutions for our cybersecurity solutions.

We define the industry, instead of waiting for directions. We need individuals who feel comfortable in ambiguity, excited by the prospect of a challenge, and empowered by the unknown risks facing our everyday lives that are only enabled by a secure digital environment.

Compensation Disclosure 

The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/commissioned roles) is expected to be between $165000 – $267500/YR. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here.

Our Commitment

We’re problem solvers that take risks and challenge cybersecurity’s status quo. It’s simple: we can’t accomplish our mission without diverse teams innovating, together.

We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at  

ac************@pa**************.com











.

Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics.

All your information will be kept confidential according to EEO guidelines.

Is role eligible for Immigration Sponsorship?: Yes



Source link

Protected by Security by CleanTalk