21Oct

CPU Research Scientist – Platform Architecture at Apple – Cambridge, Massachusetts, United States


Summary

Posted: Oct 19, 2024

Role Number:200574214

The CPU Platform Architecture team is responsible for pushing the boundaries of both single-threaded and multi-threaded CPU performance, to enhance the user experience of many Apple products. The team is composed of experts with deep experience in microarchitecture, ISA definition, performance modeling, power modeling, and workload analysis. We are seeking a highly motivated, innovative, and confident individual to join the CPU Platform Architecture team to help drive advanced exploration for next-generation iPhone, iPad, and Mac CPU designs.

Description

In this research-centric role, you will be responsible for exploring and defining next-generation CPU architectures that seek to maintain performance and efficiency leadership. This role will challenge you to:

– Discover insight into CPU performance bottlenecks and drive improvements to performance through architectural and microarchitectural enhancements

– Set direction for next-generation high performance CPU’s in such areas as: branch prediction, instruction/data prefetching, memory subsystem, or other CPU areas

– Build the right tools suited for performance analysis so that you can parse through the noise and focus on the real challenges

– Apply AI/ML techniques as both a tool for exploration and for CPU algorithmic feature improvement

– Conduct continuous research into one or more CPU areas, brainstorm ideas, and model ideas in a performance simulator to refine and prove their utility

– Present analysis/findings to guide CPU architecture and design teams on features that should be implemented in future CPU’s

– Work cross-functionally with software and system partners

– Provide recommendations and influence the roadmap for future Apple CPU’s used in iPhone, iPad, and Mac systems

Minimum Qualifications

  • B.S. degree
  • Familiarity with CPU architecture or microarchitecture concepts
  • Research experience and knowledge of CPU microarchitecture or AI/ML literature
  • Programming experience in either Python or C/C++

Preferred Qualifications

  • Expertise in one or more disciplines within CPU architecture: branch prediction, instruction or data prefetching, value prediction, caching policies, etc.
  • Ability to identify performance bottlenecks in workloads in effort to craft ideas to solve them and ability to implement those ideas in a performance simulator
  • Experience applying traditional or state-of-the-art machine learning techniques to novel problems. Knowledge of model development and tuning a plus
  • Knowledge and experience with common industry performance benchmarks and workloads
  • 10+ years of relevant experience
  • PhD in Computer Science or Computer Engineering with a focus in Computer Architecture

  • Apple is an equal opportunity employer that is committed to inclusion and diversity. We take affirmative action to ensure equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant.



Source link

20Oct

Evaluating Model Retraining Strategies | by Reinhard Sellmair | Oct, 2024


How data drift and concept drift matter to choose the right retraining strategy?

(created with Image Creator in Bing)

Many people in the field of MLOps have probably heard a story like this:

Company A embarked on an ambitious quest to harness the power of machine learning. It was a journey fraught with challenges, as the team struggled to pinpoint a topic that would not only leverage the prowess of machine learning but also deliver tangible business value. After many brainstorming sessions, they finally settled on a use case that promised to revolutionize their operations. With excitement, they contracted Company B, a reputed expert, to build and deploy a ML model. Following months of rigorous development and testing, the model passed all acceptance criteria, marking a significant milestone for Company A, who looked forward to future opportunities.

However, as time passed, the model began producing unexpected results, rendering it ineffective for its intended use. Company A reached out to Company B for advice, only to learn that the changed circumstances required building a new model, necessitating an even higher investment as the original.

What went wrong? Was the model Company B created not as good as expected? Was Company A just unlucky that something unexpected happened?

Probably the issue was that even the most rigorous testing of a model before deployment does not guarantee that this model will perform well for an unlimited amount of time. The two most important aspects that impact a model’s performance over time are data drift and concept drift.

Data Drift: Also known as covariate shift, this occurs when the statistical properties of the input data change over time. If an ML model was trained on data from a specific demographic but the demographic characteristics of the input data change, the model’s performance can degrade. Imagine you taught a child multiplication tables until 10. It can quickly give you the correct answers for what is 3 * 7 or 4 * 9. However, one time you ask what is 4 * 13, and although the rules of multiplication did not change it may give you the wrong answer because it did not memorize the solution.

Concept Drift: This happens when the relationship between the input data and the target variable changes. This can lead to a degradation in model performance as the model’s predictions no longer align with the evolving data patterns. An example here could be spelling reforms. When you were a child, you may have learned to write “co-operate”, however now it is written as “cooperate”. Although you mean the same word, your output of writing that word has changed over time.

In this article I investigate how different scenarios of data drift and concept drift impact a model’s performance over time. Furthermore, I show what retraining strategies can mitigate performance degradation.

I focus on evaluating retraining strategies with respect to the model’s prediction performance. In practice more aspects like:

  • Data Availability and Quality: Ensure that sufficient and high-quality data is available for retraining the model.
  • Computational Costs: Evaluate the computational resources required for retraining, including hardware and processing time.
  • Business Impact: Consider the potential impact on business operations and outcomes when choosing a retraining strategy.
  • Regulatory Compliance: Ensure that the retraining strategy complies with any relevant regulations and standards, e.g. anti-discrimination.

need to be considered to identify a suitable retraining strategy.

(created with Image Creator in Bing)

To highlight the differences between data drift and concept drift I synthesized datasets where I controlled to what extent these aspects appear.

I generated datasets in 100 steps where I changed parameters incrementally to simulate the evolution of the dataset. Each step contains multiple data points and can be interpreted as the amount of data that was collected over an hour, a day or a week. After every step the model was re-evaluated and could be retrained.

To create the datasets, I first randomly sampled features from a normal distribution where mean µ and standard deviation σ depend on the step number s:

The data drift of feature xi depends on how much µi and σi are changing with respect to the step number s.

All features are aggregated as follows:

Where ci are coefficients that describe the impact of feature xi on X. Concept drift can be controlled by changing these coefficients with respect to s. A random number ε which is not available for model training is added to consider that the features do not contain complete information to predict the target y.

The target variable y is calculated by inputting X into a non-linear function. By doing this we create a more challenging task for the ML model since there is no linear relation between the features and the target. For the scenarios in this article, I chose a sine function.

(created with Image Creator in Bing)

I created the following scenarios to analyze:

  • Steady State: simulating no data or concept drift — parameters µ, σ, and c were independent of step s
  • Distribution Drift: simulating data drift — parameters µ, σ were linear functions of s, parameters c is independent of s
  • Coefficient Drift: simulating concept drift: parameters µ, σ were independent of s, parameters c are a linear function of s
  • Black Swan: simulating an unexpected and sudden change — parameters µ, σ, and c were independent of step s except for one step when these parameters were changed

The COVID-19 pandemic serves as a quintessential example of a Black Swan event. A Black Swan is characterized by its extreme rarity and unexpectedness. COVID-19 could not have been predicted to mitigate its effects beforehand. Many deployed ML models suddenly produced unexpected results and had to be retrained after the outbreak.

For each scenario I used the first 20 steps as training data of the initial model. For the remaining steps I evaluated three retraining strategies:

  • None: No retraining — the model trained on the training data was used for all remaining steps.
  • All Data: All previous data was used to train a new model, e.g. the model evaluated at step 30 was trained on the data from step 0 to 29.
  • Window: A fixed window size was used to select the training data, e.g. for a window size of 10 the training data at step 30 contained step 20 to 29.

I used a XG Boost regression model and mean squared error (MSE) as evaluation metric.

Steady State

Prediction error of steady state scenario

The diagram above shows the evaluation results of the steady state scenario. As the first 20 steps were used to train the models the evaluation error was much lower than at later steps. The performance of the None and Window retraining strategies remained at a similar level throughout the scenario. The All Data strategy slightly reduced the prediction error at higher step numbers.

In this case All Data is the best strategy because it profits from an increasing amount of training data while the models of the other strategies were trained on a constant training data size.

Distribution Drift (Data Drift)

Prediction error of distribution drift scenario

When the input data distributions changed, we can clearly see that the prediction error continuously increased if the model was not retrained on the latest data. Retraining on all data or on a data window resulted in very similar performances. The reason for this is that although All Data was using more data, older data was not relevant for predicting the most recent data.

Coefficient Drift (Concept Drift)

Prediction error of coefficient drift scenario

Changing coefficients means that the importance of features changes over time. In this case we can see that the None retraining strategy had drastic increase of the prediction error. Additionally, the results showed that retraining on all data also lead to a continuous increase of prediction error while the Window retraining strategy kept the prediction error on a constant level.

The reason why the All Data strategy performance also decreased over time was that the training data contained more and more cases where similar inputs resulted in different outputs. Hence, it became more challenging for the model to identify clear patterns to derive decision rules. This was less of a problem for the Window strategy since older data was ignore which allowed the model to “forget” older patterns and focus on most recent cases.

Black Swan

Prediction error of black swan event scenario

The black swan event occurred at step 39, the errors of all models suddenly increased at this point. However, after retraining a new model on the latest data, the errors of the All Data and Window strategy recovered to the previous level. Which is not the case with the None retraining strategy, here the error increased around 3-fold compared to before the black swan event and remained on that level until the end of the scenario.

In contrast to the previous scenarios, the black swan event contained both: data drift and concept drift. It is remarkable that the All Data and Window strategy recovered in the same way after the black swan event while we found a significant difference between these strategies in the concept drift scenario. Probably the reason for this is that data drift occurred at the same time as concept drift. Hence, patterns that have been learned on older data were not relevant anymore after the black swan event because the input data has shifted.

An example for this could be that you are a translator and you get requests to translate a language that you haven’t translated before (data drift). At the same time there was a comprehensive spelling reform of this language (concept drift). While translators who translated this language for many years may be struggling with applying the reform it wouldn’t affect you because you even didn’t know the rules before the reform.

To reproduce this analysis or explore further you can check out my git repository.

Identifying, quantifying, and mitigating the impact of data drift and concept drift is a challenging topic. In this article I analyzed simple scenarios to present basic characteristics of these concepts. More comprehensive analyses will undoubtedly provide deeper and more detailed conclusions on this topic.

Here is what I learned from this project:

Mitigating concept drift is more challenging than data drift. While data drift could be handled by basic retraining strategies concept drift requires a more careful selection of training data. Ironically, cases where data drift and concept drift occur at the same time may be easier to handle than pure concept drift cases.

A comprehensive analysis of the training data would be the ideal starting point of finding an appropriate retraining strategy. Thereby, it is essential to partition the training data with respect to the time when it was recorded. To make the most realistic assessment of the model’s performance, the latest data should only be used as test data. To make an initial assessment regarding data drift and concept drift the remaining training data can be split into two equally sized sets with the older data in one set and the newer data in the other. Comparing feature distributions of these sets allows to assess data drift. Training one model on each set and comparing the change of feature importance would allow to make an initial assessment on concept drift.

No retraining turned out to be the worst option in all scenarios. Furthermore, in cases where model retraining is not taken into consideration it is also more likely that data to evaluate and/or retrain the model is not collected in an automated way. This means that model performance degradation may be unrecognized or only be noticed at a late stage. Once developers become aware that there is a potential issue with the model precious time would be lost until new data is collected that can be used to retrain the model.

Identifying the perfect retraining strategy at an early stage is very difficult and may be even impossible if there are unexpected changes in the serving data. Hence, I think a reasonable approach is to start with a retraining strategy that performed well on the partitioned training data. This strategy should be reviewed and updated the time when cases occurred where it did not address changes in the optimal way. Continuous model monitoring is essential to quickly notice and react when the model performance decreases.

If not otherwise stated all images were created by the author.



Source link

20Oct

Data Engineer at Kyndryl – Madrid HQ (KES51610)


Who We Are

At Kyndryl, we design, build, manage and modernize the mission-critical technology systems that the world depends on every day. So why work at Kyndryl? We are always moving forward – always pushing ourselves to go further in our efforts to build a more equitable, inclusive world for our employees, our customers and our communities.

The Role

We are looking for a Data Engineer to join our dynamic and multifunctional team. In this role, you will participate in several projects across multiple clouds, covering the entire data lifecycle.

You will work in a multi-client and multi-project environment, utilizing different technologies and managing data within the cloud ecosystem.

Responsibilities:

  • Design and implement data platforms in multicloud environments.
  • Manage data projects from collection to analysis and visualization.
  • Collaborate with cross-functional teams to ensure data integration and flow.
  • Evaluate and select appropriate technologies for each project.
  • Provide technical support and training to end users.

We Offer:

  • A dynamic and collaborative work environment.
  • Opportunities for professional development and continuous training.
  • Participation in innovative and high-impact projects.

If you are interested in joining our team, we invite you to apply!

Who You Are

Required Technical and Professional Expertise

  • Experience in building and maintaining large-scale data warehouses and ensuring data quality at scale. Strong understanding of Data services in one of: AWS, GCP, and/or Azure.
  • Expertise in data mining, data storage and Extract-Transform-Load (ETL) processes.
  • Python and SQL programing skills (Pyspark, Pandas, etc…)
  • Knowledge of BI tools such as Looker or PowerBI for data visualization and reporting.
  • Communication skills.
  • Spanish fluent or native.

 

Preferred Technical and Professional Experience

  • BS, MS, or PhD in Mathematics, Computer Science, or a related field.
  • Experience in data pipelines development and tooling, e.g. Databricks, Cloudera, Teradata, Snowflake
  • Knowledge of DevOps/DataOps or CI/CD pipelines.  
  • Cloud platform certification.

Being You

Diversity is a whole lot more than what we look like or where we come from, it’s how we think and who we are. We welcome people of all cultures, backgrounds, and experiences. But we’re not doing it single-handily: Our Kyndryl Inclusion Networks are only one of many ways we create a workplace where all Kyndryls can find and provide support and advice. This dedication to welcoming everyone into our company means that Kyndryl gives you – and everyone next to you – the ability to bring your whole self to work, individually and collectively, and support the activation of our equitable culture. That’s the Kyndryl Way.

What You Can Expect

With state-of-the-art resources and Fortune 100 clients, every day is an opportunity to innovate, build new capabilities, new relationships, new processes, and new value. Kyndryl cares about your well-being and prides itself on offering benefits that give you choice, reflect the diversity of our employees and support you and your family through the moments that matter – wherever you are in your life journey. Our employee learning programs give you access to the best learning in the industry to receive certifications, including Microsoft, Google, Amazon, Skillsoft, and many more. Through our company-wide volunteering and giving platform, you can donate, start fundraisers, volunteer, and search over 2 million non-profit organizations.  At Kyndryl, we invest heavily in you, we want you to succeed so that together, we will all succeed.

Get Referred!

If you know someone that works at Kyndryl, when asked ‘How Did You Hear About Us’ during the application process, select ‘Employee Referral’ and enter your contact’s Kyndryl email address.



Source link

20Oct

Data Engineer at Chubb – Mexico


With you, Chubb is better! 
 
Are passionate with data infrastructure, metrics and coding? Do you love creating pipelines to support business? Would you like to be a member of a fun working environment where your innovative projects make a real impact? Then, check this outstanding opportunity in our new Technology Hub in Mexico – CECM as a Data Engineer.
 
 
If you are a tech lover and are raring to develop your career join our growing, pioneer, diverse team within one of the largest companies in the world, we would love to hear from you! 
 
The Opportunity 
 
Your Responsibilities for this role may include, but are not limited to: 
 

  • Conceptualize, support, and drive the data architecture for multiple large-scale projects as well as recommend solutions to improve processes. 
  • Integrate data from various sources and build robust, multi-functional data assets to support analytics. Responsible for data asset design, development, integration, and optimization.  
  • Must love coding – prepare to spend more than 80% of the time on hands-on development with groundbreaking technologies.  
  • Gatekeeper of end-to-end applications and frameworks ranging from system programming to micro-services to simple front-end applications 
  • Build pipelines, dashboards, frameworks, and systems to facilitate easier development of data artifacts. Moreover, clean, unify and organize messy and complex data sets for easy access and analysis. 
  • Collaborate with others to understand data needs, representing key data insights in a meaningful way. 
  • Ability to own complete project or a subject area delivery while leading team members in a scrum setting. 
  • Design, build, and launch collections of sophisticated data models and visualizations that support multiple use cases across different products or domains. 
  • Solve our most exciting data integration problems, applying optimal ETL patterns, frameworks, query techniques, sourcing from structured and unstructured data sources. 
  • Be the point of reference for solving a challenging technical problem 
     

Knowledge, Skills, And Abilities 
 

  • At least 4 years of data/software engineering experience including data analysis, design and integration/ETL. 
  • Good knowledge of Informatica Intelligent Cloud Services (IICS) 
  • Strong knowledge of Python 
  • Experience with SQL 
  • Bonus points for 
  • PySpark 
  • Databricks 
  • Snowflake 
  • NoSQL 
  • Data Modelling 

Our team makes a difference, every time. For this reason, we offer in return! 
 
We offer hybrid working model, explicit, structured career development, a competitive salary package, annual bonus, private medical cover, monthly allowance for lunch, continuous learning experiences, work in a fun, lively environment with mentoring from our groundbreaking senior mentors. 

 

Integrity. Client Focus. Respect. Excellence. Teamwork 
 
Our core values instruct how we live and work. We’re an ethical and honest company that’s wholly committed to its clients. A business that’s engaged in mutual trust and respect for its employees and partners. A place where colleagues perform at the highest levels. And a working environment that’s collaborative and encouraging. 

 
 
Diversity & Inclusion. At Chubb, we consider our people our chief competitive advantage and as such we treat colleagues, candidates, clients, and business partners with equality, fairness, and respect, regardless of their age, disability, race, religion or belief, gender, sexual orientation, marital status or family circumstances. We strive to achieve an environment where all colleagues feel comfortable performing to their full potential and are recognized for their contributions. 
 
Many voices, One Chubb! 



Source link

20Oct

Data Governance Program Manager at NielsenIQ – Sofia, Bulgaria


Job Description

This role is part of the newly formed Data Strategy and Stewardship team within the Product organization. This team will be instrumental in shaping the organization’s data landscape. As a Data Governance Program Manager, you will lead complex, high-visibility initiatives throughout the project lifecycle. Your projects may cut across time zones, clients, and technologies and it is your job to keep track of all the moving parts, bring people together across multiple teams and communicate to junior and senior stakeholders.

At NielsenIQ, we empower our Program Managers to make decisions and own their projects. We’re looking for motivated, analytical, dynamic leaders with a passion for data and technology to join our Data Strategy and Stewardship team. If you thrive in high-energy environments and if you love the idea of working across every business function with visibility to our CTO and Product Leaders, you would be a great fit for our team!

RESPONSIBILITIES

  • Act as liaison for all facets of the program and pull people together to make decisions (Sales, Product, Operations, Technology, Data Science, Client Deployment) throughout the project
  • Act as the face of the program and communicate progress to upper management and stakeholders
  • Facilitate cross-team prioritization between various functions
  • Define business objectives clearly and ensure that the program meets them
  • Define and document release milestones, timelines, and deployment plans
  • Facilitate conversations to ensure Cross-Functional Team Dependencies are planned, identifying gaps, and mitigating or escalating risks
  • Track and communicate program objectives and progress
  • Track and ensure prompt resolution of issues; manage risks

QUALIFICATIONS

  • 5+ years in program and project management or similar role running large projects preferably with a focus in technology or data
  • Bachelor’s (undergraduate) degree in related field
  • Superior communication skills (interpersonal, verbal, written)
  • Ability to influence & gain buy-in at multiple levels, across divisions, functions, and cultures
  • Experience working with senior-level management and virtual teams
  • Ability to prioritize, manage, and deliver on multiple projects simultaneously; highly motivated and able to work against aggressive schedules
  • Strong analytical thinking with the ability to link data to the problem to solve
  • Proficiency in Microsoft Excel
  • The ability to learn other software tools as required
  • English language proficiency

Technical skills
Desirable but not mandatory:

  • Proficient end user of Business Intelligence tools eg. PowerBI, Microstrategy, Qlik
  • Ability to write basic SQL (select statements)
  • Ability to research data from public sources eg. Identify ownership of brands and retail businesses; confirm the countries in which a brand is available.
  • Scrum Alliance certifications (i.e. CSM, CSPO, etc.) preferred

WE OFFER 

  • Food vouchers of 70 BGN monthly. 
  • Monthly Hybrid Model Allowance of 51 BGN gross 
  • Additional Medical Insurance, incl Prophylactics, Outpatient care, Inpatient care, Expenses for medications and medical products 
  • Life Insurance 
  • One-time amount of 200 BGN gross for newly born child and for marriage 
  • Additional paid leave of 3 days in case of no overdue leave days from previous year 
  • Birthday allowance of 44 BGN gross 
  • Free access to LinkedIn Learning platform 
  • Multisport card, funded by the employee 
  • Emloyee Assistance Program
  • In addition: 
  • A Hybrid model of working: part of the week you work from home (home office), a part from the office 
  • The opportunity to gain a valuable experience and knowledge of FMCG on the country and international level 
  • Friendly working environment 

Additional Information

Our Benefits

  • Flexible working environment
  • Volunteer time off
  • LinkedIn Learning
  • Employee-Assistance-Program (EAP)

About NIQ

NIQ is the world’s leading consumer intelligence company, delivering the most complete understanding of consumer buying behavior and revealing new pathways to growth. In 2023, NIQ combined with GfK, bringing together the two industry leaders with unparalleled global reach. With a holistic retail read and the most comprehensive consumer insights—delivered with advanced analytics through state-of-the-art platforms—NIQ delivers the Full View™. NIQ is an Advent International portfolio company with operations in 100+ markets, covering more than 90% of the world’s population.

For more information, visit NIQ.com

Want to keep up with our latest updates?

Follow us on: LinkedIn | Instagram | Twitter | Facebook

Our commitment to Diversity, Equity, and Inclusion

NIQ is committed to reflecting the diversity of the clients, communities, and markets we measure within our own workforce. We exist to count everyone and are on a mission to systematically embed inclusion and diversity into all aspects of our workforce, measurement, and products. We enthusiastically invite candidates who share that mission to join us. We are proud to be an Equal Opportunity/Affirmative Action-Employer, making decisions without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability status, age, marital status, protected veteran status or any other protected class. Our global non-discrimination policy covers these protected classes in every market in which we do business worldwide. Learn more about how we are driving diversity and inclusion in everything we do by visiting the NIQ News Center: https://nielseniq.com/global/en/news-center/diversity-inclusion





Source link

20Oct

Visualization of Data with Pie Charts in Matplotlib | by Diana Rozenshteyn | Oct, 2024


Examples of how to create different types of pie charts using Matplotlib to visualize the results of database analysis in a Jupyter Notebook with Pandas

Photo by Niko Nieminen on Unsplash

While working on my Master’s Thesis titled “Factors Associated with Impactful Scientific Publications in NIH-Funded Heart Disease Research”, I have used different types of pie charts to illustrate some of the key findings from the database analysis.

A pie chart can be an effective choice for data visualization when a dataset contains a limited number of categories representing parts of a whole, making it well-suited for displaying categorical data with an emphasis on comparing the relative proportions of each category.

In this article, I will demonstrate how to create four different types of pie charts using the same dataset to provide a more comprehensive visual representation and deeper insight into the data. To achieve this, I will use Matplotlib, Python’s plotting library, to display pie chart visualizations of the statistical data stored in the dataframe. If you are not familiar with Matplotlib library, a good start is Python Data Science Handbook by Jake VanderPlas, specifically chapter on Visualization with Matplotlib and matplotlib.org.

First, let’s import all the necessary libraries and extensions:

Next, we’ll prepare the CSV file for processing:

The mini dataset used in this article highlights the top 10 journals for heart disease research publications from 2002 to 2020 and is part of a larger database collected for the Master’s Thesis research. The columns “Female,” “Male,” and “Unknown” represent the gender of the first author of the published articles, while the “Total” column reflects the total number of heart disease research articles published in each journal.

Image by the author and represents output of the Pie_Chart_Artcile_2.py sample code above.

For smaller datasets with fewer categories, a pie chart with exploding slices can effectively highlight a key category by pulling it out slightly from the rest of the chart. This visual effect draws attention to specific categories, making them stand out from the whole. Each slice represents a portion of the total, with its size proportional to the data it represents. Labels can be added to each slice to indicate the category, along with percentages to show their proportion to the total. This visual technique makes the exploded slice stand out without losing the context of the full data representation.

Image by the author and represents output of the Pie_Chart_Artcile_3.py sample code above.

The same exploding slices technique can be applied to all other entries in the sample dataset, and the resulting charts can be displayed within a single figure. This type of visualization helps to highlight the over representation or under representation of a particular category within the dataset. In the example provided, presenting all 10 charts in one figure reveals that none of the top 10 journals in heart disease research published more articles authored by women than men, thereby emphasizing the gender disparity.

Gender distributions for top 10 journals for heart disease research publications, 2002–2020. Image by the author and represents output of the Pie_Chart_Artcile_4.py sample code above.

A variation of the pie chart, known as a donut chart, can also be used to visualize data. Donut charts, like pie charts, display the proportions of categories that make up a whole, but the center of the donut chart can also be utilized to present additional data. This format is less cluttered visually and can make it easier to compare the relative sizes of slices compared to a standard pie chart. In the example used in this article, the donut chart highlights that among the top 10 journals for heart disease research publications, the American Journal of Physiology, Heart and Circulatory Physiology published the most articles, accounting for 21.8%.

Image by the author and represents output of the Pie_Chart_Artcile_5.py sample code above.

We can enhance the visualization of additional information from the sample dataset by building on the previous donut chart and creating a nested version. The add_artist() method from Matplotlib’s figure module is used to incorporate any additional Artist (such as figures or objects) into the base figure. Similar to the earlier donut chart, this variation displays the distribution of publications across the top 10 journals for heart disease research. However, it also includes an additional layer that shows the gender distribution of first authors for each journal. This visualization highlights that a larger percentage of the first authors are male.

Image by the author and represents output of the Pie_Chart_Artcile_6.py sample code above.

In conclusion, pie charts are effective for visualizing data with a limited number of categories, as they enable viewers to quickly understand the most important categories or dominant proportions at a glance. In this specific example, the use of four different types of pie charts provides a clear visualization of the gender distribution among first authors in the top 10 journals for heart disease research publications, based on the 2002 to 2020 mini dataset used in this study. It is evident that a higher percentage of the publication’s first authors are males, and none of the top 10 journals for heart disease research published more articles authored by females than by males during the examined period.

Jupyter Notebook and dataset used for this article can be found at GitHub

Thank you for reading,

Diana

Note: I used GitHub embeds to publish this article.



Source link

19Oct

Postdoctoral Associate at Virginia Tech – Blacksburg, Virginia


Job Description

We are searching for a postdoctoral researcher in the Hydrologic Innovation and Remote Sensing Lab of the Geoscience Department, Virginia Tech. The NASA-funded research includes a prediction of groundwater resources using multiple remote sensing technologies and quantification of related hazards, including water resource overdraft, land subsidence, flooding, and anthropogenic and climate droughts. The accepted candidate will apply a suite of statistical and physical models for integrating observations of different accuracies, improving predictions of water resources and/or predicting future hazards.

Required Qualifications

Ph.D. in Earth Sciences, Geodesy, Physics, Mathematics, Computer Sciences, Civil & Environmental Engineering, or a related field. Strong skills in technical programming languages, e.g., Matlab, Python, or C++; in computational statistics, i.e., unsupervised and supervised ML methods, incl. deep learning. Experience with at least two of the following: remote sensing of surface and ground water resources, analysis of satellite gravimetry (GRACE) data, analysis of radar and optical remote sensing, poromechanical modeling, and elastic crustal load modeling.

Preferred Qualifications

Knowledge of Remote sensing of surface water quality, numerical modeling, GIS, leadership and advising skills, teaching, & writing experience.

Appointment Type

Restricted

Salary Information

$50,000-$55,000

Review Date

May 1, 2024

Additional Information

PhD awarded no more than four years prior to the effective date of appointment with a minimum of one year eligibility remaining.

The successful candidate will be required to have a criminal conviction check.

About Virginia Tech

Dedicated to its motto, Ut Prosim (That I May Serve), Virginia Tech pushes the boundaries of knowledge by taking a hands-on, transdisciplinary approach to preparing scholars to be leaders and problem-solvers. A comprehensive land-grant institution that enhances the quality of life in Virginia and throughout the world, Virginia Tech is an inclusive community dedicated to knowledge, discovery, and creativity. The university offers more than 280 majors to a diverse enrollment of more than 36,000 undergraduate, graduate, and professional students in eight undergraduate colleges, a school of medicine, a veterinary medicine college, Graduate School, and Honors College. The university has a significant presence across Virginia, including the Innovation Campus in Northern Virginia; the Health Sciences and Technology Campus in Roanoke; sites in Newport News and Richmond; and numerous Extension offices and research centers. A leading global research institution, Virginia Tech conducts more than $500 million in research annually.

Virginia Tech does not discriminate against employees, students, or applicants on the basis of age, color, disability, sex (including pregnancy), gender, gender identity, gender expression, genetic information, national origin, political affiliation, race, religion, sexual orientation, or military status, or otherwise discriminate against employees or applicants who inquire about, discuss, or disclose their compensation or the compensation of other employees or applicants, or on any other basis protected by law.

If you are an individual with a disability and desire an accommodation, please contact Susanna Werth at

sw****@vt.edu











 
during regular business hours at least 10 business days prior to the event.



Source link

19Oct

Data Warehouse Lead at Citi – CITI AV. REVOLUCION NO. 1267, COL. LOS ALPES CIUDAD DE MEXICO


The Data Analytics Lead Analyst is a strategic professional who stays abreast of developments within own field and contributes to directional strategy by considering their application in own job and the business. Recognized technical authority for an area within the business. Requires basic commercial awareness. There are typically multiple people within the business that provide the same level of subject matter expertise. Developed communication and diplomacy skills are required in order to guide, influence and convince others, in particular colleagues in other areas and occasional external customers. Significant impact on the area through complex deliverables. Provides advice and counsel related to the technology or operations of the business. Work impacts an entire area, which eventually affects the overall performance and effectiveness of the sub-function/job family.

Responsibilities:

  • Integrates subject matter and industry expertise within a defined area.
  • Contributes to data analytics standards around which others will operate.
  • Applies in-depth understanding of how data analytics collectively integrate within the sub-function as well as coordinate and contribute to the objectives of the entire function.
  • Employs developed communication and diplomacy skills are required in order to guide, influence and convince others, in particular colleagues in other areas and occasional external customers.
  • Resolves occasionally complex and highly variable issues.
  • Produces detailed analysis of issues where the best course of action is not evident from the information available, but actions must be recommended/ taken.
  • Responsible for volume, quality, timeliness and delivery of data science projects along with short-term planning resource planning.
  • Appropriately assess risk when business decisions are made, demonstrating particular consideration for the firm’s reputation and safeguarding Citigroup, its clients and assets, by driving compliance with applicable laws, rules and regulations, adhering to Policy, applying sound ethical judgment regarding personal behavior, conduct and business practices, and escalating, managing and reporting control issues with transparency.

Qualifications:

  • 6-10 years experience usisng codes for statistical modeling of large data sets

Education:

  • Bachelor’s/University degree or equivalent experience, potentially Masters degree

This job description provides a high-level review of the types of work performed. Other job-related duties may be assigned as required.

  • 5 + years of experience in data projects (ETL, DWH, DataLake, information analysis)
  • 5 + years of experience in AbInitio/ExpressIT.
  • 5 + years of experience with Relational Database Systems and SQL such as Teradata, SQLServer Oracle or similar.
  • 2+ years of experience with Agile practices
  • Preferred 5+ years in-depth experience with Hadoop stack (MapReduce, HDFS, YARN) and with some Hadoop ecosystems tools (HIVE, HBase, Pig, Impala, Sqoop, etc)
  • Preferred 5+ years of experience in at least one scripting language (Python, Perl, PHP, Java, Javascript, C++, C#, C, etc.)
  • Preferred 2+ years of experience with NoSQL implementation (Mongo, Cassandra, Neo4j, etc).
  • Preferred 3+ years of experience with Linux or Unix Operative Systems
  • Preferred 2+ years of experience using DevOps to do CI / CD
  • Strong analytical and quantitative skills
  • Data driven and results-oriented
  • Experience delivering with an agile methodology
  • Experience leading infrastructure and technical programs
  • Skilled at working with third party service providers
  • Excellent written and oral communication skills
  • English 80% speaking and writing.
  • Ability to deliver cross-functional solutions with multiple customers
  • Strong people skills
  • Communication complex technical concepts to non-technical people
  • High teamwork capacity
  • Ability to handle multiple projects

——————————————————

Job Family Group:

Technology

——————————————————

Job Family:

Data Analytics

——————————————————

Time Type:

Full time

——————————————————

Citi is an equal opportunity and affirmative action employer.

Qualified applicants will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.

Citigroup Inc. and its subsidiaries (“Citi”) invite all qualified interested applicants to apply for career opportunities. If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.

View the “EEO is the Law” poster. View the EEO is the Law Supplement.

View the EEO Policy Statement.

View the Pay Transparency Posting



Source link

19Oct

Software Engineer, ML System Scheduling at ByteDance – Seattle


Responsibilities

Founded in 2012, ByteDance’s mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok and Helo as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.

ByteDance is a global incubator of platforms at the cutting edge of commerce, content, entertainment, and enterprise services – over 2.5 billion people interact with ByteDance products, including TikTok

Why Join Us
Creation is the core of ByteDance’s purpose. Our products are built to help imaginations thrive. This is doubly true of the teams that make our innovations possible.
Together, we inspire creativity and enrich life – a mission we aim towards achieving every day.
To us, every challenge, no matter how ambiguous, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always.
At ByteDance, we create together and grow together. That’s how we drive impact – for ourselves, our company, and the users we serve.
Join us.

Team Intro
Founded in 2023, the ByteDance Doubao (Seed) Team, is dedicated to pioneering advanced AI foundation models. Our goal is to lead in cutting-edge research and drive technological and societal advancements.

With a strong commitment to AI, our research areas span deep learning, reinforcement learning, Language, Vision, Audio, AI Infra and AI Safety. Our team has labs and research positions across China, Singapore, and the US.

Leveraging substantial data and computing resources and through continued investment in these domains, we have developed a proprietary general-purpose model with multimodal capabilities. In the Chinese market, Doubao models power over 50 ByteDance apps and business lines, including Doubao, Coze, and Dreamina, and is available to external enterprise clients via Volcano Engine. Today, the Doubao app stands as the most widely used AIGC application in China.

The Machine Learning (ML) System sub-team combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and Inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. In our team, you’ll have the opportunity to build the large-scale heterogeneous system integrating with GPU/NPU/RDMA/Storage and keep it running stable and reliable, enrich your expertise in coding, performance analysis and distributed system, and be involved in the decision-making process. You’ll also be part of a global team with members from the United States, China and Singapore working collaboratively towards unified project direction.

Responsibilities:
1. Responsible for the design and development of resource scheduling, including model training, model evaluation and model inference in various scenarios (LLM/AIGC/NLP/CV/Speech, etc.)
2. Responsible for the optimal orchestration of various computing resources (GPU, CPU, other heterogeneous hardware), realizing the rational use of stable resources, tidal resources, mixed resources, and multi-cloud resources
3. Responsible for the optimal combination of computing resources, RDMA high-speed network resources, and storage resources, and giving full play to the power of large-scale distributed clusters
4. Responsible for offline and online workload scheduling in global data centers integrating multi-cloud scenarios to achieve rational distributions

Qualifications

Minimum Qualifications:
1. Be proficient in 1 to 2 programming languages such as Go/Python/Shell in Linux environment
2. Be familiar with Kubernetes architecture and container technology such as Docker/Containerd/Kata/Podman, and have rich experience in Machine Learning system practice and development
3. Understand the principles of distributed systems and have experience in the design, development and maintenance of large-scale distributed systems
4. Have an excellent logical analysis ability, able to reasonably abstract and split business logic
5. Have a strong sense of responsibility, good learning ability, communication skills and self-drive, able to respond and act quickly

Preferred Qualifications
1. Familiar with at least one major Machine Learning framework (TensorFlow/PyTorch)
2. Experience in one of the following fields: AI Infrastructure, HW/SW Co-Design, High Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking)

ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.

ByteDance Inc. is committed to providing reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs or other reasons protected by applicable laws. If you need assistance or a reasonable accommodation, please reach out to us at https://shorturl.at/cdpT2

By submitting an application for this role, you accept and agree to our global applicant privacy policy, which may be accessed here: https://jobs.bytedance.com/en/legal/privacy.

Job Information

【For Pay Transparency】Compensation Description (annually)

The base salary range for this position in the selected city is $184300 – $337250 annually.​

Compensation may vary outside of this range depending on a number of factors, including a candidate’s qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.​

Our company benefits are designed to convey company culture and values, to create an efficient and inspiring work environment, and to support our employees to give their best in both work and life. We offer the following benefits to eligible employees: ​

We cover 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents and offer a Health Savings Account(HSA) with a company match. As well as Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans. In addition to Flexible Spending Account(FSA) Options like Health Care, Limited Purpose and Dependent Care. ​

Our time off and leave plans are: 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) (prorated upon hire and increased by tenure) and 10 paid sick days per year as well as 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability. ​

We also provide generous benefits like mental and emotional health benefits through our EAP and Lyra. A 401K company match, gym and cellphone service reimbursements. The Company reserves the right to modify or change these benefits programs at any time, with or without notice.​

For Los Angeles County (unincorporated) Candidates:​

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:​

1. Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;​

2. Appropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and​

3. Exercising sound judgment.​



Source link

19Oct

UI-Focused AI Agent


The UFO AI Agent aims to seamlessly navigate applications within the Windows OS and orchestrate events to fulfil a user query.

Initial Observations

This Windows OS based AI Agent called UFO can work well as a personal workflow optimiser for suggestions on the most optimal workflow to achieve a task on your PC.

We all have a process through which we interact with our UI…this agent can help optimise this personal workflow.

I once read that when a new type of UI is introduced, like a surface or touch screen, we start interacting with it and over time loose patterns of behaviour are established, which later turns into UI design conventions.

The same is happening with AI agents. Key ingredients of AI Agents are complex task decomposition, creating a sequence of chains. And AI Agent framework creators are converging on a set of good ideas.

Going through an iterative process of action, observation, thought, prior to taking next step.

AI Agents are also starting to exist within digital worlds, like in this case, Windows OS. Other examples are Apple’s iOS, or a web browser like Web Voyager.

You will see that as we as users have design affordances at our disposal to interact and navigate, these affordances are also available to the AI Agent.

There are also a set of action identified which are potentially high in consequence, like deleting files, or sending an email. The ramifications of these risks will grow considerably when AI Agents are embodied in the real world.

Lastly, quite a while ago I wrote about the ability of LLMs to perform symbolic reasoning. The ability of Language Models to do symbolic reasoning was a feature which I felt did not enjoy the attention it should.

We all perform symbolic reasoning as humans, we observe a room, and are able to mentally plan and project tasks based on what we have seen in a spatial setting. LLMs also have this capability, and visual models were always delivered via a test description. With the advent of vision capabilities in LLMs, images can be used.

The image below shows a common trait in AI Agents with a digital environment, where observation, thought and action are really all language based.

In User interface design, loose patterns of behaviour in time turns into UI design conventions

UFO = “U”I-”Fo”cused AI Agent

The goal of UFO as an AI agent is to effortlessly navigate and operate within individual applications, as well as across multiple apps, to complete user requests.

One powerful use-case is leveraging Vision-Language Models (VLMs) to interact with software interfaces, responding to natural language commands and executing them within real-world environments.

The development of Language Models with vision marks a shift from Large Language Models (LLMs) to Large Action Models (LAMs), enabling AI to translate decisions into real-world actions.

UFO also features an application-switching mechanism, allowing it to seamlessly transition between apps when necessary.

Vision-Language-Action models transfer web knowledge to robotic control

The image above illustrates the UFO Windows AI Agent. The AI Agent completes the user request by retrieving information from various applications including Word, Photos, PowerPoint etc. An email is then compose with the synthesised information.

UFO Process Overview

Initial Setup — UFO provides HostAgent with a full desktop screenshot and a list of available applications. The HostAgent uses this information to select the appropriate application for the task and creates a global plan to complete the user request.

Focus and Execution — The selected application is brought into focus on the desktop. The AppAgent begins executing actions based on the global plan.

Action Selection — Before each action, UFO captures a screenshot of the current application window with annotated controls. UFO provides details about each control for AppAgent’s observation.

Below is an image with annotation examples…

Action Execution — the AppAgent chooses a control, selects an action to execute, and carries it out using a control interaction module.

After each action, UFO builds a local plan for the next step and continues the process until the task is completed in the application.

Handling Multi-App Requests

If the task requires multiple applications, AppAgent passes control back to HostAgent to switch to the next app.

The process is repeated for each application until the user request is fully completed.

To some extent, it feels like the HostAgent acts as the orchestration agent, and the AppAgents are really agents in their own right. There are no tools per say, but rather applications it is accessing.

Interactive Requests

Users can introduce new requests at any time, prompting UFO to repeat the process.

Once all user requests are completed or fulfilled, UFO ends its operation.

More On The HostAgent

The process begins with a detailed observation of the current desktop window, captured through screenshots that provide a clear view of the active interface.

Based on this observation, the next logical step to complete the task is determined, following the Chain-of-Thought (CoT) reasoning approach.

Once the appropriate application is selected, its label and name are identified and noted.

The status of the task is then assessed, with the system indicating whether to continue or finish.

Alongside this, a global plan is devised — typically a broad, high-level outline for fulfilling the user request. If this plan is visible to the user, and editable, it would make for an excellent human-in-the-loop feedback loop and future improvements.

Throughout the process, additional comments or information are provided, often including a brief summary of progress or highlighting key points for further consideration.

More On The AppAgent

The process starts with the user submitting a request to UFO, which is identical to the one received by the HostAgent.

UFO then captures screenshots of the application, divided into three types:

  1. a previous screenshot,
  2. a clean current one,
  3. and an annotated version showing available controls.

Alongside this, control information is collected, listing the names and types of controls available for interaction in the selected application.

The system also recalls previous thoughts, comments, actions and execution results, building a memory that mirrors the HostAgent’s own recollections.

Additionally, examples are provided to demonstrate possible action choices for the AppAgent.

With this comprehensive input, AppAgent carefully analyses the details.

First, it makes an observation, providing a detailed description of the current application window and evaluating whether the last action had the intended effect.

The rationale behind each action is also considered, as the AppAgent logically determines the next move.

Control

Once a control is selected for interaction, its label and name are identified, and the specific function to be performed on it is defined.

The AppAgent then assesses the task status, deciding whether to continue if further actions are needed, finish if the task is complete, pending if awaiting user confirmation, screenshot if a new screenshot is required for more control annotations, or App Selection if it’s time to switch to another application.

To ensure smooth progress, the AppAgent generates a local plan, a more detailed and precise roadmap for upcoming steps to fully satisfy the user request.

Throughout this process, additional comments are provided, summarising progress or highlighting key points, mirroring the feedback offered by the HostAgent.

Observation & Thought

When HostAgent is prompted to provide its Observation and Thoughts, it serves two key purposes.

First, it pushes HostAgent to thoroughly analyse the current state of the task, offering a clear explanation of its logic and decision-making process.

This not only strengthens the internal consistency of its choices but also makes UFO’s operations more transparent and easier to understand.

Second, HostAgent assesses the task’s progress, outputting “FINISH” if the task is complete.

It can also provide feedback to the user, such as reporting progress, pointing out potential issues, or answering any queries.

Once the correct application is identified, UFO moves forward with the task, and AppAgent takes charge of executing the necessary actions within the application to fulfil the user request.

Design Consideration

UFO integrates a range of design features specifically crafted for the Windows OS.

These enhancements streamline interactions with UI controls, making them more efficient, automated, and secure, ultimately improving UFO’s ability to handle user requests.

Key aspects include interactive mode, customisable actions, control filtering, plan reflection, and safety mechanisms, each of which is discussed in more detail in the following sections.

Interactive Mode

UFO allows users to engage in interactive and iterative exchanges instead of relying on one-time completions.

After finishing a task, users can request enhancements to the previous task, propose entirely new tasks, or even assist UFO with operations it might struggle with, such as entering a password.

The researchers believe the user-friendly approach sets UFO apart from other UI agents in the market, enabling it to absorb user feedback and effectively manage longer, more complex tasks.

Follow me on LinkedIn ✨✨

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.





Source link

Protected by Security by CleanTalk