30Apr

InternVL 1.5 Advances Multimodal AI with High-Resolution and Bilingual Capabilities in Open-Source Models


Multimodal large language models (MLLMs) integrate text and visual data processing to enhance how artificial intelligence understands and interacts with the world. This area of research focuses on creating systems that can comprehend and respond to a combination of visual cues and linguistic information, mimicking human-like interactions more closely.

The challenge often lies in the limited capabilities of open-source models compared to their commercial counterparts. Open-source models frequently exhibit deficiencies in processing complex visual inputs and supporting various languages, which can restrict their practical applications and effectiveness in diverse scenarios.

Historically, most open-source MLLMs have been trained at fixed resolutions, primarily using datasets limited to the English language. This approach significantly hinders their functionality when encountering high-resolution images or content in other languages, making it difficult for these models to perform well in tasks that require detailed visual understanding or multilingual capabilities.

The research from Shanghai AI Laboratory, SenseTime Research, Tsinghua University, Nanjing University, Fudan University, and The Chinese University of Hong Kong introduces InternVL 1.5, an open-source MLLM designed to significantly enhance the capabilities of open-source systems in multimodal understanding. This model incorporates three major improvements to close the performance gap between open-source and proprietary commercial models. The three main components are:

  1. Firstly, a strong vision encoder, InternViT-6B, has been optimized through a continuous learning strategy, enhancing its visual understanding capabilities.
  2. Secondly, a dynamic high-resolution approach allows the model to handle images up to 4K resolution by dynamically adjusting image tiles based on the input’s aspect ratio and resolution. 
  3. Lastly, a high-quality bilingual dataset has been meticulously assembled, covering common scenes and document images annotated with English and Chinese question-answer pairs. 

The three steps significantly boost the model’s performance in OCR and Chinese language-related tasks. These enhancements enable InternVL 1.5 to compete robustly in various benchmarks and comparative studies, showcasing its improved effectiveness in multimodal tasks. InternVL 1.5 employs a segmented approach to image handling, allowing it to process images in resolutions up to 4K by dividing them into tiles ranging from 448×448 pixels, adapting dynamically based on the image’s aspect ratio and resolution. This method improves image comprehension and facilitates understanding of detailed scenes and documents. The model’s enhanced linguistic capabilities stem from its training on a diverse dataset comprising both English and Chinese, covering a variety of scenes and document types, which boosts its performance in OCR and text-based tasks across languages.

The model’s performance is evidenced by its results across multiple benchmarks, where it excels particularly in OCR-related datasets and bilingual scene understanding. InternVL 1.5 demonstrates state-of-the-art results, showing marked improvements over previous versions and surpassing some proprietary models in specific tests. For example, text-based visual question answering achieves an accuracy of 80.6%, and document-based question answering reaches an impressive 90.9%. In multimodal benchmarks that assess models on both visual and textual understanding, InternVL 1.5 consistently delivers competitive results, often outperforming other open-source models and rivaling commercial models.

In conclusion, InternVL 1.5 addresses the significant challenges that open-source multimodal large language models face, particularly in processing high-resolution images and supporting multilingual capabilities. This model significantly narrows the performance gap with commercial counterparts by implementing a robust vision encoder, dynamic resolution adaptation, and a comprehensive bilingual dataset. The enhanced capabilities of InternVL 1.5 are demonstrated through its superior performance in OCR-related tasks and bilingual scene understanding, establishing it as a formidable competitor in advanced artificial intelligence systems. 


Check out the Paper and GitHub PageAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.






Source link

30Apr

OpenVoice V2: Evolving Multilingual Voice Cloning with Enhanced Style Control and Cross-Lingual Capabilities


Instant Voice Cloning (IVC) in Text-to-Speech (TTS) synthesis, also known as Zero-shot TTS, allows TTS models to replicate the voice of any given speaker with just a short audio sample without requiring additional training on that speaker. While existing methods like VALLE and XTTS can replicate tone color, they need more flexibility in controlling style parameters like emotion, accent, and rhythm. Auto-regressive models, though effective, are computationally expensive and slow. Non-autoregressive approaches like YourTTS and Voicebox offer faster inference but lack comprehensive style control. Additionally, achieving cross-lingual voice cloning demands extensive datasets, hindering the inclusion of new languages. Closed-source projects further impede collaborative advancement in the field.

MIT CSAIL, MyShell.ai, and Tsinghua University researchers have developed OpenVoice V2, a groundbreaking text-to-speech model enabling voice cloning across languages. OpenVoice V2 transcends language barriers, offering applications like personalized digital interfaces, multilingual virtual assistants, and automatic dubbing. With enhanced audio quality and native support for English, Spanish, French, Chinese, Japanese, and Korean, OpenVoice V2 surpasses its predecessor. It allows granular control over voice styles, including emotion and accent, without relying on the reference speaker’s style. Moreover, it achieves zero-shot cross-lingual voice cloning, even for languages absent from its training data, while maintaining computational efficiency and real-time inference capabilities.

Prior research in IVC encompasses auto-regressive methods like VALLE and XTTS, extracting speaker characteristics to generate speech sequentially. While effectively replicating tone color, they lack flexibility in adjusting style parameters like emotion and accent. These models are computationally intensive and slow. Non-auto-regressive approaches like YourTTS and Voicebox offer faster inference but struggle with style parameter control. Additionally, they often rely on extensive datasets for cross-lingual cloning, limiting language inclusivity. Closed-source research from tech giants hampers collaborative progress in the field, hindering innovation and accessibility for the research community.

OpenVoice V2 integrates features from its predecessor and introduces Accurate Tone Color Cloning, Flexible Voice Style Control, and Zero-shot Cross-lingual Voice Cloning. The model’s simplicity lies in decoupling tone color cloning from style and language control, achieved through a base speaker TTS model and a tone color converter. The TTS model handles style and language, while the converter embodies the reference speaker’s tone color. Training involves collecting datasets for TTS and tone color conversion separately. The model structure employs flow layers for tone color conversion, ensuring natural sound while removing tone color information. The approach facilitates fluent multilingual speech generation.

The evaluation of voice cloning faces challenges in objectivity due to variations in training/test sets and objectives across studies. OpenVoice focuses on tone color cloning, style parameter control, and cross-lingual cloning. Rather than numerical comparisons, it emphasizes qualitative analysis, offering publicly available audio samples for assessment. It accurately clones tone color across diverse voice distributions, preserves various speech styles, and enables cross-lingual cloning with minimal speaker data. OpenVoice’s feed-forward structure ensures rapid inference, achieving 12× real-time performance on a single A10G GPU, with potential for further optimization.

In conclusion, OpenVoice V2 enhances audio quality through a revised training strategy and introduces native English, Spanish, French, Chinese, Japanese, and Korean support. V1 and V2 are now available for free commercial use under the MIT License. Building upon V1’s features, V2 excels in tone color cloning across languages and accents, offers precise control over voice styles, and enables zero-shot cross-lingual cloning. By decoupling tone color cloning from other voice styles and languages, OpenVoice achieves greater flexibility and provides its source code and model weights for future research.


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.




Source link

29Apr

Top Data Science Courses in 2024


As businesses increasingly rely on data-driven decision-making, the ability to extract insights and derive value from data has become quite essential. Acquiring skills in data science enables professionals to unlock new opportunities for innovation and gain a competitive edge in today’s digital age. This article lists the top data science courses one should take to master the necessary skills and meet the growing demand for data expertise in various industries.

IBM Data Science Professional Certificate

This course helps master the practical skills and knowledge necessary for a proficient data scientist. It is a beginner-friendly course that teaches the tools, languages, and libraries data scientists use, such as Python and SQL. The course allows the students to demonstrate their proficiency in data science using real-world projects.

Data Science Specialization

“Data Science Specialization” covers the concepts and tools required throughout the entire data science pipeline. The course also has a separate section on statistics, which is essential for data science. It uses R language for all programming tasks such as data analysis, statistical inference, and building machine learning models.

Applied Data Science with Python Specialization

This course is ideal for learners with a basic programming background. It teaches data science through Python and covers its libraries, such as matplotlib, pandas, nltk, scikit-learn, and networkx, covering topics like information visualization, text analysis, and social network analysis.

Programming for Data Science with Python

This course covers the programming skills required to discover patterns and insights in extensive datasets, execute queries using relational databases, and utilize Unix shell and Git. It includes instruction on tools and libraries such as NumPy, Pandas, and Control Flow.

Python for Data Science

This course introduces a comprehensive set of tools crucial for data analysis and conducting data science. It covers Jupyter Notebooks, Pandas, NumPy, Matplotlib, Git, and numerous other tools. Through engaging with compelling data science problems, students will acquire proficiency in utilizing these tools, gaining practical experience within a real-world context.

Data Science: R Basics

This course introduces the basics of R programming and moves on to cover advanced topics such as probability, inference, regression, and machine learning. It also covers data manipulation using dplyr, visualization with ggplot2, file management in UNIX/Linux, version control through Git and GitHub, and creating reproducible documents with RStudio.

Applied Data Science Specialization

This course covers the tools needed to analyze data and make data-driven business decisions, leveraging computer science and statistical analysis. Through lectures, hands-on labs, and projects hosted in the IBM Cloud, students gain practical experience addressing intriguing data challenges from beginning to end.

Data Science with Python Certification Course

This course is designed to help you become proficient in key Python programming principles, including data and file operations, object-oriented programming, and essential Python libraries like Pandas, NumPy, and Matplotlib for Data Science. It is tailored for both professionals and beginners and covers various machine learning (ML) techniques, recommendation Systems, and other important ML concepts.

Foundations of Data Science

This course is intended for those already in the industry and helps develop the skills needed to apply for more advanced data professional roles. It covers the project workflow PACE (Plan, Analyze, Construct, Execute) and explains how it can help organize data projects. 

Associate Data Scientist in Python

This course is designed by DataCamp, and it enables learners to apply theoretical concepts by executing code directly in the browser. It thoroughly explores libraries such as pandas, Seaborn, Matplotlib, scikit-learn, and others. Additionally, it provides opportunities for learners to engage with real-world datasets, mastering statistical and machine learning techniques necessary for hypothesis testing and constructing predictive models.


We make a small profit from purchases made via referral/affiliate links attached to each course mentioned in the above list.

If you want to suggest any course that we missed from this list, then please email us at 

as**@ma**********.com












Shobha is a data analyst with a proven track record of developing innovative machine-learning solutions that drive business value.




Source link

28Apr

This AI Paper from Google DeepMind Introduces Enhanced Learning Capabilities with Many-Shot In-Context Learning


In-context learning (ICL) in large language models (LLMs) utilizes input-output examples to adapt to new tasks without altering the underlying model architecture. This method has transformed how models handle various tasks by learning from direct examples provided during inference. The problem at hand is the limitation of a few-shot ICL in handling intricate tasks. These tasks often demand a deep comprehension that few-shot learning cannot provide, as it operates under the restriction of minimal input data. This scenario could be better for applications requiring detailed analysis and decision-making based on extensive data sets, such as advanced reasoning or language translation.

Existing research in the field of ICL has primarily focused on the few-shot learning capabilities of models like GPT-3, which adapt to new tasks with a limited set of examples. Studies have investigated the performance limits of these models within small context windows, revealing constraints in task complexity and scalability. The development of models with larger context windows, such as Gemini 1.5 Pro, which supports up to 1 million tokens, represents a significant evolution. This expansion allows for exploring many-shot ICL, greatly enhancing the models’ ability to process and learn from a larger dataset.

Researchers from Google Deepmind have introduced a shift toward many-shot ICL, leveraging larger context windows of models like Gemini 1.5 Pro. This move from few-shot to many-shot learning utilizes increased input examples, significantly enhancing model performance and adaptability across complex tasks. The unique aspect of this methodology is the integration of Reinforced ICL and Unsupervised ICL, which reduce reliance on human-generated content by employing model-generated data and domain-specific inputs alone.

In terms of methodology, the Gemini 1.5 Pro model was employed to handle an expanded array of input-output examples, supporting up to 1 million tokens in its context window. This allowed the exploration of Reinforced ICL, where the model generates and evaluates its rationales for correctness, and Unsupervised ICL, which challenges the model to operate without explicit rationales. The experiments were conducted across diverse domains, including machine translation, summarization, and complex reasoning tasks, using datasets like MATH for mathematical problem-solving and FLORES for machine translation tasks to test and validate the effectiveness of the many-shot ICL framework.

The results from implementing many-shot ICL demonstrate significant performance enhancements. In machine translation tasks, the Gemini 1.5 Pro model outperformed previous benchmarks, achieving a 4.5% increase in accuracy for Kurdish and a 1.5% increase for Tamil translations compared to earlier models. In mathematical problem-solving, the MATH dataset showed a 35% improvement in solution accuracy when using many-shot settings. These quantitative outcomes validate the effectiveness of many-shot ICL in enhancing the model’s adaptability and accuracy across diverse and complex cognitive tasks.

In conclusion, the research marks a significant step forward in ICL by transitioning from few-shot to many-shot ICL using the Gemini 1.5 Pro model. By expanding the context window and integrating innovative methodologies like Reinforced and Unsupervised ICL, the study has successfully enhanced model performance across various tasks, including machine translation and mathematical problem-solving. These advancements not only improve the adaptability and efficiency of large language models but also pave the way for more sophisticated applications in AI.


Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.






Source link

27Apr

FlashSpeech: A Novel Speech Generation System that Significantly Reduces Computational Costs while Maintaining High-Quality Speech Output


In recent years, speech synthesis has undergone a profound transformation thanks to the emergence of large-scale generative models. This evolution has led to significant strides in zero-shot speech synthesis systems, including text-to-speech (TTS), voice conversion (VC), and editing. These systems aim to generate speech by incorporating unseen speaker characteristics from a reference audio segment during inference without requiring additional training data.

The latest advancements in this domain leverage language and diffusion-style models for in-context speech generation on large-scale datasets. However, due to the intrinsic mechanisms of language and diffusion models, the generation process of these methods often entails extensive computational time and cost.

To tackle the challenge of slow generation speed while upholding high-quality speech synthesis, a team of researchers has introduced FlashSpeech as a groundbreaking stride towards efficient zero-shot speech synthesis. This novel approach builds upon recent advancements in generative models, particularly the latent consistency model (LCM), which paves a promising path for accelerating inference speed. 

FlashSpeech leverages the LCM and adopts the encoder of a neural audio codec to convert speech waveforms into latent vectors as the training target. To train the model efficiently, the researchers introduce adversarial consistency training, a novel technique that combines consistency and adversarial training using pre-trained speech-language models as discriminators.

One of FlashSpeech’s key components is the prosody generator module, which enhances the diversity of prosody while maintaining stability. By conditioning the LCM on prior vectors obtained from a phoneme encoder, a prompt encoder, and the prosody generator, FlashSpeech achieves more diverse expressions and prosody in the generated speech. 

When it comes to performance, FlashSpeech not only surpasses strong baselines in audio quality but also matches them in speaker similarity. What’s truly remarkable is that it achieves this at a speed approximately 20 times faster than comparable systems, marking an unprecedented level of efficiency in zero-shot speech synthesis.

The introduction of FlashSpeech signifies a significant leap forward in the field of zero-shot speech synthesis. By addressing the core limitations of existing approaches and harnessing recent innovations in generative modeling, FlashSpeech presents a compelling solution for real-world applications that demand rapid and high-quality speech synthesis. 

With its efficient generation speed and superior performance, FlashSpeech holds immense promise for a variety of applications, including virtual assistants, audio content creation, and accessibility tools. As the field continues to evolve, FlashSpeech sets a new standard for efficient and effective zero-shot speech synthesis systems.


Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.






Source link

26Apr

SenseTime from China Launched SenseNova 5.0: Unleashing High-Speed, Low-Cost Large-Scale Modeling, Challenging GPT-4 Turbo’s Performance


Artificial intelligence continues evolving, pushing data processing and computational efficiency boundaries. A standout development in this space is the emergence of large-scale AI models that are not just expansive but also uniquely capable of handling complex datasets and multi-faceted tasks with greater precision and speed. These models advance various technologies, from automated reasoning to complex problem-solving across multiple domains.

One persistent challenge in AI has been optimizing the balance between computational power and efficiency. Traditional AI systems rely heavily on cloud-based infrastructures, which, while powerful, often suffer from significant latency issues. This lag can be detrimental in scenarios where real-time data processing is crucial, such as autonomous driving systems or medical diagnostics.

The current generation of AI models has seen significant enhancements in response to these limitations. These models are increasingly hosted on centralized servers and capable of running on local devices at the edge of networks. This shift significantly reduces latency by processing data where it is collected, but these setups often require more refined and capable handling of data to maintain efficiency.

SenseTime from China has launched the RiRiXin SenseNova 5.0. This model represents a leap in AI capabilities, employing a hybrid expert architecture that leverages both the depth of cloud computing and the responsiveness of edge computing technologies. The model trained on over 10TB tokens, encompassing extensive synthetic data. It’s equipped to handle 200K context windows during reasoning. Its focus lies on boosting proficiency in knowledge, mathematics, reasoning, and coding, achieving or surpassing 10% in mainstream objective evaluations, surpassing the performance of GPT-4 Turbo.

The SenseNova 5.0 model notably excels in its operational metrics. Compared to its predecessors, it has achieved a performance improvement of over 10% in mainstream objective evaluations. Specifically, it has shown prowess in enhancing knowledge-based tasks and multi-modal functions, including image and language processing. It supports an inference speed of up to 109.5 words per second, over five times faster than the human eye can read.

SenseTime has equipped the model to operate seamlessly across various devices, like mobile phones and tablets, integrating edge computing solutions that significantly reduce cloud server dependency. This integration has substantially reduced inference costs by up to 80% compared to similar models in the industry. The deployment of these models in specialized sectors like finance, medicine, and government operations has demonstrated both high efficiency and cost-effectiveness, offering scalable solutions that adapt quickly to user demands.

In conclusion, SenseTime’s development of the RiRiXin SenseNova 5.0 model marks a transformative step in artificial intelligence. By harmonizing high-level data processing with swift, localized computation, this model sets a new standard in the efficiency and application of AI technology. The significant reductions in latency and operational costs, the model’s adaptability across various platforms, and its superior performance in multi-modal evaluations underscore its potential to enhance a wide range of AI-driven services and applications, making advanced AI more accessible and practical for everyday use.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.




Source link

26Apr

Neural Flow Diffusion Models (NFDM): A Novel Machine Learning Framework that Enhances Diffusion Models by Supporting a Broader Range of Forward Processes Beyond the Fixed Linear Gaussian


The probabilistic machine learning class, generative models, has many uses in different domains, including the visual and performing arts, the medical industry, and even physics. To generate new samples that are similar to the original data, generative models are very good at building probability distributions that appropriately describe datasets. These features are perfect for generating synthetic datasets to supplement training data (data augmentation) and discovering latent structures and patterns in an unsupervised learning environment. 

The two main steps in building diffusion models, which are a type of generative model, are the forward and reverse processes. Over time, the data distribution becomes corrupted by the forward process, going from its original condition to a noisy one. The reverse process can restore data distribution by learning to invert corruptions introduced by the forward process. In this approach, it can train itself to produce data out of thin air. Diffusion models have shown impressive performance in several fields. The majority of current diffusion models, however, assume a fixed forward process that is Gaussian in nature, rendering them incapable of task adaptation or target simplification during the reverse process.

New research by the University of Amsterdam and Constructor University, Bremen, introduces Neural Flow Diffusion Models (NFDM). This framework enables the forward process to specify and learn latent variable distributions. Suppose any continuous (and learnable) distribution can be represented as an invertible mapping applied to noise. In that case, NFDM may accommodate it, unlike traditional diffusion models that depend on a conditional Gaussian forward process. Additionally, the researchers minimize a variational upper bound on the negative log-likelihood (NLL) using an end-to-end optimization technique that does not include simulation. In addition, they suggest a parameterization for the forward process that is based on efficient neural networks. This will allow it to learn the data distribution more easily and adapt to the reverse process while training. 

Using NFDM’s adaptability, the researchers delve deeper into training with limits on the inverse process to acquire generative dynamics with targeted attributes. A curvature penalty on the deterministic generating trajectories is considered a case study. The empirical results show better computing efficiency than baselines on synthetic datasets, MNIST, CIFAR-10, and downsampled ImageNet.

Presenting their experimental findings on CIFAR-10, ImageNet 32 and 64, the team showcased the vast potential of NFDM with a learnable forward process. The state-of-the-art NLL results they achieved are crucial for a myriad of applications, including data compression, anomaly detection, and out-of-distribution detection. They also demonstrated NFDM’s application in learning generative processes with specific attributes, such as dynamics with straight-line trajectories. In these cases, NFDM led to significantly faster sampling rates, improved generation quality, and required fewer sampling steps, underscoring its practical value.

The researchers are candid about the considerations that must be made when adopting NFDM. They acknowledge that compared to traditional diffusion models, the computational costs increase when a neural network is used to parameterize the forward process. Their results indicate that NFDM optimization iterations take around 2.2 times longer than traditional diffusion models. However, they believe that NFDM’s potential in various fields and practical applications is driven by its flexibility in learning generative processes. They also propose potential avenues for improvement, such as incorporating orthogonal methods like distillation, changing the target, and exploring different parameterizations. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.






Source link

25Apr

Researchers at MIT Propose ‘MAIA’: An Artificial Intelligence System that Uses Neural Network Models to Automate Neural Model Understanding Tasks


MIT CSAIL researchers introduced MAIA (Multimodal Automated Interpretability Agent) to address the challenge of understanding neural models, especially in computer vision, where interpreting the behavior of complex models is essential for improving accuracy and robustness and identifying biases. Current methods rely on manual effort, like exploratory data analysis, hypothesis formulation, and controlled experimentation, making the process slow and expensive. MAIA (Multimodal Automated Interpretability Agent) uses neural models to automate interpretability tasks, such as feature interpretation and failure mode discovery.

Existing approaches to model interpretability are often unscalable and inaccurate, limiting their utility to hypothesis generation rather than providing actionable insights. MAIA, on the other hand, automates interpretability tasks through a modular framework. It utilizes a pre-trained vision-language model as its backbone and provides a set of tools that enable the system to conduct experiments on neural models iteratively. These tools include synthesizing and editing inputs, computing exemplars from real-world datasets, and summarizing experimental results. 

MAIA’s ability to generate descriptions of neural model behavior is compared to both baseline methods and human expert labels, demonstrating its effectiveness in understanding model behavior.

MAIA’s framework is designed to freely conduct experiments on neural systems by composing interpretability tasks into Python programs. Leveraging a pre-trained multimodal model, MAIA can process images directly and design experiments to answer user queries about model behavior. The System class within MAIA’s API instruments the system to be interpreted, making subcomponents individually callable for experimentation. Meanwhile, the Tools class comprises a suite of functions enabling MAIA to write modular programs that test hypotheses about system behavior. 

The evaluation of MAIA on the black-box neuron description task demonstrates its ability to produce predictive explanations of vision system components, identify spurious features, and automatically detect biases in classifiers. It is effective in generating descriptions of both real and synthetic neurons, outperforms baseline methods, and approaches human expert labels.

In conclusion, MAIA presents a promising solution to the challenge of understanding neural models by automating interpretability tasks. MAIA streamlines the process of understanding model behavior by combining a pre-trained vision-language model with a set of interpretability tools. While human supervision is still necessary to avoid common pitfalls and maximize effectiveness, MAIA’s framework demonstrates high potential utility in the interpretability workflow, offering a flexible and adaptable approach to understanding complex neural systems. Overall, MAIA significantly helps in bridging the gap between human interpretability and automated techniques in model understanding and analysis.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.






Source link

24Apr

Meet CopilotKit: An Open-Source Copilot Platform for Seamless AI Integration in Any Application


What is CopilotKit?

CopilotKit is an open-source framework designed to facilitate the integration of AI into applications. With 4.4k+💫Git Stars, it has received great appreciation within the open-source community. It helps to create custom AI copilots, including in-app AI chatbots and agents capable of interacting dynamically with the application’s environment. The framework is built to streamline integrating AI by handling complex aspects like app context awareness and interaction. 

Please star CopilotKit to support their work: 

https://github.com/CopilotKit/CopilotKit

Challenges Resolved Through CopilotKit 

Here are the four challenges of many that CopilotKit helps with:

Components of CopilotKit
The CopilotKit offers many components that you can use for your applications. It has native support for LangChain, LangGraph, and LangServe and also provides built-in native UI/UX components that you can use as part of your applications:

  • CopilotChat: This tool enables the building of app-aware AI chatbots that can interact with the app’s frontend and backend, as well as third-party services.
  • CopilotTextarea: It acts as a drop-in replacement for any ‘<textarea/>’ and offers AI-assisted text generation and editing.
  • In-App Agents: CopilotKit allows real-time context access to applications and lets agents take action within applications.
  • Co-Agents: It will soon be released and can enable end-users to intervene and restart agent operations if needed.
  • Purpose-specific LLM chains: It customizes the language model chains for specific applications.
  • Built-in UI Components: It also Includes components like ‘CopilotSidebar’ and ‘CopilotPopup’ for UI customization.

How does CopilotKit work? 

Let’s look at key points about how CopilotKit works: 

  1. Framework-first: a framework for connecting every component of your application to the copilot engine. 
  2. The copilot engine: Receives the user request,  pulls in the relevant application context, formats it for the LLM, then initiates in-app action on the user’s behalf.  Integrates deeply with the front and backend. 
  3. AI Components: customizable & headless UI components for native AI features: chatbots, AI agents & AI-powered textareas. 
  4. Generative UI:  custom interactive user interfaces rendered inside the chat, rendered alongside AI-initiated actions.
  5. In-app agents: bring LangChain agents as interactive components of the application. They can see realtime application context, and initiate action inside the application.
  6. Copilot Cloud: turnkey cloud services for scaling and productionizing copilots: copilot memory & chat histories,  guardrails, self-learning (the copilot gets smarter with use)
  7. Simplicity in Integration: CopilotKit integration into existing app infrastructures is facilitated through simple entry points, making applications with advanced AI functionalities easy to use.

Use Case: CoPilotKit Presentation Creator 

Let’s build something cool using CopilotKit, a text-to-powerpoint creator application. 

We have to fulfill some prerequisites before proceeding further:

Now, Let’s follow the essential steps to get the desired app for slide creation through the following steps:

git clone https://github.com/CopilotKit/presentation-demo
  • Navigate to the cloned repo and install the packages:
npm install 
  • Create a “.env.local” file in the root directory of the project and mention the two API keys obtained in the prerequisite part:
OPENAI_API_KEY = "...."
TAVILY_API_KEY = "........"
npm run dev
  • Open http://localhost:3000 in your browser to see the app:
  • A CopilotSidebar will be here. Let’s enter this prompt: “Create a slide on the benefits of AI in healthcare.” You will get the desired slide:

Here’s what CopiloKit did on the backend: 

  • It takes the prompt and sends it to TAVILY to research the topic. 
  • The response can then be forwarded to OpenAI for creating the slide content. 
  • CopiloKit then places the output from OpenAI LLM in the desired places, using its update functionalities.

Trending Examples of CoipilotKit Application 

  1. Chat with Your Resume: AI-powered resume builder application using Nextjs, CopilotKit & OpenAI.
  2. Text-to-Powerpoint Application: This AI-powered PowerPoint application can search the web to make a presentation about any topic automatically. It integrates AI into your app using Next.js, OpenAI, LangChain & Tavily, and CopilotKit.
  3. AI-Powered Blogging Platform: AI-powered blogging platform that can search the web and research any topic for a blog article using Next.js, OpenAI, LangChain & Tavily, CopilotKit, and Supabase.

Conclusion
The introduction of CopilotKit reveals a robust and promising framework for smoothly integrating AI capabilities into your applications.  By incorporating CopilotKit, developers gain access to a suite of tools that provides a simplified method for creating interactive AI features with user enhancement through intuitive interfaces like CopilotChat, CopilotSidebar, and CopilotTextarea. The up-front installation process, comprehensive documentation, and illustrative code examples ensure that even a person who is not tech-savvy and new to AI can smoothly embark on this journey confidently. Whether you’re trying to build AI-driven chatbots, enrich text areas with smart completions, or create fully customized AI interactions within your apps, CopilotKit can help you.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.




Source link

23Apr

Nota AI Researchers Introduce LD-Pruner: A Novel Performance-Preserving Structured Pruning Method for Compressing Latent Diffusion Models LDMs


Generative models have emerged as transformative tools across various domains, including computer vision and natural language processing, by learning data distributions and generating samples from them. Among these models, Diffusion Models (DMs) have garnered attention for their ability to produce high-quality images. Latent Diffusion Models (LDMs) stand out for their rapid generation capabilities and reduced computational cost. However, deploying LDMs on resource-limited devices remains challenging due to significant compute requirements, particularly from the Unet component.

Researchers have explored various compression techniques for LDMs to address this challenge, aiming to reduce computational overhead while maintaining performance. These strategies include quantization, low-rank filter decomposition, token merging, and pruning. Pruning, traditionally used for compressing convolutional networks, has been adapted to DMs through methods like Diff-Pruning, which identifies non-contributory diffusion steps and important weights to reduce computational complexity.

While pruning offers promise for LDM compression, its adaptability and effectiveness across various tasks still need to be improved. Moreover, evaluating pruning’s impact on generative models presents challenges due to the complexity and resource-intensive nature of performance metrics like Frechet Inception Distance (FID). In response, the researchers from Nota AI propose a novel task-agnostic metric for measuring the importance of individual operators in LDMs, leveraging the latent space during the pruning process.

Their proposed approach ensures independence from output types and enhances computational efficiency by operating in the latent space, where data is compact. This allows for seamless adaptation to different tasks without requiring task-specific adjustments. The method effectively identifies and removes components with minimal contribution to the output, resulting in compressed models with faster inference speeds and fewer parameters.

Their study introduces a comprehensive metric for comparing LDM latent and formulates a task-agnostic algorithm for compressing LDMs through architectural pruning. Experimental results across various tasks demonstrate the versatility and effectiveness of the proposed approach, promising wider applicability of LDMs in resource-constrained environments.

Furthermore, their proposed approach offers a nuanced understanding of the latent representations of LDMs through the novel metric, which is grounded in rigorous experimental evaluations and logical reasoning. By thoroughly assessing each element of the metric’s design, the researchers ensure its effectiveness in accurately and sensitively comparing LDM latent. This level of granularity enhances the interpretability of the pruning process and enables precise identification of components for removal while preserving output quality.

In addition to its technical contributions, their study showcases the proposed method’s practical applicability across three distinct tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG), and Unconditional Audio Generation (UAG). The successful execution of these experiments underscores the approach’s versatility and potential impact in diverse real-world scenarios. Their research validates the proposed method by demonstrating its effectiveness across multiple tasks. It opens avenues for its adoption in various applications, further advancing the field of generative modeling and compression techniques.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.






Source link

Protected by Security by CleanTalk