11Apr

Researchers at Apple Propose Ferret-UI: A New Multimodal Large Language Model (MLLM) Tailored for Enhanced Understanding of Mobile UI Screens


Mobile applications are integral to daily life, serving myriad purposes, from entertainment to productivity. However, the complexity and diversity of mobile user interfaces (UIs) often pose challenges regarding accessibility and user-friendliness. These interfaces are characterized by unique features such as elongated aspect ratios and densely packed elements, including icons and texts, which conventional models struggle to interpret accurately. This gap in technology underscores the pressing need for specialized models capable of deciphering the intricate landscape of mobile apps.

Existing research and methodologies in mobile UI understanding have introduced frameworks and models such as the RICO dataset, Pix2Struct, and ILuvUI, focusing on structural analysis and language-vision modeling. CogAgent leverages screen images for UI navigation, while Spotlight applies vision-language models to mobile interfaces. Models like Ferret, Shikra, and Kosmos2 enhance referring and grounding capabilities but mainly target natural images. MobileAgent and AppAgent employ MLLMs for screen navigation, indicating a growing emphasis on intuitive interaction mechanisms despite their reliance on external modules or predefined actions.

Apple researchers have introduced Ferret-UI, a model specifically developed to advance the understanding and interaction with mobile UIs. Distinguishing itself from existing models, Ferret-UI incorporates an “any resolution” capability, adapting to screen aspect ratios and focusing on fine details within UI elements. This approach ensures a deeper, more nuanced comprehension of mobile interfaces.

Ferret-UI’s methodology revolves around adapting its architecture for mobile UI screens, utilizing an “any resolution” strategy for handling various aspect ratios. The model processes UI screens by dividing them into sub-images, ensuring detailed element focus. Training involves the RICO dataset for Android and proprietary data for iPhone screens, covering elementary and advanced UI tasks. This includes widget classification, icon recognition, OCR, and grounding tasks like find widget and find icon, leveraging GPT-4 for generating advanced task data. The sub-images are encoded separately, using visual features of varying granularity to enrich the model’s understanding and interaction capabilities with mobile UIs.

Ferret-UI is more than just a promising model; it’s a proven performer. It outperformed open-source UI MLLMs and GPT-4V, exhibiting a significant leap in task-specific performances. In icon recognition tasks, Ferret-UI reached an accuracy rate of 95%, a substantial 25% increase over the nearest competitor model. It achieved a 90% success rate for widget classification, surpassing GPT-4V by 30%. Grounding tasks like finding widgets and icons saw Ferret-UI maintaining 92% and 93% accuracy, respectively, marking 20% and 22% improvement compared to existing models. These figures underline Ferret-UI’s enhanced capability in mobile UI understanding, setting new benchmarks in accuracy and reliability for the field.

In conclusion, the research introduced Ferret-UI, Apple’s novel approach to improving mobile UI understanding through an “any resolution” strategy and a specialized training regimen. By leveraging detailed aspect-ratio adjustments and comprehensive datasets, Ferret-UI significantly advanced task-specific performance metrics, notably exceeding those of existing models. The quantitative results underscore the model’s enhanced interpretative capabilities. But it’s not just about the numbers. Ferret-UI’s success illustrates the potential for more intuitive and accessible mobile app interactions, paving the way for future advancements in UI comprehension. It’s a model that can truly make a difference in how we interact with mobile UIs.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.






Source link

10Apr

The “Zero-Shot” Mirage: How Data Scarcity Limits Multimodal AI


Imagine an AI system that can recognize any object, comprehend any text, and generate realistic images without being explicitly trained on those concepts. This is the enticing promise of “zero-shot” capabilities in AI. But how close are we to realizing this vision?

Major tech companies have released impressive multimodal AI models like CLIP for vision-language tasks and DALL-E for text-to-image generation. These models seem to perform remarkably well on a variety of tasks “out-of-the-box” without being explicitly trained on them – the hallmark of zero-shot learning. However, a new study by researchers from Tubingen AI Center, University of Cambridge, University of Oxford, and Google Deepmind casts doubt on the true generalization abilities of these systems.  

The researchers conducted a large-scale analysis of the data used to pretrain popular multimodal models like CLIP and Stable Diffusion. They looked at over 4,000 concepts spanning images, text, and various AI tasks. Surprisingly, they found that a model’s performance on a particular concept is strongly tied to how frequently that concept appeared in the pretraining data. The more training examples for a concept, the better the model’s accuracy.

But here’s the kicker – the relationship follows an exponential curve. To get just a linear increase in performance, the model needs to see exponentially more examples of that concept during pre-training. This reveals a fundamental bottleneck – current AI systems are extremely data hungry and sample inefficient when it comes to learning new concepts from scratch.

The researchers dug deeper and unearthed some other concerning patterns. Most concepts in the pretraining datasets are relatively rare, following a long-tailed distribution. There are also many cases where the images and text captions are misaligned, containing different concepts. This “noise” likely further impairs a model’s generalization abilities.  

To put their findings to the test, the team created a new “Let It Wag!” dataset containing many long-tailed, infrequent concepts across different domains like animals, objects, and activities. When evaluated on this dataset, all models – big and small, open and private – showed significant performance drops compared to more commonly used benchmarks like ImageNet. Qualitatively, the models often failed to properly comprehend or render images for these rare concepts.

The study’s key revelation is that while current AI systems excel at specialized tasks, their impressive zero-shot capabilities are somewhat of an illusion. What seems like broad generalization is largely enabled by the models’ immense training on similar data from the internet. As soon as we move away from this data distribution, their performance craters.

So where do we go from here? One path is improving data curation pipelines to cover long-tailed concepts more comprehensively. Alternatively, model architectures may need fundamental changes to achieve better compositional generalization and sample efficiency when learning new concepts. Lastly, retrieval mechanisms that can enhance or “look up” a pre-trained model’s knowledge could potentially compensate for generalization gaps.  

In summary, while zero-shot AI is an exciting goal, we aren’t there yet. Uncovering blind spots like data hunger is crucial for sustaining progress towards true machine intelligence. The road ahead is long, but clearly mapped by this insightful study.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.






Source link

10Apr

Cornell University Researchers Introduce Reinforcement Learning for Consistency Models for Efficient Training and Inference in Text-to-Image Generation


Computer vision often involves complex generative models and seeks to bridge the gap between textual semantics and visual representation. It offers myriad applications, from enhancing digital art creation to aiding in design processes. One of the primary challenges in this domain is the efficient generation of high-quality images that closely align with given textual prompts. 

Existing research spans foundational diffusion models capable of producing high-quality, realistic images through a gradual noise reduction. Parallel developments in consistency models present a quicker method by directly mapping noise to data, enhancing the efficiency of image creation. The integration of reinforcement learning (RL) with diffusion models represents a significant innovation, treating the model’s inference as a decision-making process to refine image generation towards specific goals. Despite their advancements, these methods grapple with a common issue: a trade-off between generation quality and computational efficiency, often resulting in slow processing times that limit their practical application in real-time scenarios.

A team of researchers from Cornell University have introduced the Reinforcement Learning for Consistency Models (RLCM) framework, a novel intervention that distinctively accelerates text-to-image conversion processes. Unlike traditional approaches that rely on iterative refinement, RLCM utilizes RL to fine-tune consistency models, facilitating rapid image generation without sacrificing quality and a leap in efficiency and effectiveness in the domain.

The RLCM framework applies a policy gradient approach to fine-tune consistency models, specifically targeting the Dreamshaper v7 model for optimization. The methodology hinges on leveraging datasets like LAION for aesthetic assessments alongside a bespoke dataset designed to evaluate image compressibility and incompressibility tasks. Through this structured approach, RLCM efficiently adapts these models to generate high-quality images, optimizing for speed and fidelity to task-specific rewards. The process entails a calculated application of RL techniques to significantly reduce both training and inference times, ensuring the models’ effectiveness across varied image generation objectives without compromise.

Compared to traditional RL fine-tuned diffusion models, RLCM achieves a training speed that is up to 17 times faster. For image compressibility, RLCM managed to generate images with a 50% reduction in necessary inference steps, translating to a substantial decrease in processing time from initiation to output. On aesthetic evaluation tasks, RLCM improved reward scores by 30% compared to conventional methods. These results underscore RLCM’s capacity to deliver high-quality images efficiently, marking a substantial leap forward in the text-to-image generation domain.

To conclude, the research introduced the RLCM framework, a novel method that significantly accelerates the text-to-image generation process. By leveraging RL to fine-tune consistency models, RLCM achieves faster training and inference times while maintaining high image quality. The framework’s superior performance on various tasks, including aesthetic score optimization and image compressibility, showcases its potential to enhance the efficiency and applicability of generative models. This pivotal contribution offers a promising direction for future computer vision and artificial intelligence developments.


Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit


Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.






Source link

09Apr

LlamaIndex vs LangChain: A Comparison of Artificial Intelligence (AI) Frameworks


In the rapidly evolving landscape of AI frameworks, two prominent players have emerged: LlamaIndex and LangChain. Both offer unique approaches to enhancing the performance and functionality of large language models (LLMs), but they cater to the developer community’s slightly different needs and preferences. This comparison aims to delve into their key features, use cases, and main differences to help developers decide based on their project requirements.

LlamaIndex 

LlamaIndex is a specialized tool that enhances the interaction between data and LLMs. Its strength is in streamlining the indexing and retrieval processes, making it particularly useful for developers focused on search-oriented applications. By facilitating efficient data integration and enhancing LLM performance, LlamaIndex is tailored for scenarios where rapid, accurate access to structured data is paramount.

Key Features of LlamaIndex:

  • Data Connectors: Facilitates the integration of various data sources, simplifying the data ingestion process.
  • Engines: The bridge between data sources and LLMs allows seamless data access and interaction.
  • Data Agents: Empower data management through dynamic interaction with data structures and external APIs.
  • Application Integrations: Supports a wide array of integrations with other tools and services, enhancing the capabilities of LLM-powered applications.

Use Cases of LlamaIndex:

  • Semantic Search: Optimized for indexing and retrieval, making it highly suitable for applications requiring precise and speedy search capabilities.
  • Document Indexing: Enhances the quality and performance of data used with LLMs, facilitating efficient data retrieval.

LangChain

LangChain offers a flexible and comprehensive framework that excels in developing diverse, LLM-powered applications. Its modular design and extensible components enable developers to craft applications that intelligently interact with users, utilize external data, and execute complex workflows. LangChain’s versatility makes it suitable for innovators looking to push the boundaries of what’s possible with AI, offering the tools to build sophisticated and highly adaptable applications to user needs.

Key Features of LangChain:

  • Model I/O: Standardizes interactions with LLMs, making it easier for developers to incorporate LLM capabilities.
  • Retrieval Systems: Features Retrieval Augmented Generation (RAG) for personalized outputs by accessing external data during the generative phase.
  • Chains: Offers a versatile component for orchestrating complex operations, including RAG and task-specific workflows.

Use Cases of LangChain:

  • Context-Aware Query Engines: Allows the creation of sophisticated query engines that consider the context of queries for more accurate responses.
  • Complex Application Development: Its flexible and modular framework supports the development of diverse LLM-powered applications.

Main Differences Between LlamaIndex and LangChain

Three major differences between these key AI frameworks are as follows:

  1. Focus and Optimization: LlamaIndex is specifically crafted for search and retrieval applications, emphasizing data indexing and interaction. In contrast, LangChain offers a broader, more flexible framework for creating various LLM-powered applications.
  2. Integration and Extension: While LlamaIndex excels in integrating data for LLM enhancement, LangChain stands out in its extensibility, allowing developers to craft custom solutions by combining various data sources and services.
  3. Toolset and Components: LlamaIndex is renowned for its data connectors and agents, which streamline data tasks. Meanwhile, LangChain distinguishes itself with its modular components, like Model I/O and Chains, which facilitate complex operations and application development.

Comparative Analysis

Let’s have a look at the comparative snapshot of these two AI frameworks:

This comparison shows how LlamaIndex and LangChain cater to different facets of AI application development. LlamaIndex is your go-to for data-centric tasks requiring precise indexing and retrieval, making it indispensable for search-oriented applications. On the other hand, LangChain’s flexibility and comprehensive toolkit make it ideal for developers aiming to build complex, multifaceted applications that leverage LLMs in innovative ways. 

Conclusion

The choice between LlamaIndex and LangChain hinges on the specific requirements of your AI project. Both frameworks offer powerful capabilities to leverage LLMs yet serve distinct purposes. Understanding the nuances of each can help developers and organizations harness the full potential of AI in their applications, whether the focus is on data indexing and retrieval or on building complex, customizable applications.


Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.




Source link

08Apr

Researchers at Tsinghua University Propose SPMamba: A Novel AI Architecture Rooted in State-Space Models for Enhanced Audio Clarity in Multi-Speaker Environments


Navigating through the intricate landscape of speech separation, researchers have continually sought to refine the clarity and intelligibility of audio in bustling environments. This endeavor has been met with several methodologies, each with strengths and shortcomings. Amidst this pursuit, the emergence of State-Space Models (SSMs) marks a significant stride toward efficacious audio processing, marrying the prowess of neural networks with the finesse required for discerning individual voices from a composite auditory tapestry.

The challenge extends beyond mere noise filtration; it is the art of disentangling overlapping speech signals, a task that grows increasingly complex with the addition of multiple speakers. Earlier tools, from Convolutional Neural Networks (CNNs) to Transformer models, have offered groundbreaking insights yet falter when processing extensive audio sequences. CNNs, for instance, are constrained by their local receptive capabilities, limiting their effectiveness across lengthy audio stretches. Transformers are adept at modeling long-range dependencies, but their computational voracity dampens their utility.

Researchers from the Department of Computer Science and Technology, BNRist, Tsinghua University introduce SPMamba, a novel architecture rooted in the principles of SSMs. The discourse around speech separation has been enriched by introducing innovative models that balance efficiency with effectiveness. SSMs exemplify such balance. By adeptly integrating the strengths of CNNs and RNNs, SSMs address the pressing need for models that can efficiently process long sequences without compromising performance. 

SPMamba is developed by leveraging the TF-GridNet framework. This architecture supplants Transformer components with bidirectional Mamba modules, effectively widening the model’s contextual grasp. Such an adaptation not only surmounts the limitations of CNNs in dealing with long-sequence audio but also curtails the computational inefficiencies characteristic of RNN-based approaches. The crux of SPMamba’s innovation lies in its bidirectional Mamba modules, designed to capture an expansive range of contextual information, enhancing the model’s understanding and processing of audio sequences.

SPMamba achieves a 2.42 dB improvement in Signal-to-Interference-plus-Noise Ratio (SI-SNRi) over traditional separation models, significantly enhancing separation quality. With 6.14 million parameters and a computational complexity of 78.69 Giga Operations per Second (G/s), SPMamba not only outperforms the baseline model, TF-GridNet, which operates with 14.43 million parameters and a computational complexity of 445.56 G/s, but also establishes new benchmarks in the efficiency and effectiveness of speech separation tasks.

In conclusion, the introduction of SPMamba signifies a pivotal moment in the field of audio processing, bridging the gap between theoretical potential and practical application. By integrating State-Space Models into the architecture of speech separation, this innovative approach not only enhances speech separation quality to unprecedented levels but also alleviates the computational burden. The synergy between SPMamba’s innovative design and its operational efficiency sets a new standard, demonstrating the profound impact of SSMs in revolutionizing audio clarity and comprehension in environments with multiple speakers.


Check out the Paper and GitHubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter with 24k+ members…

Don’t Forget to join our 40k+ ML SubReddit


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.






Source link

07Apr

SiloFuse: Transforming Synthetic Data Generation in Distributed Systems with Enhanced Privacy, Efficiency, and Data Utility


In an era when data is as valuable as currency, many industries face the challenge of sharing and augmenting data across various entities without breaching privacy norms. Synthetic data generation allows organizations to circumvent privacy hurdles and unlock the potential for collaborative innovation. This is particularly relevant in distributed systems, where data is not centralized but scattered across multiple locations, each with its privacy and security protocols.

Researchers from TU Delft, BlueGen.ai, and the University of Neuchatel introduced SiloFuse in search of a method that can seamlessly generate synthetic data in a fragmented landscape. Unlike traditional techniques that struggle with distributed datasets, SiloFuse introduces a groundbreaking framework that synthesizes high-quality tabular data from siloed sources without compromising privacy. The method leverages a distributed latent tabular diffusion architecture, ingeniously combining autoencoders with a stacked training paradigm to navigate the complexities of cross-silo data synthesis.

SiloFuse employs a technique where autoencoders learn latent representations of each client’s data, effectively masking the true values. This ensures that sensitive data remains on-premise, thereby upholding privacy. A significant advantage of SiloFuse is its communication efficiency. The framework drastically reduces the need for frequent data exchanges between clients by utilizing stacked training, minimizing the communication overhead typically associated with distributed data processing. Experimental results testify to SiloFuse’s efficacy, showcasing its ability to outperform centralized synthesizers regarding data resemblance and utility by significant margins. For instance, SiloFuse achieved up to 43.8% higher resemblance scores and 29.8% better utility scores than traditional Generative Adversarial Networks (GANs) across various datasets.

SiloFuse addresses the paramount concern of privacy in synthetic data generation. The framework’s architecture ensures that reconstructing original data from synthetic samples is practically impossible, offering robust privacy guarantees. Through extensive testing, including attacks designed to quantify privacy risks, SiloFuse demonstrated superior performance, reinforcing its position as a secure method for synthetic data generation in distributed settings.

Research Snapshot

In conclusion, SiloFuse addresses a critical challenge in synthetic data generation within distributed systems, presenting a groundbreaking solution that bridges the gap between data privacy and utility. By ingeniously integrating distributed latent tabular diffusion with autoencoders and a stacked training approach, SiloFuse surpasses traditional efficiency and data fidelity methods and sets a new standard for privacy preservation. The remarkable outcomes of its application, highlighted by significant improvements in resemblance and utility scores, alongside robust defenses against data reconstruction, underscore SiloFuse’s potential to redefine collaborative data analytics in privacy-sensitive environments.


Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit


Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.






Source link

07Apr

API Strategies for Effective Database Management and Integration


API (Application Programming Interface) strategies are pivotal in effective database management and integration. In today’s fast-paced digital landscape, where organizations operate across various databases and applications, seamlessly integrating these components is not just beneficial; it’s necessary for operational efficiency, insightful data analysis, and delivering superior customer experiences. APIs serve as the linchpins in this integration process, providing a structured and secure way for disparate systems to communicate and exchange data. Let’s delve into the essence of API strategies, comparing different approaches and elucidating their advantages and disadvantages through a detailed case study.

API Strategies for Database Management

APIs are the bridge that allows applications to interact with databases seamlessly. This interaction happens without the applications needing to know the intricacies of the database schema or the specific programming language in which the database operations are written. By abstracting these details, APIs simplify the development process, bolster security measures, and ensure that systems remain modular and easy to maintain. The strategic selection of an API can have far-reaching effects on integration ease, system performance, scalability, and the overall lifecycle of the application and database ecosystem.

Types of APIs in Database Management

The landscape of APIs in database management is diverse, with each type catering to specific needs and scenarios:

  1. RESTful APIs: The go-to for many web services, Representational State Transfer (REST) APIs utilize simple HTTP requests to create, read, update, and delete data. These APIs can work with various data formats, including text, JSON, and XML, making them highly versatile and easy to implement in multiple environments.
  2. SOAP APIs: The Simple Object Access Protocol (SOAP) APIs are known for their strict standards and high level of security, which makes them a favorite among enterprise-level applications where data security and integrity are paramount. Despite their robustness, they tend to have more overhead than RESTful APIs.
  3. GraphQL APIs: A relatively new API strategy, GraphQL offers an incredibly efficient query language for complex systems with interrelated data. It allows clients to request the needed data, reducing bandwidth and improving response times.

A Comparative Look at API Strategies

Let’s compare these API strategies to highlight their distinct characteristics:

A Comparative Snapshot of API Strategies

Pros and Cons of the API Strategies

RESTful APIs:

  • Pros: Their lightweight nature and simplicity make RESTful APIs incredibly approachable for developers. They offer high flexibility, allowing quick adjustments and updates without significant overhead.
  • Cons: They may not offer the same security features as SOAP, making them less suitable for highly sensitive applications. Their efficiency can wane in scenarios requiring complex queries over multiple resources.

SOAP APIs:

  • Pros: They provide a highly standardized approach to API design, with robust security features and support for transactional integrity through ACID compliance. This makes them ideal for enterprise environments where consistency and security are non-negotiable.
  • Cons: The complexity and verbosity of SOAP can lead to slower performance, especially in web applications where quick responses are essential.

GraphQL APIs:

  • Pros: They dramatically reduce the need for multiple queries by allowing clients to specify exactly what data they need. This specificity can lead to significant performance improvements and greater flexibility in handling complex data relationships.
  • Cons: The complexity of setting up a GraphQL API can be a barrier for teams unfamiliar with its query language. Moreover, optimizing query performance requires a deeper understanding of the underlying systems.

Case Study: E-Commerce Integration Challenge

Consider an e-commerce company facing the challenge of integrating its online shopping platform with a legacy inventory management system. The integration is needed to ensure real-time synchronization of product information, stock levels, and order data, improving operational efficiency and customer satisfaction.

  • Solution: The company can opt for GraphQL APIs for this integration. The decision can be driven by the need for efficient, real-time data retrieval and updates across complex, interrelated datasets encompassing products, stocks, and orders.
  • Implementation Process:
    • A GraphQL server can be developed as an intermediary capable of interacting with the shopping platform’s database and the inventory system.
    • The implementation can leverage GraphQL’s powerful query capabilities to manage and synchronize data efficiently across systems, ensuring that product listings on the e-commerce site remain accurate and up-to-date.
  • Outcomes:
    • Using GraphQL can reduce unnecessary data over-fetching and under-fetching, optimizing server load and response times.
    • Customers can enjoy a better shopping experience with real-time visibility into product availability.
    • Due to GraphQL’s flexible query language, the development team may find it easier to address complex data retrieval and manipulation requirements.

Conclusion

The strategic selection and implementation of APIs are fundamental to successful database management and integration. Whether opting for the simplicity and flexibility of RESTful APIs, the security and robustness of SOAP, or the efficiency and precision of GraphQL, the choice should align with the project’s specific needs around security, data complexity, and performance. The discussed comparative analysis and case study illustrate how a well-considered API strategy can facilitate seamless integration, enhance system interoperability, and drive digital transformation efforts forward.


Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.




Source link
Adnan Hassan:

06Apr

How to Build a Plagiarism Detector Using Python [Part 1]


In this post, I will show you how to detect the percentage of plagiarism in a piece of text. A direct, practical solution I created and tested!

How to Build a Plagiarism Checker

The idea is very simple, acting as a perfect starting point to check plagiarism for any piece of text. I will explain the approach step by step with a practical example, so let’s start!

How Do Plagiarism Checkers Work?

Plagiarism checkers scan for matches between your text and existing texts and give a similarity percentage at the end.

Behind the scenes, most of these tools surf the web and scan your text for similarities against existing content found on the internet.

Exact matches are highlighted easily by comparison. However, more complex checkers can also identify non-exact matches, aka paraphrased text.

But before applying all that, these plagiarism tools start by splitting text into smaller chunks like phrases, sentences, and paragraphs; notice how I didn’t say “splitting into individual words.” That’s because words are independent, resulting in a less effective plagiarism test.

How does Plagiarism Checkers Work

So, which chunking method should you choose?

Each approach has its pros and cons:

For example, if you choose to chunk by sentences, you’d get a more accurate result; however, the code will need more time to execute.

Moreover, this method wouldn’t be fair to apply if you’re examining someone (Teachers using it on students) because there is a probability that some general sentences may already have been used by someone on the internet, and the person didn’t copy them.

Unlike the chunking-by-paragraphs method, which would result in a less accurate result but less time to execute. This method is the go-to one when running a plagiarism detector on students.

Here are the results I got when I tried both methods:

Difference between chunking by sentences and chunking by paragraphs

In the end, you choose the method based on your needs.

My Implementation

Let’s keep things simple with a real practical example! Here is what we need:

1- A function that takes care of the chunking of our text

2- A function that surfs the web and checks if this chunk exists

3- Add up all the results and get the percentage

Step 1: Text Chunking

Let’s make it dynamic!

def chunk_text(text, chunk_by) -> List[str]:
    if chunk_by == "sentence":
        sentences = re.split(r'(?<!\d)[.?!](?!\d)', text)
        sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
        return sentences
    elif chunk_by == "paragraph":
        paragraphs = [paragraph.strip() for paragraph in text.split("\n") if paragraph.strip()]
        return paragraphs
    else:
        raise ValueError("Invalid chunk_by value. Choose 'sentence' or 'paragraph'.")

This function takes as input the text and your chosen chunking method, then if you choose:

  • By Sentence: I used a very straightforward method: I split whenever I find a ‘.’ or ‘!’ or ‘?’ between sentences.
  • By Paragraph: I used a similar approach to the one above, which splits the input whenever there’s a new line between paragraphs. In Python, the new line is defined as \n.

This dynamic approach makes it easier to switch to whichever method is based on your liking. Plus, you can see the experiment yourself and see how the accuracy changes depending on the text and method used.

Step 2: Surf the Web

Now that we have split the text into chunks, we need to take each chunk, put it between double quotes like “[chunk]”, and search for if it matches something on the internet.

Here’s an example of a unique chunk:

Add double quotes when searching on google to search for exact matches

As you can see, no results were found for “Learnwithhasan is the best website” although it’s a well-known fact 😂

💡 Tip 💡 

When you’re searching for an exact match of something you should delimit it between double quotes. Like this search engine you’re using knows that you’re looking for this exact phrase and not normal searching.

Back to our code:

def search_chunk(chunk) -> bool:
    try:
        search_results = search_with_serpapi(f"\"{chunk}\"")
        found = len(search_results) > 0
        return found
    except Exception as e:
        print(f"An error occurred: {e}")
        return False 

In this function, I used my Library SimplerLLM, specifically a method that uses SerperAPI to search on Google from the code.

To access Google’s search engine from your code, you would need an API and its corresponding code. However, using SimplerLLM, the function is already built-in, and you just call it using the “search_with_serpapi” method.

But, you need to generate your API key from their website, create a .env file, and add your key like this:

SERPER_API_KEY = "YOUR_API_KEY"

So, using the above function, each chunk is searched for on Google, and if a result exists, it returns True; otherwise, it returns False.

Step 3: Calculating the Result

Now it’s time to take these Trues and Falses and turn them into a percentage:

def calculate_plagiarism_score(text, chunk_by) -> float:
    chunks = chunk_text(text, chunk_by)
    total_chunks = len(chunks)
    plagiarised_chunks = 0
    for chunk in chunks:
        if search_chunk(chunk):
            plagiarised_chunks += 1
    
    plagiarism_score = (plagiarised_chunks / total_chunks) * 100 if total_chunks > 0 else 0
    return plagiarism_score

This function works by first calling the chunking method explained in Step 1, and then counting the total number of these chunks.

Using step 2, we determine whether each chunk is available on the web. If it returns True, it increases the count of plagiarized chunks.

After checking all chunks, the plagiarism score is calculated by dividing the number of plagiarized chunks by the total number of chunks, multiplying by 100 to get a percentage. Finally, it returns the plagiarism score as a decimal number(float).

Step 4: Running the Script

All the above methods wouldn’t generate anything if you didn’t give it any input and print the result.

#MAIN SECTION
start_time = time.time() 
text = "YOUR_TEXT" # The Input Text

chunk_by = "sentence"  # "sentence" or "paragraph"
plagiarism_score = calculate_plagiarism_score(text, chunk_by)

end_time = time.time()  # Record the end time
runtime = end_time - start_time  # Calculate the runtime

print(f"Plagiarism Score: {plagiarism_score}%")
print(f"Runtime: {runtime} seconds")  # Print the runtime

In this section of the code, you need to enter the text you want to run the plagiarism checker on, pick your preferred method of chunking, and print the results!

You’ll even get the time it took to generate the results (we’ll use it later🤫)

The Role of SimplerLLM

SimplerLLM is an open-source Python library designed to simplify interactions with large language models (LLMs). It offers a unified interface for different LLM providers and a suite of tools to enhance language model capabilities.

I created it to facilitate coding, and it did indeed save me a lot of time. But the main reason I’m using it in this script is that I’m planning on improving this code more and making it detect similarities, too, not just exact copies of the text. So, keep an eye out for the Semantic Plagiarism Checker Post!

Advanced Technique

Now, although the script we created is working properly, why don’t we improve it a little?

For example, when we find that the chunk is available on a webpage somewhere, we can fetch the URLs of these web pages. This simple tweak to the code would make the results of this script a lot more interesting, especially if you turned it into a tool with a nice UI.

Here’s what the new code will look like:

def search_chunk(chunk) -> List[str]:
list = []
    try:
        search_results = search_with_serpapi(f"\"{chunk}\"")
        found = len(search_results) > 0
        if (found):
            list.append(found)
            list.append(search_results[0].URL)
            return list
        else:
            list.append(found)
            list.append("None")
            return list
    except Exception as e:
        print(f"An error occurred: {e}")
        list.append(False)
        list.append("None")
        return list 

def calculate_plagiarism_score(text, chunk_by) -> float:
    chunks = chunk_text(text, chunk_by)
    total_chunks = len(chunks)
    plagiarised_chunks = 0
    counter = 1
    for chunk in chunks:
        print(f"Chunk {counter} : {chunk} .... {search_chunk(chunk)[0]} .... {search_chunk(chunk)[1]}")
        counter += 1
        if search_chunk(chunk)[0]:
            plagiarised_chunks += 1

    plagiarism_score = (plagiarised_chunks / total_chunks) * 100 if total_chunks > 0 else 0
    return plagiarism_score

As you can see, I edited the “search_chunk” function so that it returns a list containing a True/ False if it found an existing duplicate and the link to the webpage that contains the same chunk. Plus, I added a print statement in the “calculate_plagiarism_score” function to print each chunk, its number, True/False, and the URL of the webpage.

Here’s what the result will look like:

Plagiarism Checker response

Performance Optimization

A major limitation of the above script is that running it on a large amount of data would be inefficient, like multiple blog posts at a time. What happens here is every chunk will have to be searched for on Google to see if there is existing content that matches it.

So, How can we fix this? There are two approaches we can try.

Approach 1:

The first is leaving the code logic as is but applying parallel programming or multi-threading to it so that it runs on multiple processors, making it much faster. The code will look something like this:

def calculate_plagiarism_score(text, chunk_by) -> float:
    """
    Calculates the plagiarism score of a given text by chunking it and checking each chunk for plagiarism in parallel.
    """
    chunks = chunk_text(text, chunk_by)
    total_chunks = len(chunks)

    with ThreadPoolExecutor() as executor:
        # Use map to apply search_chunk to all chunks. Results are returned in the order the calls were made.
        results = executor.map(search_chunk, chunks)
    
    plagiarised_chunks = sum(results)

    plagiarism_score = (plagiarised_chunks / total_chunks) * 100 if total_chunks > 0 else 0
    return plagiarism_score

The “calculate_plagiarism_score” is the only function that gets updated because all the work is happening in it, so it is the function with the most run-time. Therefore, we apply the ThreadPoolExecuter() to distribute the workload over multiple threads, which decreases the program runtime.

To use this built-in function, you need to import its corresponding module like this:

from concurrent.futures import ThreadPoolExecutor

Now let’s run the normal code and the one we just optimized and see the difference in speed:

Increase the speed of Plagiarism Checker by running it using threads

See the difference? The optimized one is almost 10x faster 😲

The normal one took 29 seconds to run, while the optimized one (using threads) took only 4 seconds

Approach 2

The other approach is to decrease the number of search-engine calls and index search results on our local machine somewhere. So, now, instead of searching on the internet, if there is an existing matching chunk, we search in our indexed search results.

Now, if I want to address this approach, it may take like 100 pages, so I’ll leave them for you to experiment 😂

If you tried it, make sure to share it with us, and if you have other better approaches to improve the plagiarism detector, drop it in the comments below!

Earn Points for Every Share!



Source link
Hasan Aboul Hasan:

04Apr

Meet Plandex: An Open-Source Terminal-based AI Coding Engine for Complex Tasks


The field of software development is evolving rapidly, and the integration of artificial intelligence (AI) with coding practices is poised to transform the way developers work on their projects. Against this backdrop, there is a new project called Plandex that aims to simplify the process of building complex software. This is an open-source, terminal-based AI coding engine that utilizes the capabilities of OpenAI. It represents a significant advancement in coding efficiency and project management.

Plandex is a tool that automates the routine tasks of coding, allowing developers to concentrate on more innovative and challenging assignments. It was developed by a programmer who found the tedious process of constantly copying and pasting code between ChatGPT and other projects to be inconvenient. Plandex is exceptional, not only because it can handle intricate tasks that involve multiple files and steps but also because of its unique approach to managing the inevitable errors and the iterative nature of coding.

Plandex utilizes long-running agents that break down large tasks into manageable subtasks, methodically implementing each one. This approach ensures that tasks requiring extensive multi-file operations are completed efficiently, transforming how developers tackle their backlogs, explore new technologies, and overcome coding obstacles.

One of the key features of Plandex is its integration with the OpenAI API, requiring users to provide their API key. However, its roadmap includes support for other models, such as Google’s Gemini and Anthropic’s Claude, as well as open-source models, indicating a future where Plandex becomes even more versatile and powerful.

The Plandex project offers a range of functionalities tailored to enhance the coding experience:

  • The ability to build complex software functionalities beyond mere autocomplete.
  • Efficient context management within the terminal, enabling seamless updates of files and directories to ensure the AI models have access to the latest project state.
  • A sandbox environment for testing changes before applying them to project files, complete with built-in version control and branching capabilities for exploring different coding strategies.
  • It is compatible across Mac, Linux, FreeBSD, and Windows, running as a single binary without dependencies.

Plandex is more than just a tool, it represents a great help for developers for reducing the “copy-pasting madness” that currently affects modern software development. By providing a platform where developers can experiment, revise, and select the best approach without the need for manual context management, Plandex is leading the way towards a new era of software development.

Key Takeaways

  • Plandex is an open-source, AI-powered coding engine designed to streamline the development of complex software projects.
  • It leverages the OpenAI API to automate tasks across multiple files, enhancing productivity and focus for developers.
  • Unique features like version control, sandbox testing, and efficient context management in the terminal set Plandex apart in the coding tools landscape.
  • By minimizing the tedious aspects of coding and focusing on automation and efficiency, Plandex represents a significant advancement in the integration of AI into software development.


Shobha is a data analyst with a proven track record of developing innovative machine-learning solutions that drive business value.




Source link
Shobha Kakkar:

03Apr

How to Build a Plagiarism Detector [Part 2] – Semantic Search


In this post, I will show you a better approach to building a Plagiarism detector tool, other than the one we built last time which checks for exact matches on the Internet.

Checking for Plagiarism in a given text using AI and Vector Embeddings

Today’s method will check for plagiarism based on how close the meaning and sentence structure are rather than searching for exact matches. This will help detect paraphrased text in addition to copy-pasted text.

We will go over 4 approaches, where we will compare articles as a whole and as chunks. Each approach will be applied using AI and vector embeddings, resulting in the following 4 approaches:

Method 1: Comparing chunks of both articles using vector embeddings

Method 2: Comparing 2 articles as a whole using vector embeddings

Method 3: Comparing chunks of both articles using AI

Method 4: Comparing 2 articles as a whole using AI

What are Vector Embeddings?

If you know what vector embeddings are, feel free to skip this section.

Vector Embedding is one of the most important concepts in machine learning. It is used in many NLP, recommendation, and search algorithms.

It enables computers to understand and process complex data, such as text, images, and more, in a more human-like manner.

By representing objects or words as high-dimensional vectors (points in space), embeddings capture their meanings, relationships, and properties in a way that numerical algorithms can manipulate.

How embeddings models work

So, all words/phrases/paragraphs with similar meanings are positioned closely together in the embedding space, allowing models to recognize patterns and make predictions.

In our case, we’ll use vector embeddings to generate high-dimensional vectors for pieces of text. Then, using something called cosine similarity, we’ll know if the texts are similar in meaning and structure.

💡 Tip 💡 

Make sure to go over part 1 and apply the simpler implemention to understand how plagiarism checkers work to navigate easily through this one.

The Implementations

I’m gonna go over each implementation briefly, explaining the idea, and in the end, I’m gonna compare all the results we got and analyze them. (I’ll share all the resources I used at the end don’t worry )

For more accurate results and analysis, and to make things easier, I won’t be surfing the web and comparing my pieces of text with pieces on the web as we did in the first part. But I’ll use 2 pieces of text that are close in meaning to each other I got, and I’ll apply all the methods to them.

After understanding how these methods all work, you can then add the search function to your code, which I mentioned in part 1, and you’ll have your own custom semantic Plagiarism Checker!

Method 1: Comparing Chunks Using Vector Embeddings

The idea of this approach is I’m gonna split the article we want to test and the article we’re comparing it to into chunks, where the chunking method used is by-paragraph.

Then I’m gonna turn all these chunks into vector embeddings using OpenAI’s “text-embedding-3-small” model, and I’ll compare each chunk from the input article with all chunks in the other article, using the cosine similarity function, giving it a threshold of 0.7

This threshold will be used to compare the output of the cosine similarity to it. If the cosine similarity is greater than 0.7, then the 2 vectors are similar in meaning and, therefore, plagiarised. The threshold I chose is just for testing; if you want to apply a more accurate one, you’ll have to do your own research to know what threshold would be the best in this case.

from SimplerLLM.tools.text_chunker import chunk_by_paragraphs
from scipy.spatial.distance import cosine
import time 
import resources
import openai

def search_semantically_similar(text):
    """
    This function takes a piece of text and calculates its plagiarism score
    against another text by comparing the semantic similarity of their chunks.
    """
    chunks = chunk_by_paragraphs(text)  # Divide the input text into chunks/paragraphs
    article_paragraphs = chunk_by_paragraphs(resources.article_two)  # Divide the second text into chunks/paragraphs for comparison
    all_comparisons = 0  # Initialize a counter for all comparison attempts
    plagiarised_chunks = 0  # Initialize a counter for chunks found to be plagiarised based on similarity threshold

    for chunk in chunks.chunks:  # Iterate over each chunk in the first text
        chunk_vector = convert_to_vector(chunk.text)  # Convert the chunk text to a vector using an embedding model
            
        for paragraph in article_paragraphs.chunks:  # Iterate over each paragraph in the second text
            if paragraph.text.strip():  # Ensure the paragraph is not just whitespace
                all_comparisons += 1  # Increment the total comparisons counter
                paragraph_vector = convert_to_vector(paragraph.text)  # Convert the paragraph text to a vector
                similarity = calculate_cosine_similarity(chunk_vector, paragraph_vector)  # Calculate the cosine similarity between vectors
                
                if is_similarity_significant(similarity):  # Check if the similarity score is above a certain threshold
                    plagiarised_chunks += 1  # If so, increment the count of plagiarised chunks
        
    plagiarism_score = ((plagiarised_chunks / all_comparisons) * 100)  # Calculate the percentage of chunks considered plagiarised
    return plagiarism_score  # Return the plagiarism score

def convert_to_vector(text):
    """
    Converts a given piece of text into a vector using OpenAI's embeddings API.
    """
    text = text.replace("\n", " ")  # Remove newlines for consistent embedding processing
    response = openai.embeddings.create(
        input=[text],
        model="text-embedding-3-small"
    )
    return response.data[0].embedding  # Return the embedding vector

def calculate_cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors, representing the similarity of their originating texts.
    """
    return 1 - cosine(vec1, vec2)  # The cosine function returns the cosine distance, so 1 minus this value gives similarity

def is_similarity_significant(similarity_score):
    """
    Determines if a cosine similarity score indicates significant semantic similarity, implying potential plagiarism.
    """
    threshold = 0.7  # Define a threshold for significant similarity; adjust based on empirical data
    return similarity_score >= threshold  # Return True if the similarity is above the threshold, False otherwise

#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one  # Assign the text to check for plagiarism

plagiarism_score = search_semantically_similar(text_to_check)  # Calculate the plagiarism score

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {plagiarism_score}%")  # Print the calculated plagiarism score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

As you can see in the above code, in the main section, we’re giving it the text_to_check, which will be run using the search_semantically_similar function, which, in its role, goes over all the steps I mentioned above.


In the codes, I’ll be using the SimplerLLM library I built to facilitate and speed up the coding process. In these implementations, I’ll be using it to generate text using OpenAI’s API(methods 3 and 4) and chunk text by paragraphs using this simple function:

chunks = chunk_by_paragraphs(text)

Other than that, the code should be simple to read and understand, given all the comments I added throughout the code😅 However, in case you found something unclear and you need some help, don’t hesitate to drop your questions on the forum!

Method 2: Comparing 2 articles as a whole using vector embeddings

In this method, we’ll be directly comparing both articles as a whole without chunking them by converting both of them into vector embeddings. Then, using cosine similarity, we’ll see if they’re similar to each other.

from scipy.spatial.distance import cosine
import time 
import resources
import openai

def convert_to_vector(text):
    """
    Converts a given piece of text into a vector using OpenAI's embeddings API.
    """
    text = text.replace("\n", " ")  # Remove newlines for consistent embedding processing
    response = openai.embeddings.create(
        input=[text],
        model="text-embedding-3-small"
    )
    return response.data[0].embedding  # Return the embedding vector

def calculate_cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors, representing the similarity of their originating texts.
    """
    return 1 - cosine(vec1, vec2)  # The cosine function returns the cosine distance, so 1 minus this value gives similarity

def is_similarity_significant(similarity_score):
    """
    Determines if a cosine similarity score indicates significant semantic similarity, implying potential plagiarism.
    """
    threshold = 0.7  # Define a threshold for significant similarity; adjust based on empirical data
    return similarity_score >= threshold  # Return True if the similarity is above the threshold, False otherwise

def search_semantically_similar(text_to_check):
    """
    Compares the semantic similarity between the input text and a predefined article text.
    It returns a list containing the similarity score and a boolean indicating whether
    the similarity is considered significant.
    """
    result = []  # Initialize an empty list to store the similarity score and significance flag

    input_vector = convert_to_vector(text_to_check)  # Convert the input text to a vector using an embedding model
        
    article_text = resources.article_two  # texts.two contains the text of the article to compare with
        
    article_vector = convert_to_vector(article_text)  # Convert the article text to a vector
        
    similarity = calculate_cosine_similarity(input_vector, article_vector)  # Calculate the cosine similarity between the two vectors
        
    result.append(similarity)  # Append the similarity score to the list
    result.append(is_similarity_significant(similarity))  # Append the result of the significance check to the list
    
    return result  # Return the list containing the similarity score and significance flag
    
def calculate_plagiarism_score(text):
    """
    Calculates the plagiarism score of a given text by comparing its semantic similarity
    with a predefined article text. The score is expressed as a percentage.
    """
    data = search_semantically_similar(text) # Obtain the similarity data for the input text
    data[0] = data[0] * 100  # Convert the first item in the data list (similarity score) to a percentage
    
    return data  # Return the plagiarism score and significance

#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one  # Assign the text to check for plagiarism

plagiarism_score = calculate_plagiarism_score(text_to_check)[0]
significance = calculate_plagiarism_score(text_to_check)[1]

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {plagiarism_score}%")  # Print the calculated plagiarism score
print(f"Is result Significant: {significance}")  # Print the signficance of the score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

As you can see, the code is very similar in structure to method 1. However, the search_semantically_similar function was edited to directly turn both articles into vectors, compare them, and return the result without chunking.

Plus, I added the calculate_plagiarism_score function, which takes the similarity score and generates a percentage of it. Then, it will return the percentage score and True/False statement if the plagiarism score is significant, which will be analyzed by comparing the cosine similarity score with the threshold I initiated to be 0.7

Method 3: Comparing chunks of both articles using AI

Now it’s time for AI to enter the battlelfield😂

This method is the same as method 1 in concept; however, instead of comparing the chunks by embedding them into vectors and generating the cosine similarity, we’ll compare them using a power prompt and OpenAI’s GPT model.

from SimplerLLM.tools.text_chunker import chunk_by_paragraphs
from SimplerLLM.language.llm import LLM, LLMProvider
import time 
import resources
import json

def compare_chunks(text_chunk):
    """
    Compares a text chunk with an article text and generates a response using a OpenAI's Model
    """
    article_text = resources.article_two  # The text to compare against

    prompt = resources.prompt3  # A template string for creating the comparison prompt
    final_prompt = prompt.format(piece=text_chunk, article=article_text)  # Formatting the prompt with the chunk and article texts

    llm_instance = LLM.create(provider=LLMProvider.OPENAI)  # Creating an instance of the language model
    response = llm_instance.generate_text(final_prompt)  # Generating text/response from the LLM

    response_data = json.loads(response)  # Parsing the response into a JSON object

    return response_data  # Returning the parsed response data

def calculate_plagiarism_score(text):
    """
    Calculates the plagiarism score of a text by comparing its chunks against an article text
    and evaluating the responses from OpenAI's Model
    """
    text_chunks = chunk_by_paragraphs(text)  # Split the input text into chunks using SimplerLLM built-in method
    total_chunks = text_chunks.num_chunks  # The total number of chunks in the input text

    similarities_json = {}  # Dictionary to store similarities found
    chunk_index = 1  # Index counter for naming the chunks in the JSON
    plagiarised_chunks_count = 0  # Counter for the number of chunks considered plagiarised
    total_scores = 0  # Sum of scores from the LLM responses

    for chunk in text_chunks.chunks:
        response_data = compare_chunks(chunk.text)  # Compare each chunk using the LLM
        total_scores += response_data["score"]  # Add the score from this chunk to the total scores

        if response_data["score"] > 6:  # A score above 6 indicates plagiarism
            plagiarised_chunks_count += 1
            similarities_json[f"chunk {chunk_index}"] = response_data["article"]  # Record the article text identified as similar
            json.dumps(similarities_json)  # Convert the JSON dictionary to a string for easier storage
            chunk_index += 1  # Increment the chunk index

    plagiarism_result_json = {}  # Dictionary to store the final plagiarism results
    plagiarism_score = (plagiarised_chunks_count / total_chunks) * 100 if total_chunks > 0 else 0  # Calculate the plagiarism score as a percentage

    plagiarism_result_json["Score"] = plagiarism_score
    plagiarism_result_json["Similarities"] = similarities_json # Adding where we found similaritites
    plagiarism_result_json["IsPlagiarised"] = (total_scores > total_chunks * 6)  # Recording if the response is really plagiarised

    json.dumps(plagiarism_result_json)  # Convert the final results dictionary to a JSON string

    return plagiarism_result_json  # Return the plagiarism results as a dictionary

#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one  # Assign the text to check for plagiarism

plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}")  # Print the calculated plagiarism score
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

In the code,, the main function is the calculate_plagiarism_score, which chunks the articles, sends them to the compare_chunks function to get the similarity score, generates a total plagiarism score, and formats the results as JSON to add some details other than the plagiarism score, keeping them clear and readable.

The compare_chunks function creates a GPT instance using SimplerLLM, then uses a power prompt to analyze both chunks and generate a score out of 10 for how similar they are. Here’s the prompt I’m using:

#TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input chunk,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to 
the input chunk. After that you're gonna pick the piece of article which is most similar and generate for it
a score out of 10 for how similar it is to the input chunk. Then you're gonna need to generate the output
as a JSON format that contains the input chunk, the article chunk which is the most similar, and the score
out of 10. 

### SCORING CRITERIA 
When checking for pieces in the article that are close in meaning to the chunk of text make sure you 
go over the article at least 2 times to make sure you picked the the right chunk in the article which is the most 
similair to the input chunk. Then when picking a score it should be based of how similar are the meanings 
and structure of both these sentences.

# INPUTS
input chunk: [{piece}]
article: [{article}]

# OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
    "chunk": "[input chunk]",
    "article": "[chunk from article which is similar]",
    "score": [score]
}}

As you can see it is a detailed prompt, very well crafted to generate a specific result. You can learn how to craft similar prompts yourself by becoming a Prompt Engineer.

Method 4: Comparing 2 articles as a whole using AI

This method is a combination of methods 2 and 3, where we’re gonna be comparing both articles as a whole but using AI instead of vector embeddings.

from SimplerLLM.language.llm import LLM, LLMProvider
import time 
import resources
import json

def compare_chunks(text_chunk):
    """
    Compares a given text chunk with an article to determine plagiarism using a language model.
    
    Returns dict: The response from the language model, parsed as a JSON dictionary.
    """
    article_text = resources.article_two  # The text to compare against

    # Formatting the prompt to include both the input text chunk and the article text
    comparison_prompt = resources.prompt4.format(piece=text_chunk, article=article_text)

    llm_instance = LLM.create(provider=LLMProvider.OPENAI)  # Creating an instance of the language model
    response = llm_instance.generate_text(comparison_prompt)  # Generating response

    response_data = json.loads(response)  # Parsing the response string into a JSON dictionary

    return response_data  # Returning the parsed JSON data

def calculate_plagiarism_score(text_to_analyze):
    """
    Calculates the plagiarism score based on the analysis of a given text against a predefined article text.
    
    Returns dict: A JSON dictionary containing the plagiarism score and the raw data from the analysis.
    """
    plagiarism_results = {}  # Dictionary to store the final plagiarism score and analysis data
    plagiarised_chunk_count = 0  # Counter for chunks considered plagiarised

    analysis_data = compare_chunks(text_to_analyze)  # Analyze the input text for plagiarism
    total_chunks = len(analysis_data)  # Total number of chunks analyzed
    
    for key, value in analysis_data.items():
        # Check if the value is a list with at least one item and contains a 'score' key
        if isinstance(value, list) and len(value) > 0 and 'score' in value[0] and value[0]['score'] > 6:
            plagiarised_chunk_count += 1
        # Check if the value is a dictionary and contains a 'score' key
        elif isinstance(value, dict) and 'score' in value and value['score'] > 6:
            plagiarised_chunk_count += 1

    plagiarism_score = (plagiarised_chunk_count / total_chunks) * 100 if total_chunks > 0 else 0 # Calculate plagiarism score as a percentage
    plagiarism_results["Total Score"] = plagiarism_score  # Add the score to the results dictionary
    plagiarism_results["Data"] = analysis_data  # Add the raw analysis data to the results dictionary

    json.dumps(plagiarism_results)  # Convert the results dictionary to a clear JSON string

    return plagiarism_results  # Return the final results dictionary
    
#MAIN SECTION
start_time = time.time()  # Record the start time of the operation

text_to_check = resources.article_one # Assign the text to check for plagiarism

plagiarism_score = calculate_plagiarism_score(text_to_check)
formatted_plagiarism_score = json.dumps(plagiarism_score, indent=2) # Format the output for better readability

end_time = time.time()  # Record the end time of the operation
runtime = end_time - start_time  # Calculate the total runtime

# Output the results
print(f"Plagiarism Score: {formatted_plagiarism_score}")  # Print the scores
print(f"Runtime: {runtime} seconds")  # Print the total runtime of the script

This code is 80% like the code in method 3. However, instead of comparing each chunk, we send both articles as a whole and let OpenAI’s GPT generate a detailed plagiarism test, comparing all parts of the articles as it wishes. In the end, it returns a detailed output containing a plagiarism score and the top sections are found to be similar in their similarity score.

All this is done using this power prompt:

### TASK
You are an expert in plagiarism checking. Your task is to analyze two pieces of text, an input text,
and an article. Then you're gonna check if there are pieces of the article that are similar in meaning to 
the pieces of the input text. After that you're gonna pick chunk pairs that are most similar to each other
in meaning and structure, a chunk from the input text and a chunk from the article. You will then generate 
a score out of 10 for each pair for how similar they are.
Then you're gonna need to generate the output as a JSON format for each pair that contains 
the input text chunk, the article chunk which are the most similar, and the score out of 10. 

### SCORING CRITERIA 
When checking for peices in the article that are close in meaning to the chunk of text make sure you 
go over the article at least 2 times to make sure you picked the right pairs of chunks which are most similar.
Then when picking a score it should be based of how similar are the meanings and structure of both these sentences.

### INPUTS
input text: [{piece}]
article: [{article}]

### OUTPUT
The output should be only a valid JSON format nothing else, here's an example structure:
{{
    "pair 1": 
    [
    "chunk 1": "[chunk from input text]",
    "article 1": "[chunk from article which is similar]",
    "score": [score]
    ],
    "pair 2": 
    [
    "chunk 2": "[chunk from input text]",
    "article 2": "[chunk from article which is similar]",
    "score": [score]
    ],
    "pair 3": 
    [
    "chunk 3": "[chunk from input text]",
    "article 3": "[chunk from article which is similar]",
    "score": [score]
    ],
    "pair 4": 
    [
    "chunk 4": "[chunk from input text]",
    "article 4": "[chunk from article which is similar]",
    "score": [score]
    ]
}}

The prompt in methods 3 and 4 is very important to be well-crafted since all the results are based on it. Feel free to tweak and optimize it to your liking and if it generates better results make sure to share it with us in the comments below!

Method 5: My Opinion

After we tried 2 types of machines to do the work for us, let’s now use human intelligence and see if their results are significant!

Here are the 2 texts I was comparing:

Article 1:

What is generative AI?

Generative AI refers to deep-learning models that can generate high-quality text, images, and other content based on the data they were trained on.

Artificial intelligence has gone through many cycles of hype, but even to skeptics, the release of ChatGPT seems to mark a turning point. OpenAI's chatbot, powered by its latest large language model, can write poems, tell jokes, and churn out essays that look like a human created them. 
Prompt ChatGPT with a few words, and out comes love poems in the form of Yelp reviews, or song lyrics in the style of Nick Cave.
Article 1:

What is generative AI?

Generative artificial intelligence (AI) describes algorithms (such as ChatGPT) that can be used to create new content, including audio, code, images, text, simulations, and videos. 
Recent breakthroughs in the field have the potential to drastically change the way we approach content creation.

Generative AI systems fall under the broad category of machine learning, and here's how one such system—ChatGPT—describes what it can do:

Ready to take your creativity to the next level? Look no further than generative AI! 
This nifty form of machine learning allows computers to generate all sorts of new and exciting content, from music and art to entire virtual worlds. And it's not just for fun—generative AI has plenty of practical uses too, like creating new product designs and optimizing business processes. So why wait? Unleash the power of generative AI and see what amazing creations you can come up with!

Did anything in that paragraph seem off to you? Maybe not. The grammar is perfect, the tone works, and the narrative flows.

As you can both articles are about the same topic and they’re just a small chunk of it, so it’s something logical for the plagiarism score to be at least 50% if not 80%. Read both of them, and you’ll see they’re very close; they were just written in different styles.

Therefore, to get more accurate results and see which of the methods is the best among them, we’ll need to run all of them on 10-20 pairs of long articles.

Of course, I can’t do that in this blog and share all the results. It would take forever😂 So, I’ll keep the experimentation for you and share the results with us!

Run The Codes

To run the codes, you’re gonna have to create a .env file that contains your OpenAI API key like this:

Syntax to put an API Key in a .env file

This way, all the methods will run perfectly, but using the articles I presented above. If you wish to input your own articles, you will find them in the resources.py file along with both power prompts I mentioned above.

Plus, don’t forget to install all the necessary libraries, which you will find in the requirements file. Install them by running this in the terminal:

pip install -r requirements.txt

Comparison

I executed all the methods on the same set of articles I presented above, and here were the results:

Factors Method 1 Method 2 Method 3 Method 4
Plagiarism Score 25% 85% 100% 100%
Runtime 44 secs 1 sec 8 secs 10 secs

Runtime analysis

Logically, Methods 1 and 3 are supposed to take more time than Methods 2 and 4 because they compare all chunks rather than the articles as a whole. However, the runtime of method 1 is very bad, so to use this method, you’re gonna either need to optimize the code to run faster (ex: parallel programming)

Other than that, all runtimes are good, so there is no need to optimize any of the other codes.

Plagiarism Score Analysis

I’ll give my personal analysis of each method and then draw a full conclusion.

Method 1:

25% for these 2 articles is very low, considering they’re very similar in meaning. However, I hypothesize that since each chunk is being compared to all other chunks, it’s something logical that not all parts of the article are gonna be about the same idea.

So, when we’re comparing, for example, a chunk in paragraph 1 and a chunk in the last paragraph, of course, these chunks would be about totally different ideas. Plus, the probability for a pair to be similar is way lower than that of being about a different idea because, in each article, every idea is mentioned once, so only one chunk will have a very similar meaning to the chunk we’re comparing.

However, this method has a major drawback: where if in one article we have a paragraph and in the other article we have exactly the same paragraph but split into 2, it won’t detect that they’re the same. That’s because we’re chunking based on paragraphs so in article 2 the 2 paragraphs would be 2 different chunks while in article 1 they would be 1 chunk, therefore affecting the score. To solve this, we need a better chunking method!

Method 2:

A score of 85% is very fair for such articles; they are truly very similar. However, do you think comparing 2 articles as a whole is really efficient to test for plagiarism? Personally, I don’t think it’s a good practice to use it since the purpose of plagiarism detection is to check for parts of articles that are found on the internet.

In this case, it will only work if both articles are the exact same copies from the introduction to the conclusion, giving a 100% accurate result.

Methods 3 and 4:

These 2 methods are kinda the same because in the background, both of them are using AI to go over different chunks and check which pairs are the most similar. However, the main difference is that in 3 we are manually chunking the articles by paragraph, while in 4 the AI is doing it as it finds the most efficient, so we can’t actually tell how it is chunking the articles.

In addition, these methods totally rely on how well-crafted the prompt is, so you can get better results by improving it and vice versa. The main factor that determines how good the prompt is, is making it apply the best plagiarism algorithm possible, where you’re gonna have to do your own research understand the algorithm, and implement it in a well-crafted prompt.

Conclusion

There is no actual conclusion; it’s more of an opinion.

Based on the tests I did, I can say that method 1 might be the best way to implement a good plagiarism checker because it goes into detail about all the chunks of the articles and compares them. So, with a better chunking method and more optimized code to make it faster, I think it would make a good plagiarism checker!

Agree or Disagree? Share your thoughts in the comments section!

Earn Points for Every Share!



Source link
Husein Aboul Hasan:

Protected by Security by CleanTalk