TinyStories Is A Synthetic DataSet Created With GPT-4 & Used To Train Phi-3 | by Cobus Greyling | Jul, 2024

04Jul

The Small Language Model from Microsoft, called Phi-3, was trained using a novel dataset called TinyStories.

Microsoft used the following recipe to create synthetic training data for the Phi-3 language model:

Microsoft researchers created a discrete dataset based on 3,000 words, comprising of roughly equal numbers of nouns, verbs, and adjectives.
They then instructed an LLM to create children’s stories using one noun, one verb, and one adjective from the list.
This prompt repeated millions of times over several days, generating millions of tiny children’s stories.
The TinyStories dataset was created to combine all the qualitative elements of natural language, such as grammar, vocabulary, facts, and reasoning.
The main challenge in using large language models for producing training data is generating a dataset that is sufficiently diverse.
This method also forces the LLM to not be too repetitive in the content generated.

The Small Language Model (SLM) Phi-3 was trained on synthetic data generated by GPT-3.5 and GPT-4. Training data created by large language models can often be too repetitive and lack diversity in verbs, nouns, and adjectives.

The dataset needed to include all the qualitative elements of natural language, such as grammar, vocabulary, facts, and reasoning, but it was designed to be smaller, less diverse, and more restricted in content.

The concept of creating a framework or data topology for the LLM to generate synthetic training data is intriguing.

The study indicates that training generative models on TinyStories can typically be completed in less than a day on a single GPU, while still exhibiting behaviours similar to those observed in larger models.

Instead of relying solely on raw web data, the creators of Phi-3 sought high-quality data. Microsoft researchers created a discrete dataset based on 3,000 words, comprising roughly equal numbers of nouns, verbs, and adjectives.

They then instructed a large language model to create children’s stories using one noun, one verb, and one adjective from the list — a prompt repeated millions of times over several days, generating millions of tiny children’s stories.

Small language models are designed to excel at simpler tasks, making them more accessible and easier to use for organisations with limited resources. They can also be fine-tuned more easily to meet specific needs.

Source link