Łukasz Białozor
-
Apr 29, 2024
-
10 min read
ChatGPT dataset – what it learned on?
There is no doubt that ChatGPT is one of the most widely used online tools. Businesses integrate it into their daily operations and even customise it to suit their needs. In addition to those enterprise-friendly versions, everybody can use ChatGPT 3.5 or purchase access to ChatGPT 4. But, at its core, Open AI’s golden child functions very similar to other tools capable of generating text.
These tools rely on language models trained on massive datasets containing billions of words. In the case of ChatGPT 4, the dataset contained over 1.76 trillion parameters of text. Text files are very lightweight compared to image or video files, so the amount of written content ingested by ChatGPT 4 is virtually unimaginable. It came from various sources, such as web pages, research papers, and books and was then processed, cleaned and organised to improve the model’s pattern recognition. Because, as obvious as it may seem, writing is much more about just knowing words and randomly putting them together. Some of them just don’t fit next to each other, and their juxtaposition looks very artificial.
Prediction mechanisms of ChatGPT
That’s why a prediction mechanism (decoder) is needed. ChatGPT tries to answer one question – what is the most logical way to continue the text?
The most basic way to predict another word is simply to assign a probability to its occurrence. Let’s say the sentences beginning with “Dinner was delightful because” were followed only by 4 words in our dataset. Some more often, some less often, and we can express these values as percentages and rank our words:
Dinner was delightful because:
- of: 51.2%
- the: 33.1%
- we: 8.6%
- it: 7.1%
Using only the method described above, we’d know that our sentence should begin with “Dinner was delightful because of”. That is, after all, the most probable option, at least according to our dataset. But, of course, this outcome would be very predictable and mundane, not reflecting the human writing pattern very well. That’s why ChatGPT randomly chooses the words that are ranked lower to mimic the human-like unpredictability. However, it’s not as chaotic as it seems.
Controlling the chaos – temperature parameter
The randomness of ChatGPT is controlled by another parameter called temperature. It guides the process of choosing the next word. You can test it yourself by using Bing Chat. This ChatGPT 4-based tool allows for the selection of a conversation style among creative, balanced, and precise. Although Bing Chat doesn’t provide any information about its temperature setting, we can estimate them based on ChatGPT API and research:
- Precise style has a low temperature (close to 0) and is therefore more deterministic. It’s more focused, predictable and conservative. Low temperature is a way to go if you need a specific response, e.g. containing the usable Javascript code.
- Balanced style, which has a temperature of around 0.8, creates outputs appealing to most contexts. It lies between predictability and creativeness, and its responses are the most human-like.
- Creative style ups the temperature to above 1, coming up with unpredictable yet attractive and out-of-the-box responses. It’s not recommended for gaining any specific information, as it’s the most prone to ChatGPT hallucinations, but it might allow you to glean some inspiration.
Of course, exact values are unknown and change dynamically based on your prompt. ChatGPT versions available on the OpenAI website don’t allow for any temperature control; they adjust it according to the task they’re facing. That’s why providing the most context is so important while interacting with any AI tool.
Context and text styles
As ChatGPT is capable of outputting comprehensible press releases, engaging social media posts or entirely fictional stories, it obviously needs the ability to differentiate between different writing styles and distinguish between good and bad examples.
Text embedding
Contrary to what intuition might tell us, the ChatGPT learning process didn’t start by manually teaching it to distinguish text styles, contexts or even correct words from incorrect ones. This process first happens automatically when each chunk of text gets embedded. Embedding can be compared to a careful examination, during which ChatGPT tries to learn how the text is structured. With enough embeddings, the tool has a better awareness of the semantic information about the words, their parts and their surroundings. In the subsequent learning phase, this process was further refined with human feedback.
Error correction
Thanks to embeddings, you can make a lot of typos in your prompt and still be understood – the tool will infer the intended word. This exact mechanism works when ChatGPT is presented with a word that wasn’t included in its dataset or is asked to come up with a made-up one. For example, suppose it has never seen the word “unfriendly” but knows the meaning of “friendly”. In that case, it can assume what adding the “un” means, as it encountered this prefix in many other word pairs, such as “able/unable”, “common/uncommon”, “known/unknown”, and so on.
Human feedback and continuous improvement
The embeddings were further reinforced using the narrower dataset and human feedback. The reviewers helped ChatGPT to improve its responses, which, to an extent, happens to this day. Everybody can submit their thoughts about the OpenAI tool responses while using it. Our input isn’t merely used as feedback for ChatGPT creators – it helps it to fine-tune its embeddings. Of course, despite these efforts, no AI chatbot is perfect. It can still produce biased or inappropriate content, but as the datasets get larger in each superior model, this is less and less possible.
ChatGPT’s (im)perfect memory
It’s impossible to talk about AI-based chatbots without discussing tokens. Even the best pattern recognition algorithm falls flat if the tool cannot meaningfully answer our questions and maintain the conversation flow. We already know how ChatGPT learned our languages’ intricacies based on a vast mix of texts from various sources. Still, there’s no possibility that it ever encountered every question imaginable. People interacting with this tool are naturally curious and willing to test its limitations, so they ask it for abstract outcomes. How does ChatGPT handle such situations? The answer is simple: tokens.
Why does ChatGPT forget the chat history?
In language models, a word or a piece of word is represented as a token. The model can consider a number of tokens at each given time. In the case of chatbots, this number includes both our statements and the tool’s own responses. They all form a context window. The bigger it is, the more text the AI tool analyses on the fly and the more it “remembers.” With longer conversations, ChatGPT cannot return to the oldest responses because it’s literally unaware of their existence. There are some workarounds for this issue, though. Longer conversations can be truncated or summarised to retain only the most important information more compactly, which doesn’t exceed the context window limit.
Summary
As we aimed to explain the basic idea behind ChatGPT in simple words, some topics had to be omitted. For example, we didn’t touch the sophisticated neural networks that power the system, or the Transformer, which the “T” in “ChatGPT” stands for and many other layers of the technical miracle in question. Despite that, grasping even the basic idea behind ChatGPT can enable its more informed use.
Although the results of using this tool may resemble human creations, they rely on rather complex and abstract algorithms. The language models try to assign embeddings to the chunks of text without understanding them in a human sense. Chatbots carefully “plan” their steps, crafting their responses word by word to best fit the expectation in a given context. They try to remember our conversations, but sometimes, their responses lack cohesiveness due to a narrow context window. Despite all those limitations, the results of OpenAI efforts are awe-inspiring. The accomplishment of ChatGPT lies in the intricate and clever web of algorithms and code analysing multiple inputs in a fraction of a second. All of this mimics the most basic ability of humans – communication.