Computer Science & AI — Journal Vol. 1

Attention Is All You Need

By Aiden Zheng, Akhil Rajdeep, Allen Zheng, Amy Liu, Carter Li, Vaibhav Vijay, & Vihaan Vijay

"Attention Is All You Need", written by researchers at Google and the University of Toronto, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jacob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Ilia Polokushin, introduces a new way for computers to recognize and process human language. Before this paper, most computer systems used recurrent networks to process the given sentence by each given word. Recurrent neural networks, long short-term memory and gated recurrent neural networks are some types of more primitive and less developed AI, or artificial intelligence. While these versions of AI work, they are not ideal, and are not very reliable in terms of accuracy. The recurrent networks were slow, and models had difficulties with longer sentences. This paper proposes a fully different method called "Attention".

This study outlines how attention in transformers work. The researchers listed out the steps in an encoder-decoder structure: embedding, scaled dot-product attention, feed forward networks, and back propagation, steps needed for the transformer to work. One of the steps, specifically the scaled dot-product attention, uses queries and keys to determine the relevance of a word to the meanings of other words in the sentence. The result of this is a better working AI, surpassing RNNs, LSTMs, and other less powerful AI models. Self attention is favorable because of its computation power and complexity.

Transformers determine how much attention each part of the input gets by using queries, keys and values. The first step is to match the queries with the keys. For example, in the sentence "She ate the cake." We match the word "ate" with other words in the sentence by using dot products. Dot products tell how similar words are. The larger the dot product, the more similar the two words are. For example in the "She ate the cake" example, "ate" and "cake" would have a very large number representing the dot product. Then we scale the dot product by square rooting it, and then use a softmax, squishing the value. The paper teaches us the usefulness of attention in transformers compared to RNNs and LSTMs. This matters because RNNs and other networks struggle a lot with remembering information from a while back. Which means that transformers have a deeper understanding of the context. And this can lead to creations of better LLMs. Most well known AI models these days use transformers. For example, the GPT series (one to four), Google's Gemini, BERT and countless other AI models.

Back propagation is really important and it helps computers learn from their mistakes. It helps improve neural networks which allows it to get better over time by going back through the hidden layers. It goes back through the hidden layers to see which neurons had incorrect numbers and then it adjusts the weights and biases. For example, back propagation is similar to how a teacher corrects you in class. The teacher helps you by going back and seeing what's wrong and going step by step to help you through it. The summarization for back propagation is the following. Step 1 is to guess the answer to the question you received. Step 2 is to check how wrong the answer to the question was. Step 3 is check what you got wrong and fix the error. Step 4 is to try again with a better guess to the question.

Thus, with these improvements, transformers take a lot less time to train compared to other neural networks, and have a higher performance and efficiency. However, transformers aren't perfect. They compare every word in a sentence to other words, so if there is a long text or video they have to summarize, it would be really hard for the transformer. Even though transformers are very powerful, they also need a lot of data to train themselves. And lessening the data used to train the transformer would mean that the model won't be able to do a lot of tasks. This research is pretty significant because it showed how the old and popular neural networks like RNNs and LSTMs could all be replaced by transformers. Some implications are that transformers led to new AI models such as ChatGPT. This research about attention shows how if we just focus on attention, this can improve AI models and can lead them to better understanding of language.

The Transformers model was a new model made to solve the constraint of sequential computation. It uses attention to determine the importance of words, which allows it to focus on the most relevant parts. In the WMT 2014 English to German and English to French translation tasks, the Transformer achieved state of the art and surpassed all other AI models. The Transformer model used attention to determine the importance of words, and was a very successful large language model that significantly improved AI. Conclusively, this paper focuses on the application of attention in the embedding process of LLMs, how the process functions, and why it's notable. The authors described the 4 steps to embedding and how attention plays a major role in the process. Overall, the emphasis on attention highlights a shift in AI towards models that process information more efficiently, paving the way for more powerful and efficient models in the future.