Scroll down to the bottom of the page to download the Only Question PDF
GTU MOST IMP QUESTIONS FOR AIML
1. What is the Word Embedding Technique?
Word Embedding is a technique used in NLP to represent words or phrases as vectors of real numbers. The key idea behind word embeddings is to map words from a sparse, high-dimensional space (one-hot encoding) to a more dense, continuous vector space where semantically similar words are located closer to each other.
Purpose of Word Embeddings:
- Dimensionality Reduction: Word embeddings reduce the dimensions of the data, making it easier to process.
- Capturing Semantics: Embeddings capture relationships between words (synonymy, antonymy, etc.) by placing semantically similar words closer together in the vector space.
- Efficiency: Word embeddings enable more efficient computation and are typically used for machine learning models, reducing the computational complexity of NLP tasks.
2. Define the Term Word Embedding and List Various Word Embedding Techniques
Word Embedding:
A word embedding is a learned representation of a word, where each word is represented by a dense vector of numbers. These vectors are learned through training on a large corpus of text data. The vectors capture the semantic meaning of words based on their context in the corpus.
For example:
- "King" might be represented by a vector like .
- "Queen" might have a similar vector, with slight differences in each dimension, reflecting their similar meanings.
Common Word Embedding Techniques:
Word2Vec:
- Overview: Word2Vec is a popular method that learns word representations using a shallow neural network. It comes in two architectures:
- Skip-gram model: Predicts the context (surrounding words) for a given word.
- Continuous Bag of Words (CBOW): Predicts the target word given the context.
- Example Use Case: It can learn that "dog" and "puppy" are similar words based on their usage in similar contexts.
- Overview: Word2Vec is a popular method that learns word representations using a shallow neural network. It comes in two architectures:
GloVe (Global Vectors for Word Representation):
- Overview: GloVe is a count-based model that combines the advantages of matrix factorization and the context of words. It focuses on the global co-occurrence statistics of a corpus to derive the embeddings.
- Key Feature: GloVe attempts to create embeddings such that the dot product of two word vectors approximates the log of the probability of the two words co-occurring in the corpus.
FastText:
- Overview: Developed by Facebook, FastText extends Word2Vec by breaking words down into n-grams (sub-word units). This allows the model to capture morphology (such as prefixes and suffixes) and better handle out-of-vocabulary words.
ELMo (Embeddings from Language Models):
- Overview: ELMo uses a deep bi-directional LSTM language model to generate word embeddings. Unlike Word2Vec, ELMo captures context-dependent meanings (word meaning changes based on context).
BERT (Bidirectional Encoder Representations from Transformers):
- Overview: BERT uses transformer-based models for pre-training and produces contextual embeddings for words, considering the full context of a sentence, making it more powerful than traditional embeddings.
3. Explain Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BoW), Word2Vec
Term Frequency - Inverse Document Frequency (TF-IDF):
TF-IDF is a statistical measure used to evaluate how important a word is in a document relative to a collection of documents (corpus). It helps to highlight words that are unique or specific to a document while suppressing common words.
Term Frequency (TF): The frequency of a word in a specific document.
Inverse Document Frequency (IDF): Measures the importance of the word across the entire corpus.
TF-IDF Calculation:
Bag of Words (BoW):
Overview: BoW is a simple model used to represent text data. It treats a document as a collection of words without considering word order. Each word is represented as a feature, and the model counts the occurrences of each word in the document.
Drawback: It does not capture semantics or word order. For example, "cat" and "dog" would be treated as independent features without understanding their relationship.
Word2Vec:
- Overview: As described above, Word2Vec represents words as dense vectors and tries to capture semantic relationships by using the context of words in a neural network framework. It uses either the Skip-gram or CBOW models.
4. Calculate the TF and IDF for the Below Example
Given Sentences:
- [He is Walter]
- [He is William]
- [He isn’t Peter or Walter]
Step 1: Calculate Term Frequency (TF):
- TF for each term in each sentence:
Word | Sentence 1 | Sentence 2 | Sentence 3 |
---|---|---|---|
He | 1/3 | 1/3 | 1/5 |
is | 1/3 | 1/3 | 1/5 |
Walter | 1/3 | 0 | 1/5 |
William | 0 | 1/3 | 0 |
isn’t | 0 | 0 | 1/5 |
Peter | 0 | 0 | 1/5 |
or | 0 | 0 | 1/5 |
- Note: TF for a word is calculated by dividing the number of times the word appears in a sentence by the total number of words in that sentence.
Step 2: Calculate Inverse Document Frequency (IDF):
- IDF for each term:
The formula for IDF is:
Document count: 3 (total sentences)
Word frequency across documents:
- "He" appears in all 3 sentences.
- "is" appears in all 3 sentences.
- "Walter" appears in 2 sentences.
- "William", "isn’t", "Peter", "or" appear in only 1 sentence.
IDF values:
- IDF("He") = log(3/3) = 0
- IDF("is") = log(3/3) = 0
- IDF("Walter") = log(3/2) ≈ 0.176
- IDF("William") = log(3/1) ≈ 1.098
- IDF("isn't") = log(3/1) ≈ 1.098
- IDF("Peter") = log(3/1) ≈ 1.098
- IDF("or") = log(3/1) ≈ 1.098
Step 3: Calculate TF-IDF:
Now, you multiply the TF values by the corresponding IDF values to obtain the TF-IDF scores for each word.
5. Write Short Notes on GloVe
GloVe (Global Vectors for Word Representation) is a word embedding technique developed by Stanford researchers. It is based on factorizing the word co-occurrence matrix, using both global and local statistical information to generate word vectors. Unlike Word2Vec, which is based on a local context window, GloVe tries to capture the global relationships between words.
Key Features of GloVe:
- Matrix Factorization: GloVe starts by constructing a large co-occurrence matrix, where each element (i, j) represents how often word i occurs in the context of word j.
- Optimization Objective: The goal is to minimize the difference between the dot product of word vectors and the logarithm of the co-occurrence probability of the corresponding words.
- Handling Global Statistics: GloVe captures information about the global structure of the corpus, making it efficient for tasks involving semantic meaning and analogies.
Example Use Case:
GloVe has been shown to effectively capture word analogies, such as:
- "king" - "man" + "woman" ≈ "queen"
Advantages of GloVe:
- It effectively models word co-occurrences.
- Suitable for large corpora and efficient training.
6. List Out All Applications of NLP and Explain Any Three
Natural Language Processing (NLP) is a subfield of AI that focuses on the interaction between computers and human language. NLP is widely used in many applications, some of which include:
- Text Classification (Sentiment Analysis)
- Machine Translation
- Named Entity Recognition (NER)
- Speech Recognition
- Question Answering Systems
- Text Summarization
- Spell and Grammar Checking
- Chatbots and Virtual Assistants
- Information Retrieval
- Text-to-Speech and Speech-to-Text
Explanation of Three Applications:
Text Classification (Sentiment Analysis):
- Purpose: Involves categorizing text into predefined categories, such as detecting positive or negative sentiment in product reviews or social media posts.
- Example: Analyzing customer feedback to determine if the sentiment is positive, neutral, or negative.
- Use Case: It helps businesses understand customer opinions, leading to better decision-making and improvements in products or services.
Machine Translation:
- Purpose: Translates text or speech from one language to another.
- Example: Google Translate and DeepL use NLP to convert English text into Spanish, French, German, etc., while maintaining grammatical and contextual accuracy.
- Use Case: It enables communication across language barriers, enhancing global interaction in education, business, and technology.
Named Entity Recognition (NER):
- Purpose: Identifies and classifies named entities (such as people, organizations, dates, locations) in text.
- Example: In the sentence "Apple was founded by Steve Jobs in Cupertino in 1976," NER can identify "Apple" as an organization, "Steve Jobs" as a person, "Cupertino" as a location, and "1976" as a date.
- Use Case: NER is used in search engines, social media monitoring, and automatic document indexing.
7. Explain the Calculation of TF (Term Frequency) for a Document with a Suitable Example
Term Frequency (TF) is a metric used to measure the frequency of a word in a document relative to the total number of words in that document. It gives us an idea of how important a word is within a document.
Formula:
Example:
Consider the following document:
Document 1: "The dog jumped over the fence."
- Step 1: Count the frequency of each word in the document:
Word | Frequency |
---|---|
The | 1 |
dog | 1 |
jumped | 1 |
over | 1 |
the | 1 |
fence | 1 |
Step 2: Total number of terms in the document:
The document has 6 words in total, so the total number of terms is 6.Step 3: Calculate the TF for each word:
For the word "the" (appears twice in the document):
For the word "dog" (appears once):
Repeat this for all words in the document.
8. Explain the Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF) is a measure of how important a word is within the entire corpus. While TF gives the frequency of the word in a document, IDF helps to scale down the weight of words that occur very frequently across documents (like common words such as "the", "is", etc.) and gives higher weight to words that occur less frequently, thus providing more discriminative power to rare words.
Formula:
Where:
- = Total number of documents in the corpus
- = The number of documents containing the term
Example:
Consider a corpus with 5 documents:
- Document 1: "The dog jumped over the fence."
- Document 2: "The dog played in the yard."
- Document 3: "A cat sat on the mat."
- Document 4: "The fence was painted red."
- Document 5: "The dog barked loudly."
Now, to calculate the IDF for the word "dog":
- The word "dog" appears in 4 out of 5 documents.
Using the IDF formula:
For a common word like "the," which appears in all documents:
9. List Out and Explain Each Challenge of TF-IDF and BOW
While both TF-IDF and Bag of Words (BoW) are widely used methods for representing text data, they come with several challenges:
Challenges of Bag of Words (BoW):
Lack of Semantic Meaning:
- BoW only counts word occurrences, disregarding their meanings. It treats words independently, which may not capture word relationships like synonyms or antonyms.
High Dimensionality:
- BoW can create large feature vectors, especially in a large corpus, leading to sparse vectors that are computationally inefficient and harder to process.
Loss of Word Order:
- BoW does not preserve the word order in the sentence. This is a critical issue since the meaning of a sentence can change depending on word sequence. For example, "dog bites man" and "man bites dog" are represented the same way in BoW.
Handling Rare Words:
- BoW tends to give equal importance to all words, including very common or very rare ones, which may not be meaningful. It doesn't differentiate between high-frequency common words and rare, potentially meaningful terms.
Out-of-Vocabulary Words:
- BoW may struggle with words that weren’t seen during training, leading to difficulties in processing new data (out-of-vocabulary words).
Challenges of TF-IDF:
Overweighting Rare Words:
- While rare words can be more discriminative, TF-IDF may overemphasize them, especially when they appear only in one document but are still assigned a high IDF score.
Sparsity:
- Like BoW, TF-IDF representations are also sparse and can lead to high-dimensional vectors that are hard to manage and computationally expensive.
Inability to Capture Context:
- TF-IDF treats each word independently and does not capture the context in which words are used. This is problematic in tasks like word sense disambiguation, where the meaning of a word depends on its context.
Sensitivity to Document Length:
- Longer documents may have higher TF values for frequent terms, which can skew the results if not normalized. While IDF helps, it may not fully correct for the disparity in document lengths.
Post a Comment