DIPLOMA CH 4 MOST IMP QUESTION (AI-ML)

Scroll down to the bottom of the page to download the Only Question  PDF


GTU MOST IMP QUESTIONS FOR  AIML

1. What is NLP? Write the Advantages and Disadvantages of NLP

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that enables computers to process, understand, and generate human language in a way that is both meaningful and useful. It combines computational linguistics and machine learning to bridge the gap between human communication and computer understanding. NLP encompasses a variety of tasks, such as language translation, sentiment analysis, text classification, speech recognition, and chatbot interaction.

Advantages of NLP:

  • Automation of Manual Tasks: NLP can automate tedious processes such as data entry, document categorization, and sentiment analysis. This helps businesses save time and resources.

  • Improved User Interaction: NLP is the backbone of many conversational agents like virtual assistants (e.g., Siri, Alexa). It allows users to interact with machines through natural language, improving user experience.

  • Enhanced Decision Making: NLP can analyze vast amounts of unstructured data (e.g., social media posts, customer reviews) to extract valuable insights that support business decision-making.

  • Language Translation: NLP enables real-time translation between languages. Applications like Google Translate are powered by NLP, breaking down language barriers globally.

  • Information Extraction: NLP is used to extract structured data from unstructured sources, such as pulling key facts from a news article or legal documents.

Disadvantages of NLP:

  • Language Ambiguity: Human language is highly ambiguous. Words, phrases, and sentences can have multiple meanings depending on context. This poses a challenge for NLP systems in terms of disambiguation.

  • Complexity of Language: Idiomatic expressions, humor, sarcasm, and context-specific phrases are difficult for machines to understand. For instance, interpreting “kick the bucket” as a literal phrase is misleading without contextual understanding.

  • High Resource Consumption: Advanced NLP techniques, such as deep learning-based models, require significant computational resources and data for training. This can be costly and time-consuming.

  • Bias and Ethical Concerns: NLP systems are often trained on large datasets that may contain inherent biases. This can lead to biased predictions, such as gender or racial biases, in sentiment analysis or automated decision-making systems.

  • Cultural Nuances: NLP systems may fail to grasp the cultural or regional nuances present in the language, leading to inaccuracies in text interpretation.


2. Difference Between Natural Language Understanding (NLU) and Natural Language Generation (NLG)

Natural Language Understanding (NLU) and Natural Language Generation (NLG) are two fundamental tasks in NLP that focus on interpreting and producing human language, respectively.

Natural Language Understanding (NLU):

  • Goal: NLU focuses on enabling computers to understand and interpret human language. It involves extracting meaning, intent, and context from text or speech.

  • Key Tasks:

    • Named Entity Recognition (NER): Identifying entities like names of people, organizations, locations, dates, etc.
    • Part-of-Speech (POS) Tagging: Identifying the grammatical role of each word (e.g., noun, verb, adjective).
    • Sentiment Analysis: Determining the emotional tone of the text (positive, negative, neutral).
    • Intent Recognition: Understanding the user's intention, such as whether they want information, perform an action, etc.
  • Example: Given the sentence, “I want to book a flight to New York tomorrow,” NLU would identify "book" as the intent (action) and "New York" as the destination (location).

Natural Language Generation (NLG):

  • Goal: NLG is the process of generating human-like language from structured data or internal representations. It allows computers to produce text that reads naturally, often used for creating reports or summaries.

  • Key Tasks:

    • Summarization: Creating a brief summary of a longer text while retaining its key points.
    • Text Paraphrasing: Rewriting text in a different way without changing its meaning.
    • Report Generation: Generating coherent and structured reports from raw data (e.g., creating a weather report from sensor data).
  • Example: From the input data "Temperature: 72°F, Condition: Sunny," an NLG system might generate, "The weather today is sunny with a temperature of 72°F."

AspectNLUNLG
FocusUnderstanding and interpreting human languageGenerating human-like text from structured data
Task ExamplesSentiment analysis, entity recognition, POS taggingSummarization, paraphrasing, report generation
End ResultStructured meaning (e.g., extracted entities or intents)Coherent, human-readable text

3. Applications of NLP

NLP is used in a wide range of industries and applications, improving efficiency, enhancing user experience, and enabling powerful features.

  1. Machine Translation:

    • Example: Google Translate, which translates text between languages, relies heavily on NLP to understand context and meaning in both source and target languages.
    • Challenges: Handling idiomatic expressions, slang, and word order differences between languages.
  2. Speech Recognition:

    • Example: Virtual assistants like Siri, Alexa, and Google Assistant use NLP to convert spoken language into text, which is then processed for action (e.g., setting reminders, checking the weather).
    • Challenges: Accents, background noise, and homophones can affect recognition accuracy.
  3. Sentiment Analysis:

    • Example: Businesses use sentiment analysis on customer reviews or social media posts to gauge consumer opinions about products, services, or brands.
    • Challenges: Understanding context, detecting sarcasm, and differentiating between subjective and objective statements.
  4. Text Summarization:

    • Example: Automatically summarizing long documents like research papers or news articles into concise abstracts or key points.
    • Challenges: Maintaining the original meaning while compressing the information into a much shorter format.
  5. Chatbots and Virtual Assistants:

    • Example: Chatbots used in customer service interact with users to resolve issues or answer queries.
    • Challenges: Understanding complex or vague customer queries and providing accurate responses.
  6. Text Classification:

    • Example: Categorizing emails into spam or non-spam, or classifying news articles by topic (e.g., politics, sports).
    • Challenges: Misclassifications can occur due to subtle nuances or complex language.
  7. Information Extraction:

    • Example: Extracting structured data like dates, names, and places from unstructured sources like legal contracts, news reports, or medical records.
    • Challenges: The text may be complex, contain errors, or use ambiguous language.

4. Phases of NLP in Detail

NLP tasks typically follow a series of phases to convert raw language data into structured information:

  1. Text Preprocessing:

    • Tokenization: Dividing text into smaller chunks such as words, sentences, or phrases. For example, the sentence “I love NLP” would be tokenized into ["I", "love", "NLP"].
    • Lowercasing: Converting all text to lowercase to maintain uniformity and reduce complexity.
    • Punctuation Removal: Eliminating unnecessary punctuation marks (unless they’re significant to the task).
    • Stop Words Removal: Removing common words like "the", "is", "and" that do not contribute to meaning in most NLP tasks.
  2. Lexical Analysis:

    • Stemming: The process of reducing words to their root forms. For example, “running” becomes “run.”
    • Lemmatization: A more refined process that involves mapping words to their lemma (base form), ensuring it’s a valid word. For example, “better” becomes “good.”
  3. Syntactic Analysis:

    • Parsing: Understanding the grammatical structure of a sentence. Parsing identifies subject-verb-object relationships in sentences.
    • Dependency Parsing: Identifies how words in a sentence depend on each other. For example, in the sentence "She ate the apple," "ate" depends on "She", and "apple" depends on "ate".
  4. Semantic Analysis:

    • Word Sense Disambiguation (WSD): Determining the correct meaning of words based on context. For example, "bank" could refer to a financial institution or the side of a river, depending on the context.
    • Named Entity Recognition (NER): Identifying and categorizing key entities in text, such as names of people, locations, organizations, etc.
  5. Discourse Analysis:

    • Understanding how sentences relate to one another and the larger context. For example, resolving coreference like “he” referring to “John” in a paragraph.
  6. Pragmatics:

    • Analyzing how the context of a sentence affects its interpretation. It helps understand implied meaning or politeness, for example, recognizing that “Could you pass the salt?” is a request, not a question.
  7. Text Generation:

    • Generating human-readable text, either as a direct translation of structured data (e.g., weather reports from temperature data) or as a natural continuation of a conversation (e.g., chatbots).

5. Lexical Ambiguity, Syntactic Ambiguity, and Referential Ambiguity

Lexical Ambiguity:

  • Definition: Lexical ambiguity occurs when a word has multiple meanings or interpretations. The ambiguity arises from the fact that a single word can belong to different categories or have different meanings based on context.

  • Example:

    • Word: "Bat"
      • Meaning 1: A flying mammal.
      • Meaning 2: A piece of equipment used in baseball.
    • In the sentence “He hit the ball with a bat,” it’s clear from context that "bat" refers to a piece of equipment.
  • How it’s resolved: NLP systems rely on context to distinguish between different meanings of ambiguous words. This can be done using Word Sense Disambiguation (WSD) algorithms, which determine the most likely meaning based on the surrounding words.

Syntactic Ambiguity:

  • Definition: Syntactic ambiguity occurs when the structure of a sentence can be interpreted in multiple ways, leading to different meanings. This happens because of multiple ways to parse the sentence's syntax.

  • Example:

    • Sentence: “I saw the man with the telescope.”
      • Interpretation 1: The speaker saw a man who had a telescope.
      • Interpretation 2: The speaker used a telescope to see the man.
  • How it’s resolved: Syntactic ambiguity is often resolved through parsing, where the syntactic structure of a sentence is analyzed to determine the correct interpretation. Dependency parsing is often used to resolve ambiguities in sentence structure.

Referential Ambiguity:

  • Definition: Referential ambiguity occurs when it is unclear what a pronoun or noun phrase refers to in a sentence. This typically involves pronouns like “he,” “she,” or “it” and the antecedents to which they refer.

  • Example:

    • Sentence: “John told David that he was going to the store.”
      • Who is “he”? Is it John or David? This is an example of referential ambiguity.
  • How it’s resolved: Coreference resolution techniques are used to identify which entity a pronoun refers to. In this case, the ambiguity can be resolved by looking at the context or using algorithms that track the antecedents in a conversation.


6. Stemming Words and Parts of Speech (POS) Tagging with Suitable Example

Stemming:

  • Definition: Stemming is the process of reducing words to their base or root form. It involves chopping off the suffixes of words to get to a common stem, which might not always be a valid word in the language.

  • Example:

    • Words: “running,” “runner,” “ran”
    • Stem: “run” (all variations are reduced to “run”)
  • Tools: Popular stemming algorithms include the Porter Stemmer and the Lancaster Stemmer.

Parts of Speech (POS) Tagging:

  • Definition: POS tagging is the process of assigning grammatical labels to each word in a sentence based on its function. These tags include nouns, verbs, adjectives, adverbs, etc.

  • Example:

    • Sentence: “She enjoys reading books.”
      • POS Tags:
        • She -> Pronoun (PRP)
        • enjoys -> Verb (VBZ)
        • reading -> Verb (VBG)
        • books -> Noun (NNS)
  • Why it’s important: POS tagging helps NLP systems understand sentence structure and disambiguate words that may have multiple meanings based on their function.


7. Difference Between Stemming and Lemmatization

Both stemming and lemmatization aim to reduce words to their base forms, but they do so in different ways:

Stemming:

  • Process: Stemming uses heuristics to strip prefixes or suffixes from words. It does not always produce valid words but aims to reduce the word to its root form.
  • Example:
    • "better" -> "better" (no stemming applied)
    • "running" -> "run" (using heuristics)
  • Advantages: Fast and computationally efficient.
  • Disadvantages: May lead to non-words or stems that are not linguistically correct.

Lemmatization:

  • Process: Lemmatization is more sophisticated, involving vocabulary and morphological analysis. It uses a dictionary or lexicon to reduce a word to its lemma, ensuring the root form is a valid word.
  • Example:
    • "better" -> "good" (using a lexicon to get the correct lemma)
    • "running" -> "run" (reducing it to the base form using grammar rules)
  • Advantages: Results in valid words and is more linguistically accurate.
  • Disadvantages: Slower and requires more computational resources than stemming.
AspectStemmingLemmatization
ComplexitySimple and fastMore complex and slower
OutputMay result in non-words or incorrect stemsOutputs valid words (lemmas)
AccuracyLess accurateMore accurate

8. Filtering of Stop Words in Detail

Stop Words are common words that occur frequently in a language (e.g., "the", "and", "is", "to") but do not carry significant meaning or value in certain NLP tasks like text classification, sentiment analysis, or topic modeling.

Why Stop Words Are Removed:

  • Noise Reduction: Stop words often introduce noise in the data, so removing them helps focus on the more meaningful words.
  • Improved Efficiency: Removing stop words reduces the size of the data being processed, speeding up NLP tasks and saving computational resources.

Example:

  • Original Text: "I am going to the market to buy some fruits."
  • After Removing Stop Words: "going market buy fruits"

Challenges:

  • In some contexts, stop words may carry important meaning (e.g., in sentiment analysis, the word “not” can change the meaning of a sentence), so a blanket removal strategy is not always ideal.

Commonly Removed Stop Words:

  • Articles: "the", "a", "an"
  • Pronouns: "I", "you", "he", "she"
  • Prepositions: "in", "on", "at"
  • Conjunctions: "and", "but", "or"

9. Data Processing Using the NLTK Library

The Natural Language Toolkit (NLTK) is a comprehensive Python library for NLP tasks. It provides modules and functions for handling text processing, including tokenization, stemming, POS tagging, and more. Here's how data processing is done using NLTK:

1. Tokenization:

  • Purpose: Splitting text into smaller units like words or sentences.

  • Example:

    from nltk.tokenize import word_tokenize text = "I love NLP." tokens = word_tokenize(text) print(tokens) # Output: ['I', 'love', 'NLP', '.']

2. POS Tagging:

  • Purpose: Assign grammatical labels to each token in the text.

  • Example:

    from nltk import pos_tag from nltk.tokenize import word_tokenize text = "I love NLP" tokens = word_tokenize(text) tagged = pos_tag(tokens) print(tagged) # Output: [('I', 'PRP'), ('love', 'VBP'), ('NLP', 'NNP')]

3. Stemming:

  • Purpose: Reduce words to their root form.

  • Example:

    from nltk.stem import PorterStemmer
    ps = PorterStemmer() print(ps.stem("running")) # Output: "run"

4. Lemmatization:

  • Purpose: Reduce words to their dictionary form (lemma).

  • Example:

    from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize("better", pos="a")) # Output: "good"

5. Named Entity Recognition (NER):

  • Purpose: Identify entities like names, locations, and dates.

  • Example:

    from nltk import ne_chunk from nltk.tokenize import word_tokenize text = "Barack Obama was born in Hawaii." tokens = word_tokenize(text) tagged = pos_tag(tokens) tree = ne_chunk(tagged) print(tree)

6. Stop Word Removal:

  • Purpose: Remove common words that do not contribute much to the meaning.

  • Example:

    from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) words = ["I", "love", "NLP"] filtered_words = [word for word in words if word.lower() not in stop_words] print(filtered_words) # Output: ['love', 'NLP']

10. What do you mean by Corpus in NLP?

Definition:

A corpus (plural: corpora) is a large and structured set of text or speech data that is used for linguistic analysis, machine learning, and Natural Language Processing (NLP). It serves as the foundational dataset for training and evaluating NLP models.

  • Purpose: A corpus is used to train models, test hypotheses, and validate the performance of NLP systems. It contains representative examples of language, and by processing this data, algorithms can learn the structure, grammar, semantics, and other aspects of language.

Types of Corpora:

  1. Text Corpora: These are collections of written texts, such as books, articles, and social media posts.

    • Example: The Brown Corpus is a well-known collection of text from various genres (news, fiction, etc.).
  2. Speech Corpora: These consist of spoken language data, often transcribed into text.

    • Example: The Switchboard Corpus, a collection of telephone conversations.
  3. Parallel Corpora: These corpora contain text that is translated into multiple languages, helping in tasks like machine translation.

    • Example: The Europarl Corpus contains European Parliament proceedings in various languages.
  4. Domain-Specific Corpora: These are corpora related to specific fields like medical or legal texts.

    • Example: The PubMed corpus is a large collection of medical papers and articles.

Role of Corpus in NLP:

  • Training Data for Models: A corpus provides the necessary data for training models in NLP tasks like part-of-speech tagging, named entity recognition, sentiment analysis, etc.
  • Data Annotation: Corpora often come annotated with information such as syntactic structures or named entities, which is critical for supervised learning.
  • Corpus Analysis: Linguists and researchers use corpora to analyze language patterns, trends, and usage statistics across different contexts or over time.

11. Discuss what is the NER and WORDNET.

Named Entity Recognition (NER):

  • Definition: NER is an NLP task that involves identifying and classifying named entities (people, organizations, locations, dates, etc.) in a text. The goal is to locate and categorize these entities in a structured manner.

Why NER is Important:

NER helps systems to understand and extract critical information from unstructured text. It is widely used in information extraction, machine translation, question answering systems, and more.

Example:

  • Sentence: "Apple Inc. was founded by Steve Jobs in Cupertino in 1976."
  • NER Output:
    • "Apple Inc." → Organization
    • "Steve Jobs" → Person
    • "Cupertino" → Location
    • "1976" → Date

NER Techniques:

  • Rule-based Systems: Uses manually crafted rules to identify entities.
  • Statistical Models: Machine learning-based methods like Hidden Markov Models (HMM) or Conditional Random Fields (CRF).
  • Deep Learning Models: More recent methods use deep learning models like LSTMs or BERT for high-performance NER tasks.

WordNet:

  • Definition: WordNet is a lexical database or a thesaurus of the English language, which groups words into sets of synonyms called synsets. It also provides short definitions and records the relationships between different synsets.

Purpose of WordNet:

  • It helps in understanding the meanings of words and the relationships between them, such as synonyms, antonyms, hyponyms, and meronyms. This is crucial for various NLP applications like information retrieval, text classification, and machine translation.

Example:

  • WordNet Synsets:
    • Synset for “car”:
      • Synonyms: automobile, auto, motorcar
      • Hypernyms (broader term): motor vehicle
      • Hyponyms (specific types): sedan, hatchback, convertible

Features of WordNet:

  1. Synonyms: Words with similar meanings.
  2. Antonyms: Words with opposite meanings.
  3. Hypernyms: More general terms (e.g., "vehicle" is a hypernym of "car").
  4. Hyponyms: More specific terms (e.g., "sedan" is a hyponym of "car").
  5. Meronyms: Parts of something (e.g., "wheel" is a meronym of "car").

12. Explain the Frequency Distribution of Words in Detail

Definition:

A frequency distribution in NLP refers to how often each word appears in a corpus or a document. It provides a statistical view of the vocabulary and can help to understand the structure of the text. The frequency of a word indicates its importance or relevance in the document or corpus.

Purpose of Word Frequency Distribution:

  1. Text Analysis: Helps to identify the most important or frequent terms in a text.
  2. Feature Extraction: For tasks like text classification, word frequencies can be used as features to train machine learning models.
  3. Topic Modeling: Words with high frequencies may indicate key themes or topics in the text.

How to Calculate Word Frequency Distribution:

  • Step 1: Tokenize the text into individual words (tokens).
  • Step 2: Count the occurrence of each word.
  • Step 3: Create a distribution showing each word's frequency.

Example:

  • Text: “I love machine learning. Machine learning is fun.”
  • Tokenization: [“I”, “love”, “machine”, “learning”, “Machine”, “learning”, “is”, “fun”]
  • Word Frequency Distribution:
    • "machine" -> 2
    • "learning" -> 2
    • "I" -> 1
    • "love" -> 1
    • "is" -> 1
    • "fun" -> 1

How to Compute Word Frequency in Python Using NLTK:


from nltk import FreqDist from nltk.tokenize import word_tokenize text = "I love machine learning. Machine learning is fun." tokens = word_tokenize(text.lower()) # Convert to lowercase for case-insensitive counting fdist = FreqDist(tokens) # Print frequency distribution print(fdist) # To view the frequency of individual words: for word, freq in fdist.items(): print(f'{word}: {freq}')

Word Cloud:

One of the visualizations that often use frequency distributions is the word cloud, where words are displayed in varying sizes based on their frequency. Words that appear more frequently are displayed larger.

Applications of Word Frequency Distribution:

  1. Stopword Removal: High-frequency words like "the", "is", "and" are typically stop words, which are often removed to focus on more meaningful content.
  2. Text Classification: Word frequency counts are used as features to classify texts into categories (e.g., spam vs. non-spam).
  3. Keyword Extraction: Frequent words in a document can help to identify key terms that summarize the content.

Advanced Techniques:

  • Term Frequency-Inverse Document Frequency (TF-IDF): This measure adjusts word frequency by how often the word appears across all documents in a corpus, making it useful for finding words that are important in specific documents but not common across all documents.

Example of TF-IDF:

  • TF: Measures how frequently a word appears in a document.
  • IDF: Measures how important a word is in the entire corpus (if a word is common across all documents, its importance decreases).

Formula:

  • TF: TF(w)=Number of times word w appears in a documentTotal number of words in the documentTF(w) = \frac{\text{Number of times word w appears in a document}}{\text{Total number of words in the document}}
  • IDF: IDF(w)=log(Total number of documentsNumber of documents containing word w)IDF(w) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing word w}}\right)
  • TF-IDF: TF-IDF(w)=TF(w)×IDF(w)\text{TF-IDF}(w) = TF(w) \times IDF(w)


ONLY QUESTIONS PDF  

👇

Post a Comment

Previous Post Next Post