Questions tagged [natural-language]
Natural Language Processing is a set of techniques from linguistics, artificial intelligence, machine learning and statistics that aim at processing and understanding human languages.
1,151
questions
1
vote
0
answers
12
views
+100
NER With Custom Tags, How to Approach
I am building a "field tagger" for documents. Basically, a document, in my case something like a proposal or sales quote, would have a bunch of entities scattered throughout it, and we want ...
0
votes
0
answers
28
views
Normalizing the embedding space of an encoder language model with respect to categorical data
Suppose we have a tree/hierarchy of categories (e.g. categories of products in an e-commerce website), each node being assigned a title. Assume that the title of each node is semantically accurate, ...
0
votes
0
answers
9
views
Why learn an embedding before self attention when training transformers?
I understand that self-attention layers learn the "role" of a word in a sentence while embedding layers learn the relationship between the words. But I am not totally convinced that a self-...
0
votes
0
answers
12
views
Log-likelihood calculation for unigrams
I am calculating the log-likelihood for each unigram that I generated by using the CountVectorizer to see each unigram's importance. However, I got all the positive value after calculating the log-...
4
votes
2
answers
534
views
Overfitting in randomForest model in R, WHY?
I am trying to train a Random Forest model in R for sentiment analysis. The model works with tf-idf matrix and learns from it how to classify a review, in positive or negative.
Positive ones are ...
0
votes
0
answers
20
views
Where does the equation $ C = 6 \times N \times T $ come from for Large Language Models, especially with a simple explanation for both passes?
Why $ C = 6 \times N \times T $?
I'm trying to understand the computational steps specifically during the backward pass of neural networks in relation to the widely cited formula ( C = 6 \times N \...
0
votes
0
answers
18
views
Can 3D convolutions appropriately capture a frozen embedding space?
My project is a strange combination of NLP and Computer Vision.
I have datapoints of 3D tensor where each element is a token in an NLP vocabulary. The vocabulary is around 1000 unique "words"...
0
votes
1
answer
26
views
Find event date given the probabilities of finding an event
I have a set of clinical notes with dates for each patient and an NLP models which gives a score between 0.0 and 1.0 of a certain event being present in the note. Given the scores, what is the best ...
0
votes
0
answers
10
views
Appropriateness of the Universal Sentence Encoder model
I have a classification problem where the goal is to predict, based on a small paragraph, if an individual is British or not.
The model used for the classification is Universal Sentence Encoder (to ...
0
votes
1
answer
33
views
Clustering of large text datasets with unknown number of clusters
I have a list of hotel names which may or may not be correct, and with different spellings (such as '&' instead of 'and'). I want to use clustering in order to group the hotels with different ...
1
vote
0
answers
18
views
BERT eval loss increase while performance metrics also increase
I want to fine-tune BERT for Named Entity Recognition (NER). However, when fine-tuning over several epochs on different datasets I get a weird behaviour where the training loss decreases, eval loss ...
0
votes
0
answers
100
views
Locality sensitive hashing (LSH) with word embeddings and cosine similarity
I would like to ask about the methodology of LSH algorithm with Word Embeddings and Cosine Similarity to identify similar documents.
First, I tokenize my sentences to create a list of tokens. Then, I ...
0
votes
0
answers
9
views
Problems in understanding Word2vec architectures
I have probably a very simple question, but I did not find any clear resource on the web.
First let's consider the Skip-gram model, in which we try to predict a context word given the target word. In ...
2
votes
1
answer
141
views
If a document set is too small for running a topic model, can you simply multiply the document set by a factor of 10 to be able to run the model?
Say I'm using Top2Vec as a topic model to capture the top 10 salient topics across documents. I have an array that contains the documents of the corpus. Initially, there are not enough documents to ...
0
votes
0
answers
73
views
How is the unigram tokenization using EM algorithm?
I intuitively understand what is happening in the unigram tokenizer and I think I also understand the EM algorithm if I can figure out the formulation in which I understand it i.e. What is the latent ...