Machine Learning: Improve text features

October 15, 2019

Machine Learning: Improve text features

If it is categorical or nominal variables with binary labels (like baby names) we use the DictVectorizer class for one-hot encoding in scikit-learn.

from sklearn.feature_extraction import DictVectorizer
onehot_encoder = DictVectorizer()

Text needs to be transformed that encodes the meaning properly. Following are few techniques from NLTK (Natural Language Tool Kit) in python.

- Most popular representation is bag-of-words, it uses a multiset that encodes one feature for each word with an intuition that multiple documents, "if have similar words may have similar meanings". For example a corpus of 2 documents,

corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game'
]

has eight unique words. Word presence or absences is encoded by 1 and 0. Euclidean distance is used to measure similarity.

- Stop-word filtering does two things, ignores case variations and keeps common words (stop words) such a, the, etc. out of contention. It helps enormously reducing diensionality. One can vectorize and filter a corpus using below code

vectorizer = CountVectorizer(stop_words='english')

print vectorizer.fit_transform(corpus).todense()

- Stemming and lemmatization follow a pruning system to further reduce dimensions by recognizing words like "jump" and "jumping" being condensed and derived form of the same word, marking it both as a single feature

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

- TF-IDF weight helps getting frequency of a word in a document or string, excessive presence of a word compared to scarcely presence in another document may not be recognized by Bag of Words but adding logarithmically scaled term frequency we can train our model to capture such difference.

- Hashing trick will help minimize memory utilization, multiple passes through corpus and improve model performance

from sklearn.feature_extraction.text import HashingVectorizer

Features that best capture gender in names:

From https://blog.ayoungprogrammer.com/2016/04/determining-gender-of-name-with-80.html/ (Links to an external site.)

the best features with an ~80% accuracy,

Last character
frequency and order of character 'a'
second last character
order of characters y and o

The link https://nlpforhackers.io/introduction-machine-learning/ (Links to an external site.)shows a use of DecisionTreeVectorizer, also checking on the last character(s).

Below is some more feature ideas

"Names ending in -a, -e and -i are likely to be female, while names ending in -k, -o, -r, -s and -t are likely to be male… names ending in -yn appear to be predominantly female, despite the fact that names ending in -n tend to be male; and names ending in -ch are usually male, even though names that end in -h tend to be female."

Efficiency across languages:

It is very hard to say how those features will accommodate across languages. If the origin of modern language falls under the same classical language then despite the modernization and changes over time the basic structure of names will remain pretty similar. As our discussion is mostly focused on English language, most of those features should work well any language evolved, post-Latin or even post-Greek. A different approach may be needed for sub-continental post-Sanskrit language as mostly those names won't fit in here.

Also, religious link can be put into account, as most names derived from Abrahamic religions are very similar (Jacob vs Yakub etc.).

Reference:

1. Bird, S., Klein, E., and Loper, E. “6.1.1 Gender Identification.” Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O’Reilly, 2009

2. Hackelin, G "Mastering Machine Learning with scikit-learn" PACKT Publishing (Open Source)

Search This Blog

Algorithm Made Easy

Machine Learning: Improve text features

Comments

Post a Comment

Popular Posts

Artificial Intelligence 1

Cracking the Coding Interview Chapter 2 Solutions in PeopleCode