Bag of Words | Jose Guerra

This post reviews a homework assignment in which I used a Bag of Words model to analyze 28 scientific papers. By converting each document into a vector representation, I tested whether a simple NLP pipeline could recover meaningful scientific subtopics from the corpus.

Bag of Words is one of the earliest and most influential vector space approaches in natural language processing. While it lacks the sophistication of modern language models, it provides an interpretable foundation for understanding how text can be represented numerically and compared across documents. I present this analysis as an introduction to computational text representation and the broader ideas it motivates.

At a glance

Corpus

28 papers

Methods

BoW + TF-IDF

Embeddings

PCA, UMAP

Best cosine pair

0.6177

Overview

1) Preprocessing and vectorization

lowercase conversion
punctuation removal
whitespace token splitting
term-document matrix construction

From the raw count matrix (X), I computed TF-IDF:

\[T_{d,t} = \frac{X_{d,t}}{\sum_{t'} X_{d,t'}} \cdot \log\left( \frac{N}{ \sum_{d'} \mathbf{1}[X_{d',t} > 0] } \right)\]

Then each document vector was L2-normalized.

2) Dimensionality reduction and similarity

PCA on normalized TF-IDF vectors
UMAP with cosine distance (n_neighbors=5, min_dist=0.1)
cosine similarity matrix + hierarchical ordering using distance (1 - \cos(i,j))

Main findings

papers with similar subtopics tended to group together
the first three PCA components explained 0.106, 0.100, and 0.095 of variance
PCA loadings reflected biologically meaningful terms (for example tfh, gc, germinal, follicular)
UMAP revealed groupings not visible in the PCA projection
the cosine clustermap was consistent with those grouping patterns

PCA projection of TF-IDF vectors colored by immunology subtopic — PCA projection of normalized TF-IDF vectors. Subtopic structure is partially visible in linear space.

UMAP embedding of TF-IDF vectors with cosine distance and tuned parameters — UMAP embedding (`n_neighbors=5`, `min_dist=0.1`) showing clearer nonlinear grouping by subtopic.

Cosine similarity clustermap of document vectors ordered by hierarchical clustering — Cosine similarity clustermap reordered by hierarchical clustering. Similar papers align in local blocks.

UMAP parameter sweep over number of neighbors — UMAP hyperparameter sweep over `n_neighbors`.

UMAP parameter sweep over minimum distance setting — UMAP hyperparameter sweep over `min_dist`.

Additional comments

The utility of TF-IDF weighting and L2 normalization was a key part of this assignment. When I built the raw term-document matrix, the most frequent token was, unsurprisingly, and.

This illustrates why raw counts alone can produce weak embeddings: high-frequency background tokens dominate the signal across nearly all documents. After TF-IDF weighting, high-weight terms shifted toward more domain-specific vocabulary such as cgas.

Top 10 most frequent tokens in the raw corpus
Rank	Term	Count
1	and	14573
2	of	13337
3	the	12363
4	in	10033
5	a	6960
6	to	6528
7	cells	6462
8	t	6120
9	al	5411
10	et	5343

The token t (rank 8) is likely an artifact of PDF extraction splitting hyphenated terms such as “T-cell” into t and cell.

Term-document matrix heatmap showing sparse token occurrence across documents — Term-Document Matrix. The figure shows whether or not a word appears in a given document.

Nearest-neighbor check

Closest document pair under cosine similarity of TF-IDF vectors
Query document	Doc 0
Nearest neighbor	Doc 6
Cosine similarity	0.6177

Metadata for the closest document pair
Doc	Title	Subtopic
0	Molecular and cellular insights into T cell exhaustion.	T-cell exhaustion
6	Defining "T cell exhaustion".	T-cell exhaustion

Top shared TF-IDF terms for Docs 0 and 6
Term	Shared score
exhausted	0.2281
exhaustion	0.1415
pd1	0.0550
pubmed	0.0462
manuscript	0.0460
author	0.0364
wherry	0.0048
tumour	0.0042
pmc	0.0035
progenitor	0.0029
tumours	0.0027
cd8	0.0020

Several top shared terms (pubmed, manuscript, author, pmc) are metadata artifacts from PDF extraction. The cosine similarity of 0.6177 is therefore partially inflated by source noise rather than topic signal alone.

Supplementary: top PCA loading tables (PC1 / PC2 / PC3)

PC1 top contributing terms (explained variance = 0.106)
Term	Loading
gc	0.4543
tfh	0.2394
germinal	0.1579
org	0.1191
gcs	0.1136
lz	0.1119
follicular	0.1003
annualreviews	0.0984
dz	0.0959
https	0.0869
self-reactive	0.0822
pubmed	-0.2997
manuscript	-0.2147
autophagy	-0.2022
author	-0.1893
exhausted	-0.1622
exhaustion	-0.1589
deretic	-0.1132
tex	-0.0924
pd1	-0.0754

PC2 top contributing terms (explained variance = 0.100)
Term	Loading
https	0.3579
org	0.3322
tex	0.1823
ammasome	0.1255
1038	0.0911
ammatory	0.0862
guest	0.0773
gc	-0.3272
pubmed	-0.3113
manuscript	-0.2155
author	-0.1884
germinal	-0.1300
autophagy	-0.1139
gcs	-0.1081
dz	-0.1043
lz	-0.1042
tfh	-0.0774
shm	-0.0714
deretic	-0.0667
bcr	-0.0653

Note: ammasome and ammatory appear to be truncated tokens from PDF extraction, likely fragments of inflammasome and inflammatory.

PC3 top contributing terms (explained variance = 0.095)
Term	Loading
tex	0.5089
exhaustion	0.3057
guest	0.2314
exhausted	0.1664
annualreviews	0.1507
iy37ch19-wherry	0.1199
pd-1	0.1158
cls	0.1157
arjats	0.1157
etal	0.1138
tmem	0.1073
teff	0.1057
downloaded	0.0997
2026	0.0997
www	0.0970
mon	0.0907
https	-0.2133
autophagy	-0.1412
org	-0.1320
cgas	-0.1191

Limitations

preprocessing was intentionally simple, so metadata/noise leaked into features
tokens such as annualreviews, https, pubmed, and author appeared with large weights
BoW/TF-IDF ignores word order and deeper semantics

Takeaway

Bag-of-Words + TF-IDF provided an interpretable baseline that partially recovered subtopic structure in this 28-paper corpus, though the results were strongly affected by simple preprocessing choices and residual PDF/source metadata.