Bag of Words

This post reviews a homework assignment in which I used a Bag of Words model to analyze 28 scientific papers. By converting each document into a vector representation, I tested whether a simple NLP pipeline could recover meaningful scientific subtopics from the corpus.

Bag of Words is one of the earliest and most influential vector space approaches in natural language processing. While it lacks the sophistication of modern language models, it provides an interpretable foundation for understanding how text can be represented numerically and compared across documents. I present this analysis as an introduction to computational text representation and the broader ideas it motivates.

At a glance
Corpus
28 papers
Methods
BoW + TF-IDF
Embeddings
PCA, UMAP
Best cosine pair
0.6177

Overview

1) Preprocessing and vectorization

  • lowercase conversion
  • punctuation removal
  • whitespace token splitting
  • term-document matrix construction

From the raw count matrix (X), I computed TF-IDF:

\[T_{d,t} = \frac{X_{d,t}}{\sum_{t'} X_{d,t'}} \cdot \log\left( \frac{N}{ \sum_{d'} \mathbf{1}[X_{d',t} > 0] } \right)\]

Then each document vector was L2-normalized.

2) Dimensionality reduction and similarity

  • PCA on normalized TF-IDF vectors
  • UMAP with cosine distance (n_neighbors=5, min_dist=0.1)
  • cosine similarity matrix + hierarchical ordering using distance (1 - \cos(i,j))

Main findings

  • papers with similar subtopics tended to group together
  • the first three PCA components explained 0.106, 0.100, and 0.095 of variance
  • PCA loadings reflected biologically meaningful terms (for example tfh, gc, germinal, follicular)
  • UMAP revealed groupings not visible in the PCA projection
  • the cosine clustermap was consistent with those grouping patterns
PCA projection of TF-IDF vectors colored by immunology subtopic
PCA projection of normalized TF-IDF vectors. Subtopic structure is partially visible in linear space.
UMAP embedding of TF-IDF vectors with cosine distance and tuned parameters
UMAP embedding (`n_neighbors=5`, `min_dist=0.1`) showing clearer nonlinear grouping by subtopic.
Cosine similarity clustermap of document vectors ordered by hierarchical clustering
Cosine similarity clustermap reordered by hierarchical clustering. Similar papers align in local blocks.
UMAP parameter sweep over number of neighbors
UMAP hyperparameter sweep over `n_neighbors`.
UMAP parameter sweep over minimum distance setting
UMAP hyperparameter sweep over `min_dist`.

Additional comments

The utility of TF-IDF weighting and L2 normalization was a key part of this assignment. When I built the raw term-document matrix, the most frequent token was, unsurprisingly, and.

This illustrates why raw counts alone can produce weak embeddings: high-frequency background tokens dominate the signal across nearly all documents. After TF-IDF weighting, high-weight terms shifted toward more domain-specific vocabulary such as cgas.

Top 10 most frequent tokens in the raw corpus
Rank Term Count
1 and 14573
2 of 13337
3 the 12363
4 in 10033
5 a 6960
6 to 6528
7 cells 6462
8 t 6120
9 al 5411
10 et 5343

The token t (rank 8) is likely an artifact of PDF extraction splitting hyphenated terms such as “T-cell” into t and cell.

Term-document matrix heatmap showing sparse token occurrence across documents
Term-Document Matrix. The figure shows whether or not a word appears in a given document.

Nearest-neighbor check

Closest document pair under cosine similarity of TF-IDF vectors
Query document Doc 0
Nearest neighbor Doc 6
Cosine similarity 0.6177
Metadata for the closest document pair
Doc Title Subtopic
0 Molecular and cellular insights into T cell exhaustion. T-cell exhaustion
6 Defining "T cell exhaustion". T-cell exhaustion
Top shared TF-IDF terms for Docs 0 and 6
Term Shared score
exhausted 0.2281
exhaustion 0.1415
pd1 0.0550
pubmed 0.0462
manuscript 0.0460
author 0.0364
wherry 0.0048
tumour 0.0042
pmc 0.0035
progenitor 0.0029
tumours 0.0027
cd8 0.0020

Several top shared terms (pubmed, manuscript, author, pmc) are metadata artifacts from PDF extraction. The cosine similarity of 0.6177 is therefore partially inflated by source noise rather than topic signal alone.

Supplementary: top PCA loading tables (PC1 / PC2 / PC3)
PC1 top contributing terms (explained variance = 0.106)
Term Loading
gc 0.4543
tfh 0.2394
germinal 0.1579
org 0.1191
gcs 0.1136
lz 0.1119
follicular 0.1003
annualreviews 0.0984
dz 0.0959
https 0.0869
self-reactive 0.0822
pubmed -0.2997
manuscript -0.2147
autophagy -0.2022
author -0.1893
exhausted -0.1622
exhaustion -0.1589
deretic -0.1132
tex -0.0924
pd1 -0.0754
PC2 top contributing terms (explained variance = 0.100)
Term Loading
https 0.3579
org 0.3322
tex 0.1823
ammasome 0.1255
1038 0.0911
ammatory 0.0862
guest 0.0773
gc -0.3272
pubmed -0.3113
manuscript -0.2155
author -0.1884
germinal -0.1300
autophagy -0.1139
gcs -0.1081
dz -0.1043
lz -0.1042
tfh -0.0774
shm -0.0714
deretic -0.0667
bcr -0.0653

Note: ammasome and ammatory appear to be truncated tokens from PDF extraction, likely fragments of inflammasome and inflammatory.

PC3 top contributing terms (explained variance = 0.095)
Term Loading
tex 0.5089
exhaustion 0.3057
guest 0.2314
exhausted 0.1664
annualreviews 0.1507
iy37ch19-wherry 0.1199
pd-1 0.1158
cls 0.1157
arjats 0.1157
etal 0.1138
tmem 0.1073
teff 0.1057
downloaded 0.0997
2026 0.0997
www 0.0970
mon 0.0907
https -0.2133
autophagy -0.1412
org -0.1320
cgas -0.1191

Limitations

  • preprocessing was intentionally simple, so metadata/noise leaked into features
  • tokens such as annualreviews, https, pubmed, and author appeared with large weights
  • BoW/TF-IDF ignores word order and deeper semantics

Takeaway

Bag-of-Words + TF-IDF provided an interpretable baseline that partially recovered subtopic structure in this 28-paper corpus, though the results were strongly affected by simple preprocessing choices and residual PDF/source metadata.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Science as Art-UChicago contest
  • A Multistate Kinetic Model of the Sodium-Potassium ATPase
  • Scheherazade: Magic and Myth with the St. Louis Symphony Orchestra
  • Jakub Hrůša with the Chicago Symphony Orchestra Open Rehearsal
  • Fine-Tuning an LLM Based on My PhD Advisor's Publication