# Top2Vec **Repository Path**: mirrors_lepy/Top2Vec ## Basic Information - **Project Name**: Top2Vec - **Description**: Top2Vec learns jointly embedded topic, document and word vectors. - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-25 - **Last Updated**: 2026-03-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [](https://pypi.org/project/top2vec/) [](https://github.com/ddangelov/Top2Vec/blob/master/LICENSE) [](https://top2vec.readthedocs.io/en/latest/?badge=latest) [](http://arxiv.org/abs/2008.09470) Top2Vec ======= Top2Vec is an algorithm for **topic modeling** and **semantic search**. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can: * Get number of detected topics. * Get topics. * Get topic sizes. * Get hierarchichal topics. * Search topics by keywords. * Search documents by topic. * Search documents by keywords. * Find similar words. * Find similar documents. * Expose model with [RESTful-Top2Vec](https://github.com/ddangelov/RESTful-Top2Vec) See the [paper](http://arxiv.org/abs/2008.09470) for more details on how it works. Benefits -------- 1. Automatically finds number of topics. 2. No stop word lists required. 3. No need for stemming/lemmatizing. 4. Works on short text. 5. Creates jointly embedded topic, document, and word vectors. 6. Has search functions built in. How does it work? ----------------- The assumption the algorithm makes is that many semantically similar documents are indicative of an underlying topic. The first step is to create a joint embedding of document and word vectors. Once documents and words are embedded in a vector space the goal of the algorithm is to find dense clusters of documents, then identify which words attracted those documents together. Each dense area is a topic and the words that attracted the documents to the dense area are the topic words. ### The Algorithm: **1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html).** >Documents will be placed close to other similar documents and close to the most distinguishing words.
### Search Documents by Topic
We are going to search by **topic 48**, a topic that appears to be about **science**.
```python
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
```
Returns:
* ``documents``: The documents in a list, the most similar are first.
* ``doc_scores``: Semantic similarity of document to topic. The cosine similarity of the
document and topic vector.
* ``doc_ids``: Unique ids of documents. If ids were not given, the index of document
in the original corpus.
For each of the returned documents we are going to print its content, score and document number.
```python
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
print(f"Document: {doc_id}, Score: {score}")
print("-----------")
print(doc)
print("-----------")
print()
```
Document: 15227, Score: 0.6322
-----------
Evolution is both fact and theory. The THEORY of evolution represents the
scientific attempt to explain the FACT of evolution. The theory of evolution
does not provide facts; it explains facts. It can be safely assumed that ALL
scientific theories neither provide nor become facts but rather EXPLAIN facts.
I recommend that you do some appropriate reading in general science. A good
starting point with regard to evolution for the layman would be "Evolution as
Fact and Theory" in "Hen's Teeth and Horse's Toes" [pp 253-262] by Stephen Jay
Gould. There is a great deal of other useful information in this publication.
-----------
Document: 14515, Score: 0.6186
-----------
Just what are these "scientific facts"? I have never heard of such a thing.
Science never proves or disproves any theory - history does.
-Tim
-----------
Document: 9433, Score: 0.5997
-----------
The same way that any theory is proven false. You examine the predicitions
that the theory makes, and try to observe them. If you don't, or if you
observe things that the theory predicts wouldn't happen, then you have some
evidence against the theory. If the theory can't be modified to
incorporate the new observations, then you say that it is false.
For example, people used to believe that the earth had been created
10,000 years ago. But, as evidence showed that predictions from this
theory were not true, it was abandoned.
-----------
Document: 11917, Score: 0.5845
-----------
The point about its being real or not is that one does not waste time with
what reality might be when one wants predictions. The questions if the
atoms are there or if something else is there making measurements indicate
atoms is not necessary in such a system.
And one does not have to write a new theory of existence everytime new
models are used in Physics.
-----------
...
### Semantic Search Documents by Keywords
Search documents for content semantically similar to **cryptography** and **privacy**.
```python
documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["cryptography", "privacy"], num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
print(f"Document: {doc_id}, Score: {score}")
print("-----------")
print(doc)
print("-----------")
print()
```
Document: 16837, Score: 0.6112
-----------
...
Email and account privacy, anonymity, file encryption, academic
computer policies, relevant legislation and references, EFF, and
other privacy and rights issues associated with use of the Internet
and global networks in general.
...
Document: 16254, Score: 0.5722
-----------
...
The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
communications while meeting the legitimate needs of law
enforcement.
...
-----------
...
### Similar Keywords
Search for similar words to **space**.
```python
words, word_scores = model.similar_words(keywords=["space"], keywords_neg=[], num_words=20)
for word, score in zip(words, word_scores):
print(f"{word} {score}")
```
space 1.0
nasa 0.6589
shuttle 0.5976
exploration 0.5448
planetary 0.5391
missions 0.5069
launch 0.4941
telescope 0.4821
astro 0.4696
jsc 0.4549
ames 0.4515
satellite 0.446
station 0.4445
orbital 0.4438
solar 0.4386
astronomy 0.4378
observatory 0.4355
facility 0.4325
propulsion 0.4251
aerospace 0.4226