IR2Vec
Loading...
Searching...
No Matches
seed_embeddings Directory Reference

Directories

 OpenKE
 

Detailed Description

If you do not want to train the model to generate seed embeddings, and want to use the pretrained vocabulary, please use the vocabulary given here and skip the following.

Currently we support three different embedding dimensions (75, 100, 300). If you want to use embeddings of different dimensions, you can follow the below steps and copy the resultant vocabulary to the vocabulary directory with the following naming convention. seedEmbeddingVocab<DIM>D.txt. Such vocabularies would be automatically used during the build process of IR2Vec.

Generation of Seed Embedding Vocabulary

This directory helps in generating seed embedding vocabulary in 3 steps.

  1. Building ir2vec
  2. Generating Triplets
  3. Training TransE to generate seed embedding vocabulary

Step 1: Building ir2vec

If you have not done make, follow the following steps to build ir2vec binary.

Step 2: Generating Triplets

Steps to collect the triplets

Run triplets.sh script with the required parameters Usage: bash triplets.sh <build dir> <No of opt> <llFile list> <output FileName>

Example Usage:

bash triplets.sh ../build 2 files_path.txt triplets.txt

Files used to generate Seed Embedding Vocabulary

We generated ll files from Boost libraries and spec cpu 2017 benchmarks to generate triplets.

Dataset Source
Boost https://www.boost.org/
SPEC17 CPU https://www.spec.org/cpu2017/

Step 3: Training TransE to generate seed embedding vocabulary

The OpenKE directory is a modified version of OpenKE repository (https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch) with the necessary changes for training seed embedding vocabulary.

Please see OpenKE/README.md for further information on OpenKE.

Requirements

Create conda environment and install the packages given in openKE.yaml

Preprocessing the triplets

We preprocess the generated triplets from the previous step in a form suitable for training TransE.

Training TransE to generate embeddings

Run python generate_embedding_ray.py Possible Arguments: All the arguments have default values unless provided:

Example Command

To train a model with analogy scoring enabled and a batch size of 200, you can run:

python generate_embedding_ray.py --index_dir "../seed_embeddings/preprocessed/" --epoch 1500 --is_analogy True --use_gpu true
TensorBoard Tracking

Once training begins, you can monitor the progress using TensorBoard by running the following command:

tensorboard --logdir=~/ray_results
ASHA Scheduler for Hyperparameter Optimization

We employ the ASHA Scheduler to efficiently optimize hyperparameters and terminate suboptimal trials. This scheduler tracks key metrics, which are determined by the following conditions:

Results

Once the training completes, the best model will be saved in the specified index_dir with the filename format:

seedEmbedding_{}E_{}D_{}batches_{}margin.ckpt

In addition, the entity embeddings will be stored in the index_dir/embeddings subdirectory in the following format:

embeddings/seedEmbedding_{}E_{}D_{}batches_{}margin.txt