IR2Vec
|
If you do not want to train the model to generate seed embeddings, and want to use the pretrained vocabulary, please use the vocabulary given here and skip the following.
Currently we support three different embedding dimensions (75
, 100
, 300
). If you want to use embeddings of different dimensions, you can follow the below steps and copy the resultant vocabulary to the vocabulary directory with the following naming convention. seedEmbeddingVocab<DIM>D.txt
. Such vocabularies would be automatically used during the build process of IR2Vec.
This directory helps in generating seed embedding vocabulary in 3 steps.
ir2vec
If you have not done make
, follow the following steps to build ir2vec
binary.
build
directory (cd ../build
)make
Run triplets.sh
script with the required parameters Usage: bash triplets.sh <build dir> <No of opt> <llFile list> <output FileName>
buildDir
points to the path of IR2Vec's build foldernumOpt
is an integer between 1
and 6
O[0-3sz]
selected at randomllFileList
is a file containing the path of all the ll files. Use find <ll_dir> -type f > files_path.txt
outputFileName
is the file where the triplets would be writtenExample Usage:
bash triplets.sh ../build 2 files_path.txt triplets.txt
We generated ll files from Boost
libraries and spec cpu 2017
benchmarks to generate triplets.
Dataset | Source |
---|---|
Boost | https://www.boost.org/ |
SPEC17 CPU | https://www.spec.org/cpu2017/ |
The OpenKE
directory is a modified version of OpenKE repository (https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch) with the necessary changes for training seed embedding vocabulary.
Please see OpenKE/README.md for further information on OpenKE.
Create conda
environment and install the packages given in openKE.yaml
conda create -f ./OpenKE/openKE.yaml
conda activate openKE
We preprocess the generated triplets from the previous step in a form suitable for training TransE.
cd OpenKE
python preprocess.py --tripletFile=<tripletsFilePath>
--tripletFile
points to the location of the outputFileName
generated in the previous stepentity2id.txt
, train2id.txt
and relation2id.txt
will be generated in the same directory as that of tripletsFilePath
. Run python generate_embedding_ray.py
Possible Arguments: All the arguments have default values unless provided:
--index_dir
: Specifies the directory containing the processed files generated from preprocessing the triplets.--epoch
: Sets the number of epochs. Default is 1000
.--is_analogy
: Boolean flag to report analogy scores, calculated every 10 epochs using analogies.txt. Default is False
.--link_pred
: Boolean flag to report link prediction scores. Requires testing files (test2id.txt
,valid2id.txt
) in the `--index_dir
. Link prediction scores include hit@1, hit@3, hit@10, mean rank (MR), and mean reciprocal rank (MRR). Default is False
.--nbatches
: Specifies the batch size. Default is 100
.--margin
: Specifies the margin size for training. Default is 1.0
. To train a model with analogy scoring enabled and a batch size of 200, you can run:
Once training begins, you can monitor the progress using TensorBoard by running the following command:
We employ the ASHA Scheduler to efficiently optimize hyperparameters and terminate suboptimal trials. This scheduler tracks key metrics, which are determined by the following conditions:
--is_analogy
is set to True
, the AnalogyScore will be the key metric.--link_pred
is set to True
, the hit@1 will be the key metric.Once the training completes, the best model will be saved in the specified index_dir
with the filename format:
In addition, the entity embeddings will be stored in the index_dir/embeddings
subdirectory in the following format: