|
IR2Vec
|
Directories | |
| OpenKE | |
If you do not want to train the model to generate seed embeddings, and want to use the pretrained vocabulary, please use the vocabulary given here and skip the following.
Currently we support three different embedding dimensions (75, 100, 300). If you want to use embeddings of different dimensions, you can follow the below steps and copy the resultant vocabulary to the vocabulary directory with the following naming convention. seedEmbeddingVocab<DIM>D.txt. Such vocabularies would be automatically used during the build process of IR2Vec.
This directory helps in generating seed embedding vocabulary in 3 steps.
ir2vecIf you have not done make, follow the following steps to build ir2vec binary.
build directory (cd ../build)makeRun triplets.sh script with the required parameters Usage: bash triplets.sh <build dir> <No of opt> <llFile list> <output FileName>
buildDir points to the path of IR2Vec's build foldernumOpt is an integer between 1 and 6O[0-3sz] selected at randomllFileList is a file containing the path of all the ll files. Use find <ll_dir> -type f > files_path.txtoutputFileName is the file where the triplets would be writtenExample Usage:
bash triplets.sh ../build 2 files_path.txt triplets.txt
We generated ll files from Boost libraries and spec cpu 2017 benchmarks to generate triplets.
| Dataset | Source |
|---|---|
| Boost | https://www.boost.org/ |
| SPEC17 CPU | https://www.spec.org/cpu2017/ |
The OpenKE directory is a modified version of OpenKE repository (https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch) with the necessary changes for training seed embedding vocabulary.
Please see OpenKE/README.md for further information on OpenKE.
Create conda environment and install the packages given in openKE.yaml
conda create -f ./OpenKE/openKE.yamlconda activate openKEWe preprocess the generated triplets from the previous step in a form suitable for training TransE.
cd OpenKEpython preprocess.py --tripletFile=<tripletsFilePath>--tripletFile points to the location of the outputFileName generated in the previous stepentity2id.txt, train2id.txt and relation2id.txt will be generated in the same directory as that of tripletsFilePath. Run python generate_embedding_ray.py Possible Arguments: All the arguments have default values unless provided:
--index_dir: Specifies the directory containing the processed files generated from preprocessing the triplets.--epoch: Sets the number of epochs. Default is 1000.--is_analogy: Boolean flag to report analogy scores, calculated every 10 epochs using analogies.txt. Default is False.--link_pred: Boolean flag to report link prediction scores. Requires testing files (test2id.txt,valid2id.txt) in the `--index_dir. Link prediction scores include hit@1, hit@3, hit@10, mean rank (MR), and mean reciprocal rank (MRR). Default is False.--nbatches: Specifies the batch size. Default is 100.--margin: Specifies the margin size for training. Default is 1.0. To train a model with analogy scoring enabled and a batch size of 200, you can run:
Once training begins, you can monitor the progress using TensorBoard by running the following command:
We employ the ASHA Scheduler to efficiently optimize hyperparameters and terminate suboptimal trials. This scheduler tracks key metrics, which are determined by the following conditions:
--is_analogy is set to True, the AnalogyScore will be the key metric.--link_pred is set to True, the hit@1 will be the key metric.Once the training completes, the best model will be saved in the specified index_dir with the filename format:
In addition, the entity embeddings will be stored in the index_dir/embeddings subdirectory in the following format: