IR2Vec
|
The following guide details the steps followed in training IR2Vec with the ComPile dataset.
IR2Vec-Version-Upgrade-Checks
has the required scripts to be run for this process.ComPile
.ComPile/collect_dataset_info.py
- This script will generate the list of all the unique C/C++ files in the dataset.ComPile/save_ir.py
- This script will download the ByteCode files for all the C/C++ files in the dataset, generate the IR files and save them in the specified location, following which it will save the file_names to ir_paths.txt
file.ComPile/prep_ir_list.py
- This script will take the ir_paths.txt
file and generate the list of paths of all the IR files in the downloaded dataset.Once we have generated the list of all the .ll
file paths, we go to the seed_embedding folder in the main IR2Vec repository. Here, our process will have to involve the following tasks.
triplets.sh
bash file with relevant changes to update the llvm version.seed_embeddings/README.md
./tmp
, we will specify a custom path to store the temp files.gen_triplets.sh
helper script available in the ComPile
folder. You need to copy the gen_triplets.sh
script to the seed_embeddings
folder, make relevant changes to the script and run it.triplets.txt
file will be extremely large in size. So much so that attempting to open it, or directly read from it, is likely to cause an overshoot of the available RAM space, and cause a system crash.triplets.txt
is stored. Create a new folder, say split_files
split -C 500M triplets.txt split_files/triplets_part -d -a 2 --numeric-suffixes=11 --additional-suffix=.txt
triplets.txt
file into multiple files, each 500MB in size, and store them in the folder, labelled by number.openKE
folder, where the script IR2Vec/seed_embeddings/OpenKE/preprocess.py
was being used to preprocess the data from triplets.txt
.IR2Vec/seed_embeddings/OpenKE/preprocess_hybrid.py
. This script takes as input, the folder of broken up triplets file, as created in the previous step. The script takes in the folder, and iterates over all the files in the folder, to generate the Entities, Relation and the Training sets, in a safe manner, so as to not cause any RAM overshootstriplets.txt
file was sufficient to generate the requisite preprocessed information. To run with a single triplets file, just place it in a folder, and pass the folder path to this script.triplets.txt
, the file train2id.txt
is also going to be an extremely large file, and attempting to open it will likely overshoot the RAM and cause a system crash.train2id.txt
file is read, and an attempt is made to extract all the unique train sets from it. This, when run with a large train2id.txt
, is going to be a likely site of RAM overshoot and a subsequent system crash.ComPile/get_uniq_train.sh
is supplied. The user needs to copy this to the site of the train2id.txt
, and run with the appropriate path changes. This will give an output of much reduced size, with unique train sets.Once this step is reached, the user can then resume training, from the original path here. A helper script, ComPile/run_training_ray.sh
has been provided to help the user provide log paths, and specify properly formatted parameters and run the training accordingly.