Parameters for running Xander are set in the configuration file (xander_setenv.sh). There are comments in the configuration file and in the README file on GitHub that explain at least in part how to select the parameters. There are three sections of the configuration file:
- Directory assignment: a user’s paths for data, output results, and to programs Xander calls.
- Sample naming: a unique sample identifier to be appended to the beginning of all output files and contig id (i.e. fasta headers).
- Parameters: defines parameters important to the quality of Xander results.
- Build: parameters for building the De Bruijn graph. These relate to the size of the data file.
- Contig search: parameters for searching for contigs. These impact the timing of search and are not adjusted often.
- Contig merge: parameters for assembling contigs. These can impact the length and quality of assembled sequences.
- Contig clustering: Parameters for clustering resulting contigs.
- Other parameters
 
Directory assignment
This section must be modified to match your file structure.
- SEQFILE: absolute path to sequence file.- accepted file formats: fasta, fastq or gz format
- can use wildcards (*) to point to multiple files as long as there are no spaces in the names
 
- WORKDIR: absolute path to working directory. It is useful to have a separate working directory for each sample of interest.
- REF_DIR: absolute path to- Xander_assemblerdirectory
- JAR_DIR: absolute path to- RDPTools
- UCHIME: absolute path to- uchime
- HMMALIGN: absolute path to- hmmalign
Build Parameters
As the name suggests, these determine how the De Bruijn graph, or bloom filter, is built. It is important to get this right because everything else depends on the quality of the boom filter. Some experimentation in selecting the parameters may be necessary, so in practice it is best to first run Xander with only the build step.
De Bruijn graph set up
- FILTER_SIZE: This value depends on the size of the data file. Based on our experience with metagenomic data from soil samples, the following choices are appropriate:
| Data file size | FILTER_SIZE parameter | 
|---|---|
| 2 GB | 32 | 
| 6 GB | 35 | 
| 70 GB | 38 | 
| 350 GB | 40 | 
- K_SIZE: The kmer size (in base pairs) used for contig assembly. It must be a multiple of 3 (3 nucleotides code for an amino acid) and cannot be larger than 63; a value of 45 is recommended. Higher numbers yield more stringent results.
- MIN_COUNT: The minimum kmer occurrence in the- SEQFILE(data file) for the kmer to be included in the final bloom filter. This is almost always equal to 1. Larger values require more memory (java heap size).
- MAX_JVM_HEAP (java heap size): The maximum amount of memory allowed for the build processes.- Must be larger than the size of the bloom filter, which is determined by the values of FILTER_SIZEandMIN_COUNT
- If MIN_COUNTis 1, the size of the bloom filter is approximately (2(FILTER_SIZE-3))/109 GB. For example:
 
- Must be larger than the size of the bloom filter, which is determined by the values of 
| Filter_SIZE parameter | Approximate bloom file size | 
|---|---|
| 35 | 4 GB | 
| 36 | 8 GB | 
| 37 | 16 GB | 
| 38 | 32 GB | 
- If MIN_COUNTis 2, then the bloom filter is approximately twice as large.
De Bruijn graph quality
The quality of the bloom filter can be evaluated by examining the false discovery rate reported on the next to last line of the output file knn_bloom_stat.txt found in the knn sub-directory of the data output directory. knn is the kmer size specified by the K_SIZE parameter (i.e. k45).
- The false discovery rate should be less than 0.01 (1 %) and depends on the parameter FILTER_SIZE.
- If the false discovery rate is greater than 0.01, delete the bloom filter (knn.bloom) in the working directory and then re-run Xander build with a largerFILTER_SIZE.
Contig Search Parameters
These impact the timing of search and are not adjusted often.
- PRUNEthe search if the score does not improve after the specified value for- n_nodes. The recommended value is 20. If this is set to 0, pruning is disabled but required memory and time increases.
- PATHSis the number of paths to search for each starting kmer. A value of 1 returns the shortest path.
- LIMIT_IN_SECSis the time limit in seconds to spend searching for each kmer. The recommended value is 100 seconds if- PATHS= 1. If- PATHSis larger, then the value for- LIMIT_IN_SECSneeds to be increased.
Contig Merge Parameters
These can impact the length and quality of assembled sequences.
- MIN_BITSis the minimum assembled contigs bit score. The recommended value is 50. This value can be increased if low quality sequences are assembled.
- MIN_LENGTHis the minimum length for assembled protein contigs. The recommended value is 150, which would result in a minimum assembled bp length of 450 and a minimum aa length of 150. You may need to reduce this for very small proteins.
Contig Clustering Parameters
- DIST_CUTOFFis the distance at which to cluster aa sequences. The recommended value is 0.01, which would cluster final contigs at 99% aa identity.
Other parameters
- THREADSis the number of computer cores to use.- Only one core is used to build the bloom filter, THREADSdoes not impact this step.
- The find and search steps may be run in parallel, one core for each gene, as explained in the sections Test a Local Xander Installation and Interactive Example on MSU’s HPCC .
- Set THREADSto the number of genes you are searching for, but do not exceed one less than the number of cores you have on your computer.
- For example when submitting a job to MSU’s cluster, the value of ppn should be THREADSplus 1.
 
- Set 
 
- Only one core is used to build the bloom filter, 
- NAME=k$K_SIZE need not be changed. It defines the name of the sub-directory to which results are written (- knn).
