FreeBSD Manual Pages
cmbuild(1) Infernal Manual cmbuild(1) NAME cmbuild - construct covariance model(s) from structurally annotated RNA multiple sequence alignment(s) SYNOPSIS cmbuild [options] _cmfile_out_ _msafile_ DESCRIPTION For each multiple sequence alignment in _msafile_ build a covariance model and save it to a new file _cmfile_out_. The alignment file must be in Stockholm or SELEX format, and must con- tain consensus secondary structure annotation. cmbuild uses the con- sensus structure to determine the architecture of the CM. _msafile_ may be '-' (dash), which means reading this input from stdin rather than a file. To use '-', you must also specify the alignment file format with --informat _s_, as in --informat stockholm (because of a current limitation in our implementation, MSA file formats cannot be autodetected in a nonrewindable input stream.) _cmfile_out_ may not be '-' (stdout), because sending the CM file to stdout would conflict with the other text output of the program. In addition to writing CM(s) to _cmfile_out_, cmbuild also outputs a single line for each model created to stdout. Each line has the follow- ing fields: "aln": the index of the alignment used to build the CM; "idx": the index of the CM in the _cmfile_out_; "name": the name of the CM; "nseq": the number of sequences in the alignment used to build the CM; "eff_nseq": the effective number of sequences used to build the model; "alen": the length of the alignment used to build the CM; "clen": the number of columns from the alignment defined as consensus (match) columns; "bps": the number of basepairs in the CM; "bifs": the number of bifurcations in the CM; "rel entropy: CM": the total relative entropy of the model divided by the number of consensus columns; "rel entropy: HMM": the total relative entropy of the model ignoring sec- ondary structure divided by the number of consensus columns. "descrip- tion": description of the model/alignment. OPTIONS -h Help; print a brief reminder of command line usage and available options. -n _s_ Name the new CM _s_. The default is to use the name of the alignment (if one is present in the _msafile_), or, failing that, the name of the _msafile_. If _msafile_ contains more than one alignment, -n doesn't work, and every alignment must have a name annotated in the _msafile_ (as in Stockholm #=GF ID annotation). -F Allow _cmfile_out_ to be overwritten. Without this option, if _cmfile_out_ already exists, cmbuild exits with an error. -o _f_ Direct the summary output to file _f_, rather than to stdout. -O _f_ After each model is constructed, resave annotated source align- ments to a file _f_ in Stockholm format. Sequences are annoted with what relative sequence weights were assigned. The align- ments are also annotated with a reference annotation line indi- cating which columns were assigned as consensus. If the source alignment had reference annotation ("#=GC RF") it will be re- placed with the consensus residue of the model for consensus columns and '.' for insert columns, unless the --hand option was used for specifying consensus positions, in which case it will be unchanged. --devhelp Print help, as with -h , but also include expert op- tions that are not displayed with -h . These expert options are not expected to be relevant for the vast majority of users and so are not described in the manual page. The only resources for understanding what they actually do are the brief one-line de- scriptions output when --devhelp is enabled, and the source code. OPTIONS CONTROLLING MODEL CONSTRUCTION These options control how consensus columns are defined in an align- ment. --fast Define consensus columns automatically as those that have a fraction >= symfrac of residues as opposed to gaps. (See below for the --symfrac option.) This is the default. --hand Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which columns are consensus, and which are inserts. Any non-gap character indicates a consensus column. (For exam- ple, mark consensus columns with "x", and insert columns with ".".) This option was called --rf in previous versions of Infer- nal (0.1 through 1.0.2). --symfrac _x_ Define the residue fraction threshold necessary to define a con- sensus column when not using --hand. The default is 0.5. The symbol fraction in each column is calculated after taking rela- tive sequence weighting into account. Setting this to 0.0 means that every alignment column will be assigned as consensus, which may be useful in some cases. Setting it to 1.0 means that only columns that include 0 gaps will be assigned as consensus. This option replaces the --gapthresh _y_ option from previous ver- sions of Infernal (0.1 through 1.0.2), with _x_ equal to (1.0 - _y_). For example to reproduce behavior for a command of cm- build --gapthresh 0.8 in a previous version, use cmbuild --sym- frac 0.2 with this version. --noss Ignore the secondary structure annotation, if any, in _msafile_ and build a CM with zero basepairs. This model will be similar to a profile HMM and the cmsearch and cmscan programs will use HMM algorithms which are faster than CM ones for this model. Ad- ditionally, a zero basepair model need not be calibrated with cmcalibrate prior to running cmsearch with it. The --noss option must be used if there is no secondary structure annotation in <msafile>. --rsearch _f_ Parameterize emission scores a la RSEARCH, using the RIBOSUM ma- trix in file _f_. With --rsearch enabled, all alignments in _msafile_ must contain exactly one sequence or the --call option must also be enabled. All positions in each sequence will be considered consensus "columns". Actually, the emission scores for these models will not be identical to RIBOSUM scores due of differences in the modelling strategy between Infernal and RSEARCH, but they will be as similar as possible. RIBOSUM ma- trix files are included with Infernal in the "matrices/" subdi- rectory of the top-level "infernal-xxx" directory. RIBOSUM ma- trices are substitution score matrices trained specifically for structural RNAs with separate single stranded residue and base pair substitution scores. For more information see the RSEARCH publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003). OTHER MODEL CONSTRUCTION OPTIONS --null _f_ Read a null model from _f_. The null model defines the proba- bility of each RNA nucleotide in background sequence, the de- fault is to use 0.25 for each nucleotide. The format of null files is specified in the user guide. --prior _f_ Read a Dirichlet prior from _f_, replacing the default mixture Dirichlet. The format of prior files is specified in the user guide. Use --devhelp to see additional, otherwise undocumented, model con- struction options. OPTIONS CONTROLLING RELATIVE WEIGHTS cmbuild uses an ad hoc sequence weighting algorithm to downweight closely related sequences and upweight distantly related ones. This has the effect of making models less biased by uneven phylogenetic repre- sentation. For example, two identical sequences would typically each receive half the weight that one sequence would. These options control which algorithm gets used. --wpb Use the Henikoff position-based sequence weighting scheme [Henikoff and Henikoff, J. Mol. Biol. 243:574, 1994]. This is the default. --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Ger- stein et al, J. Mol. Biol. 235:1067, 1994]. --wnone Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0. --wgiven Use sequence weights as given in annotation in the input align- ment file. If no weights were given, assume they are all 1.0. The default is to determine new sequence weights by the Ger- stein/Sonnhammer/Chothia algorithm, ignoring any annotated weights. --wblosum Use the BLOSUM filtering algorithm to weight the sequences, in- stead of the default GSC weighting. Cluster the sequences at a given percentage identity (see --wid); assign each cluster a to- tal weight of 1.0, distributed equally amongst the members of that cluster. --wid _x_ Controls the behavior of the --wblosum weighting option by set- ting the percent identity for clustering the alignment to _x_. OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER After relative weights are determined, they are normalized to sum to a total effective sequence number, eff_nseq. This number may be the ac- tual number of sequences in the alignment, but it is almost always smaller than that. The default entropy weighting method (--eent) re- duces the effective sequence number to reduce the information content (relative entropy, or average expected score on true homologs) per con- sensus position. The target relative entropy is controlled by a two-pa- rameter function, where the two parameters are settable with --ere and --esigma. --eent Use the entropy weighting strategy to determine the effective sequence number that gives a target mean match state relative entropy. This option is the default, and can be turned off with --enone. The default target mean match state relative entropy is 0.59 bits for models with at least 1 basepair and 0.38 bits for models with zero basepairs, but can be changed with --ere. The default of 0.59 or 0.38 bits is automatically changed if the total relative entropy of the model (summed match state relative entropy) is less than a cutoff, which is controlled by the --es- igma option. If you really want to play with that option, con- sult the source code. Additionally, the effective sequence num- ber cannot be larger than the number of sequences in the align- ment, although this can be overridden to set the maximum possi- ble effective sequence number with the --emaxseq option. --enone Turn off the entropy weighting strategy. The effective sequence number is just the number of sequences in the alignment. --ere _x_ Set the target mean match state relative entropy as _x_. By de- fault the target relative entropy per match position is 0.59 bits for models with at least 1 basepair and 0.38 for models with zero basepairs. --eminseq _x_ Define the minimum allowed effective sequence number as _x_. --emaxseq _x_ Define the maximum allowed effective sequence number as _x_. This number can be larger than the number of sequences in the alignment. --ehmmre _x_ Set the target HMM mean match state relative entropy as _x_. Entropy for basepairing match states is calculated using marginalized basepair emission probabilities. --eset _x_ Set the effective sequence number for entropy weighting as _x_. OPTIONS CONTROLLING FILTER P7 HMM CONSTRUCTION For each CM that cmbuild constructs, an accompanying filter p7 HMM is built from the input alignment as well. These options control filter HMM construction: --p7ere _x_ Set the target mean match state relative entropy for the filter p7 HMM as _x_. By default the target relative entropy per match position is 0.38 bits. --p7ml Use a maximum likelihood p7 HMM built from the CM as the filter HMM. This HMM will be as similar as possible to the CM (while necessarily ignorant of secondary structure). Use --devhelp to see additional, otherwise undocumented, filter HMM construction options. OPTIONS CONTROLLING FILTER P7 HMM CALIBRATION After building each filter HMM, cmbuild determines appropriate E-value parameters to use during filtering in cmsearch and cmscan by sampling a set of sequences and searching them with each HMM filter configuration and algorithm. --EmN _n_ Set the number of sampled sequences for local MSV filter HMM calibration to _n_. 200 by default. --EvN _n_ Set the number of sampled sequences for local Viterbi filter HMM calibration to _n_. 200 by default. --ElfN _n_ Set the number of sampled sequences for local Forward filter HMM calibration to _n_. 200 by default. --EgfN _n_ Set the number of sampled sequences for glocal Forward fil- ter HMM calibration to _n_. 200 by default. Use --devhelp to see additional, otherwise undocumented, filter HMM calibration options. OPTIONS FOR REFINING THE INPUT ALIGNMENT --refine _f_ Attempt to refine the alignment before building the CM using ex- pectation-maximization (EM). A CM is first built from the ini- tial alignment as usual. Then, the sequences in the alignment are realigned optimally (with the HMM banded CYK algorithm, op- timal means optimal given the bands) to the CM, and a new CM is built from the resulting alignment. The sequences are then re- aligned to the new CM, and a new CM is built from that align- ment. This is continued until convergence, specifically when the alignments for two successive iterations are not significantly different (the summed bit scores of all the sequences in the alignment changes less than 1% between two successive itera- tions). The final alignment (the alignment used to build the CM that gets written to _cmfile_out_) is written to _f_. -l With --refine, turn on the local alignment algorithm, which al- lows the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. The default is to globally align the query model to the target sequences. --gibbs Modifies the behavior of --refine so Gibbs sampling is used in- stead of EM. The difference is that during the alignment stage the alignment is not necessarily optimal, instead an alignment (parsetree) for each sequences is sampled from the posterior distribution of alignments as determined by the Inside algo- rithm. Due to this sampling step --gibbs is non-deterministic, so different runs with the same alignment may yield different results. This is not true when --refine is used without the --gibbs option, in which case the final alignment and CM will always be the same. When --gibbs is enabled, the --seed <n> op- tion can be used to seed the random number generator pre- dictably, making the results reproducible. The goal of the --gibbs option is to help expert RNA alignment curators refine structural alignments by allowing them to observe alternative high scoring alignments. --seed _n_ Seed the random number generator with _n_, an integer >= 0. This option can only be used in combination with --gibbs. If _n_ is nonzero, stochastic sampling of alignments will be repro- ducible; the same command will give the same results. If _n_ is 0, the random number generator is seeded arbitrarily, and sto- chastic samplings may vary from run to run of the same command. The default seed is 0. --cyk With --refine, align with the CYK algorithm. By default the op- timal accuracy algorithm is used. There is more information on this in the cmalign manual page. --notrunc With --refine, turn off the the truncated alignment algorithm. There is more information on this in the cmalign manual page. Use --devhelp to see additional, otherwise undocumented, alignment re- finement options as well as other output file options and options for building multiple models for a single alignment. SEE ALSO See infernal(1) for a master man page with a list of all the individual man pages for programs in the Infernal package. For complete documentation, see the user guide that came with your In- fernal distribution (Userguide.pdf); or see the Infernal web page (). COPYRIGHT Copyright (C) 2019 Howard Hughes Medical Institute. Freely distributed under the BSD open source license. For additional information on copyright and licensing, see the file called COPYRIGHT in your Infernal source distribution, or see the In- fernal web page (). AUTHOR The Eddy/Rivas Laboratory Janelia Farm Research Campus 19700 Helix Drive Ashburn VA 20147 USA http://eddylab.org Infernal 1.1.3 Nov 2019 cmbuild(1)
NAME | SYNOPSIS | DESCRIPTION | OPTIONS | OPTIONS CONTROLLING MODEL CONSTRUCTION | OTHER MODEL CONSTRUCTION OPTIONS | OPTIONS CONTROLLING RELATIVE WEIGHTS | OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER | OPTIONS CONTROLLING FILTER P7 HMM CONSTRUCTION | OPTIONS CONTROLLING FILTER P7 HMM CALIBRATION | OPTIONS FOR REFINING THE INPUT ALIGNMENT | SEE ALSO | COPYRIGHT | AUTHOR
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmbuild&sektion=1&manpath=FreeBSD+13.0-RELEASE+and+Ports>