FreeBSD Manual Pages
POCKETSPHINX(1) General Commands Manual POCKETSPHINX(1) NAME pocketsphinx - Run speech recognition on audio data SYNOPSIS pocketsphinx [ options... ] [ live | single | help | soxflags ] IN- PUTS... DESCRIPTION The pocketsphinx command-line program reads single-channel 16-bit PCM audio one or more input files (or - to read from standard input), and attempts to recognize speech in it using the default acoustic and lan- guage model. The input files can be raw audio, WAV, or NIST Sphere files, though some of these may not be recognized properly. It accepts a large number of options which you probably don't care about, and a command which defaults to live. The commands are as follows: help Print a long list of those options you don't care about. config Dump configuration as JSON to standard output (can be loaded with the -config option). live Detect speech segments in input files, run recognition on them (using those options you don't care about), and write the re- sults to standard output in line-delimited JSON. I realize this isn't the prettiest format, but it sure beats XML. Each line contains a JSON object with these fields, which have short names to make the lines more readable: "b": Start time in seconds, from the beginning of the stream "d": Duration in seconds "p": Estimated probability of the recognition result, i.e. a number between 0 and 1 which may be used as a confidence score "t": Full text of recognition result "w": List of segments (usually words), each of which in turn contains the b, d, p, and t fields, for start, end, probability, and the text of the word. In the future we may also support hi- erarchical results in which case w could be present. single Recognize the input as a single utterance, and write a JSON ob- ject in the same format described above. align Align a single input file (or - for standard input) to a word sequence, and write a JSON object in the same format described above. The first positional argument is the input, and all sub- sequent ones are concatenated to make the text, to avoid sur- prises if you forget to quote it. You are responsible for nor- malizing the text to remove punctuation, uppercase, centipedes, etc. For example: pocketsphinx align goforward.wav "go forward ten meters" By default, only word-level alignment is done. To get phone alignments, pass `-phone_align yes` in the flags, e.g.: pocketsphinx -phone_align yes align audio.wav $text This will make not particularly readable output, but you can use jq (https://stedolan.github.io/jq/) to clean it up. For exam- ple, you can get just the word names and start times like this: pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]' Or you could get the phone names and durations like this: pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]' There are many, many other possibilities, of course. help Print a usage and help text with a list of possible arguments. soxflags Return arguments to sox which will create the appropriate input format. Note that because the sox command-line is slightly quirky these must always come after the filename or -d (which tells sox to read from the microphone). You can run live recog- nition like this: sox -d $(pocketsphinx soxflags) | pocketsphinx - or decode from a file named "audio.mp3" like this: sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx - By default only errors are printed to standard error, but if you want more information you can pass -loglevel INFO. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however. OPTIONS -agc Automatic gain control for c0 ('max', 'emax', 'noise', or 'none') -agcthresh Initial threshold for automatic gain control -allphone phoneme decoding with phonetic lm (given here) -allphone_ci Perform phoneme decoding with phonetic lm and context-indepen- dent units only -alpha Preemphasis parameter -ascale Inverse of acoustic model scale for confidence score calculation -aw Inverse weight applied to acoustic scores. -backtrace Print results and backtraces to log. -beam Beam width applied to every frame in Viterbi search (smaller values mean wider beam) -bestpath Run bestpath (Dijkstra) search over word lattice (3rd pass) -bestpathlw Language model probability weight for bestpath search -ceplen Number of components in the input feature vector -cmn Cepstral mean normalization scheme ('live', 'batch', or 'none') -cmninit Initial values (comma-separated) for cepstral mean when 'live' is used -compallsen Compute all senone scores in every frame (can be faster when there are many senones) -dict pronunciation dictionary (lexicon) input file -dictcase Dictionary is case sensitive (NOTE: case insensitivity applies to ASCII characters only) -dither Add 1/2-bit noise -doublebw Use double bandwidth filters (same center freq) -ds Frame GMM computation downsampling ratio -fdict word pronunciation dictionary input file -feat Feature stream type, depends on the acoustic model -featparams containing feature extraction parameters. -fillprob Filler word transition probability -frate Frame rate -fsg format finite state grammar file -fsgusealtpron Add alternate pronunciations to FSG -fsgusefiller Insert filler words at each state. -fwdflat Run forward flat-lexicon search over word lattice (2nd pass) -fwdflatbeam Beam width applied to every frame in second-pass flat search -fwdflatefwid Minimum number of end frames for a word to be searched in fwd- flat search -fwdflatlw Language model probability weight for flat lexicon (2nd pass) decoding -fwdflatsfwin Window of frames in lattice to search for successor words in fwdflat search -fwdflatwbeam Beam width applied to word exits in second-pass flat search -fwdtree Run forward lexicon-tree search (1st pass) -hmm containing acoustic model files. -input_endian Endianness of input data, big or little, ignored if NIST or MS Wav -jsgf grammar file -keyphrase to spot -kws file with keyphrases to spot, one per line -kws_delay Delay to wait for best detection score -kws_plp Phone loop probability for keyphrase spotting -kws_threshold Threshold for p(hyp)/p(alternatives) ratio -latsize Initial backpointer table size -lda containing transformation matrix to be applied to features (sin- gle-stream features only) -ldadim Dimensionality of output of feature transformation (0 to use en- tire matrix) -lifter Length of sin-curve for liftering, or 0 for no liftering. -lm trigram language model input file -lmctl a set of language model -lmname language model in -lmctl to use by default -logbase Base in which all log-likelihoods calculated -logfn to write log messages in -loglevel Minimum level of log messages (DEBUG, INFO, WARN, ERROR) -logspec Write out logspectral files instead of cepstra -lowerf Lower edge of filters -lpbeam Beam width applied to last phone in words -lponlybeam Beam width applied to last phone in single-phone words -lw Language model probability weight -maxhmmpf Maximum number of active HMMs to maintain at each frame (or -1 for no pruning) -maxwpf Maximum number of distinct word exits at each frame (or -1 for no pruning) -mdef definition input file -mean gaussian means input file -mfclogdir to log feature files to -min_endfr Nodes ignored in lattice construction if they persist for fewer than N frames -mixw mixture weights input file (uncompressed) -mixwfloor Senone mixture weights floor (applied to data from -mixw file) -mllr transformation to apply to means and variances -mmap Use memory-mapped I/O (if possible) for model files -ncep Number of cep coefficients -nfft Size of FFT, or 0 to set automatically (recommended) -nfilt Number of filter banks -nwpen New word transition penalty -pbeam Beam width applied to phone transitions -pip Phone insertion penalty -pl_beam Beam width applied to phone loop search for lookahead -pl_pbeam Beam width applied to phone loop transitions for lookahead -pl_pip Phone insertion penalty for phone loop -pl_weight Weight for phoneme lookahead penalties -pl_window Phoneme lookahead window size, in frames -rawlogdir to log raw audio files to -remove_dc Remove DC offset from each frame -remove_noise Remove noise using spectral subtraction -round_filters Round mel filter frequencies to DFT points -samprate Sampling rate -seed Seed for random number generator; if less than zero, pick our own -sendump dump (compressed mixture weights) input file -senlogdir to log senone score files to -senmgau to codebook mapping input file (usually not needed) -silprob Silence word transition probability -smoothspec Write out cepstral-smoothed logspectral files -svspec specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38) -tmat state transition matrix input file -tmatfloor HMM state transition probability floor (applied to -tmat file) -topn Maximum number of top Gaussians to use in scoring. -topn_beam Beam width used to determine top-N Gaussians (or a list, per- feature) -toprule rule for JSGF (first public rule is default) -transform Which type of transform to use to calculate cepstra (legacy, dct, or htk) -unit_area Normalize mel filters to unit area -upperf Upper edge of filters -uw Unigram weight -var gaussian variances input file -varfloor Mixture gaussian variance floor (applied to data from -var file) -varnorm Variance normalize each utterance (only if CMN == current) -verbose Show input filenames -warp_params defining the warping function -warp_type Warping function type (or shape) -wbeam Beam width applied to word exits -wip Word insertion penalty -wlen Hamming window length AUTHOR Written by numerous people at CMU from 1994 onwards. This manual page by David Huggins-Daines <dhdaines@gmail.com> COPYRIGHT Copyright (C) 1994-2016 Carnegie Mellon University. See the file LI- CENSE included with this package for more information. SEE ALSO pocketsphinx_batch(1), sphinx_fe(1). 2022-09-27 POCKETSPHINX(1)
NAME | SYNOPSIS | DESCRIPTION | OPTIONS | AUTHOR | COPYRIGHT | SEE ALSO
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=pocketsphinx&sektion=1&manpath=FreeBSD+Ports+15.0>
