API Reference

This section documents the ProtParCon core API, and it is intended for developers and experienced users to extend ProtParCon.

msa

Providing a common interface for aligning multiple sequences using various align programs.

Users are only asked to provide an alignment programs’s executable and an multiple sequence file in FASTA format. The general use function msa() will always return the pathname of the alignment output file or exit with an error code 1 and an error message logged.

Users are recommended to only use function msa() and avoid to use any private functions inside the module. However, we strongly recommend users to implement new private functions for additional alignment programs that they are interested and incorporate them into the general use function msa().

ProtParCon.msa._clustal(exe, seq, outfile)[source]

Align multiple sequences using CLUSTAL (OMEGA).

Parameters:
  • exe – str, path to the executable of a multiple sequence align program.
  • seq – str, path to the multiple sequence file (must in FASTA format).
  • seq – str, path to the aligned sequence output file (in FASTA format).
Returns:

str, path to the aligned sequence output file (in FASTA format).

ProtParCon.msa._guess(exe)[source]

Guess the name of a multiple sequence alignment (MSA) program according to its executable.

Parameters:exe – str, path to the executable of a MSA program.
Returns:tuple, name of the MSA program and the corresponding function.
ProtParCon.msa._mafft(exe, seq, outfile)[source]

Align multiple sequences using MAFFT.

Parameters:
  • exe – str, path to the executable of a multiple sequence align program.
  • seq – str, path to the multiple sequence file (must in FASTA format).
  • seq – str, path to the aligned sequence output file (in FASTA format).
Returns:

str, path to the aligned sequence output file (in FASTA format).

ProtParCon.msa._muscle(exe, seq, outfile)[source]

Align multiple sequences using MUSCLE.

Parameters:
  • exe – str, path to the executable of a multiple sequence align program.
  • seq – str, path to the multiple sequence file (must in FASTA format).
  • seq – str, path to the aligned sequence output file (in FASTA format).
Returns:

str, path to the aligned sequence output file (in FASTA format).

ProtParCon.msa._tcoffee(exe, seq, outfile)[source]

Align multiple sequences using T-COFFEE.

Parameters:
  • exe – str, path to the executable of a multiple sequence align program.
  • seq – str, path to the multiple sequence file (must in FASTA format).
  • seq – str, path to the aligned sequence output file (in FASTA format).
Returns:

str, path to the aligned sequence output file (in FASTA format).

ProtParCon.msa.msa(exe, seq, outfile='', trimming=False, verbose=False)[source]

General use function for multiple sequence alignment (MSA).

Parameters:
  • exe – str, path to the executable of a MSA program.
  • seq – str, path to the multiple sequence file (must in FASTA format).
  • outfile – str, path to the aligned sequence output (FASTA) file, default: [basename].[aligner].fasta, where basename is the filename of the sequence file without known FASTA file extension, aligner is the name of the aligner program in lowercase, and fasta is the extension for fasta format file.
  • trimming – bool, trim gaps and ambiguous sites if True, otherwise, leave them untouched.
  • verbose – bool, invoke verbose or silent process mode, default: False, silent mode.
Returns:

str, path to the aligned sequence output file (in FASTA format).

asr

Providing a common interface for ancestral stated reconstruction (ASR) using various ASR programs.

Users are only asked to provide an ASR programs’s executable, an aligned multiple sequence file (in FASTA format), and a guide tree (in NEWICK) format. The general use function asr() will always return dict object containing sequence records and a tree object (generated by Bio.Phylo module inside Biopython) or exit with an error code 1 and an error message logged.

Users are recommended only to use function asr() and avoid to use any private function inside the module. However, users are strongly recommended to implement new private functions for additional ASR programs that they are interested and incorporate them into the general use function asr().

ProtParCon.asr._codeml(exe, msa, tree, model, gamma, alpha, freq, outfile)[source]

Reconstruct ancestral sequences using CODEML (inside PAML package).

Parameters:
  • exe – str, path to the executable of an ASR program.
  • msa – str, path to the MSA file (must in FASTA format).
  • tree – str, path to the tree file (must in NEWICK format) or a NEWICK format tree string (must start with “(” and end with “;”).
  • model – namedtuple, substitution model for ASR.
  • gamma – int, The number of categories for the discrete gamma rate heterogeneity model.
  • freq – str, the equilibrium frequencies of the twenty amino acids.
  • alpha – float, the shape (alpha) for the gamma rate heterogeneity.
  • outfile – str, path to the output file.
Returns:

tuple, a tree object, a dict for sequences, and a list or rates.

Note

See doc string of function asr() for details of all arguments.

ProtParCon.asr._fastml(exe, msa, tree, model, gamma, alpha, freq, outfile)[source]

Reconstruct ancestral sequences using FastML.

Parameters:
  • exe – str, path to the executable of an ASR program.
  • msa – str, path to the MSA file (must in FASTA format).
  • tree – str, path to the tree file (must in NEWICK format) or a NEWICK format tree string (must start with “(” and end with “;”).
  • model – namedtuple, substitution model for ASR.
  • gamma – int, The number of categories for the discrete gamma rate heterogeneity model.
  • freq – str, the equilibrium frequencies of the twenty amino acids.
  • alpha – float, the shape (alpha) for the gamma rate heterogeneity.
  • outfile – str, path to the output file.
Returns:

tuple, a tree object and a dict for sequences.

Note

See doc string of function asr() for details of all arguments.

ProtParCon.asr._guess(exe)[source]

Guess the name of a ancestral states reconstruction (ASR) program according to its executable.

Parameters:exe – str, path to the executable of an ASR program.
Returns:tuple, name of the ASR program and the corresponding function.
ProtParCon.asr._label(tree, ancestors)[source]

Relabel internal nodes of a tree and map them to the corresponding name of ancestral sequences.

Parameters:
  • tree – str, a NEWICK format string or file for a tree (must start with “(” and end with ‘;’).
  • ancestors – dict, a dict object stores sequences.
Returns:

tuple, a relabeled tree object and a dict object for sequences.

ProtParCon.asr._parse(wd)[source]

Parse the rst file generated by CODEML.

Parameters:wd – str, work directory of CODEML (inside PAML package).
Returns:tuple, a tree object, a dict for sequences, and a list or rates.
ProtParCon.asr._raxml(exe, msa, tree, model, gamma, alpha, freq, outfile)[source]

Reconstruct ancestral sequences using RAxML.

Parameters:
  • exe – str, path to the executable of an ASR program.
  • msa – str, path to the MSA file (must in FASTA format).
  • tree – str, path to the tree file (must in NEWICK format) or a NEWICK format tree string (must start with “(” and end with “;”).
  • model – namedtuple, substitution model for ASR.
  • gamma – int, The number of categories for the discrete gamma rate heterogeneity model.
  • freq – str, the equilibrium frequencies of the twenty amino acids.
  • alpha – float, the shape (alpha) for the gamma rate heterogeneity.
  • outfile – str, path to the output file.
Returns:

tuple, a tree object and a dict for sequences.

Note

See doc string of function asr() for details of all arguments.

ProtParCon.asr._write(tree, ancestor, rates, aps, outfile)[source]

Write tree (object) and ancestor (dict) to a output file.

Parameters:
  • tree – object, tree object.
  • ancestor – dict, dict object for sequence records.
  • aps – dict, dict object for probabilities of ancestral states.
  • outfile – str, path to the output file.
Returns:

str, path to the output file.

ProtParCon.asr.asr(exe, msa, tree, model, gamma=4, alpha=1.8, freq='', outfile='', verbose=False)[source]

General use function for (marginal) ancestral states reconstruction (ASR).

Parameters:
  • exe – str, path to the executable of an ASR program.
  • msa – str, path to the MSA file (must in FASTA format).
  • tree – str, path to the tree file (must in NEWICK format) or a NEWICK format tree string (must start with “(” and end with “;”).
  • model – str, substitution model for ASR. Either a path to a model file or a valid model string (name of an empirical model plus some other options like gamma category and equilibrium frequency option). If a model file is in use, the file format of the model file depends on the ASR program, see the its documentation for details.
  • gamma – int, The number of categories for the discrete gamma rate heterogeneity model. Without setting gamma, RAxML will use CAT model instead, while CODEML will use 4 gamma categories.
  • freq – str, the base frequencies of the twenty amino acids. Accept empirical, or estimate, where empirical will set frequencies use the empirical values associated with the specified substitution model, and estimate will use a ML estimate of base frequencies.
  • alpha – float, the shape (alpha) for the gamma rate heterogeneity.
  • outfile – str, path to the output file. Whiteout setting, results of ancestral states reconstruction will be saved using the filename [basename].[asrer].tsv, where basename is the filename of MSA file without known FASTA file extension, asrer is the name of the ASR program (in lower case). The first line of the file will start with ‘#TREE’ and followed by a TAB ( ) and then a NEWICK formatted tree string, the internal nodes were labeled. The second line of the tsv file is intentionally left as a blank line and the rest lines of the file are tab separated sequence IDs and amino acid sequences.
  • verbose – bool, invoke verbose or silent (default) process mode.
Returns:

tuple, the paths of the ancestral states file.

Note

If a tree (with branch lengths and/or internal nodes labeled) is provided, the branch lengths and internal node labels) will be ignored.

If the model name combined with Gamma category numbers, i.e. JTT+G4, WAG+G8, etc., only the name of the model will be used. For all models contain G letter, a discrete Gamma model will be used to account for among-site rate variation. If there is a number after letter G, the number will be used to define number of categories in CODEML. For RAxML, the number of categories will always be set to 4 if G presented.

imc

Providing a common interface for identifying parallel and convergent amino acid replacements in orthologous protein sequences. In order to make this module for general use, function ProtParCon() is built on top of other modules to facilitate the identification of parallel and convergent amino acid replacements using a wide range of sequence data. Depending on the sequence data, optional parameters and external programs may be required.

ProtParCon.imc._load(tsv)[source]

Load tree, rates, and data blocks from a tsv file.

Parameters:tsv – str, path to the tsv file stores ancestral states or simulated sequences.
Returns:tuple, tree, rates (list) and sequence records (defaultdict).
ProtParCon.imc._pairing(tree, indpairs=True)[source]

Checking whether two branches are sister branch pair or branch pair sharing the same evolutionary path.

Parameters:
  • tree – object, a tree object.
  • indpairs – bool, only return independent branch pairs if true, or return all branch pairs if False.
Returns:

tuple, a list of branches and a list of branch pairs.

ProtParCon.imc._sequencing(sequence, tree, aligner, ancestor, wd, asr_model, verbose)[source]

Identify the type of the sequence file.

Parameters:
  • sequence – str, path to a sequence data file.
  • tree – str, path to a NEWICK tree file.
Returns:

tuple, sequence, alignment, ancestor, and simulation data file.

ProtParCon.imc.imc(sequence, tree='', aligner='', ancestor='', simulator='', asr_model='JTT', exp_model='JTT', n=100, divergent=True, indpairs=True, threshold=0.0, exp_prob=False, verbose=False)[source]

Identify molecular parallel and convergent changes.

Parameters:
  • sequence

    str, path to the sequence data file. Sequence data file here covers a wide range of files and formats:

    • sequences: raw protein sequence file, need to be in FASTA format and a NEWICK format tree is also required for argument tree.
    • msa: multiple sequence alignment file, need to be in FASTA format and a NEWICK format tree is also required for argument tree.
    • ancestors: reconstructed ancestral states file, need to be in tsv (tab separated) file, the first line needs to start with #TREE, second line needs to be a blank line, and the rest lines in the file need to be tab separated sequence name (or ID) and amino acid sequences.
    • simulations: simulated sequences, need to be in tsv file, the first line needs to start with #TREE, second line needs to be a blank line, each dataset need to be separated by a blank line and inside each dataset block, each line should consist of tab separated sequence name (or ID) and amino acid sequences.
  • tree – str, NEWICK format tree string or tree file. This need to be set according to argument sequence. if sequence is raw sequence file or MSA file, tree is required for guiding ancestral states reconstruction. If sequence is ancestors or simulations, then tree is not necessary.
  • aligner – str, executable of an alignment program.
  • ancestor – str, executable of an ancestral states reconstruction program.
  • simulator – str, executable of an sequence simulation program.
  • asr_model – str, model name or model file for ancestral states reconstruction, default: JTT.
  • exp_model – str, model name or model file for estimate expected changes based on simulation or replacement probability manipulation, default: JTT.
  • n – int, number of datasets (or duplicates) should be simulated.
  • divergent – bool, identify divergent changes if True, or only identify parallel and convergent changes if False.
  • indpairs – bool, only identify changes for independent branch pairs if true, or identify changes for all branch pairs if False.
  • threshold – float, a probability threshold that ranges from 0.0 to 1.0. If provided, only ancestral states with probability equal or larger than the threshold will be used, default: 0.0.
  • exp_prob – bool, calculate the probability of expected changes if set to True and the exp_model contains a probability matrix. Time consuming process, be patient for the calculation.
  • verbose – bool, invoke verbose or silent process mode, default: False, silent mode.
Returns:

tuple, a dict object of counts of parallel replacements, a dict object of counts of convergent replacements, a list consists of details of replacements (namedtuple) and the p-value of AU Test (float or None).

detect

ProtParCon.detect._tester(obs, exp, values, alpha=0.05)[source]

One sample T-test to determine whether the observed value is statistically significantly different to the expected value.

Parameters:
  • obs – int, observed value.
  • exp – float, the expected value.
  • values – list or tuple, a list of expected values where exp was calculated.
  • alpha – float, significance level.
Returns:

float, p value of the T-test.

ProtParCon.detect.detect(branchpair=None, pars=None, cons=None, wd='', fn='', tester=None, printout=True, verbose=True)[source]

Pairwise comparison for parallel and convergent amino acid replacements in protein sequences.

Parameters:
  • branchpair – list, a list of branch pairs need to be tested.
  • pars – dict, a dict object stores parallel changes.
  • cons – dict, a dict object stores convergent changes.
  • wd – str, path to the work directory. Without specifying, it will be set to current work directory. A file ends with ‘.counts.tsv’ in the work directory will be used if neither pars nor cons was provided.
  • fn – str: path to the result file.
  • tester – function, a function for test the differences.
  • printout – bool, print out the test results (default) or only return the test result without printing them out.
  • verbose – bool, invoke verbose or silent process mode, default: False, silent mode.
Returns:

list, a list of test results.

sim

Providing a common interface for simulating amino acid sequences using various simulation programs. At this stage, the module only support simulate sequences using EVOLVER (inside PAML program) and Seq-Gen.

The minimum requirement of this interface asks users to provide a NEWICK format tree string or tree file and a executable of a simulation program. Users can also pass additional options to simulate amino acid sequences under various cases.

Users are recommended to only use function sim() and avoid to use any private functions inside the module. However, users are strongly recommended to implement new private functions for additional simulation programs and wrap them into the general use function aut().

ProtParCon.sim._evolver(exe, tree, length, freq, model, n, seed, gamma, alpha, invp, outfile)[source]

Sequence simulation via EVOLVER.

Parameters:
  • exe – str, path to the executable of EVOLVER.
  • tree – str, path to the tree (must has branch lengths and in NEWICK format).
  • length – int, the number of the amino acid sites need to be simulated.
  • freq – list or None, base frequencies of 20 amino acids.
  • model – str, name of a model a filename of a model file.
  • n – int, number of datasets (or duplicates) need to be simulated.
  • seed – int, the seed used to initiate the random number generator.
  • gamma – int, 0 means discrete Gamma model not in use, any positive integer larger than 1 will invoke discrete Gamma model and set the number of categories to gamma.
  • alpha – float, the Gamma shape parameter alpha, without setting, the value will be estimated by the program, in case an initial value is needed, the initial value of alpha will be set to 0.5.
  • invp – float, proportion of invariable site.
  • outfile – pathname of the output ML tree. If not set, default name [basename].[program].ML.newick, where basename is the filename of the sequence file without extension, program is the name of the ML inference program, and newick is the extension for NEWICK format tree file.
Returns:

str, path to the simulation output file.

ProtParCon.sim._evolver_parse(wd)[source]

Parse the work directory of EVOLVER to get the simulated results.

Parameters:wd – str, work directory of EVOLVER.
Returns:tuple, a list of simulated sequences (dict) and a tree object.
ProtParCon.sim._guess(exe)[source]

Guess the name of a sequence simulation program according to its executable.

Parameters:exe – str, path to the executable of an simulation program.
Returns:tuple, name of the simulation program and the corresponding function.
ProtParCon.sim._seqgen(exe, tree, length, freq, model, n, seed, gamma, alpha, invp, outfile)[source]

Sequence simulation via EVOLVER.

Parameters:
  • exe – str, path to the executable of EVOLVER.
  • tree – str, path to the tree (must has branch lengths and in NEWICK format).
  • length – int, the number of the amino acid sites need to be simulated.
  • freq – list or None, base frequencies of 20 amino acids.
  • model – str, name of a model a filename of a model file.
  • n – int, number of datasets (or duplicates) need to be simulated.
  • seed – int, the seed used to initiate the random number generator.
  • gamma – int, 0 means discrete Gamma model not in use, any positive integer larger than 1 will invoke discrete Gamma model and set the number of categories to gamma.
  • alpha – float, the Gamma shape parameter alpha, without setting, the value will be estimated by the program, in case an initial value is needed, the initial value of alpha will be set to 0.5.
  • invp – float, proportion of invariable site.
  • outfile – pathname of the output ML tree. If not set, default name [basename].[program].ML.newick, where basename is the filename of the sequence file without extension, program is the name of the ML inference program, and newick is the extension for NEWICK format tree file.
Returns:

str, path to the simulation output file.

ProtParCon.sim.sim(exe, tree='', sequence='', model='JTT', length=100, freq='empirical', n=100, seed=0, gamma=4, alpha=0.5, invp=0, outfile='', verbose=False)[source]

Sequence simulation via EVOLVER.

Parameters:
  • exe – str, path to the executable of EVOLVER.
  • tree – str, path to the tree (must has branch lengths and in NEWICK format). If not provided, sequence file need to be a tsv file consisting a tree with branch lengths and sequences. If both tree and sequence in tsv format file were provided, the tree in tsv file will be ignored.
  • sequence – str, path to a multiple sequence alignment file in FASTA format or a tsv file generated by function asr() that have a line contains a tree with branch lengths. If provided, the length and base amino acid frequencies will be calculated based on the leaf sequences.
  • model – str, name of a model a filename of a model file.
  • length – int, the number of the amino acid sites need to be simulated, default: 0, the length will be obtained from the sequence.
  • freq – str, “empirical”, “estimate”, or a comma separated string of base frequencies of 20 amino acids in the order of “ARNDCQEGHILKMFPSTWYV”.
  • n – int, number of datasets (or duplicates) need to be simulated.
  • seed – int, the seed used to initiate the random number generator.
  • gamma – int, 0 means discrete Gamma model not in use, any positive integer larger than 1 will invoke discrete Gamma model and set the number of categories to gamma.
  • alpha – float, the Gamma shape parameter alpha, without setting, the value will be estimated by the program, in case an initial value is needed, the initial value of alpha will be set to 0.5.
  • invp – float, proportion of invariable site.
  • outfile – pathname of the output ML tree. If not set, default name [basename].[program].ML.newick, where basename is the filename of the sequence file without extension, program is the name of the ML inference program, and newick is the extension for NEWICK format tree file.
  • verbose – bool, invoke verbose or silent process mode, default: False, silent mode.
Returns:

str, path to the simulation output file.

aut

Providing a common interface for performing topology test (AU test) over protein alignments using various programs.

Users are asked to provide a multiple sequence alignment (MSA) file, a NEWICK format tree file, and a topology test program’s executable. If only one tree in the tree file was provided, a maximum-likelihood (ML) tree would be inferred and AU test will be performed to test the difference between the user specified tree and the ML tree. If a set of trees in NEWICK format was provided in the tree file, only these trees would be evaluated without reconstructing the ML tree. In both cases, only the p-value of AU test for the first tree will be returned.

Users are recommended only to use function aut() and avoid to use any private functions inside the module. However, users are recommended to implement new private functions for additional topology test programs that they are interested and incorporate them into the general use function aut().

ProtParCon.aut._iqtree(exe, msa, tree, model, seed)[source]

Perform topology test (AU test) using IQ-TREE.

Parameters:
  • msa – str, path to a FASTA format MSA file.
  • tree – str, path to a NEWICK format tree file.
  • exe – str, path to a topology test program’s executable.
  • model – str, name of the model or path of the model file.
  • seed – int, the seed used to initiate the random number generator.
Returns:

tuple, p-value of the AU test (float), first tree (string), and second tree (string).

ProtParCon.aut.aut(exe, msa, tree, model='', seed=0, outfile='', verbose=False)[source]

General use function for performing topology test (AU test).

Parameters:
  • msa – str, path to a FASTA format MSA file.
  • tree – str, path to a NEWICK format tree file.
  • exe – str, path to a topology test program’s executable.
  • model – str, name of the model or path of the model file.
  • seed – int, the seed used to initiate the random number generator.
  • outfile – str, path to the output file for storing test result. If not set, only return the result without save to a file.
  • verbose – bool, invoke verbose or silent process mode, default: False, silent mode.
Returns:

tuple, p-value of the AU test (float), first tree (string), and second tree (string).

mlt

Providing a common interface for inferring phylogenetic trees from alignments of protein sequences using maximum-likelihood method.

The minimum requirement for using this module is a multiple sequence alignment (MSA) file and an executable of a maximum-likelihood tree inference program. If user wants to take full control of the tree inference, optional parameters are also accepted depend on the program. Without providing an evolutionary model, the default model (JTT model for FastTree and LG model for PhyML) or a model decided by model selection procedure (RAxML and IQ-TREE) will be used.

Users are recommended to only use function mlt() and avoid to use any private functions in this module. However, users are strongly recommended to implement new private functions for additional maximum-likelihood tree inference program that they are interested and incorporate them into the general use function mlt().

Todo

It seems FastTree does not accept FASTA format file which has a space between ‘>’ and the sequence name (or ID), should check FASTA file before feed in.

ProtParCon.mlt.MODEL

alias of ProtParCon.mlt.model

ProtParCon.mlt._fasttree(exe, msa, model, cat, gamma, alpha, freq, invp, start_tree, constraint_tree, seed, outfile)[source]

Infer ML phylogenetic tree using FastTree.

ProtParCon.mlt._guess(exe)[source]

Guess the name of a tree inference program according to its executable.

Parameters:exe – str, path to the executable of an maximum-likelihood tree inference program.
Returns:tuple, name of the program and the corresponding function.
ProtParCon.mlt._iqtree(exe, msa, model, cat, gamma, alpha, freq, invp, start_tree, constraint_tree, seed, outfile)[source]

Infer ML phylogenetic tree using IQ-TREE.

ProtParCon.mlt._phyml(exe, msa, model, cat, gamma, alpha, freq, invp, start_tree, constraint_tree, seed, outfile)[source]

Infer ML phylogenetic tree using PhyML.

ProtParCon.mlt._raxml(exe, msa, model, cat, gamma, alpha, freq, invp, start_tree, constraint_tree, seed, outfile)[source]

Infer ML phylogenetic tree using RAxML.

ProtParCon.mlt.mlt(exe, msa, model='', cat=0, gamma=0, alpha=0.0, freq='empirical', invp=0.0, start_tree='', constraint_tree='', seed=0, outfile='', verbose=False)[source]

Common interface for inferring ML phylogenetic tree.

Parameters:
  • msa – str, path of the multiple sequence alignment (FASTA) file.
  • exe – str, path of the executable of the ML tree inference program.
  • model – str, name of the model or path of the model file.
  • cat – int, invoke rate heterogeneity (CAT) model and set the number of categories to the corresponding number cat, if CAT model is not in use, it will be ignored. When FastTree is in use, set cat=None to invoke nocat mode.
  • gamma – int, 0 means discrete Gamma model not in use, any positive integer larger than 1 will invoke discrete Gamma model and set the number of categories to gamma.
  • alpha – float, the Gamma shape parameter alpha, without setting, the value will be estimated by the program, in case an initial value is needed, the initial value of alpha will be set to 0.5.
  • freq – str, the base frequencies of the twenty amino acids. Accept empirical, or estimate, where empirical will set frequencies use the empirical values associated with the specified substitution model, and estimate will use a ML estimate of base frequencies.
  • invp – float, proportion of invariable site.
  • start_tree – str, path of the starting tree file, the tree file must be in NEWICK format.
  • constraint_tree – str, path of the constraint tree file, the tree file muse be in NEWICK format.
  • seed – int, the seed used to initiate the random number generator.
  • outfile – pathname of the output ML tree. If not set, default name [basename].[program].ML.newick, where basename is the filename of the sequence file without extension, program is the name of the ML inference program, and newick is the extension for NEWICK format tree file.
  • verbose – bool, invoke verbose or silent process mode, default: False, silent mode.
Returns:

path of the maximum-likelihood tree file.

utilities

ProtParCon.utilities.basename(name)[source]

Removing file extension, if the extension is in ENDINGS.

Parameters:name – str, a filename.
Returns:str, filename without known extensions for tsv, txt, FASTA, and NEWICK as well as some alignment format files.
ProtParCon.utilities.modeling(model)[source]

Parse evolutionary model.

The model needs to be in the format of MODEL+<FreqType>+<RateType> or a model (text) file, where MODEL is a model name (e.g. JTT, WAG, …), FreqType is how the model will handle amino acid frequency (e.g. F, FO, or FQ), and RateType is the rate heterogeneity type. Protein mixture models are not supported.

Parameters:model – str, name of the model a pathname to the model file.
Returns:namedtuple, in the order of name, frequency, gamma, rates, invp, and type.
ProtParCon.utilities.trim(msa, fmt='fasta', outfile='', verbose=False)[source]

Remove gaps and ambiguous characters from protein multiple sequence alignment (MSA) file or a dictionary object of MSA records.

Parameters:
  • msa – str or dict, pathname of the protein (MSA) multiple sequence alignment file or a dictionary object of of MSA records.
  • fmt – format of the MSA file, if msa is the pathname of a MSA file.
  • outfile – pathname for saving trimmed MSA to a file, if not set, trimmed sequences will only be returned without saving to a file.
  • verbose – bool, invoke verbose or silent process mode, default: False, silent mode.
Returns:

dict, trimmed MSA in a dictionary.

Note

Ambiguous characters (characters not in ARNDCQEGHILKMFPSTWYV) will be treated as gaps and removed. The trimmed alignment file will be saved to outfile in FASTA format if outfile is not a empty string and there are sites left after trimming. Whether the outfile is set or not, trim() always returns trimmed MSA in a dictionary (even a empty dict).

Careful with PHYLIP format MSA files, trim() use Bio.AlignIO to read MSA file, users are responsible for given the right name for PHYLIP format files, e.g. phylip-relaxed or phylip-sequential.

class ProtParCon.utilities.Tree(tree, leave=False)[source]
file(filename, ic=True, ic2name=False, nodes=False, brlen=True)[source]

Write a tree to a file after manipulating.

Parameters:
  • filename – str, the name of the tree output file
  • ic – bool, whether to keep confidence of internal nodes.
  • ic2name – whether to convert confidence of internal nodes to their names.
  • nodes – bool, discard or keep name of internal nodes.
  • brlen – bool, discard or keep branch lengths.
Returns:

str, a newick tree string.

string(ic=True, ic2name=False, nodes=False, brlen=True)[source]

Get a newick tree string for a tree after manipulating.

Parameters:
  • ic – bool, whether to keep confidence of internal nodes.
  • ic2name – whether to convert confidence of internal nodes to their names.
  • nodes – bool, discard or keep name of internal nodes.
  • brlen – bool, discard or keep branch lengths.
Returns:

str, a newick tree string.