tmkit

Overview

TMKit is an open-source Python programming interface, which is modular, scalable, and specifically designed for processing transmembrane protein data. It is a one-stop computational analysis tool for transmembrane proteins, enabling users to perform database wrangling, engineer features at the mutational, domain, and topological levels, and visualize protein-protein interaction interfaces through its unique programming interface. In addition, TMKit includes seqNetRR, a high-performance computing library that allows for customised construction and rewiring of residue connections. This library is particularly well-suited for assigning coevolutionary features at a fast speed.

TMKit offers quality control, I/O and collation of protein sequences/structures, sequence-predicted/structure-derived topologies, multiple sequence alignment generation, generation of canonical TM-specific features, visualization of protein structures and interfaces between TM proteins. Besides, other intriguing features of TMKit are the processing of functional (CATH), interactome, and mutational databases and more functionalities expected. It allows performance evaluation of residue-residue contact predictions.

1. Functions of TMKit

TMKit provides nine function classes to handle a number of transmembrane protein sequence and structural analysis problems, including visualization, sequence, quality control, topology, mapping, annotation, connectivity, edge extraction, and feature.

1.1 Modules summary

After installation of tmkit (please see how to install it), you can import this library by putting the following code in a Python script or a Jupyter notebook. Then, you can access the 14 modules covering 9 function classes.

import tmkit as tmk

The below table summarizes what tasks these modules can be used to do.

#	Tool module	Function class	Note
1	`tmk.fetch`	Quality control	fetch example data
2	`tmk.qc`	Quality control	generate and extract metrics of sequences and structures
3	`tmk.seq`	Sequence	parse sequences and structures
4	`tmk.msa`	Sequence	produce commands for generating multiple sequence alignment
5	`tmk.feature`	Feature	protein biological features
6	`tmk.collate`	Mapping	seek difference between RCSB and PDBTM structures
7	`tmk.topo`	Topology	transmembrane protein topologies
8	`tmk.rrc`	Feature	performance evaluation of residue contact prediction
9	`tmk.ppi`	Connectivity	protein connectivity
10	`tmk.mut`	Annotation	transmembrane protein's mutation data processing
11	`tmk.vs`	Visualization	visualize protein structures
12	`tmk.cath`	Annotation	access protein domains and families
13	`tmk.mapping`	Mapping	conversion between protein identifiers
14	`tmk.edge`	Edge extraction	rewiring of connections between residues

1.2 Module functions

The module functions are descried briefly in the cards below.

Visualization

Identification of protein-protein interaction (PPI) interfaces of proteins is critical to understand the biological processes governed by them.

Sequence

The sequence pre-processing module is a fundamental component of TMKit, designed to handle sequence reading in diverse formats, sequence retrieval from various sources, and multiple sequence alignment (MSA) generation.

Quality control

This module evaluates various criteria, including the experimentation methods used, resolution, subclass, and sequence length, to qualify proteins in bulk.

Topology

TMKit can be used to obtain more detailed non-TM topologies, that is, side 1, side 2, strand, coil, inside, loop, and interfacial. Besides the structure-derived topologies, TMKit also supplies predicted topologies by embedding TMHMM and Phobius running on the command line interface (CLI) and within Python

Mapping

Identifier mapping between structural and sequence data (e.g., FASTA residue IDs and PDB residue IDs) is an important technical premise to guarantee the correct interpretation of biological findings.

Annotation

Amino acid residues of transmembrane proteins to be involved in mutations and function domains can be annotated through the MutHTP, Pred-MutHTP and CATH databases.

Connectivity

Studying connections of a protein to others in a PPI network is of crucial importance to understand its biological role.

Edge extraction

We provide a high-performance computing library for extracting connections between residues by constructing bipartite and unipartite graphs (where residue connections are treated as edges) and assigning features in linear time with respect to the number of residues used.

Feature

A set of transmembrane protein-specific and general-purpose features is provided by TMKit in support of machine learning modelling.