Functionality Guide
This guide documents every major pdb_cpp workflow with fully worked
examples. For copy-paste snippets see Quick Recipes;
for the auto-generated API reference see pdb_cpp package.
1. Reading and writing structures
Supported formats
Format |
Extension |
Read |
Write |
Notes |
|---|---|---|---|---|
PDB |
|
yes |
yes |
Bond topology via |
mmCIF |
|
yes |
yes |
Bond topology via |
PQR |
|
yes |
yes |
Charge in |
GRO |
|
yes |
yes |
GROMACS format; coordinates converted nm ↔ Å |
Loading from a local file
from pdb_cpp import Coor
coor = Coor("structure.cif") # auto-detects format by extension
coor = Coor("structure.pdb")
The format can also be forced explicitly with the format keyword.
This is useful when the file extension is absent or misleading — for
example a compressed pipeline that writes to a generic filename:
from pdb_cpp import Coor
# Via the constructor
coor = Coor("structure.dat", format="pdb") # treat as PDB
coor = Coor("structure.dat", format="cif") # treat as mmCIF
coor = Coor("structure.dat", format="pqr") # treat as PQR
coor = Coor("structure.dat", format="gro") # treat as GROMACS GRO
# Via read() on an existing object
coor = Coor()
coor.read("structure.dat", format="cif")
Accepted format values: "pdb", "cif", "pqr", "gro".
format defaults to "", which means infer from extension — the same
behaviour as before.
Fetching from the PDB archive
coor = Coor(pdb_id="1y0m") # downloads mmCIF, caches in /tmp/pdb_cpp_cache/
The cached file is reused on subsequent calls with the same PDB ID.
Interaction analysis
Interface-oriented tools live directly in the analysis submodules:
pdb_cpp.analysis.hbonds.hbonds()for hydrogen bondspdb_cpp.analysis.salt_bridge.salt_bridges()for ionic contacts / salt bridgespdb_cpp.analysis.sasa.buried_surface_area()for buried surface and interface areapdb_cpp.analysis.sasa.shape_complementarity()for Lawrence-Colman style interface shape complementarity
from pdb_cpp import Coor
from pdb_cpp.analysis import hbonds, salt_bridge, sasa
coor = Coor("tests/input/1A0A.cif")
hb = hbonds.hbonds(coor, donor_sel="protein", acceptor_sel="nucleic")
sb = salt_bridge.salt_bridges(coor, cation_sel="protein", anion_sel="nucleic")
iface = sasa.buried_surface_area(
coor,
receptor_sel="chain C D",
ligand_sel="chain A B",
)[0]
print(len(hb[0]), len(sb[0]), iface["interface_area"])
Loading through the rcsb module
The pdb_cpp.rcsb module gives explicit control over which RCSB coordinate
file is downloaded.
from pdb_cpp import rcsb
# Deposited asymmetric unit
asym_unit = rcsb.load("1y0m", structure="asymmetric_unit")
# Biological assembly 1 in mmCIF format
assembly_1 = rcsb.load("5a9z", structure="biological_assembly", assembly_id=1)
# Download without loading, using a custom cache directory
local_path = rcsb.download(
"5a9z",
structure="biological_assembly",
assembly_id=1,
cache_dir="/tmp/pdb_cpp_rcsb",
)
Supported structure values are:
"asymmetric_unit": the deposited entry file, for example1y0m.cif"biological_assembly": an assembly-specific RCSB file, for example5a9z-assembly1.cif
Supported file_format values are "cif" and "pdb". For biological
assemblies, the RCSB URL pattern differs by format:
mmCIF:
<pdb_id>-assembly<assembly_id>.cifPDB:
<pdb_id>.pdb<assembly_id>
If you prefer the older convenience API, Coor(pdb_id="...") still works and
now accepts extra RCSB-related keyword arguments:
from pdb_cpp import Coor
coor = Coor(
pdb_id="5a9z",
rcsb_structure="biological_assembly",
assembly_id=1,
cache_dir="/tmp/pdb_cpp_rcsb",
force_download=False,
)
This path currently reads the downloaded file as returned by RCSB. It does not
construct assemblies locally from _pdbx_struct_assembly_gen; instead it relies
on the explicit assembly files published by RCSB.
Writing output
coor.write("output.pdb") # PDB format
coor.write("output.cif") # mmCIF format
coor.write("output.pqr") # PQR format
coor.write("output.gro") # GRO format
The format is determined by the file extension. Selections can also be written directly:
sel = coor.select_atoms("chain A")
sel.write("chain_A.pdb")
2. The Coor and Model objects
A Coor object holds one or more Model frames. Many properties are
available on both classes via Python property accessors.
Key Coor properties
Property |
Type |
Description |
|---|---|---|
|
|
Number of atoms in the active model |
|
|
Number of models |
|
|
All models |
|
|
Index of the active model (read/write) |
|
|
CONECT bond records (atom number → bonded atom numbers) |
Coor methods
Method |
Returns |
Description |
|---|---|---|
|
|
Select atoms by string query (see §3) |
|
|
Select atoms by boolean mask (one bool per atom) |
|
|
Atom indices matching a selection string |
|
|
Read a file into an existing |
|
— |
Write to PDB/mmCIF/PQR/GRO (by extension) |
|
— |
Append a |
|
— |
Remove all models |
|
|
Unique chain IDs (char arrays) |
|
|
Unique chain IDs as Python strings |
|
|
Amino acid sequences per chain |
|
|
D/L amino acid sequences (D as lowercase) |
|
|
Amino acid + nucleic acid sequences |
|
|
Remove residues missing backbone atoms |
Model methods
Method |
Returns |
Description |
|---|---|---|
|
|
Boolean mask of atoms matching selection |
|
|
SASA total with optional per-atom/per-residue data |
|
|
Add a single atom (see below) |
|
— |
Remove all atoms from this model |
|
|
Centroid of all atoms, or of given indices |
Note that Model.select_atoms() returns a boolean mask (not a new
object), unlike Coor.select_atoms() which returns a new Coor.
Model.addAtom() takes positional arguments:
num, name, resname, resid, chain, x, y, z, occ, beta, altloc, elem, insertres, field, uniq_resid.
Per-atom properties (on both Coor and Model)
These properties return data for the active model when called on Coor,
or directly when called on a Model.
Property |
Type |
Description |
|---|---|---|
|
|
Cartesian coordinates |
|
|
Coordinates as a NumPy array |
|
|
Atom names (fixed-length char arrays) |
|
|
Atom names as Python strings |
|
|
Residue names (char arrays) |
|
|
Residue names as strings |
|
|
Residue sequence numbers |
|
|
Unique (0-based) residue identifiers |
|
|
Chain identifiers (char arrays) |
|
|
Chain identifiers as strings |
|
|
Atom serial numbers |
|
|
B-factors (temperature factors) |
|
|
Occupancies |
|
|
Element symbols |
|
|
Alternate location indicators |
|
|
Residue insertion codes |
coor = Coor("tests/input/1y0m.cif")
# Access via Coor (delegates to active model)
print(coor.x[:5])
print(coor.chain_str[:5])
print(coor.xyz.shape)
# Access a specific model directly
model_0 = coor.models[0]
print(model_0.name_str[:5])
print(model_0.beta[:5])
Multi-model structures
coor = Coor("tests/input/2rri.pdb")
print(coor.model_num) # e.g. 20 models (NMR ensemble)
# Switch active model
coor.active_model = 5
print(coor.x[:3]) # coordinates from model 5
# Iterate over all models
for i, model in enumerate(coor.models):
print(f"Model {i}: {model.len} atoms")
Centroid calculation
model = coor.models[0]
# Centroid of all atoms
centroid = model.get_centroid()
# Centroid of specific atom indices
ca_indices = coor.get_index_select("name CA")
centroid_ca = model.get_centroid(ca_indices)
Boolean-mask selection
When you already have a boolean array (e.g. from NumPy operations),
use select_bool_index instead of building a string query:
import numpy as np
coor = Coor("tests/input/1y0m.cif")
mask = np.array(coor.beta) < 30.0 # low B-factor atoms
low_b = coor.select_bool_index(mask.tolist())
print(f"{low_b.len} atoms with B < 30")
3.1 SASA and interface SASA
sasa.sasa() computes solvent-accessible surface area for each model in a
Coor object. It returns one dictionary per model, each with at least
total, polar, and apolar, and optionally atom_areas and
residue_areas.
SASA for one chain or selection
from pdb_cpp import Coor
from pdb_cpp.analysis import sasa
coor = Coor("tests/input/1a2k.pdb")
result = sasa.sasa(coor, selection="chain A", by_residue=True)[0]
print(f"Total SASA: {result['total']:.2f} A^2")
print(f"Polar SASA: {result['polar']:.2f} A^2")
print(f"Apolar SASA: {result['apolar']:.2f} A^2")
print(result["residue_areas"][:3])
The polar/apolar split is FreeSASA-like: atom areas are partitioned from the
per-atom SASA using a simple element-based classifier, with N, O, S,
P, and Se treated as polar atoms.
Protein-protein interface SASA
Use pdb_cpp.analysis.sasa.buried_surface_area() to evaluate two partners inside one
complex. This helper computes three SASA values internally:
receptor alone
ligand alone
receptor + ligand together as the complex
The returned interface metrics follow the standard convention:
buried_surface = sasa_receptor + sasa_ligand - sasa_complexinterface_area = buried_surface / 2
from pdb_cpp import Coor
from pdb_cpp.analysis import sasa
coor = Coor("tests/input/1a2k.pdb")
interface = sasa.buried_surface_area(
coor,
receptor_sel="chain A",
ligand_sel="chain B",
by_residue=True,
)[0]
print(f"Complex SASA : {interface['complex_sasa']:.2f} A^2")
print(f"Complex polar : {interface['complex_polar_sasa']:.2f} A^2")
print(f"Complex apolar : {interface['complex_apolar_sasa']:.2f} A^2")
print(f"Receptor SASA : {interface['receptor_sasa']:.2f} A^2")
print(f"Ligand SASA : {interface['ligand_sasa']:.2f} A^2")
print(f"Buried surface : {interface['buried_surface']:.2f} A^2")
print(f"Buried polar : {interface['buried_polar_surface']:.2f} A^2")
print(f"Buried apolar : {interface['buried_apolar_surface']:.2f} A^2")
print(f"Interface area : {interface['interface_area']:.2f} A^2")
print(f"Interface polar: {interface['interface_polar_area']:.2f} A^2")
print(f"Interface apol.: {interface['interface_apolar_area']:.2f} A^2")
for residue in interface["residue_buried_surface"]:
print(
residue["partner"],
residue["chain"],
residue["resname"],
residue["resid"],
residue["buried_area"],
)
The residue_buried_surface list contains one entry per residue from the
receptor and ligand selections, with:
isolated_area: SASA of that residue in the isolated partnercomplex_area: SASA of that residue inside the full complexburied_area: difference between the two
For residue-level SASA, each residue entry also contains polar_area and
apolar_area; interface residue burial entries additionally expose
buried_polar_area and buried_apolar_area.
Shape complementarity
pdb_cpp.analysis.sasa.shape_complementarity() estimates a Lawrence-Colman style
shape complementarity score from rolling-probe surface dots and opposing
surface normals. Each accessible interface-facing dot on one partner is matched
to its nearest dot on the other partner within a configurable interface radius
using a voxel hash, and the score aggregates dot(n_a, -n_b) over the matched
interface dots.
Warning
shape_complementarity() is currently experimental and has not yet been
validated against established reference implementations or benchmark datasets.
Use this score with caution for quantitative conclusions.
from pdb_cpp import Coor
from pdb_cpp.analysis import sasa
coor = Coor("tests/input/1a2k.pdb")
sc = sasa.shape_complementarity(
coor,
receptor_sel="chain A",
ligand_sel="chain B",
dots_per_sq_angstrom=12.0,
search_radius=1.5,
)[0]
print(f"Sc: {sc['shape_complementarity']:.3f}")
print(f"Interface dot pairs: {sc['interface_dot_pairs']}")
print(f"Mean paired distance: {sc['mean_interface_distance']:.3f} A")
Scores are bounded to [-1, 1], with values closer to 1 indicating more
complementary opposing surface normals.
Important parameters:
dots_per_sq_angstrom: sampling density of the rolling-probe surfacesearch_radius: maximum nearest-dot pairing distance across the interfacereducer: use"trimmed_mean"(default) or"median"to aggregate interface dot scorestrim_fraction: fraction removed from each tail whenreducer="trimmed_mean"
The returned dictionary also includes the per-frame receptor_shape_complementarity,
ligand_shape_complementarity, interface_dot_pairs, and
mean_interface_distance diagnostics.
CONECT records
CONECT records (covalent bonds) are stored as a dictionary mapping atom numbers to lists of bonded atom numbers. Bond topology is read from and written to both supported formats:
PDB:
CONECTlines (hybrid-36 encoded atom serial numbers)mmCIF:
_struct_connloop (anyconn_type_idis accepted on read;covaleis written on output)
from pdb_cpp import Coor
# Load a structure with covalent bonds (ligands, zinc coordination, etc.)
coor = Coor("tests/input/1u85.pdb")
# Inspect bonds
for atom_num, bonded in list(coor.conect.items())[:3]:
print(f"Atom {atom_num} bonded to {bonded}")
# Bonds survive PDB and mmCIF round-trips
coor.write("output_with_conect.pdb") # writes CONECT lines
coor.write("output_with_conect.cif") # writes _struct_conn loop
Note
mmCIF files distributed by the PDB typically store only inter-residue bonds
in _struct_conn (disulfides, metal coordination, covalent cross-links).
Intra-ligand bonds may be absent unless the file was produced by pdb_cpp.
Programmatic construction
You can build a Coor from scratch or append models:
from pdb_cpp import Coor
coor_1 = Coor("tests/input/1y0m.cif")
coor_2 = Coor("tests/input/1y0m.cif")
# Append all models from coor_2 into coor_1
for model in coor_2.models:
coor_1.add_Model(model)
print(coor_1.model_num) # doubled
# Clear everything
coor_1.clear()
print(coor_1.model_num) # 0
3. Atom selection
Selection syntax
Selections are performed with coor.select_atoms(selection_string) which
returns a new Coor object containing only the matching atoms.
Field selectors
Keyword |
Matches |
Example |
|---|---|---|
|
Atom name |
|
|
Residue name |
|
|
Chain identifier |
|
|
Residue sequence number |
|
|
Unique residue index (0-based) |
|
|
Alternate location indicator |
|
|
Coordinate value |
|
|
B-factor |
|
|
Occupancy |
|
Shortcut keywords
Keyword |
Equivalent to |
|---|---|
|
Standard amino acid residue names |
|
|
|
Non-hydrogen atoms |
|
RNA and DNA residue names |
|
RNA residue names ( |
|
DNA residue names ( |
Logical operatorsfrom pdb_cpp import Coor, alignment
coor_1 = Coor(“tests/input/1rxz_colabfold_model_1.pdb”) coor_2 = Coor(“tests/input/1rxz.pdb”)
rmsds, mappings = alignment.align_chain_permutation(coor_1, coor_2) print(“Best RMSD:”, rmsds[0]) print(“Returned mapping tuple lengths:”, len(mappings[0]), len(mappings[1]))
and— intersectionor— unionnot— negation
Comparison operators
==, !=, <, <=, >, >= — used with numeric fields (resid,
residue, x, y, z, beta, occ).
Spatial queries
within <distance> of <sub-selection>
Selects atoms within distance Ångströms of any atom matching the
sub-selection.
Selection examples
from pdb_cpp import Coor
coor = Coor("tests/input/1rxz.pdb")
# Chain selection
chain_a = coor.select_atoms("chain A")
# Backbone of a residue range
seg = coor.select_atoms("backbone and chain A and residue >= 6 and residue <= 58")
# Interface: atoms of A within 5 Å of B
interface = coor.select_atoms("chain A and within 5.0 of chain B")
# Exclude water and alternate locations
clean = coor.select_atoms("protein and not altloc B C D and not resname HOH")
# Numeric coordinate filter
slab = coor.select_atoms("name CA and x >= 10.0 and x <= 30.0")
Getting indices without creating a new Coor
indices = coor.get_index_select("name CA and chain A")
# Returns list[int] of atom indices in the active model
Frame-specific selections (multi-model)
# Select using coordinates from a specific model frame
sel = coor.select_atoms("within 5.0 of chain A", frame=3)
4. Sequence extraction
Amino acid sequences
seqs = coor.get_aa_seq()
# Returns dict[str, str]: chain ID -> one-letter sequence
print(seqs["A"])
With gap insertion for missing residues:
seqs = coor.get_aa_seq(gap_in_seq=True) # default: gaps inserted
seqs = coor.get_aa_seq(gap_in_seq=False) # no gap characters
D/L amino acid sequences
dl_seqs = coor.get_aa_DL_seq()
# D-amino acids are returned as lowercase letters
Amino acid and nucleic acid sequences
all_seqs = coor.get_aa_na_seq()
# Includes both protein and nucleic acid chains
Via the sequence module
from pdb_cpp import sequence
seqs = sequence.get_aa_seq(coor)
dl_seqs = sequence.get_aa_DL_seq(coor)
5. Sequence alignment
pdb_cpp.alignment provides pairwise sequence alignment using a BLOSUM62
substitution matrix and affine gap penalties.
from pdb_cpp import alignment
seq_1 = "ACDEFGHIKLMNPQRSTVWY"
seq_2 = "ACDFGHIKLMNPQRSTVWY" # missing E
aln_1, aln_2, score = alignment.align_seq(seq_1, seq_2)
print(aln_1)
print(aln_2)
print(f"Score: {score}")
# Pretty-print with identity/similarity stats
alignment.print_align_seq(aln_1, aln_2)
Parameters
Parameter |
Default |
Description |
|---|---|---|
|
-11 |
Gap opening penalty |
|
-1 |
Gap extension penalty |
|
|
Custom substitution matrix file; defaults to bundled BLOSUM62 |
Low-level C++ alignment: core.sequence_align
The alignment.align_seq() function is a Python wrapper around the C++
core.sequence_align() function, which returns an Alignment_cpp object:
from pdb_cpp import core
result = core.sequence_align("ACDEFG", "ACDEG")
print(result.seq1) # aligned sequence 1 (with gaps)
print(result.seq2) # aligned sequence 2 (with gaps)
print(result.score) # alignment score (int)
Alignment_cpp fields
Field |
Type |
Description |
|---|---|---|
|
|
Aligned sequence 1 (with gaps) |
|
|
Aligned sequence 2 (with gaps) |
|
|
Raw alignment score |
6. Structural alignment (RMSD-based)
Step-by-step alignment
from pdb_cpp import Coor, core, analysis
coor_1 = Coor("tests/input/1u85.pdb")
coor_2 = Coor("tests/input/1ubd.pdb")
# 1. Find common atoms by sequence alignment
idx_1, idx_2 = core.get_common_atoms(
coor_1, coor_2,
chain_1=["A"], chain_2=["C"],
back_names=["C", "N", "O", "CA"], # backbone atoms
)
# 2. Superpose coor_1 onto coor_2 (modifies coor_1 in-place)
core.coor_align(coor_1, coor_2, idx_1, idx_2, frame_ref=0)
# 3. Compute RMSD
rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2])
print(f"RMSD: {rmsd_values[0]:.3f} Å")
One-step sequence-based alignment
rmsd_list, align_idx_1, align_idx_2 = core.align_seq_based(
coor_1, coor_2,
chain_1=["A"], chain_2=["C"],
)
print(f"RMSD: {rmsd_list[0]:.3f} Å")
This combines get_common_atoms, coor_align, and RMSD in a single call.
Chain-permutation alignment (multi-chain complexes)
When chain correspondence is unknown:
from pdb_cpp import Coor, alignment
coor_1 = Coor("tests/input/1rxz_colabfold_model_1.pdb")
coor_2 = Coor("tests/input/1rxz.pdb")
rmsds, mappings = alignment.align_chain_permutation(coor_1, coor_2)
print("Best RMSD:", rmsds[0])
print("Returned mapping tuple lengths:", len(mappings[0]), len(mappings[1]))
This tries all possible chain pairings and returns the best RMSD.
RMSD functions
The analysis API is also grouped into submodules, so both of these patterns are supported:
from pdb_cpp import analysis
from pdb_cpp.analysis import dockq, sasa, hbonds
from pdb_cpp import analysis
# RMSD using a selection string
rmsd_values = analysis.rmsd(coor_1, coor_2, selection="name CA")
# RMSD with explicit index lists (no re-selection)
rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2])
7. TM-align and TM-score
TM-align provides length-independent structural similarity scoring via the bundled USalign implementation.
from pdb_cpp import Coor, geom
from pdb_cpp.core import tmalign_ca
coor_1 = Coor("tests/input/1y0m.cif")
coor_2 = Coor("tests/input/1ubd.pdb")
result = tmalign_ca(
coor_1, coor_2,
chain_1=["A"], chain_2=["C"],
mm=0,
include_transform=True,
)
print(result.rotation)
print(result.translation)
# Apply x' = R x + t to all atoms of the mobile structure
coor_1.xyz = geom.apply_transform(coor_1.xyz, result.rotation, result.translation)
TMalignResult fields
Field |
Type |
Description |
|---|---|---|
|
|
TM-score normalized by length of structure 1 |
|
|
TM-score normalized by length of structure 2 |
|
|
TM-score normalized by aligned length |
|
|
RMSD of aligned residues |
|
|
Number of aligned residues |
|
|
Number of identical aligned residues |
|
|
Aligned sequence of structure 1 |
|
|
Aligned sequence of structure 2 |
|
|
Alignment match string ( |
|
|
3x3 rotation matrix (when |
|
|
Length-3 translation vector (when |
The mm parameter
Value |
Meaning |
|---|---|
0 |
Monomer TM-align |
1 |
Monomer TM-align (same) |
Multi-chain alignment
For multi-chain alignment with chain-pairing, pass multiple chains:
result = tmalign_ca(
coor_1, coor_2,
chain_1=["A", "B"], chain_2=["A", "B"],
mm=0,
)
Citation
If you use the TM-align functionality, please cite:
Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang (2022) Nat Methods. 19(9), 1109-1115.
Chengxin Zhang, Anna Marie Pyle (2022) iScience. 25(10), 105218.
8. Secondary structure assignment
Uses the TM-align secondary structure assignment algorithm (DSSP-like).
from pdb_cpp import Coor, TMalign
coor = Coor("tests/input/1y0m.cif")
ss_list = TMalign.compute_secondary_structure(coor)
Returns a list (one entry per model) of dictionaries mapping chain ID to a secondary structure string.
Secondary structure codes
Code |
Meaning |
|---|---|
|
Helix |
|
Strand (sheet) |
|
Turn |
|
Coil / other |
for chain_id, ss_string in ss_list[0].items():
print(f"Chain {chain_id}: {ss_string}")
Low-level: core.compute_SS
The TMalign.compute_secondary_structure() wraps the C++ function
core.compute_SS(), which returns secondary structure strings directly:
from pdb_cpp import Coor, core
coor = Coor("tests/input/1y0m.cif")
ss = core.compute_SS(coor, gap_in_seq=False)
# Returns list[list[str]]: one list of SS strings per model
9. DockQ scoring
DockQ evaluates docking model quality by combining interface contacts, ligand RMSD, and interface RMSD into a single score.
Basic usage (automatic chain-role inference)
from pdb_cpp import Coor, analysis
model = Coor("tests/input/1rxz_colabfold_model_1.pdb")
native = Coor("tests/input/1rxz.pdb")
scores = analysis.dockQ(model, native)
With explicit chain roles
scores = analysis.dockQ(
model, native,
rec_chains=["B"],
lig_chains=["C"],
native_rec_chains=["A"],
native_lig_chains=["B"],
)
Command-line multimer scoring
The installed pdb_cpp_dockq executable runs analysis.dockQ_multimer().
This is the same multimer code path used in the benchmark script, including
automatic native-to-model chain-map discovery when --chain-map is omitted.
pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb
To provide an explicit mapping, pass native:model pairs:
pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb --chain-map A:B,B:A,C:C
DockQ output dictionary
Key |
Type |
Description |
|---|---|---|
|
|
Combined DockQ score (0-1) |
|
|
Fraction of native contacts |
|
|
Fraction of non-native contacts |
|
|
Ligand RMSD (Å) |
|
|
Interface RMSD (Å) |
|
|
Receptor RMSD (Å) |
Each value is a list with one entry per model in the input Coor.
DockQ quality thresholds
DockQ range |
Quality |
|---|---|
0.00 – 0.23 |
Incorrect |
0.23 – 0.49 |
Acceptable |
0.49 – 0.80 |
Medium |
0.80 – 1.00 |
High |
Individual metric functions
# Fraction of native/non-native contacts
fnat_list, fnonnat_list = analysis.native_contact(
model, native,
rec_chains=["A"], lig_chains=["B"],
native_rec_chains=["A"], native_lig_chains=["B"],
cutoff=5.0,
)
# Interface RMSD
irmsd_list = analysis.interface_rmsd(
model, native,
rec_chains_native=["A"],
lig_chains_native=["B"],
cutoff=10.0,
)
native_contact parameters
Parameter |
Default |
Description |
|---|---|---|
|
— |
Model |
|
— |
Native |
|
— |
Model receptor chain IDs |
|
— |
Model ligand chain IDs |
|
— |
Native receptor chain IDs |
|
— |
Native ligand chain IDs |
|
|
Contact distance cutoff (Å) |
|
|
|
|
|
|
The residue_id_map parameters are useful when model and native have
different residue numbering — provide dictionaries that map each residue
uniq_resid to a common numbering scheme.
interface_rmsd notes
interface_rmsd returns a list of floats with one entry per model. Entries
are None when no interface residues are found within the cutoff distance.
Citation
If you use DockQ scoring, please cite:
Mirabello, C. & Wallner, B. (2024) Bioinformatics. DOI: 10.1093/bioinformatics/btae586
10. Hydrogen bond detection
pdb_cpp.analysis.hbonds detects hydrogen bonds using a purely geometric approach
(Baker & Hubbard, 1984). For structures without explicit hydrogen atoms,
backbone NH and sidechain hydrogen positions are reconstructed
algebraically. The algorithm covers all standard L/D amino acids and RNA/DNA
nucleotides.
Basic usage
from pdb_cpp import Coor
from pdb_cpp.analysis import hbonds
coor = Coor("tests/input/2rri.cif")
# All protein–protein H-bonds for every model
all_bonds = hbonds.hbonds(coor)
# List of HBond objects for model 0
bonds_model0 = all_bonds[0]
print(f"Model 0: {len(bonds_model0)} hydrogen bonds")
for b in bonds_model0[:3]:
print(b)
Function signature
hbonds.hbonds(
coor,
donor_sel = "protein",
acceptor_sel = "protein",
dist_DA_cutoff = 3.5, # D···A distance cutoff (Å)
dist_HA_cutoff = 2.5, # H···A distance cutoff (Å)
angle_cutoff = 90.0, # D−H···A angle cutoff (°)
) -> list[list[HBond]]
Returns a list with one inner list (of HBond objects) per model frame.
Parameters
Parameter |
Default |
Description |
|---|---|---|
|
— |
A |
|
|
Selection string for donor-containing atoms |
|
|
Selection string for acceptor-containing atoms |
|
|
Maximum heavy-atom donor–acceptor distance (Å) |
|
|
Maximum hydrogen–acceptor distance (Å) |
|
|
Minimum D−H···A angle (°) |
HBond object
Each entry in the returned lists is an HBond object with the following
read-only attributes:
Attribute |
Type |
Description |
|---|---|---|
|
|
Donor residue sequence number |
|
|
Donor residue name |
|
|
Donor chain identifier |
|
|
Donor heavy-atom name (e.g. |
|
|
Donor hydrogen atom name (e.g. |
|
|
Donor heavy-atom coordinates (x, y, z) |
|
|
Donor hydrogen coordinates (x, y, z) |
|
|
Acceptor residue sequence number |
|
|
Acceptor residue name |
|
|
Acceptor chain identifier |
|
|
Acceptor atom name (e.g. |
|
|
Acceptor atom coordinates (x, y, z) |
|
|
D···A distance (Å) |
|
|
H···A distance (Å) |
|
|
D−H···A angle (°) |
Filtering and analysis examples
from pdb_cpp import Coor
from pdb_cpp.analysis import hbonds
coor = Coor("tests/input/2rri.cif")
bonds = hbonds.hbonds(coor)[0]
# Backbone–backbone H-bonds only
bb_names = {"N", "O"}
bb_bonds = [b for b in bonds
if b.donor_heavy_name in bb_names
and b.acceptor_name in bb_names]
# Inter-chain H-bonds (cross-chain contacts)
inter = [b for b in bonds if b.donor_chain != b.acceptor_chain]
print(f"{len(inter)} inter-chain H-bonds")
# Sort by D–A distance
bonds.sort(key=lambda b: b.dist_DA)
for b in bonds[:5]:
print(f"{b.donor_chain}{b.donor_resid}{b.donor_resname:>4}"
f" {b.donor_heavy_name} → "
f"{b.acceptor_chain}{b.acceptor_resid}{b.acceptor_resname:>4}"
f" {b.acceptor_name} d(DA)={b.dist_DA:.2f} Å"
f" angle={b.angle_DHA:.1f}°")
Cross-selection H-bonds (e.g. protein–nucleic)
The donor_sel and acceptor_sel parameters accept any valid selection
string, enabling inter-molecular or cross-type analysis:
# Protein side-chains donating to RNA acceptors
rna_contacts = hbonds.hbonds(
coor,
donor_sel = "protein",
acceptor_sel = "nucleic",
)
# H-bonds between two specific chains
ab_bonds = hbonds.hbonds(coor, donor_sel="chain A", acceptor_sel="chain B")
Baker & Hubbard criteria
The default thresholds reproduce the original Baker & Hubbard (1984) geometric criteria:
A stricter cut-off of 120° (angle_cutoff=120) is closer to values used
by biotite and is recommended when comparing with other tools.
11. Geometry utilities
Distance matrix
from pdb_cpp import Coor, geom
coor = Coor("tests/input/1y0m.cif")
ca = coor.select_atoms("name CA")
# Pairwise distance matrix between two coordinate sets
dmat = geom.distance_matrix(ca.xyz, ca.xyz)
print(dmat.shape) # (N_ca, N_ca)
The function accepts any two Coor objects (or selections) and returns an
(N, M) float32 NumPy array of Euclidean distances.
Hybrid-36 encoding/decoding
PDB format uses hybrid-36 encoding for atom/residue numbers > 99999:
from pdb_cpp.core import hy36encode, hy36decode
encoded = hy36encode(5, 100000) # "A0000"
decoded = hy36decode(5, "A0000") # 100000
12. Data cleaning utilities
Remove incomplete backbone residues
Residues missing any backbone atom (CA, C, N, O by default) are removed:
from pdb_cpp import select
clean_coor = select.remove_incomplete_backbone_residues(coor)
# Or with custom backbone definition
clean_coor = select.remove_incomplete_backbone_residues(
coor, back_atom=["CA", "C", "N"]
)
This is recommended before structural alignment to avoid mismatches.
13. Module summary
Module |
Key functions / classes |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14. Data subpackage (pdb_cpp.data)
The pdb_cpp.data module provides residue dictionaries and the BLOSUM62
substitution matrix used internally by the alignment routines.
BLOSUM62 matrix
from pdb_cpp.data.blosum import BLOSUM62
# Substitution score for two amino acids
print(BLOSUM62[("A", "A")]) # 4
print(BLOSUM62[("A", "W")]) # -3
# Load a custom matrix from file
from pdb_cpp.data.blosum import load_blosum
custom = load_blosum("/path/to/matrix.txt")
BLOSUM62 is a lazy-loaded dictionary mapping (aa1, aa2) tuples to
integer substitution scores.
Residue dictionaries
from pdb_cpp.data.res_dict import AA_DICT, AA_DICT_L, AA_DICT_D, NA_DICT
# 3-letter to 1-letter amino acid code
print(AA_DICT["ALA"]) # "A"
print(AA_DICT["GLY"]) # "G"
# D-amino acids (e.g. DAL -> "A")
print(AA_DICT_D["DAL"]) # "A"
# Nucleic acids
print(NA_DICT["DA"]) # "A"
# Combined amino acid + nucleic acid dictionary
from pdb_cpp.data.res_dict import AA_NA_DICT
Dictionary |
Contents |
|---|---|
|
Standard L-amino acids (31 entries incl. protonation variants) |
|
D-amino acids (22 entries) |
|
Combined L + D amino acids |
|
DNA nucleotides (DA, DT, DC, DG) |
|
Combined amino acids + nucleotides |