Functionality Guide

This guide documents every major pdb_cpp workflow with fully worked examples. For copy-paste snippets see Quick Recipes; for the auto-generated API reference see pdb_cpp package.


1. Reading and writing structures

Supported formats

Format

Extension

Read

Write

Notes

PDB

.pdb

yes

yes

Bond topology via CONECT lines

mmCIF

.cif

yes

yes

Bond topology via _struct_conn loop

PQR

.pqr

yes

yes

Charge in occ, radius in beta slot

GRO

.gro

yes

yes

GROMACS format; coordinates converted nm ↔ Å

Loading from a local file

from pdb_cpp import Coor

coor = Coor("structure.cif")   # auto-detects format by extension
coor = Coor("structure.pdb")

The format can also be forced explicitly with the format keyword. This is useful when the file extension is absent or misleading — for example a compressed pipeline that writes to a generic filename:

from pdb_cpp import Coor

# Via the constructor
coor = Coor("structure.dat", format="pdb")   # treat as PDB
coor = Coor("structure.dat", format="cif")   # treat as mmCIF
coor = Coor("structure.dat", format="pqr")   # treat as PQR
coor = Coor("structure.dat", format="gro")   # treat as GROMACS GRO

# Via read() on an existing object
coor = Coor()
coor.read("structure.dat", format="cif")

Accepted format values: "pdb", "cif", "pqr", "gro". format defaults to "", which means infer from extension — the same behaviour as before.

Fetching from the PDB archive

coor = Coor(pdb_id="1y0m")    # downloads mmCIF, caches in /tmp/pdb_cpp_cache/

The cached file is reused on subsequent calls with the same PDB ID.


Interaction analysis

Interface-oriented tools live directly in the analysis submodules:

  • pdb_cpp.analysis.hbonds.hbonds() for hydrogen bonds

  • pdb_cpp.analysis.salt_bridge.salt_bridges() for ionic contacts / salt bridges

  • pdb_cpp.analysis.sasa.buried_surface_area() for buried surface and interface area

  • pdb_cpp.analysis.sasa.shape_complementarity() for Lawrence-Colman style interface shape complementarity

from pdb_cpp import Coor
from pdb_cpp.analysis import hbonds, salt_bridge, sasa

coor = Coor("tests/input/1A0A.cif")

hb = hbonds.hbonds(coor, donor_sel="protein", acceptor_sel="nucleic")
sb = salt_bridge.salt_bridges(coor, cation_sel="protein", anion_sel="nucleic")
iface = sasa.buried_surface_area(
    coor,
    receptor_sel="chain C D",
    ligand_sel="chain A B",
)[0]

print(len(hb[0]), len(sb[0]), iface["interface_area"])

Loading through the rcsb module

The pdb_cpp.rcsb module gives explicit control over which RCSB coordinate file is downloaded.

from pdb_cpp import rcsb

# Deposited asymmetric unit
asym_unit = rcsb.load("1y0m", structure="asymmetric_unit")

# Biological assembly 1 in mmCIF format
assembly_1 = rcsb.load("5a9z", structure="biological_assembly", assembly_id=1)

# Download without loading, using a custom cache directory
local_path = rcsb.download(
    "5a9z",
    structure="biological_assembly",
    assembly_id=1,
    cache_dir="/tmp/pdb_cpp_rcsb",
)

Supported structure values are:

  • "asymmetric_unit": the deposited entry file, for example 1y0m.cif

  • "biological_assembly": an assembly-specific RCSB file, for example 5a9z-assembly1.cif

Supported file_format values are "cif" and "pdb". For biological assemblies, the RCSB URL pattern differs by format:

  • mmCIF: <pdb_id>-assembly<assembly_id>.cif

  • PDB: <pdb_id>.pdb<assembly_id>

If you prefer the older convenience API, Coor(pdb_id="...") still works and now accepts extra RCSB-related keyword arguments:

from pdb_cpp import Coor

coor = Coor(
    pdb_id="5a9z",
    rcsb_structure="biological_assembly",
    assembly_id=1,
    cache_dir="/tmp/pdb_cpp_rcsb",
    force_download=False,
)

This path currently reads the downloaded file as returned by RCSB. It does not construct assemblies locally from _pdbx_struct_assembly_gen; instead it relies on the explicit assembly files published by RCSB.

Writing output

coor.write("output.pdb")      # PDB format
coor.write("output.cif")      # mmCIF format
coor.write("output.pqr")      # PQR format
coor.write("output.gro")      # GRO format

The format is determined by the file extension. Selections can also be written directly:

sel = coor.select_atoms("chain A")
sel.write("chain_A.pdb")

2. The Coor and Model objects

A Coor object holds one or more Model frames. Many properties are available on both classes via Python property accessors.

Key Coor properties

Property

Type

Description

len

int

Number of atoms in the active model

model_num

int

Number of models

models

list[Model]

All models

active_model

int

Index of the active model (read/write)

conect

dict[int, list[int]]

CONECT bond records (atom number → bonded atom numbers)

Coor methods

Method

Returns

Description

select_atoms(selection)

Coor

Select atoms by string query (see §3)

select_bool_index(bool_array)

Coor

Select atoms by boolean mask (one bool per atom)

get_index_select(selection)

list[int]

Atom indices matching a selection string

read(filename)

bool

Read a file into an existing Coor object

write(filename)

Write to PDB/mmCIF/PQR/GRO (by extension)

add_Model(model)

Append a Model to this Coor

clear()

Remove all models

get_uniq_chain()

list

Unique chain IDs (char arrays)

get_uniq_chain_str()

list[str]

Unique chain IDs as Python strings

get_aa_seq(gap_in_seq=True)

dict

Amino acid sequences per chain

get_aa_DL_seq()

dict

D/L amino acid sequences (D as lowercase)

get_aa_na_seq()

dict

Amino acid + nucleic acid sequences

remove_incomplete_backbone_residues()

Coor

Remove residues missing backbone atoms

Model methods

Method

Returns

Description

select_atoms(selection)

list[bool]

Boolean mask of atoms matching selection

sasa(...)

dict

SASA total with optional per-atom/per-residue data

addAtom(...)

bool

Add a single atom (see below)

clear()

Remove all atoms from this model

get_centroid(indices=None)

ndarray(3,)

Centroid of all atoms, or of given indices

Note that Model.select_atoms() returns a boolean mask (not a new object), unlike Coor.select_atoms() which returns a new Coor. Model.addAtom() takes positional arguments: num, name, resname, resid, chain, x, y, z, occ, beta, altloc, elem, insertres, field, uniq_resid.

Per-atom properties (on both Coor and Model)

These properties return data for the active model when called on Coor, or directly when called on a Model.

Property

Type

Description

x, y, z

list[float]

Cartesian coordinates

xyz

ndarray(N,3)

Coordinates as a NumPy array

name

list

Atom names (fixed-length char arrays)

name_str

list[str]

Atom names as Python strings

resname

list

Residue names (char arrays)

resname_str

list[str]

Residue names as strings

resid

list[int]

Residue sequence numbers

uniq_resid

list[int]

Unique (0-based) residue identifiers

chain

list

Chain identifiers (char arrays)

chain_str

list[str]

Chain identifiers as strings

num

list[int]

Atom serial numbers

beta

list[float]

B-factors (temperature factors)

occ

list[float]

Occupancies

elem_str

list[str]

Element symbols

alterloc_str

list[str]

Alternate location indicators

insertres_str

list[str]

Residue insertion codes

coor = Coor("tests/input/1y0m.cif")

# Access via Coor (delegates to active model)
print(coor.x[:5])
print(coor.chain_str[:5])
print(coor.xyz.shape)

# Access a specific model directly
model_0 = coor.models[0]
print(model_0.name_str[:5])
print(model_0.beta[:5])

Multi-model structures

coor = Coor("tests/input/2rri.pdb")
print(coor.model_num)   # e.g. 20 models (NMR ensemble)

# Switch active model
coor.active_model = 5
print(coor.x[:3])       # coordinates from model 5

# Iterate over all models
for i, model in enumerate(coor.models):
    print(f"Model {i}: {model.len} atoms")

Centroid calculation

model = coor.models[0]

# Centroid of all atoms
centroid = model.get_centroid()

# Centroid of specific atom indices
ca_indices = coor.get_index_select("name CA")
centroid_ca = model.get_centroid(ca_indices)

Boolean-mask selection

When you already have a boolean array (e.g. from NumPy operations), use select_bool_index instead of building a string query:

import numpy as np

coor = Coor("tests/input/1y0m.cif")
mask = np.array(coor.beta) < 30.0          # low B-factor atoms
low_b = coor.select_bool_index(mask.tolist())
print(f"{low_b.len} atoms with B < 30")

3.1 SASA and interface SASA

sasa.sasa() computes solvent-accessible surface area for each model in a Coor object. It returns one dictionary per model, each with at least total, polar, and apolar, and optionally atom_areas and residue_areas.

SASA for one chain or selection

from pdb_cpp import Coor
from pdb_cpp.analysis import sasa

coor = Coor("tests/input/1a2k.pdb")
result = sasa.sasa(coor, selection="chain A", by_residue=True)[0]
print(f"Total SASA: {result['total']:.2f} A^2")
print(f"Polar SASA: {result['polar']:.2f} A^2")
print(f"Apolar SASA: {result['apolar']:.2f} A^2")
print(result["residue_areas"][:3])

The polar/apolar split is FreeSASA-like: atom areas are partitioned from the per-atom SASA using a simple element-based classifier, with N, O, S, P, and Se treated as polar atoms.

Protein-protein interface SASA

Use pdb_cpp.analysis.sasa.buried_surface_area() to evaluate two partners inside one complex. This helper computes three SASA values internally:

  • receptor alone

  • ligand alone

  • receptor + ligand together as the complex

The returned interface metrics follow the standard convention:

  • buried_surface = sasa_receptor + sasa_ligand - sasa_complex

  • interface_area = buried_surface / 2

from pdb_cpp import Coor
from pdb_cpp.analysis import sasa

coor = Coor("tests/input/1a2k.pdb")

interface = sasa.buried_surface_area(
    coor,
    receptor_sel="chain A",
    ligand_sel="chain B",
    by_residue=True,
)[0]

print(f"Complex SASA   : {interface['complex_sasa']:.2f} A^2")
print(f"Complex polar  : {interface['complex_polar_sasa']:.2f} A^2")
print(f"Complex apolar : {interface['complex_apolar_sasa']:.2f} A^2")
print(f"Receptor SASA  : {interface['receptor_sasa']:.2f} A^2")
print(f"Ligand SASA    : {interface['ligand_sasa']:.2f} A^2")
print(f"Buried surface : {interface['buried_surface']:.2f} A^2")
print(f"Buried polar   : {interface['buried_polar_surface']:.2f} A^2")
print(f"Buried apolar  : {interface['buried_apolar_surface']:.2f} A^2")
print(f"Interface area : {interface['interface_area']:.2f} A^2")
print(f"Interface polar: {interface['interface_polar_area']:.2f} A^2")
print(f"Interface apol.: {interface['interface_apolar_area']:.2f} A^2")

for residue in interface["residue_buried_surface"]:
    print(
        residue["partner"],
        residue["chain"],
        residue["resname"],
        residue["resid"],
        residue["buried_area"],
    )

The residue_buried_surface list contains one entry per residue from the receptor and ligand selections, with:

  • isolated_area: SASA of that residue in the isolated partner

  • complex_area: SASA of that residue inside the full complex

  • buried_area: difference between the two

For residue-level SASA, each residue entry also contains polar_area and apolar_area; interface residue burial entries additionally expose buried_polar_area and buried_apolar_area.

Shape complementarity

pdb_cpp.analysis.sasa.shape_complementarity() estimates a Lawrence-Colman style shape complementarity score from rolling-probe surface dots and opposing surface normals. Each accessible interface-facing dot on one partner is matched to its nearest dot on the other partner within a configurable interface radius using a voxel hash, and the score aggregates dot(n_a, -n_b) over the matched interface dots.

Warning

shape_complementarity() is currently experimental and has not yet been validated against established reference implementations or benchmark datasets. Use this score with caution for quantitative conclusions.

from pdb_cpp import Coor
from pdb_cpp.analysis import sasa

coor = Coor("tests/input/1a2k.pdb")

sc = sasa.shape_complementarity(
    coor,
    receptor_sel="chain A",
    ligand_sel="chain B",
    dots_per_sq_angstrom=12.0,
    search_radius=1.5,
)[0]

print(f"Sc: {sc['shape_complementarity']:.3f}")
print(f"Interface dot pairs: {sc['interface_dot_pairs']}")
print(f"Mean paired distance: {sc['mean_interface_distance']:.3f} A")

Scores are bounded to [-1, 1], with values closer to 1 indicating more complementary opposing surface normals.

Important parameters:

  • dots_per_sq_angstrom: sampling density of the rolling-probe surface

  • search_radius: maximum nearest-dot pairing distance across the interface

  • reducer: use "trimmed_mean" (default) or "median" to aggregate interface dot scores

  • trim_fraction: fraction removed from each tail when reducer="trimmed_mean"

The returned dictionary also includes the per-frame receptor_shape_complementarity, ligand_shape_complementarity, interface_dot_pairs, and mean_interface_distance diagnostics.

CONECT records

CONECT records (covalent bonds) are stored as a dictionary mapping atom numbers to lists of bonded atom numbers. Bond topology is read from and written to both supported formats:

  • PDB: CONECT lines (hybrid-36 encoded atom serial numbers)

  • mmCIF: _struct_conn loop (any conn_type_id is accepted on read; covale is written on output)

from pdb_cpp import Coor

# Load a structure with covalent bonds (ligands, zinc coordination, etc.)
coor = Coor("tests/input/1u85.pdb")

# Inspect bonds
for atom_num, bonded in list(coor.conect.items())[:3]:
    print(f"Atom {atom_num} bonded to {bonded}")

# Bonds survive PDB and mmCIF round-trips
coor.write("output_with_conect.pdb")   # writes CONECT lines
coor.write("output_with_conect.cif")   # writes _struct_conn loop

Note

mmCIF files distributed by the PDB typically store only inter-residue bonds in _struct_conn (disulfides, metal coordination, covalent cross-links). Intra-ligand bonds may be absent unless the file was produced by pdb_cpp.

Programmatic construction

You can build a Coor from scratch or append models:

from pdb_cpp import Coor

coor_1 = Coor("tests/input/1y0m.cif")
coor_2 = Coor("tests/input/1y0m.cif")

# Append all models from coor_2 into coor_1
for model in coor_2.models:
    coor_1.add_Model(model)

print(coor_1.model_num)  # doubled

# Clear everything
coor_1.clear()
print(coor_1.model_num)  # 0

3. Atom selection

Selection syntax

Selections are performed with coor.select_atoms(selection_string) which returns a new Coor object containing only the matching atoms.

Field selectors

Keyword

Matches

Example

name

Atom name

name CA

resname

Residue name

resname ALA GLY

chain

Chain identifier

chain A B

resid

Residue sequence number

resid 1 2 3

residue

Unique residue index (0-based)

residue >= 10 and residue < 50

altloc

Alternate location indicator

not altloc B C D

x, y, z

Coordinate value

x >= 20.0

beta

B-factor

beta < 50.0

occ

Occupancy

occ > 0.5

Shortcut keywords

Keyword

Equivalent to

protein

Standard amino acid residue names

backbone

name CA C N O within protein residues

noh

Non-hydrogen atoms

nucleic

RNA and DNA residue names

rna

RNA residue names (A U G C)

dna

DNA residue names (DA DC DG DT)

Logical operatorsfrom pdb_cpp import Coor, alignment

coor_1 = Coor(“tests/input/1rxz_colabfold_model_1.pdb”) coor_2 = Coor(“tests/input/1rxz.pdb”)

rmsds, mappings = alignment.align_chain_permutation(coor_1, coor_2) print(“Best RMSD:”, rmsds[0]) print(“Returned mapping tuple lengths:”, len(mappings[0]), len(mappings[1]))

  • and — intersection

  • or — union

  • not — negation

Comparison operators

==, !=, <, <=, >, >= — used with numeric fields (resid, residue, x, y, z, beta, occ).

Spatial queries

within <distance> of <sub-selection>

Selects atoms within distance Ångströms of any atom matching the sub-selection.

Selection examples

from pdb_cpp import Coor

coor = Coor("tests/input/1rxz.pdb")

# Chain selection
chain_a = coor.select_atoms("chain A")

# Backbone of a residue range
seg = coor.select_atoms("backbone and chain A and residue >= 6 and residue <= 58")

# Interface: atoms of A within 5 Å of B
interface = coor.select_atoms("chain A and within 5.0 of chain B")

# Exclude water and alternate locations
clean = coor.select_atoms("protein and not altloc B C D and not resname HOH")

# Numeric coordinate filter
slab = coor.select_atoms("name CA and x >= 10.0 and x <= 30.0")

Getting indices without creating a new Coor

indices = coor.get_index_select("name CA and chain A")
# Returns list[int] of atom indices in the active model

Frame-specific selections (multi-model)

# Select using coordinates from a specific model frame
sel = coor.select_atoms("within 5.0 of chain A", frame=3)

4. Sequence extraction

Amino acid sequences

seqs = coor.get_aa_seq()
# Returns dict[str, str]: chain ID -> one-letter sequence
print(seqs["A"])

With gap insertion for missing residues:

seqs = coor.get_aa_seq(gap_in_seq=True)   # default: gaps inserted
seqs = coor.get_aa_seq(gap_in_seq=False)   # no gap characters

D/L amino acid sequences

dl_seqs = coor.get_aa_DL_seq()
# D-amino acids are returned as lowercase letters

Amino acid and nucleic acid sequences

all_seqs = coor.get_aa_na_seq()
# Includes both protein and nucleic acid chains

Via the sequence module

from pdb_cpp import sequence

seqs = sequence.get_aa_seq(coor)
dl_seqs = sequence.get_aa_DL_seq(coor)

5. Sequence alignment

pdb_cpp.alignment provides pairwise sequence alignment using a BLOSUM62 substitution matrix and affine gap penalties.

from pdb_cpp import alignment

seq_1 = "ACDEFGHIKLMNPQRSTVWY"
seq_2 = "ACDFGHIKLMNPQRSTVWY"   # missing E

aln_1, aln_2, score = alignment.align_seq(seq_1, seq_2)
print(aln_1)
print(aln_2)
print(f"Score: {score}")

# Pretty-print with identity/similarity stats
alignment.print_align_seq(aln_1, aln_2)

Parameters

Parameter

Default

Description

gap_cost

-11

Gap opening penalty

gap_ext

-1

Gap extension penalty

matrix_file

None

Custom substitution matrix file; defaults to bundled BLOSUM62

Low-level C++ alignment: core.sequence_align

The alignment.align_seq() function is a Python wrapper around the C++ core.sequence_align() function, which returns an Alignment_cpp object:

from pdb_cpp import core

result = core.sequence_align("ACDEFG", "ACDEG")
print(result.seq1)    # aligned sequence 1 (with gaps)
print(result.seq2)    # aligned sequence 2 (with gaps)
print(result.score)   # alignment score (int)

Alignment_cpp fields

Field

Type

Description

seq1

str

Aligned sequence 1 (with gaps)

seq2

str

Aligned sequence 2 (with gaps)

score

int

Raw alignment score


6. Structural alignment (RMSD-based)

Step-by-step alignment

from pdb_cpp import Coor, core, analysis

coor_1 = Coor("tests/input/1u85.pdb")
coor_2 = Coor("tests/input/1ubd.pdb")

# 1. Find common atoms by sequence alignment
idx_1, idx_2 = core.get_common_atoms(
    coor_1, coor_2,
    chain_1=["A"], chain_2=["C"],
    back_names=["C", "N", "O", "CA"],  # backbone atoms
)

# 2. Superpose coor_1 onto coor_2 (modifies coor_1 in-place)
core.coor_align(coor_1, coor_2, idx_1, idx_2, frame_ref=0)

# 3. Compute RMSD
rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2])
print(f"RMSD: {rmsd_values[0]:.3f} Å")

One-step sequence-based alignment

rmsd_list, align_idx_1, align_idx_2 = core.align_seq_based(
    coor_1, coor_2,
    chain_1=["A"], chain_2=["C"],
)
print(f"RMSD: {rmsd_list[0]:.3f} Å")

This combines get_common_atoms, coor_align, and RMSD in a single call.

Chain-permutation alignment (multi-chain complexes)

When chain correspondence is unknown:

from pdb_cpp import Coor, alignment

coor_1 = Coor("tests/input/1rxz_colabfold_model_1.pdb")
coor_2 = Coor("tests/input/1rxz.pdb")

rmsds, mappings = alignment.align_chain_permutation(coor_1, coor_2)
print("Best RMSD:", rmsds[0])
print("Returned mapping tuple lengths:", len(mappings[0]), len(mappings[1]))

This tries all possible chain pairings and returns the best RMSD.

RMSD functions

The analysis API is also grouped into submodules, so both of these patterns are supported:

from pdb_cpp import analysis
from pdb_cpp.analysis import dockq, sasa, hbonds
from pdb_cpp import analysis

# RMSD using a selection string
rmsd_values = analysis.rmsd(coor_1, coor_2, selection="name CA")

# RMSD with explicit index lists (no re-selection)
rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2])

7. TM-align and TM-score

TM-align provides length-independent structural similarity scoring via the bundled USalign implementation.

from pdb_cpp import Coor, geom
from pdb_cpp.core import tmalign_ca

coor_1 = Coor("tests/input/1y0m.cif")
coor_2 = Coor("tests/input/1ubd.pdb")

result = tmalign_ca(
    coor_1, coor_2,
    chain_1=["A"], chain_2=["C"],
    mm=0,
    include_transform=True,
)

print(result.rotation)
print(result.translation)

# Apply x' = R x + t to all atoms of the mobile structure
coor_1.xyz = geom.apply_transform(coor_1.xyz, result.rotation, result.translation)

TMalignResult fields

Field

Type

Description

TM1

float

TM-score normalized by length of structure 1

TM2

float

TM-score normalized by length of structure 2

TM_ali

float

TM-score normalized by aligned length

rmsd

float

RMSD of aligned residues

L_ali

int

Number of aligned residues

Liden

int

Number of identical aligned residues

seqxA

str

Aligned sequence of structure 1

seqyA

str

Aligned sequence of structure 2

seqM

str

Alignment match string (: aligned, . close, far)

rotation

list[list[float]]

3x3 rotation matrix (when include_transform=True)

translation

list[float]

Length-3 translation vector (when include_transform=True)

The mm parameter

Value

Meaning

0

Monomer TM-align

1

Monomer TM-align (same)

Multi-chain alignment

For multi-chain alignment with chain-pairing, pass multiple chains:

result = tmalign_ca(
    coor_1, coor_2,
    chain_1=["A", "B"], chain_2=["A", "B"],
    mm=0,
)

Citation

If you use the TM-align functionality, please cite:

  • Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang (2022) Nat Methods. 19(9), 1109-1115.

  • Chengxin Zhang, Anna Marie Pyle (2022) iScience. 25(10), 105218.


8. Secondary structure assignment

Uses the TM-align secondary structure assignment algorithm (DSSP-like).

from pdb_cpp import Coor, TMalign

coor = Coor("tests/input/1y0m.cif")
ss_list = TMalign.compute_secondary_structure(coor)

Returns a list (one entry per model) of dictionaries mapping chain ID to a secondary structure string.

Secondary structure codes

Code

Meaning

H

Helix

E

Strand (sheet)

T

Turn

C

Coil / other

for chain_id, ss_string in ss_list[0].items():
    print(f"Chain {chain_id}: {ss_string}")

Low-level: core.compute_SS

The TMalign.compute_secondary_structure() wraps the C++ function core.compute_SS(), which returns secondary structure strings directly:

from pdb_cpp import Coor, core

coor = Coor("tests/input/1y0m.cif")
ss = core.compute_SS(coor, gap_in_seq=False)
# Returns list[list[str]]: one list of SS strings per model

9. DockQ scoring

DockQ evaluates docking model quality by combining interface contacts, ligand RMSD, and interface RMSD into a single score.

Basic usage (automatic chain-role inference)

from pdb_cpp import Coor, analysis

model = Coor("tests/input/1rxz_colabfold_model_1.pdb")
native = Coor("tests/input/1rxz.pdb")

scores = analysis.dockQ(model, native)

With explicit chain roles

scores = analysis.dockQ(
    model, native,
    rec_chains=["B"],
    lig_chains=["C"],
    native_rec_chains=["A"],
    native_lig_chains=["B"],
)

Command-line multimer scoring

The installed pdb_cpp_dockq executable runs analysis.dockQ_multimer(). This is the same multimer code path used in the benchmark script, including automatic native-to-model chain-map discovery when --chain-map is omitted.

pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb

To provide an explicit mapping, pass native:model pairs:

pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb --chain-map A:B,B:A,C:C

DockQ output dictionary

Key

Type

Description

DockQ

list[float]

Combined DockQ score (0-1)

Fnat

list[float]

Fraction of native contacts

Fnonnat

list[float]

Fraction of non-native contacts

LRMS

list[float]

Ligand RMSD (Å)

iRMS

list[float]

Interface RMSD (Å)

rRMS

list[float]

Receptor RMSD (Å)

Each value is a list with one entry per model in the input Coor.

DockQ quality thresholds

DockQ range

Quality

0.00 – 0.23

Incorrect

0.23 – 0.49

Acceptable

0.49 – 0.80

Medium

0.80 – 1.00

High

Individual metric functions

# Fraction of native/non-native contacts
fnat_list, fnonnat_list = analysis.native_contact(
    model, native,
    rec_chains=["A"], lig_chains=["B"],
    native_rec_chains=["A"], native_lig_chains=["B"],
    cutoff=5.0,
)

# Interface RMSD
irmsd_list = analysis.interface_rmsd(
    model, native,
    rec_chains_native=["A"],
    lig_chains_native=["B"],
    cutoff=10.0,
)

native_contact parameters

Parameter

Default

Description

coor

Model Coor

native_coor

Native Coor

rec_chains

Model receptor chain IDs

lig_chains

Model ligand chain IDs

native_rec_chains

Native receptor chain IDs

native_lig_chains

Native ligand chain IDs

cutoff

5.0

Contact distance cutoff (Å)

residue_id_map

None

dict[int, int] mapping model residue IDs to a shared ID space

native_residue_id_map

None

dict[int, int] mapping native residue IDs to the same shared ID space

The residue_id_map parameters are useful when model and native have different residue numbering — provide dictionaries that map each residue uniq_resid to a common numbering scheme.

interface_rmsd notes

interface_rmsd returns a list of floats with one entry per model. Entries are None when no interface residues are found within the cutoff distance.

Citation

If you use DockQ scoring, please cite:

  • Mirabello, C. & Wallner, B. (2024) Bioinformatics. DOI: 10.1093/bioinformatics/btae586


10. Hydrogen bond detection

pdb_cpp.analysis.hbonds detects hydrogen bonds using a purely geometric approach (Baker & Hubbard, 1984). For structures without explicit hydrogen atoms, backbone NH and sidechain hydrogen positions are reconstructed algebraically. The algorithm covers all standard L/D amino acids and RNA/DNA nucleotides.

Basic usage

from pdb_cpp import Coor
from pdb_cpp.analysis import hbonds

coor = Coor("tests/input/2rri.cif")

# All protein–protein H-bonds for every model
all_bonds = hbonds.hbonds(coor)

# List of HBond objects for model 0
bonds_model0 = all_bonds[0]
print(f"Model 0: {len(bonds_model0)} hydrogen bonds")

for b in bonds_model0[:3]:
    print(b)

Function signature

hbonds.hbonds(
    coor,
    donor_sel    = "protein",
    acceptor_sel = "protein",
    dist_DA_cutoff = 3.5,   # D···A distance cutoff (Å)
    dist_HA_cutoff = 2.5,   # H···A distance cutoff (Å)
    angle_cutoff   = 90.0,  # D−H···A angle cutoff (°)
) -> list[list[HBond]]

Returns a list with one inner list (of HBond objects) per model frame.

Parameters

Parameter

Default

Description

coor

A Coor object

donor_sel

"protein"

Selection string for donor-containing atoms

acceptor_sel

"protein"

Selection string for acceptor-containing atoms

dist_DA_cutoff

3.5

Maximum heavy-atom donor–acceptor distance (Å)

dist_HA_cutoff

2.5

Maximum hydrogen–acceptor distance (Å)

angle_cutoff

90.0

Minimum D−H···A angle (°)

HBond object

Each entry in the returned lists is an HBond object with the following read-only attributes:

Attribute

Type

Description

donor_resid

int

Donor residue sequence number

donor_resname

str

Donor residue name

donor_chain

str

Donor chain identifier

donor_heavy_name

str

Donor heavy-atom name (e.g. N, OG)

donor_h_name

str

Donor hydrogen atom name (e.g. H, HG)

donor_heavy_xyz

tuple

Donor heavy-atom coordinates (x, y, z)

donor_h_xyz

tuple

Donor hydrogen coordinates (x, y, z)

acceptor_resid

int

Acceptor residue sequence number

acceptor_resname

str

Acceptor residue name

acceptor_chain

str

Acceptor chain identifier

acceptor_name

str

Acceptor atom name (e.g. O, OD1)

acceptor_xyz

tuple

Acceptor atom coordinates (x, y, z)

dist_DA

float

D···A distance (Å)

dist_HA

float

H···A distance (Å)

angle_DHA

float

D−H···A angle (°)

Filtering and analysis examples

from pdb_cpp import Coor
from pdb_cpp.analysis import hbonds

coor = Coor("tests/input/2rri.cif")
bonds = hbonds.hbonds(coor)[0]

# Backbone–backbone H-bonds only
bb_names = {"N", "O"}
bb_bonds = [b for b in bonds
            if b.donor_heavy_name in bb_names
            and b.acceptor_name in bb_names]

# Inter-chain H-bonds (cross-chain contacts)
inter = [b for b in bonds if b.donor_chain != b.acceptor_chain]
print(f"{len(inter)} inter-chain H-bonds")

# Sort by D–A distance
bonds.sort(key=lambda b: b.dist_DA)
for b in bonds[:5]:
    print(f"{b.donor_chain}{b.donor_resid}{b.donor_resname:>4}"
          f" {b.donor_heavy_name} → "
          f"{b.acceptor_chain}{b.acceptor_resid}{b.acceptor_resname:>4}"
          f" {b.acceptor_name}  d(DA)={b.dist_DA:.2f} Å"
          f"  angle={b.angle_DHA:.1f}°")

Cross-selection H-bonds (e.g. protein–nucleic)

The donor_sel and acceptor_sel parameters accept any valid selection string, enabling inter-molecular or cross-type analysis:

# Protein side-chains donating to RNA acceptors
rna_contacts = hbonds.hbonds(
    coor,
    donor_sel    = "protein",
    acceptor_sel = "nucleic",
)

# H-bonds between two specific chains
ab_bonds = hbonds.hbonds(coor, donor_sel="chain A", acceptor_sel="chain B")

Baker & Hubbard criteria

The default thresholds reproduce the original Baker & Hubbard (1984) geometric criteria:

\[ d(\text{D}{\cdots}\text{A}) \leq 3.5~\text{Å}, \quad d(\text{H}{\cdots}\text{A}) \leq 2.5~\text{Å}, \quad \angle(\text{D{-}H}{\cdots}\text{A}) \geq 90° \]

A stricter cut-off of 120° (angle_cutoff=120) is closer to values used by biotite and is recommended when comparing with other tools.


11. Geometry utilities

Distance matrix

from pdb_cpp import Coor, geom

coor = Coor("tests/input/1y0m.cif")
ca = coor.select_atoms("name CA")

# Pairwise distance matrix between two coordinate sets
dmat = geom.distance_matrix(ca.xyz, ca.xyz)
print(dmat.shape)  # (N_ca, N_ca)

The function accepts any two Coor objects (or selections) and returns an (N, M) float32 NumPy array of Euclidean distances.

Hybrid-36 encoding/decoding

PDB format uses hybrid-36 encoding for atom/residue numbers > 99999:

from pdb_cpp.core import hy36encode, hy36decode

encoded = hy36encode(5, 100000)   # "A0000"
decoded = hy36decode(5, "A0000")  # 100000

12. Data cleaning utilities

Remove incomplete backbone residues

Residues missing any backbone atom (CA, C, N, O by default) are removed:

from pdb_cpp import select

clean_coor = select.remove_incomplete_backbone_residues(coor)

# Or with custom backbone definition
clean_coor = select.remove_incomplete_backbone_residues(
    coor, back_atom=["CA", "C", "N"]
)

This is recommended before structural alignment to avoid mismatches.


13. Module summary

Module

Key functions / classes

pdb_cpp

Coor, Model — main data objects

pdb_cpp.core

get_common_atoms, coor_align, align_seq_based, tmalign_ca, distance_matrix, compute_SS, sequence_align, Alignment_cpp, hy36encode, hy36decode

pdb_cpp.alignment

align_seq, print_align_seq, align_chain_permutation

pdb_cpp.analysis

rmsd, interface_rmsd, native_contact, dockQ

pdb_cpp.analysis.hbonds

hbonds — Baker & Hubbard H-bond detection

pdb_cpp.TMalign

compute_secondary_structure

pdb_cpp.geom

distance_matrix

pdb_cpp.select

remove_incomplete_backbone_residues

pdb_cpp.sequence

get_aa_seq, get_aa_DL_seq

pdb_cpp.data

BLOSUM62, get_blosum62, load_blosum, AA_DICT, NA_DICT, AA_NA_DICT


14. Data subpackage (pdb_cpp.data)

The pdb_cpp.data module provides residue dictionaries and the BLOSUM62 substitution matrix used internally by the alignment routines.

BLOSUM62 matrix

from pdb_cpp.data.blosum import BLOSUM62

# Substitution score for two amino acids
print(BLOSUM62[("A", "A")])   # 4
print(BLOSUM62[("A", "W")])   # -3

# Load a custom matrix from file
from pdb_cpp.data.blosum import load_blosum
custom = load_blosum("/path/to/matrix.txt")

BLOSUM62 is a lazy-loaded dictionary mapping (aa1, aa2) tuples to integer substitution scores.

Residue dictionaries

from pdb_cpp.data.res_dict import AA_DICT, AA_DICT_L, AA_DICT_D, NA_DICT

# 3-letter to 1-letter amino acid code
print(AA_DICT["ALA"])   # "A"
print(AA_DICT["GLY"])   # "G"

# D-amino acids (e.g. DAL -> "A")
print(AA_DICT_D["DAL"])  # "A"

# Nucleic acids
print(NA_DICT["DA"])     # "A"

# Combined amino acid + nucleic acid dictionary
from pdb_cpp.data.res_dict import AA_NA_DICT

Dictionary

Contents

AA_DICT_L

Standard L-amino acids (31 entries incl. protonation variants)

AA_DICT_D

D-amino acids (22 entries)

AA_DICT

Combined L + D amino acids

NA_DICT

DNA nucleotides (DA, DT, DC, DG)

AA_NA_DICT

Combined amino acids + nucleotides