# Functionality Guide This guide documents every major `pdb_cpp` workflow with fully worked examples. For copy-paste snippets see [Quick Recipes](quick_recipes.md); for the auto-generated API reference see [pdb_cpp package](pdb_cpp.rst). --- ## 1. Reading and writing structures ### Supported formats | Format | Extension | Read | Write | Notes | |--------|-----------|------|-------|------------------------------------------| | PDB | `.pdb` | yes | yes | Bond topology via `CONECT` lines | | mmCIF | `.cif` | yes | yes | Bond topology via `_struct_conn` loop | | PQR | `.pqr` | yes | yes | Charge in `occ`, radius in `beta` slot | | GRO | `.gro` | yes | yes | GROMACS format; coordinates converted nm ↔ Å | ### Loading from a local file ```python from pdb_cpp import Coor coor = Coor("structure.cif") # auto-detects format by extension coor = Coor("structure.pdb") ``` The format can also be forced explicitly with the `format` keyword. This is useful when the file extension is absent or misleading — for example a compressed pipeline that writes to a generic filename: ```python from pdb_cpp import Coor # Via the constructor coor = Coor("structure.dat", format="pdb") # treat as PDB coor = Coor("structure.dat", format="cif") # treat as mmCIF coor = Coor("structure.dat", format="pqr") # treat as PQR coor = Coor("structure.dat", format="gro") # treat as GROMACS GRO # Via read() on an existing object coor = Coor() coor.read("structure.dat", format="cif") ``` Accepted `format` values: `"pdb"`, `"cif"`, `"pqr"`, `"gro"`. `format` defaults to `""`, which means *infer from extension* — the same behaviour as before. ### Fetching from the PDB archive ```python coor = Coor(pdb_id="1y0m") # downloads mmCIF, caches in /tmp/pdb_cpp_cache/ ``` The cached file is reused on subsequent calls with the same PDB ID. --- ## Interaction analysis Interface-oriented tools live directly in the analysis submodules: - `pdb_cpp.analysis.hbonds.hbonds()` for hydrogen bonds - `pdb_cpp.analysis.salt_bridge.salt_bridges()` for ionic contacts / salt bridges - `pdb_cpp.analysis.sasa.buried_surface_area()` for buried surface and interface area - `pdb_cpp.analysis.sasa.shape_complementarity()` for Lawrence-Colman style interface shape complementarity ```python from pdb_cpp import Coor from pdb_cpp.analysis import hbonds, salt_bridge, sasa coor = Coor("tests/input/1A0A.cif") hb = hbonds.hbonds(coor, donor_sel="protein", acceptor_sel="nucleic") sb = salt_bridge.salt_bridges(coor, cation_sel="protein", anion_sel="nucleic") iface = sasa.buried_surface_area( coor, receptor_sel="chain C D", ligand_sel="chain A B", )[0] print(len(hb[0]), len(sb[0]), iface["interface_area"]) ``` ### Loading through the `rcsb` module The `pdb_cpp.rcsb` module gives explicit control over which RCSB coordinate file is downloaded. ```python from pdb_cpp import rcsb # Deposited asymmetric unit asym_unit = rcsb.load("1y0m", structure="asymmetric_unit") # Biological assembly 1 in mmCIF format assembly_1 = rcsb.load("5a9z", structure="biological_assembly", assembly_id=1) # Download without loading, using a custom cache directory local_path = rcsb.download( "5a9z", structure="biological_assembly", assembly_id=1, cache_dir="/tmp/pdb_cpp_rcsb", ) ``` Supported `structure` values are: - `"asymmetric_unit"`: the deposited entry file, for example `1y0m.cif` - `"biological_assembly"`: an assembly-specific RCSB file, for example `5a9z-assembly1.cif` Supported `file_format` values are `"cif"` and `"pdb"`. For biological assemblies, the RCSB URL pattern differs by format: - mmCIF: `-assembly.cif` - PDB: `.pdb` If you prefer the older convenience API, `Coor(pdb_id="...")` still works and now accepts extra RCSB-related keyword arguments: ```python from pdb_cpp import Coor coor = Coor( pdb_id="5a9z", rcsb_structure="biological_assembly", assembly_id=1, cache_dir="/tmp/pdb_cpp_rcsb", force_download=False, ) ``` This path currently reads the downloaded file as returned by RCSB. It does not construct assemblies locally from `_pdbx_struct_assembly_gen`; instead it relies on the explicit assembly files published by RCSB. ### Writing output ```python coor.write("output.pdb") # PDB format coor.write("output.cif") # mmCIF format coor.write("output.pqr") # PQR format coor.write("output.gro") # GRO format ``` The format is determined by the file extension. Selections can also be written directly: ```python sel = coor.select_atoms("chain A") sel.write("chain_A.pdb") ``` --- ## 2. The `Coor` and `Model` objects A `Coor` object holds one or more `Model` frames. Many properties are available on both classes via Python property accessors. ### Key `Coor` properties | Property | Type | Description | |-------------------|-----------------|-----------------------------------------------| | `len` | `int` | Number of atoms in the active model | | `model_num` | `int` | Number of models | | `models` | `list[Model]` | All models | | `active_model` | `int` | Index of the active model (read/write) | | `conect` | `dict[int, list[int]]` | CONECT bond records (atom number → bonded atom numbers) | ### `Coor` methods | Method | Returns | Description | |-------------------------------------|----------------|---------------------------------------------------| | `select_atoms(selection)` | `Coor` | Select atoms by string query (see §3) | | `select_bool_index(bool_array)` | `Coor` | Select atoms by boolean mask (one bool per atom) | | `get_index_select(selection)` | `list[int]` | Atom indices matching a selection string | | `read(filename)` | `bool` | Read a file into an existing `Coor` object | | `write(filename)` | — | Write to PDB/mmCIF/PQR/GRO (by extension) | | `add_Model(model)` | — | Append a `Model` to this `Coor` | | `clear()` | — | Remove all models | | `get_uniq_chain()` | `list` | Unique chain IDs (char arrays) | | `get_uniq_chain_str()` | `list[str]` | Unique chain IDs as Python strings | | `get_aa_seq(gap_in_seq=True)` | `dict` | Amino acid sequences per chain | | `get_aa_DL_seq()` | `dict` | D/L amino acid sequences (D as lowercase) | | `get_aa_na_seq()` | `dict` | Amino acid + nucleic acid sequences | | `remove_incomplete_backbone_residues()` | `Coor` | Remove residues missing backbone atoms | ### `Model` methods | Method | Returns | Description | |-------------------------------------|-----------------|--------------------------------------------------| | `select_atoms(selection)` | `list[bool]` | Boolean mask of atoms matching selection | | `sasa(...)` | `dict` | SASA total with optional per-atom/per-residue data | | `addAtom(...)` | `bool` | Add a single atom (see below) | | `clear()` | — | Remove all atoms from this model | | `get_centroid(indices=None)` | `ndarray(3,)` | Centroid of all atoms, or of given indices | Note that `Model.select_atoms()` returns a **boolean mask** (not a new object), unlike `Coor.select_atoms()` which returns a new `Coor`. `Model.addAtom()` takes positional arguments: `num, name, resname, resid, chain, x, y, z, occ, beta, altloc, elem, insertres, field, uniq_resid`. ### Per-atom properties (on both `Coor` and `Model`) These properties return data for the active model when called on `Coor`, or directly when called on a `Model`. | Property | Type | Description | |-----------------|-----------------|------------------------------------------| | `x`, `y`, `z` | `list[float]` | Cartesian coordinates | | `xyz` | `ndarray(N,3)` | Coordinates as a NumPy array | | `name` | `list` | Atom names (fixed-length char arrays) | | `name_str` | `list[str]` | Atom names as Python strings | | `resname` | `list` | Residue names (char arrays) | | `resname_str` | `list[str]` | Residue names as strings | | `resid` | `list[int]` | Residue sequence numbers | | `uniq_resid` | `list[int]` | Unique (0-based) residue identifiers | | `chain` | `list` | Chain identifiers (char arrays) | | `chain_str` | `list[str]` | Chain identifiers as strings | | `num` | `list[int]` | Atom serial numbers | | `beta` | `list[float]` | B-factors (temperature factors) | | `occ` | `list[float]` | Occupancies | | `elem_str` | `list[str]` | Element symbols | | `alterloc_str` | `list[str]` | Alternate location indicators | | `insertres_str` | `list[str]` | Residue insertion codes | ```python coor = Coor("tests/input/1y0m.cif") # Access via Coor (delegates to active model) print(coor.x[:5]) print(coor.chain_str[:5]) print(coor.xyz.shape) # Access a specific model directly model_0 = coor.models[0] print(model_0.name_str[:5]) print(model_0.beta[:5]) ``` ### Multi-model structures ```python coor = Coor("tests/input/2rri.pdb") print(coor.model_num) # e.g. 20 models (NMR ensemble) # Switch active model coor.active_model = 5 print(coor.x[:3]) # coordinates from model 5 # Iterate over all models for i, model in enumerate(coor.models): print(f"Model {i}: {model.len} atoms") ``` ### Centroid calculation ```python model = coor.models[0] # Centroid of all atoms centroid = model.get_centroid() # Centroid of specific atom indices ca_indices = coor.get_index_select("name CA") centroid_ca = model.get_centroid(ca_indices) ``` ### Boolean-mask selection When you already have a boolean array (e.g. from NumPy operations), use `select_bool_index` instead of building a string query: ```python import numpy as np coor = Coor("tests/input/1y0m.cif") mask = np.array(coor.beta) < 30.0 # low B-factor atoms low_b = coor.select_bool_index(mask.tolist()) print(f"{low_b.len} atoms with B < 30") ``` --- ## 3.1 SASA and interface SASA `sasa.sasa()` computes solvent-accessible surface area for each model in a `Coor` object. It returns one dictionary per model, each with at least `total`, `polar`, and `apolar`, and optionally `atom_areas` and `residue_areas`. ### SASA for one chain or selection ```python from pdb_cpp import Coor from pdb_cpp.analysis import sasa coor = Coor("tests/input/1a2k.pdb") result = sasa.sasa(coor, selection="chain A", by_residue=True)[0] print(f"Total SASA: {result['total']:.2f} A^2") print(f"Polar SASA: {result['polar']:.2f} A^2") print(f"Apolar SASA: {result['apolar']:.2f} A^2") print(result["residue_areas"][:3]) ``` The polar/apolar split is FreeSASA-like: atom areas are partitioned from the per-atom SASA using a simple element-based classifier, with `N`, `O`, `S`, `P`, and `Se` treated as polar atoms. ### Protein-protein interface SASA Use `pdb_cpp.analysis.sasa.buried_surface_area()` to evaluate two partners inside one complex. This helper computes three SASA values internally: - receptor alone - ligand alone - receptor + ligand together as the complex The returned interface metrics follow the standard convention: - `buried_surface = sasa_receptor + sasa_ligand - sasa_complex` - `interface_area = buried_surface / 2` ```python from pdb_cpp import Coor from pdb_cpp.analysis import sasa coor = Coor("tests/input/1a2k.pdb") interface = sasa.buried_surface_area( coor, receptor_sel="chain A", ligand_sel="chain B", by_residue=True, )[0] print(f"Complex SASA : {interface['complex_sasa']:.2f} A^2") print(f"Complex polar : {interface['complex_polar_sasa']:.2f} A^2") print(f"Complex apolar : {interface['complex_apolar_sasa']:.2f} A^2") print(f"Receptor SASA : {interface['receptor_sasa']:.2f} A^2") print(f"Ligand SASA : {interface['ligand_sasa']:.2f} A^2") print(f"Buried surface : {interface['buried_surface']:.2f} A^2") print(f"Buried polar : {interface['buried_polar_surface']:.2f} A^2") print(f"Buried apolar : {interface['buried_apolar_surface']:.2f} A^2") print(f"Interface area : {interface['interface_area']:.2f} A^2") print(f"Interface polar: {interface['interface_polar_area']:.2f} A^2") print(f"Interface apol.: {interface['interface_apolar_area']:.2f} A^2") for residue in interface["residue_buried_surface"]: print( residue["partner"], residue["chain"], residue["resname"], residue["resid"], residue["buried_area"], ) ``` The `residue_buried_surface` list contains one entry per residue from the receptor and ligand selections, with: - `isolated_area`: SASA of that residue in the isolated partner - `complex_area`: SASA of that residue inside the full complex - `buried_area`: difference between the two For residue-level SASA, each residue entry also contains `polar_area` and `apolar_area`; interface residue burial entries additionally expose `buried_polar_area` and `buried_apolar_area`. ### Shape complementarity `pdb_cpp.analysis.sasa.shape_complementarity()` estimates a Lawrence-Colman style shape complementarity score from rolling-probe surface dots and opposing surface normals. Each accessible interface-facing dot on one partner is matched to its nearest dot on the other partner within a configurable interface radius using a voxel hash, and the score aggregates `dot(n_a, -n_b)` over the matched interface dots. ```{warning} `shape_complementarity()` is currently experimental and has not yet been validated against established reference implementations or benchmark datasets. Use this score with caution for quantitative conclusions. ``` ```python from pdb_cpp import Coor from pdb_cpp.analysis import sasa coor = Coor("tests/input/1a2k.pdb") sc = sasa.shape_complementarity( coor, receptor_sel="chain A", ligand_sel="chain B", dots_per_sq_angstrom=12.0, search_radius=1.5, )[0] print(f"Sc: {sc['shape_complementarity']:.3f}") print(f"Interface dot pairs: {sc['interface_dot_pairs']}") print(f"Mean paired distance: {sc['mean_interface_distance']:.3f} A") ``` Scores are bounded to `[-1, 1]`, with values closer to `1` indicating more complementary opposing surface normals. Important parameters: - `dots_per_sq_angstrom`: sampling density of the rolling-probe surface - `search_radius`: maximum nearest-dot pairing distance across the interface - `reducer`: use `"trimmed_mean"` (default) or `"median"` to aggregate interface dot scores - `trim_fraction`: fraction removed from each tail when `reducer="trimmed_mean"` The returned dictionary also includes the per-frame `receptor_shape_complementarity`, `ligand_shape_complementarity`, `interface_dot_pairs`, and `mean_interface_distance` diagnostics. ### CONECT records CONECT records (covalent bonds) are stored as a dictionary mapping atom numbers to lists of bonded atom numbers. Bond topology is read from and written to both supported formats: - **PDB**: `CONECT` lines (hybrid-36 encoded atom serial numbers) - **mmCIF**: `_struct_conn` loop (any `conn_type_id` is accepted on read; `covale` is written on output) ```python from pdb_cpp import Coor # Load a structure with covalent bonds (ligands, zinc coordination, etc.) coor = Coor("tests/input/1u85.pdb") # Inspect bonds for atom_num, bonded in list(coor.conect.items())[:3]: print(f"Atom {atom_num} bonded to {bonded}") # Bonds survive PDB and mmCIF round-trips coor.write("output_with_conect.pdb") # writes CONECT lines coor.write("output_with_conect.cif") # writes _struct_conn loop ``` ```{note} mmCIF files distributed by the PDB typically store only inter-residue bonds in `_struct_conn` (disulfides, metal coordination, covalent cross-links). Intra-ligand bonds may be absent unless the file was produced by pdb_cpp. ``` ### Programmatic construction You can build a `Coor` from scratch or append models: ```python from pdb_cpp import Coor coor_1 = Coor("tests/input/1y0m.cif") coor_2 = Coor("tests/input/1y0m.cif") # Append all models from coor_2 into coor_1 for model in coor_2.models: coor_1.add_Model(model) print(coor_1.model_num) # doubled # Clear everything coor_1.clear() print(coor_1.model_num) # 0 ``` --- ## 3. Atom selection ### Selection syntax Selections are performed with `coor.select_atoms(selection_string)` which returns a new `Coor` object containing only the matching atoms. #### Field selectors | Keyword | Matches | Example | |------------|----------------------------------|----------------------------------| | `name` | Atom name | `name CA` | | `resname` | Residue name | `resname ALA GLY` | | `chain` | Chain identifier | `chain A B` | | `resid` | Residue sequence number | `resid 1 2 3` | | `residue` | Unique residue index (0-based) | `residue >= 10 and residue < 50` | | `altloc` | Alternate location indicator | `not altloc B C D` | | `x`, `y`, `z` | Coordinate value | `x >= 20.0` | | `beta` | B-factor | `beta < 50.0` | | `occ` | Occupancy | `occ > 0.5` | #### Shortcut keywords | Keyword | Equivalent to | |------------|---------------------------------------------| | `protein` | Standard amino acid residue names | | `backbone` | `name CA C N O` within protein residues | | `noh` | Non-hydrogen atoms | | `nucleic` | RNA and DNA residue names | | `rna` | RNA residue names (`A U G C`) | | `dna` | DNA residue names (`DA DC DG DT`) | #### Logical operatorsfrom pdb_cpp import Coor, alignment coor_1 = Coor("tests/input/1rxz_colabfold_model_1.pdb") coor_2 = Coor("tests/input/1rxz.pdb") rmsds, mappings = alignment.align_chain_permutation(coor_1, coor_2) print("Best RMSD:", rmsds[0]) print("Returned mapping tuple lengths:", len(mappings[0]), len(mappings[1])) - `and` — intersection - `or` — union - `not` — negation #### Comparison operators `==`, `!=`, `<`, `<=`, `>`, `>=` — used with numeric fields (`resid`, `residue`, `x`, `y`, `z`, `beta`, `occ`). #### Spatial queries ``` within of ``` Selects atoms within `distance` Ångströms of any atom matching the sub-selection. ### Selection examples ```python from pdb_cpp import Coor coor = Coor("tests/input/1rxz.pdb") # Chain selection chain_a = coor.select_atoms("chain A") # Backbone of a residue range seg = coor.select_atoms("backbone and chain A and residue >= 6 and residue <= 58") # Interface: atoms of A within 5 Å of B interface = coor.select_atoms("chain A and within 5.0 of chain B") # Exclude water and alternate locations clean = coor.select_atoms("protein and not altloc B C D and not resname HOH") # Numeric coordinate filter slab = coor.select_atoms("name CA and x >= 10.0 and x <= 30.0") ``` ### Getting indices without creating a new `Coor` ```python indices = coor.get_index_select("name CA and chain A") # Returns list[int] of atom indices in the active model ``` ### Frame-specific selections (multi-model) ```python # Select using coordinates from a specific model frame sel = coor.select_atoms("within 5.0 of chain A", frame=3) ``` --- ## 4. Sequence extraction ### Amino acid sequences ```python seqs = coor.get_aa_seq() # Returns dict[str, str]: chain ID -> one-letter sequence print(seqs["A"]) ``` With gap insertion for missing residues: ```python seqs = coor.get_aa_seq(gap_in_seq=True) # default: gaps inserted seqs = coor.get_aa_seq(gap_in_seq=False) # no gap characters ``` ### D/L amino acid sequences ```python dl_seqs = coor.get_aa_DL_seq() # D-amino acids are returned as lowercase letters ``` ### Amino acid and nucleic acid sequences ```python all_seqs = coor.get_aa_na_seq() # Includes both protein and nucleic acid chains ``` ### Via the `sequence` module ```python from pdb_cpp import sequence seqs = sequence.get_aa_seq(coor) dl_seqs = sequence.get_aa_DL_seq(coor) ``` --- ## 5. Sequence alignment `pdb_cpp.alignment` provides pairwise sequence alignment using a BLOSUM62 substitution matrix and affine gap penalties. ```python from pdb_cpp import alignment seq_1 = "ACDEFGHIKLMNPQRSTVWY" seq_2 = "ACDFGHIKLMNPQRSTVWY" # missing E aln_1, aln_2, score = alignment.align_seq(seq_1, seq_2) print(aln_1) print(aln_2) print(f"Score: {score}") # Pretty-print with identity/similarity stats alignment.print_align_seq(aln_1, aln_2) ``` ### Parameters | Parameter | Default | Description | |---------------|---------|------------------------------------| | `gap_cost` | -11 | Gap opening penalty | | `gap_ext` | -1 | Gap extension penalty | | `matrix_file` | `None` | Custom substitution matrix file; defaults to bundled BLOSUM62 | ### Low-level C++ alignment: `core.sequence_align` The `alignment.align_seq()` function is a Python wrapper around the C++ `core.sequence_align()` function, which returns an `Alignment_cpp` object: ```python from pdb_cpp import core result = core.sequence_align("ACDEFG", "ACDEG") print(result.seq1) # aligned sequence 1 (with gaps) print(result.seq2) # aligned sequence 2 (with gaps) print(result.score) # alignment score (int) ``` #### `Alignment_cpp` fields | Field | Type | Description | |---------|-------|----------------------------------| | `seq1` | `str` | Aligned sequence 1 (with gaps) | | `seq2` | `str` | Aligned sequence 2 (with gaps) | | `score` | `int` | Raw alignment score | --- ## 6. Structural alignment (RMSD-based) ### Step-by-step alignment ```python from pdb_cpp import Coor, core, analysis coor_1 = Coor("tests/input/1u85.pdb") coor_2 = Coor("tests/input/1ubd.pdb") # 1. Find common atoms by sequence alignment idx_1, idx_2 = core.get_common_atoms( coor_1, coor_2, chain_1=["A"], chain_2=["C"], back_names=["C", "N", "O", "CA"], # backbone atoms ) # 2. Superpose coor_1 onto coor_2 (modifies coor_1 in-place) core.coor_align(coor_1, coor_2, idx_1, idx_2, frame_ref=0) # 3. Compute RMSD rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2]) print(f"RMSD: {rmsd_values[0]:.3f} Å") ``` ### One-step sequence-based alignment ```python rmsd_list, align_idx_1, align_idx_2 = core.align_seq_based( coor_1, coor_2, chain_1=["A"], chain_2=["C"], ) print(f"RMSD: {rmsd_list[0]:.3f} Å") ``` This combines `get_common_atoms`, `coor_align`, and RMSD in a single call. ### Chain-permutation alignment (multi-chain complexes) When chain correspondence is unknown: ```python from pdb_cpp import Coor, alignment coor_1 = Coor("tests/input/1rxz_colabfold_model_1.pdb") coor_2 = Coor("tests/input/1rxz.pdb") rmsds, mappings = alignment.align_chain_permutation(coor_1, coor_2) print("Best RMSD:", rmsds[0]) print("Returned mapping tuple lengths:", len(mappings[0]), len(mappings[1])) ``` This tries all possible chain pairings and returns the best RMSD. ### RMSD functions The analysis API is also grouped into submodules, so both of these patterns are supported: ```python from pdb_cpp import analysis from pdb_cpp.analysis import dockq, sasa, hbonds ``` ```python from pdb_cpp import analysis # RMSD using a selection string rmsd_values = analysis.rmsd(coor_1, coor_2, selection="name CA") # RMSD with explicit index lists (no re-selection) rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2]) ``` --- ## 7. TM-align and TM-score TM-align provides length-independent structural similarity scoring via the bundled USalign implementation. ```python from pdb_cpp import Coor, geom from pdb_cpp.core import tmalign_ca coor_1 = Coor("tests/input/1y0m.cif") coor_2 = Coor("tests/input/1ubd.pdb") result = tmalign_ca( coor_1, coor_2, chain_1=["A"], chain_2=["C"], mm=0, include_transform=True, ) print(result.rotation) print(result.translation) # Apply x' = R x + t to all atoms of the mobile structure coor_1.xyz = geom.apply_transform(coor_1.xyz, result.rotation, result.translation) ``` ### `TMalignResult` fields | Field | Type | Description | |----------|---------|-----------------------------------------------------| | `TM1` | `float` | TM-score normalized by length of structure 1 | | `TM2` | `float` | TM-score normalized by length of structure 2 | | `TM_ali` | `float` | TM-score normalized by aligned length | | `rmsd` | `float` | RMSD of aligned residues | | `L_ali` | `int` | Number of aligned residues | | `Liden` | `int` | Number of identical aligned residues | | `seqxA` | `str` | Aligned sequence of structure 1 | | `seqyA` | `str` | Aligned sequence of structure 2 | | `seqM` | `str` | Alignment match string (`:` aligned, `.` close, ` ` far) | | `rotation` | `list[list[float]]` | 3x3 rotation matrix (when `include_transform=True`) | | `translation` | `list[float]` | Length-3 translation vector (when `include_transform=True`) | ### The `mm` parameter | Value | Meaning | |-------|--------------------------| | 0 | Monomer TM-align | | 1 | Monomer TM-align (same) | ### Multi-chain alignment For multi-chain alignment with chain-pairing, pass multiple chains: ```python result = tmalign_ca( coor_1, coor_2, chain_1=["A", "B"], chain_2=["A", "B"], mm=0, ) ``` ### Citation If you use the TM-align functionality, please cite: - Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang (2022) *Nat Methods*. 19(9), 1109-1115. - Chengxin Zhang, Anna Marie Pyle (2022) *iScience*. 25(10), 105218. --- ## 8. Secondary structure assignment Uses the TM-align secondary structure assignment algorithm (DSSP-like). ```python from pdb_cpp import Coor, TMalign coor = Coor("tests/input/1y0m.cif") ss_list = TMalign.compute_secondary_structure(coor) ``` Returns a list (one entry per model) of dictionaries mapping chain ID to a secondary structure string. ### Secondary structure codes | Code | Meaning | |------|------------------| | `H` | Helix | | `E` | Strand (sheet) | | `T` | Turn | | `C` | Coil / other | ```python for chain_id, ss_string in ss_list[0].items(): print(f"Chain {chain_id}: {ss_string}") ``` ### Low-level: `core.compute_SS` The `TMalign.compute_secondary_structure()` wraps the C++ function `core.compute_SS()`, which returns secondary structure strings directly: ```python from pdb_cpp import Coor, core coor = Coor("tests/input/1y0m.cif") ss = core.compute_SS(coor, gap_in_seq=False) # Returns list[list[str]]: one list of SS strings per model ``` --- ## 9. DockQ scoring DockQ evaluates docking model quality by combining interface contacts, ligand RMSD, and interface RMSD into a single score. ### Basic usage (automatic chain-role inference) ```python from pdb_cpp import Coor, analysis model = Coor("tests/input/1rxz_colabfold_model_1.pdb") native = Coor("tests/input/1rxz.pdb") scores = analysis.dockQ(model, native) ``` ### With explicit chain roles ```python scores = analysis.dockQ( model, native, rec_chains=["B"], lig_chains=["C"], native_rec_chains=["A"], native_lig_chains=["B"], ) ``` ### Command-line multimer scoring The installed `pdb_cpp_dockq` executable runs `analysis.dockQ_multimer()`. This is the same multimer code path used in the benchmark script, including automatic native-to-model chain-map discovery when `--chain-map` is omitted. ```bash pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb ``` To provide an explicit mapping, pass native:model pairs: ```bash pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb --chain-map A:B,B:A,C:C ``` ### DockQ output dictionary | Key | Type | Description | |------------|---------------|--------------------------------------| | `DockQ` | `list[float]` | Combined DockQ score (0-1) | | `Fnat` | `list[float]` | Fraction of native contacts | | `Fnonnat` | `list[float]` | Fraction of non-native contacts | | `LRMS` | `list[float]` | Ligand RMSD (Å) | | `iRMS` | `list[float]` | Interface RMSD (Å) | | `rRMS` | `list[float]` | Receptor RMSD (Å) | Each value is a list with one entry per model in the input `Coor`. ### DockQ quality thresholds | DockQ range | Quality | |-------------|------------| | 0.00 – 0.23 | Incorrect | | 0.23 – 0.49 | Acceptable | | 0.49 – 0.80 | Medium | | 0.80 – 1.00 | High | ### Individual metric functions ```python # Fraction of native/non-native contacts fnat_list, fnonnat_list = analysis.native_contact( model, native, rec_chains=["A"], lig_chains=["B"], native_rec_chains=["A"], native_lig_chains=["B"], cutoff=5.0, ) # Interface RMSD irmsd_list = analysis.interface_rmsd( model, native, rec_chains_native=["A"], lig_chains_native=["B"], cutoff=10.0, ) ``` #### `native_contact` parameters | Parameter | Default | Description | |--------------------------|---------|----------------------------------------------------| | `coor` | — | Model `Coor` | | `native_coor` | — | Native `Coor` | | `rec_chains` | — | Model receptor chain IDs | | `lig_chains` | — | Model ligand chain IDs | | `native_rec_chains` | — | Native receptor chain IDs | | `native_lig_chains` | — | Native ligand chain IDs | | `cutoff` | `5.0` | Contact distance cutoff (Å) | | `residue_id_map` | `None` | `dict[int, int]` mapping model residue IDs to a shared ID space | | `native_residue_id_map` | `None` | `dict[int, int]` mapping native residue IDs to the same shared ID space | The `residue_id_map` parameters are useful when model and native have different residue numbering — provide dictionaries that map each residue `uniq_resid` to a common numbering scheme. #### `interface_rmsd` notes `interface_rmsd` returns a list of floats with one entry per model. Entries are `None` when no interface residues are found within the cutoff distance. ### Citation If you use DockQ scoring, please cite: - Mirabello, C. & Wallner, B. (2024) *Bioinformatics*. DOI: 10.1093/bioinformatics/btae586 --- ## 10. Hydrogen bond detection `pdb_cpp.analysis.hbonds` detects hydrogen bonds using a purely geometric approach (Baker & Hubbard, 1984). For structures without explicit hydrogen atoms, backbone NH and sidechain hydrogen positions are reconstructed algebraically. The algorithm covers all standard L/D amino acids and RNA/DNA nucleotides. ### Basic usage ```python from pdb_cpp import Coor from pdb_cpp.analysis import hbonds coor = Coor("tests/input/2rri.cif") # All protein–protein H-bonds for every model all_bonds = hbonds.hbonds(coor) # List of HBond objects for model 0 bonds_model0 = all_bonds[0] print(f"Model 0: {len(bonds_model0)} hydrogen bonds") for b in bonds_model0[:3]: print(b) ``` ### Function signature ```python hbonds.hbonds( coor, donor_sel = "protein", acceptor_sel = "protein", dist_DA_cutoff = 3.5, # D···A distance cutoff (Å) dist_HA_cutoff = 2.5, # H···A distance cutoff (Å) angle_cutoff = 90.0, # D−H···A angle cutoff (°) ) -> list[list[HBond]] ``` Returns a list with one inner list (of `HBond` objects) per model frame. #### Parameters | Parameter | Default | Description | |------------------|-------------|-------------------------------------------------------| | `coor` | — | A `Coor` object | | `donor_sel` | `"protein"` | Selection string for donor-containing atoms | | `acceptor_sel` | `"protein"` | Selection string for acceptor-containing atoms | | `dist_DA_cutoff` | `3.5` | Maximum heavy-atom donor–acceptor distance (Å) | | `dist_HA_cutoff` | `2.5` | Maximum hydrogen–acceptor distance (Å) | | `angle_cutoff` | `90.0` | Minimum D−H···A angle (°) | ### `HBond` object Each entry in the returned lists is an `HBond` object with the following read-only attributes: | Attribute | Type | Description | |----------------------|---------|----------------------------------------------------| | `donor_resid` | `int` | Donor residue sequence number | | `donor_resname` | `str` | Donor residue name | | `donor_chain` | `str` | Donor chain identifier | | `donor_heavy_name` | `str` | Donor heavy-atom name (e.g. `N`, `OG`) | | `donor_h_name` | `str` | Donor hydrogen atom name (e.g. `H`, `HG`) | | `donor_heavy_xyz` | `tuple` | Donor heavy-atom coordinates (x, y, z) | | `donor_h_xyz` | `tuple` | Donor hydrogen coordinates (x, y, z) | | `acceptor_resid` | `int` | Acceptor residue sequence number | | `acceptor_resname` | `str` | Acceptor residue name | | `acceptor_chain` | `str` | Acceptor chain identifier | | `acceptor_name` | `str` | Acceptor atom name (e.g. `O`, `OD1`) | | `acceptor_xyz` | `tuple` | Acceptor atom coordinates (x, y, z) | | `dist_DA` | `float` | D···A distance (Å) | | `dist_HA` | `float` | H···A distance (Å) | | `angle_DHA` | `float` | D−H···A angle (°) | ### Filtering and analysis examples ```python from pdb_cpp import Coor from pdb_cpp.analysis import hbonds coor = Coor("tests/input/2rri.cif") bonds = hbonds.hbonds(coor)[0] # Backbone–backbone H-bonds only bb_names = {"N", "O"} bb_bonds = [b for b in bonds if b.donor_heavy_name in bb_names and b.acceptor_name in bb_names] # Inter-chain H-bonds (cross-chain contacts) inter = [b for b in bonds if b.donor_chain != b.acceptor_chain] print(f"{len(inter)} inter-chain H-bonds") # Sort by D–A distance bonds.sort(key=lambda b: b.dist_DA) for b in bonds[:5]: print(f"{b.donor_chain}{b.donor_resid}{b.donor_resname:>4}" f" {b.donor_heavy_name} → " f"{b.acceptor_chain}{b.acceptor_resid}{b.acceptor_resname:>4}" f" {b.acceptor_name} d(DA)={b.dist_DA:.2f} Å" f" angle={b.angle_DHA:.1f}°") ``` ### Cross-selection H-bonds (e.g. protein–nucleic) The `donor_sel` and `acceptor_sel` parameters accept any valid selection string, enabling inter-molecular or cross-type analysis: ```python # Protein side-chains donating to RNA acceptors rna_contacts = hbonds.hbonds( coor, donor_sel = "protein", acceptor_sel = "nucleic", ) # H-bonds between two specific chains ab_bonds = hbonds.hbonds(coor, donor_sel="chain A", acceptor_sel="chain B") ``` ### Baker & Hubbard criteria The default thresholds reproduce the original Baker & Hubbard (1984) geometric criteria: $$ d(\text{D}{\cdots}\text{A}) \leq 3.5~\text{Å}, \quad d(\text{H}{\cdots}\text{A}) \leq 2.5~\text{Å}, \quad \angle(\text{D{-}H}{\cdots}\text{A}) \geq 90° $$ A stricter cut-off of 120° (`angle_cutoff=120`) is closer to values used by biotite and is recommended when comparing with other tools. --- ## 11. Geometry utilities ### Distance matrix ```python from pdb_cpp import Coor, geom coor = Coor("tests/input/1y0m.cif") ca = coor.select_atoms("name CA") # Pairwise distance matrix between two coordinate sets dmat = geom.distance_matrix(ca.xyz, ca.xyz) print(dmat.shape) # (N_ca, N_ca) ``` The function accepts any two `Coor` objects (or selections) and returns an `(N, M)` float32 NumPy array of Euclidean distances. ### Hybrid-36 encoding/decoding PDB format uses hybrid-36 encoding for atom/residue numbers > 99999: ```python from pdb_cpp.core import hy36encode, hy36decode encoded = hy36encode(5, 100000) # "A0000" decoded = hy36decode(5, "A0000") # 100000 ``` --- ## 12. Data cleaning utilities ### Remove incomplete backbone residues Residues missing any backbone atom (CA, C, N, O by default) are removed: ```python from pdb_cpp import select clean_coor = select.remove_incomplete_backbone_residues(coor) # Or with custom backbone definition clean_coor = select.remove_incomplete_backbone_residues( coor, back_atom=["CA", "C", "N"] ) ``` This is recommended before structural alignment to avoid mismatches. --- ## 13. Module summary | Module | Key functions / classes | |---------------------|-----------------------------------------------------------| | `pdb_cpp` | `Coor`, `Model` — main data objects | | `pdb_cpp.core` | `get_common_atoms`, `coor_align`, `align_seq_based`, `tmalign_ca`, `distance_matrix`, `compute_SS`, `sequence_align`, `Alignment_cpp`, `hy36encode`, `hy36decode` | | `pdb_cpp.alignment` | `align_seq`, `print_align_seq`, `align_chain_permutation` | | `pdb_cpp.analysis` | `rmsd`, `interface_rmsd`, `native_contact`, `dockQ` | | `pdb_cpp.analysis.hbonds` | `hbonds` — Baker & Hubbard H-bond detection | | `pdb_cpp.TMalign` | `compute_secondary_structure` | | `pdb_cpp.geom` | `distance_matrix` | | `pdb_cpp.select` | `remove_incomplete_backbone_residues` | | `pdb_cpp.sequence` | `get_aa_seq`, `get_aa_DL_seq` | | `pdb_cpp.data` | `BLOSUM62`, `get_blosum62`, `load_blosum`, `AA_DICT`, `NA_DICT`, `AA_NA_DICT` | --- ## 14. Data subpackage (`pdb_cpp.data`) The `pdb_cpp.data` module provides residue dictionaries and the BLOSUM62 substitution matrix used internally by the alignment routines. ### BLOSUM62 matrix ```python from pdb_cpp.data.blosum import BLOSUM62 # Substitution score for two amino acids print(BLOSUM62[("A", "A")]) # 4 print(BLOSUM62[("A", "W")]) # -3 # Load a custom matrix from file from pdb_cpp.data.blosum import load_blosum custom = load_blosum("/path/to/matrix.txt") ``` `BLOSUM62` is a lazy-loaded dictionary mapping `(aa1, aa2)` tuples to integer substitution scores. ### Residue dictionaries ```python from pdb_cpp.data.res_dict import AA_DICT, AA_DICT_L, AA_DICT_D, NA_DICT # 3-letter to 1-letter amino acid code print(AA_DICT["ALA"]) # "A" print(AA_DICT["GLY"]) # "G" # D-amino acids (e.g. DAL -> "A") print(AA_DICT_D["DAL"]) # "A" # Nucleic acids print(NA_DICT["DA"]) # "A" # Combined amino acid + nucleic acid dictionary from pdb_cpp.data.res_dict import AA_NA_DICT ``` | Dictionary | Contents | |-------------|---------------------------------------------| | `AA_DICT_L` | Standard L-amino acids (31 entries incl. protonation variants) | | `AA_DICT_D` | D-amino acids (22 entries) | | `AA_DICT` | Combined L + D amino acids | | `NA_DICT` | DNA nucleotides (DA, DT, DC, DG) | | `AA_NA_DICT`| Combined amino acids + nucleotides |