Representing Molecular Interaction Data: From Crystal Structures to Learned Embeddings

Apr 9

Consider a single moment of molecular recognition: an inhibitor settles into the ATP binding pocket of a kinase, displaces an ordered water, forms a hydrogen bond to the hinge backbone, and makes van der Waals contact with a gatekeeper methionine. This event has a physical reality -- it is what the X-ray diffraction pattern records, what the binding assay measures, what the free energy calculation tries to reproduce. But by the time the event reaches a downstream analysis -- a scoring function, a docking benchmark, a machine learning model -- it has been translated into one of a dozen different data structures. Each representation preserves some aspects of the interaction and discards others. Each is an opinion, enforced at the level of the file format, about which features of the binding event matter.

The choice of representation is not neutral. A protein-ligand complex stored as atomic coordinates contains information that a 2048-bit interaction fingerprint cannot recover. A distance matrix captures geometric relationships that a tabulated IC50 value erases entirely. A voxelized density grid enables 3D convolutional processing but buries the identities of individual atoms inside a tensor of floating-point numbers. None of these is the binding event itself. Each is a projection of it into a space chosen to match a specific question.

This article surveys the major representations of molecular interaction data used in computational drug discovery and structural biology. It covers where each representation comes from, what it encodes, what it discards, which tools produce and consume it, and which downstream tasks it enables. It is not a ranking. The appropriate representation depends on the question, the source data, the compute budget, and the modeling approach. But understanding the landscape -- and the tradeoffs between representations -- is the difference between choosing a tool deliberately and reaching for whatever the last paper happened to use.

What Counts as Molecular Interaction Data

Before cataloging representations, it is worth being precise about scope. "Molecular interaction data" in the drug discovery context generally refers to information about non-covalent contacts between a small molecule and a macromolecular target -- most commonly a protein, but also nucleic acids, membrane bilayers, or other proteins. The same framework extends to protein-protein interactions, host-pathogen binding, and enzyme-substrate complexes, though each has its own specialized conventions.

The data itself can come from several sources with very different characteristics. Experimental structural biology (X-ray crystallography, cryo-electron microscopy, NMR) produces high-resolution atomic coordinates at the cost of slow and expensive acquisition. Molecular docking generates predicted poses at low cost but with uncertain accuracy. Molecular dynamics simulations produce trajectories that capture temporal behavior but generate terabytes of data per system. Biochemical assays produce scalar affinity measurements with no structural detail at all. Sequence-based predictions infer binding from protein and ligand strings without ever computing a 3D structure. These sources differ in spatial resolution, confidence, and coverage, and the representation chosen downstream must be compatible with the source's actual information content -- a graph neural network trained on AlphaFold predictions is learning from structures, but not structures with the reliability of a 1.5 Angstrom crystal.

Coordinate-Based Structures

The most information-rich representation is the raw atomic coordinate file. A PDB file records the three-dimensional Cartesian position, element, residue membership, chain identifier, occupancy, and B-factor for every atom in the asymmetric unit, along with crystallographic metadata (resolution, R-factors, space group, ligand identifiers). SDF and MOL2 files play the analogous role for small molecules and ligand-bound complexes, with MOL2 adding explicit Sybyl atom types and partial charges that MDL-format files leave implicit. The newer mmCIF format replaces the 80-column PDB record structure with a more extensible tabular layout and is now the primary distribution format for the RCSB Protein Data Bank.

Coordinate files are the lingua franca of structural biology. Every other representation in this article is ultimately computed from them. The PDBbind database (Wang et al. 2004) pairs roughly 20,000 experimentally determined protein-ligand complexes with measured binding affinities and is distributed as coordinate files plus a tab-separated metadata table. PDBbind serves as the backbone dataset for nearly every structure-based machine learning benchmark in binding affinity prediction. BindingMOAD offers a complementary collection focused on biologically relevant complexes with curated ligand annotations.

The limitations of coordinate representations are practical rather than informational. They are relatively large (a typical protein-ligand complex file runs 100-500 KB uncompressed), they are spatially sparse (most of the 3D extent is empty), and they do not directly encode interactions. A hydrogen bond between a ligand carbonyl and a protein backbone amide is an emergent property of the atomic positions, not an explicit field in the file. Any downstream consumer must either learn to identify interactions implicitly or apply a pre-processing step that extracts them. Crystallographic artifacts complicate the picture further: alternate conformations, partially occupied ligands, crystallographic waters whose positions may or may not be mechanistically meaningful, and missing density in flexible loops all introduce ambiguity that a naive pipeline will silently mishandle.

Interaction Fingerprints

If coordinates are the maximum-information representation, interaction fingerprints sit at the opposite end of the information-density spectrum. They reduce a binding event to a fixed-length binary or count vector, throwing away geometric detail in exchange for compactness and comparability. The idea originated with Structural Interaction Fingerprints (SIFt; Deng et al. 2004), which represent each residue in the binding site as a 7-bit vector encoding whether the ligand makes any contact, a polar contact, a non-polar contact, a hydrogen-bond donor interaction, a hydrogen-bond acceptor interaction, an aromatic interaction, or a charged interaction with that residue. Concatenated across all binding-site residues, the result is a sparse binary fingerprint suitable for Tanimoto similarity search, clustering, and classical machine learning models.

Subsequent variants extend this framework in different directions. The Atom-Pair Interaction Fingerprint (APIF) encodes pairs of ligand-protein atom contacts to capture higher-order interaction patterns. SPLIF (Da and Kireev 2014) captures local structural environments around each contact using extended-connectivity fingerprints on both the ligand and protein atoms involved. PLEC (Wojcikowski et al. 2019) combines the two approaches, hashing pairs of environment fingerprints into a Morgan-style vector that has proven effective for rescoring docking poses. ProLIF (Bouysset and Fiorucci 2021) provides a modern Python implementation supporting eight interaction types -- hydrophobic, hydrogen bond, halogen bond, ionic, cation-pi, pi-stacking, pi-cation, and metal coordination -- with direct compatibility for MDAnalysis trajectories, which makes it the standard choice for analyzing interactions across molecular dynamics simulations rather than single static structures.

Interaction fingerprints have three main virtues. They are compact (typically 256 to 8192 bits). They are comparable across complexes: the Tanimoto similarity between two fingerprints is a reasonable proxy for binding-mode similarity. And they are interpretable at the bit level -- each bit or residue position can be traced back to a specific contact, which allows a chemist to read a fingerprint directly and understand which interactions are present.

The main limitation is information loss. A fingerprint records that an interaction exists but usually not its geometry; a hydrogen bond at an ideal 2.8 Angstrom distance and a stretched 3.4 Angstrom one register identically. Strength gradations, the relative positions of multiple interactions, and residue identity context are compressed or lost. For tasks that depend on subtle geometric discrimination -- selectivity engineering between close analogs, covalent warhead positioning, resistance mutation analysis -- fingerprints alone are insufficient and must be combined with richer representations.

Distance Matrices and Contact Maps

A distance matrix represents an interaction by tabulating pairwise distances between selected atoms -- commonly all heavy atoms of the ligand against all atoms of the binding pocket, or alpha carbons of the binding-site residues against the ligand centroid. A contact map binarizes this: position (i, j) is 1 if atoms i and j lie within some threshold distance (typically 4 to 5 Angstroms for heavy-atom contacts), 0 otherwise. Optional channels can layer interaction type on top of raw distance.

This representation occupies a useful middle ground. It preserves geometry more faithfully than a fingerprint -- exact distances rather than presence/absence flags -- while remaining a fixed-size tensor that standard machine learning architectures can consume directly. Contact maps are the native input for several protein-ligand prediction architectures and closely mirror the intra-protein contact maps that structure prediction methods like AlphaFold produce as intermediate representations.

The fixed-size requirement is also the main source of friction. Ligands have variable atom counts, so the matrix dimensions must either be padded to a maximum size or managed with dynamic architectures. Binding sites similarly vary in residue count, and the choice of how to define "binding site residues" -- by distance cutoff from the ligand, by a fixed sphere around a reference point, by manual annotation -- affects what the contact map contains. Two distance matrices computed for the same complex with different binding-site definitions are not directly comparable, which makes sharing and benchmarking harder than it first appears.

Interaction Graphs

Graph representations treat a molecular interaction as a heterogeneous network. Nodes correspond to atoms (at the finest granularity) or residues (for coarser views), and edges correspond to covalent bonds, through-space contacts, or specific interaction types. Separate node types distinguish protein atoms from ligand atoms, and separate edge types distinguish covalent bonds from non-covalent contacts. Node features encode atom-level chemistry (element, hybridization, partial charge, aromaticity, hydrogen count), and edge features encode bond or interaction properties (type, distance, directionality).

This is the representation of choice for graph neural network architectures applied to structure-based tasks. Message-passing or graph convolutional layers learn to aggregate local features into task-relevant embeddings, and frameworks like PyTorch Geometric, DGL-LifeSci, and TorchDrug provide production implementations with pre-built readers for the common file formats.

Graph representations have two advantages over voxel- and coordinate-based approaches. First, they are naturally invariant to rotation and translation, since graph structure does not depend on a choice of coordinate system. Second, they scale gracefully to variable-size inputs -- adding atoms adds nodes, without requiring padding or resizing. Their main limitation is geometric expressiveness: a standard message-passing graph encodes topology and pairwise distances but not higher-order geometric features like bond angles or torsions. Recent work on geometric and equivariant graph networks (SE(3)-transformers, EGNN, DimeNet, Equiformer) addresses this by incorporating directional information into the message-passing step, at the cost of substantial architectural complexity and longer training times. AlphaFold3 and RoseTTAFold All-Atom, which jointly predict protein structure and bound ligand geometry, operate on heterogeneous graph representations of this kind under the hood.

Voxel Grids

Voxel-based representations discretize the 3D space around a binding site into a regular grid of small cubes -- typically 0.5 to 1.0 Angstrom per voxel, covering a 20 to 30 Angstrom cube centered on the ligand. Each voxel stores one or more channels representing the presence or density of different atom types (carbon, nitrogen, oxygen, sulfur, halogen) or interaction features (hydrogen-bond donor, acceptor, hydrophobic character). The result is a 4D tensor of shape (channels, x, y, z) that can be fed directly to a 3D convolutional neural network without further preprocessing.

Voxelization was the entry point for deep learning into structure-based drug design. Methods like Atomic Convolutional Networks (Gomes et al. 2017) and KDEEP (Jimenez et al. 2018) demonstrated that 3D CNNs trained on voxel grids could predict binding affinities competitively with knowledge-based scoring functions. The appeal is the direct analogy to image classification: the binding site is a 3D image, atoms are pixels, and the CNN learns translation-invariant features.

The limitations are computational and representational. Voxel grids are expensive to store and process -- a 24-Angstrom cube at 1-Angstrom resolution is 13,824 voxels per channel, and memory scales cubically with grid size. They are not rotation-invariant: a rotated binding site produces a different tensor, requiring either data augmentation during training or explicit equivariance in the architecture. And the discretization introduces boundary artifacts that can affect predictions for small ligand translations. For these reasons, voxel methods have partially been displaced by graph and equivariant approaches in recent benchmarks, though they remain competitive for tasks where the regular grid structure is a natural fit.

Pharmacophore Models

A pharmacophore is an abstract representation of the chemical features responsible for a molecule's biological activity, independent of which specific atoms provide those features. The canonical feature types are hydrogen bond donors, hydrogen bond acceptors, positive ionizable groups, negative ionizable groups, hydrophobic centers, and aromatic rings. A pharmacophore model encodes the 3D arrangement of these features -- typically as spheres with distance and angle tolerances -- that a molecule must match to be considered consistent with the model.

Pharmacophores are the oldest interaction representation in rational drug design, predating crystallographic drug design and molecular docking. Their modern form is implemented in software like LigandScout, MOE, Phase, Catalyst, and RDKit's native pharmacophore tools. They are particularly valuable for ligand-based virtual screening in the absence of a target structure: a pharmacophore derived from a set of known actives can screen millions of compounds quickly, and the geometric constraints naturally exclude molecules that cannot adopt the required feature arrangement.

The strength of pharmacophores -- abstraction away from specific atoms -- is also their weakness. They capture what matters for activity in a human-interpretable way but discard the atomic-level detail required for quantitative binding affinity prediction. For classification tasks (active versus inactive) pharmacophores are efficient. For regression tasks (predicting IC50 within a factor of three) they are usually supplemented with more detailed representations.

Tabular Affinity Data

The simplest representation strips away structural information entirely and records only the pairing: compound identifier, target identifier, affinity measurement, units, assay type, and a small set of metadata fields. This is the format of BindingDB, ChEMBL's activities table, and the bioactivity sections of PubChem and the IUPHAR/BPS Guide to Pharmacology. A single row encodes a single measurement without any reference to coordinates, conformation, or contact geometry.

Tabular affinity data is the substrate for ligand-based models -- quantitative structure-activity relationship regressions, message-passing networks trained only on SMILES, random forests on Morgan fingerprints, and the more recent ligand-only transformer models. It is also the ground truth against which structure-based models are evaluated: a docking pose is scored, the score is compared to a tabular Ki value, and the correlation determines whether the method works.

The practical concerns here are the inverse of the coordinate-file concerns. Tabular data is compact, uniform, and easy to query, but it has been stripped of everything that makes an interaction spatial. Two compounds with identical IC50 values against the same target may bind in completely different poses through completely different contacts, and the table cannot distinguish them. Curation issues also loom larger than they first appear: unit inconsistency, assay type conflation, mixing of functional and binding measurements, and stereochemistry loss during compound identifier resolution are all failure modes that silently corrupt downstream models. These are the same issues covered at length in the data foundation article, applied here to bioactivity rather than training metadata.

Extraction Tools: Bridging Between Representations

Most practical pipelines do not start from scratch when moving between representations. A mature toolkit sits between coordinates and the more abstract forms, automating the extraction step so that a scientist does not have to hand-write geometry checks for every interaction type. PLIP (Protein-Ligand Interaction Profiler; Salentin et al. 2015) takes a PDB file as input and returns an annotated list of hydrogen bonds, hydrophobic contacts, halogen bonds, salt bridges, pi-stacking, pi-cation, water bridges, and metal complexes, with residue-level assignments and geometric details. It is the de facto reference implementation for automated interaction extraction from static structures. ProLIF serves the analogous role for trajectories, where the question is not which interactions are present in a single pose but which fraction of the trajectory contains each interaction.

The Open Drug Discovery Toolkit (ODDT) and RDKit both provide lower-level programmatic access to the same operations, allowing a pipeline to compute interaction fingerprints, extract contact maps, or build graphs from a shared upstream representation. A typical modern workflow reads coordinates with RDKit or MDAnalysis, runs PLIP or ProLIF to identify interactions, computes one or more fingerprints with ProLIF or the ODDT fingerprint module, and stores the results alongside the original coordinates in a versioned data store. The representations are not competitors in such a pipeline; they are co-existing views of the same underlying event, each feeding a different downstream consumer.

Hybrid Representations and Practical Guidance

In practice, most modern structure-based ML pipelines use multiple representations simultaneously. A typical workflow might ingest coordinates from PDBbind, extract a graph for the ligand and a graph for the binding pocket, compute an interaction fingerprint for each complex, and store tabular affinity values as the prediction target. The graphs feed a message-passing model, the fingerprints serve as both a baseline and an interpretability layer, and the table provides labels and stratification information. No single representation carries the entire workload.

Choosing a representation for a new task is a matter of matching information content to requirement. If the task is similarity search across a known database of complexes, interaction fingerprints are cheap and effective. If the task is quantitative affinity prediction with strong structural signal and data on the scale of PDBbind, a graph or voxel representation is appropriate. If the task is pose scoring inside a docking workflow, SPLIF or PLEC fingerprints have a long track record of strong empirical performance. If the task is selectivity engineering between close analogs, coordinate files and explicit free energy calculations are probably required -- no compressed representation captures the energetics with the precision needed. If the task is ligand-based screening with no structure of the target, tabular affinity data feeding a QSAR model is both the cheapest and often the only available option.

A useful diagnostic question: what information does the representation discard, and is the discarded information relevant to the question at hand? An interaction fingerprint discards geometry; if geometry matters, look elsewhere. A voxel grid discards atom identities within a channel; if identity matters, look elsewhere. A contact map discards interaction types; if type matters, supplement with an interaction fingerprint or a graph with edge features. A tabular entry discards everything structural; if any structural information matters, a table is not enough on its own.

The second diagnostic question is about compatibility with the source. A representation cannot manufacture information the source did not contain. A graph built from an AlphaFold-predicted complex is not interchangeable with a graph built from a crystal structure, even though both are "graphs," because the upstream coordinates have different reliability. Good pipelines propagate uncertainty about the source through to the downstream representation, rather than treating all graphs (or all fingerprints, or all distance matrices) as equivalent.

Conclusion

A molecular interaction is a physical event. The data structures used to represent it are human inventions, each embedding a theory about which features of the event matter. Coordinate files theorize that everything matters and the model can figure out the rest. Interaction fingerprints theorize that the presence or absence of typed contacts captures the relevant signal. Graphs theorize that topology plus node and edge features, processed through learned aggregation, is sufficient. Voxel grids theorize that the 3D image of a binding site is a sufficient statistic. Pharmacophores theorize that abstract feature geometry is what binds molecules to targets. Tabular data theorizes that the pairing and the number are what you need.

None of these theories is universally correct. They are tools whose fitness depends on the question. A scientist encountering molecular interaction data for the first time confronts a landscape of representations that can feel arbitrary: why does one paper use contact maps, another use SPLIF, a third use equivariant graph networks? The answer, almost always, is that each author chose the representation whose information content matched the task and whose compute profile fit the available hardware. Understanding what each representation encodes, and what it quietly lets fall away, is the difference between picking the right tool and reaching for whatever happens to be closest.

References

Deng, Z., Chuaqui, C., & Singh, J. (2004). Structural interaction fingerprint (SIFt): a novel method for analyzing three-dimensional protein-ligand binding interactions. Journal of Medicinal Chemistry, 47(2), 337-344. doi:10.1021/jm030331x
Da, C. & Kireev, D. (2014). Structural protein-ligand interaction fingerprints (SPLIF) for structure-based virtual screening: method and benchmark study. Journal of Chemical Information and Modeling, 54(9), 2555-2561. doi:10.1021/ci500319f
Wojcikowski, M., Kukielka, M., Stepniewska-Dziubinska, M. M., & Siedlecki, P. (2019). Development of a protein-ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions. Bioinformatics, 35(8), 1334-1341. doi:10.1093/bioinformatics/bty757
Bouysset, C. & Fiorucci, S. (2021). ProLIF: a library to encode molecular interactions as fingerprints. Journal of Cheminformatics, 13(1), 72. doi:10.1186/s13321-021-00548-6
Wang, R., Fang, X., Lu, Y., & Wang, S. (2004). The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. Journal of Medicinal Chemistry, 47(12), 2977-2980. doi:10.1021/jm030580l
Liu, T., Lin, Y., Wen, X., Jorissen, R. N., & Gilson, M. K. (2007). BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Research, 35(Database issue), D198-D201. doi:10.1093/nar/gkl999
Hu, L., Benson, M. L., Smith, R. D., Lerner, M. G., & Carlson, H. A. (2005). Binding MOAD (Mother Of All Databases). Proteins: Structure, Function, and Bioinformatics, 60(3), 333-340. doi:10.1002/prot.20512
Gomes, J., Ramsundar, B., Feinberg, E. N., & Pande, V. S. (2017). Atomic convolutional networks for predicting protein-ligand binding affinity. arXiv preprint arXiv:1703.10603.
Jimenez, J., Skalic, M., Martinez-Rosell, G., & De Fabritiis, G. (2018). KDEEP: Protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks. Journal of Chemical Information and Modeling, 58(2), 287-296. doi:10.1021/acs.jcim.7b00650
Satorras, V. G., Hoogeboom, E., & Welling, M. (2021). E(n) equivariant graph neural networks. Proceedings of the 38th International Conference on Machine Learning.
Salentin, S., Schreiber, S., Haupt, V. J., Adasme, M. F., & Schroeder, M. (2015). PLIP: fully automated protein-ligand interaction profiler. Nucleic Acids Research, 43(W1), W443-W447. doi:10.1093/nar/gkv315
Wojcikowski, M., Zielenkiewicz, P., & Siedlecki, P. (2015). Open Drug Discovery Toolkit (ODDT): a new open-source player in the drug discovery field. Journal of Cheminformatics, 7, 26. doi:10.1186/s13321-015-0078-2
Abramson, J., Adler, J., Dunger, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 493-500. doi:10.1038/s41586-024-07487-w

Taylor Powell

Representing Molecular Interaction Data: From Crystal Structures to Learned Embeddings

From Biological Systems to Human Performance: How ML Pipelines Built for Genomics Transfer to Sports Science

Uncertainty Quantification in Molecular Property Prediction: From Research Metric to Deployment Requirement