Building the Data Foundation: Data Engineering Patterns for Molecular and Genomic ML in Pharma

Feb 25

A biotech's lead optimization team spent four months making medicinal chemistry decisions guided by an ADME prediction model that, by all standard metrics, appeared to be performing well. Validation accuracy was strong. Stakeholders trusted the outputs. The model was integrated into weekly compound prioritization meetings.

Then a new computational chemist joined the team and noticed something peculiar in the training data. The same molecule — ibuprofen — appeared twice: once as its sodium salt form extracted from a clinical formulation database, and once as the free acid pulled from ChEMBL. The SMILES strings looked different. The molecular weights were different. To every automated check in the pipeline, these were two distinct compounds. But they were not. They were the same active pharmaceutical ingredient, represented differently because two data sources used different conventions and no standardization step existed between data ingestion and feature computation.

The problem was worse than a single duplicate. Systematic investigation revealed that roughly 8% of the training set contained similar identity confusions — salt forms counted separately from parent molecules, tautomers treated as distinct compounds, and non-canonical SMILES producing different fingerprint bits for identical structures. The model's reported performance was inflated by data leakage through molecular identity confusion, and four months of compound prioritization decisions had been influenced by a model that was, in effect, learning from artifacts rather than real chemistry.

This was not a modeling failure. The architecture was sound. The hyperparameters were well-tuned. The evaluation methodology was rigorous. It was a data engineering failure — the most common and least discussed category of failure in pharmaceutical machine learning.

In previous articles on this platform, we have examined this problem from multiple angles. We cataloged the common failure modes in pathogen genomics ML pipelines, showing that most failures stem from infrastructure rather than models. We prescribed a systems-first approach to pipeline design, emphasizing modularity, observability, and reproducibility. We outlined ten essential practices for sustainability, covering everything from data provenance to governance frameworks. But across all three articles, one assumption went unexamined: that clean, well-structured data arrives at the pipeline's front door.

This article fills the foundation layer. It covers what happens before that door — the data engineering that determines whether molecular and genomic ML systems learn from reality or from artifacts of sloppy data handling. For pharmaceutical organizations where model predictions influence decisions worth millions of dollars, the data engineering layer is not plumbing to be delegated — it is infrastructure to be designed.

The Unique Data Engineering Challenges of Molecular and Genomic Data

Most data engineering best practices were developed for tabular business data, text corpora, or image datasets. Molecular and genomic data breaks these patterns in specific, domain-dependent ways that generic ETL tooling does not anticipate.

Molecular Identity Is Not Obvious

The most fundamental challenge in molecular data engineering is that the same molecule can be represented in dozens of valid ways, and determining whether two representations refer to the same entity requires domain-specific computation rather than simple string comparison.

Consider SMILES notation, the most common text-based molecular representation. SMILES is not canonical by default — the string depends on which atom the writer starts from and which path they traverse through the molecular graph. The molecule caffeine can be written as `CN1C=NC2=C1C(=O)N(C(=O)N2C)C` or `Cn1c(=O)c2c(ncn2C)n1C` or any number of other valid permutations. Canonical SMILES algorithms exist to resolve this ambiguity, but different software packages can produce different canonical forms for the same molecule, particularly for complex aromatic systems or molecules with unusual bonding patterns.

Tautomerism introduces a deeper problem. Tautomers are structural isomers that interconvert rapidly through proton migration — the classic example being keto-enol tautomerism, where a carbonyl group and its enol form exist in equilibrium. In solution, both forms are present simultaneously. But in a database, a molecule is stored as one tautomer or the other, and the choice is often arbitrary. The RDKit tautomer canonicalization algorithm addresses this by enumerating possible tautomers, scoring them according to rules that favor aromatic systems and penalize certain substructures, and selecting the highest-scoring form. But as Greg Landrum noted when introducing this algorithm, the goal is to produce a canonical result — always the same output for tautomerically equivalent inputs — not necessarily the chemically preferred form. Different toolkits (RDKit versus ChemAxon, for instance) apply different rules and can produce different canonical tautomers for the same molecule.

Salt forms add another layer of complexity. Pharmaceutical compounds are frequently formulated as salts — ibuprofen sodium, metformin hydrochloride, atorvastatin calcium — to improve solubility, stability, or manufacturing properties. In bioactivity databases, the same compound may appear as its salt form in one record and as the free base or free acid in another. For most ML applications, these should be treated as the same active pharmaceutical ingredient, requiring automated salt stripping to extract the parent molecule. But the boundary is not always clear: some counterions influence biological activity (lithium in lithium carbonate is the active moiety), and some metal-containing compounds have metals integral to their mechanism of action.

Stereochemistry presents perhaps the most consequential identity challenge. Enantiomers — molecules that are mirror images of each other — have identical SMILES strings except for chirality annotations. Yet they can have dramatically different biological activity. The cautionary example is thalidomide, where one enantiomer treated morning sickness and the other caused severe birth defects. A data pipeline that strips stereochemistry to simplify identity matching may inadvertently merge compounds with opposite safety profiles.

Heterogeneous Source Formats

Molecular data arrives in formats that encode fundamentally different levels of structural information. SMILES and InChI are one-dimensional text strings capturing connectivity. SDF and MOL files are multidimensional, storing three-dimensional atomic coordinates alongside connectivity. These formats cannot be naively interconverted — generating 3D coordinates from a SMILES string requires conformer generation algorithms that introduce computational cost and non-determinism. Going the other direction, converting a 3D structure to SMILES discards spatial information that may be critical for certain applications.

Genomic data presents analogous heterogeneity. FASTA files store raw sequences. VCF files encode variants relative to a reference assembly. BAM files contain read alignments with quality scores. GFF files provide feature annotations with genomic coordinates. Each format has its own schema, coordinate system, and versioning cadence, and they are deeply interdependent — a VCF file is meaningless without knowing which reference genome assembly it was called against.

Bioactivity data from public databases introduces unit heterogeneity. ChEMBL reports activity measurements in the units provided by the original publication — nM, μM, mM, %, or arbitrary units — with varying assay types (binding, functional, ADME, toxicity) and different statistical frameworks. The pChEMBL value attempts to standardize this by converting to negative log molar units, but it only applies to specific activity types (IC50, Ki, Kd, EC50) with exact equality relations and nanomolar units. A pipeline that treats all bioactivity values as comparable without filtering on these constraints will produce training data that mixes fundamentally different measurements.

The Reference Database Problem

Molecular and genomic databases evolve continuously. ChEMBL releases new versions approximately biannually, each potentially reorganizing compound hierarchies, correcting structural errors, and merging or splitting entries. PubChem updates daily. UniProt revises protein annotations and occasionally merges accession numbers. Reference genome assemblies change across versions (GRCh37 to GRCh38), invalidating coordinate-based annotations and requiring liftover operations that can fail for complex genomic regions.

For ML systems that depend on these databases, each update is a potential source of silent data corruption. A model trained on ChEMBL v31 may reference compound identifiers that were reorganized in v33. A genomic annotation pipeline built against GRCh37 coordinates will produce incorrect gene assignments if applied to GRCh38-aligned data. The data engineering response is to pin database versions explicitly, track them as dependencies alongside code and library versions, and build validation checks that detect when upstream databases have changed in ways that affect downstream outputs — exactly the provenance and versioning practices we advocated in our earlier articles, but applied at the data source layer rather than the model layer.

Chemical Structure Standardization as a Data Engineering Discipline

Structure standardization is the single most impactful data engineering operation in pharmaceutical ML. Getting it wrong contaminates every downstream process — feature computation, deduplication, train/test splitting, and model evaluation. Getting it right requires treating standardization not as a preprocessing convenience but as a rigorous, versioned, auditable pipeline stage.

The Standardization Pipeline

The community has converged on a multi-step standardization workflow, most thoroughly documented in the canSAR chemistry registration pipeline published in the Journal of Cheminformatics. This pipeline implements five sequential steps, each addressing a specific category of molecular identity ambiguity.

The first step is structure validation: parsing input representations (SMILES strings or SDF records) into molecular objects and rejecting or flagging those that fail chemical validity checks. In RDKit, this means calling `Chem.MolFromSmiles()` with sanitization enabled, which checks valence rules, aromaticity perception, and ring system consistency. Molecules that fail sanitization may contain invalid bond orders, impossible valence states, or unparseable notation. The design decision is whether to reject these outright or attempt repair — the canSAR pipeline attempts correction of kekulized forms and stereochemistry where possible, accepting a broader range of inputs than the more conservative ChEMBL pipeline.

The second step is normalization: applying rule-based transformations to standardize functional group representations. Nitro groups, for example, can be drawn as `-N(=O)=O` or `-[N+](=O)[O-]`, and charge-separated representations of the same functional group vary across data sources. RDKit's `rdMolStandardize.Cleanup()` performs hydrogen removal, metal atom disconnection, functional group normalization, and reionization in a single call, producing a consistent representation regardless of input conventions.

The third step is canonical tautomer generation: selecting a single canonical tautomer from the set of possible forms. The RDKit implementation enumerates tautomers using SMIRKS-based transformation rules, scores each candidate according to criteria that favor aromatic rings and penalize certain substructures, and selects the highest-scoring form. In the event of ties, the tautomer with the lexicographically smaller canonical SMILES is selected. This is a canonicalization algorithm, not a stability prediction — it guarantees that tautomerically equivalent inputs always produce the same output, which is the property that matters for database registration and ML training data deduplication.

The fourth step is salt stripping and fragment selection: extracting the parent molecule from salt forms, solvates, and multi-component mixtures. The `FragmentParent()` function selects the largest organic fragment, which is correct for the vast majority of pharmaceutical compounds. Edge cases require domain-specific handling: pharmaceutical salts where the counterion is pharmacologically relevant, metal complexes where the metal is integral to the mechanism of action, and co-crystals where multiple active components are present by design.

The fifth step is charge neutralization: neutralizing formal charges where chemically appropriate. The `Uncharger()` class handles common cases — protonating carboxylates, deprotonating amines — but some species should remain charged. Quaternary ammonium compounds are permanently charged by virtue of their bonding. Zwitterionic amino acids exist as charged species at physiological pH. The decision of whether to neutralize must reflect the intended use of the standardized structure.

Uniqueness Identifiers: The Registration Decision

After standardization, every compound needs a unique identifier that determines how the system distinguishes "same molecule" from "different molecule." This seemingly simple requirement conceals a consequential design decision.

ChEMBL uses InChI (International Chemical Identifier) and its hashed form, InChIKey, as uniqueness measures. Standard InChI collapses most tautomers into a single representation, treating keto and enol forms as identical. PubChem opted for de-aromatized isomeric canonical SMILES, which preserves tautomeric distinctions. The canSAR knowledgebase chose Non-Standard InChI with fixed hydrogens, appending an extra layer to the InChI string that makes it specific to a single tautomeric form. Each choice reflects a different philosophy about molecular identity.

For ML training data, collapsing tautomers is usually the correct behavior — a model should learn that acetone and its enol form have the same properties, not treat them as independent data points. For tracking experimental results, preserving tautomeric form may be important — the specific form tested in an assay matters for reproducibility. The data engineering challenge is supporting both use cases through a compound hierarchy rather than forcing a single definition of identity.

Building Compound Hierarchies

The concept of compound hierarchies addresses this tension by maintaining multiple levels of molecular identity for different use cases. A well-designed hierarchy might include: the Standard Compound (validated and normalized but otherwise unmodified), the Canonical Representative (canonical tautomer selected), the Unsalted Canonical Representative (salts stripped, canonical tautomer), and the Abstract Compound (stereochemistry and isotope labels removed). Each level serves different purposes — the Abstract Compound enables broad similarity searches across stereoisomers, while the Canonical Representative enables precise identity matching.

In practice, this hierarchy should be captured in the compound registration record alongside standardization metadata:

```json
{
"compound_id": "CMP-2025-00142",
"original_input": "CC(=O)Oc1ccccc1C(O)=O.[Na]",
"standardization_pipeline_version": "v3.1.0",
"rdkit_version": "2024.09.1",
"hierarchy": {
"standard_compound_smiles": "CC(=O)Oc1ccccc1C(O)=O.[Na]",
"canonical_representative_smiles": "CC(=O)Oc1ccccc1C(=O)O",
"unsalted_canonical_smiles": "CC(=O)Oc1ccccc1C(=O)O",
"abstract_compound_smiles": "CC(=O)Oc1ccccc1C(=O)O"
},
"identifiers": {
"inchi": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)",
"inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
"canonical_smiles": "CC(=O)Oc1ccccc1C(=O)O"
},
"standardization_log": [
"salt_stripped: Na removed",
"charge_neutralized: carboxylate protonated",
"tautomer_canonical: no tautomerization applied"
]
}
```

This record provides full traceability from raw input to standardized output, with versioned pipeline and library metadata enabling exact reproduction. When an auditor asks "what molecule was this prediction based on?" or a model developer needs to understand how training data was constructed, the answer is documented at the point of registration rather than reconstructed from scattered logs.

Molecular Feature Engineering as Infrastructure

Molecular descriptor computation — the transformation of chemical structures into numerical feature vectors suitable for machine learning — is typically treated as a preprocessing step embedded in training scripts. This is a fundamental architectural mistake. Descriptor computation is a data engineering service that should be centralized, versioned, and served consistently to both training and inference workloads. Treating it as ad hoc scripting is the primary mechanism by which training-serving skew enters pharmaceutical ML systems.

The Descriptor Landscape

The molecular descriptor space spans several levels of structural representation. One-dimensional descriptors — molecular weight, calculated LogP, polar surface area, hydrogen bond donor and acceptor counts, rotatable bond counts — are scalar properties computable directly from molecular connectivity. RDKit's `Descriptors` module provides over two hundred such descriptors. These are fast to compute, interpretable to medicinal chemists, and sufficient for many QSAR applications, but they discard structural information that distinguishes molecules with similar bulk properties but different biological activity.

Two-dimensional descriptors, primarily molecular fingerprints, capture structural patterns as fixed-length bit vectors. Morgan fingerprints (also called Extended Connectivity Fingerprints, ECFP) are the de facto standard for similarity searching and many ML applications. Each bit encodes the presence or absence of a circular substructure centered on each atom, extending to a specified radius. The standard parameterization — radius 2 with 2048 bits — balances specificity against bit collision rates, but both parameters significantly affect downstream model performance and should be treated as engineering decisions rather than defaults accepted without evaluation.

Three-dimensional descriptors — shape descriptors, pharmacophore fingerprints, electrostatic surface properties — require conformer generation before computation, introducing both computational cost and non-determinism. The same molecule can produce different 3D descriptors depending on which conformer was generated, how the generation algorithm was parameterized, and even the random seed used for initialization. For pipeline reproducibility, 3D descriptor computation requires either deterministic conformer generation or explicit storage of the generated conformer alongside the computed descriptors.

Learned representations — embeddings from transformer models trained on SMILES strings, or latent vectors from graph neural networks — represent the frontier of molecular featurization. These approaches can capture patterns that handcrafted descriptors miss, but they introduce a new infrastructure dependency: the embedding model itself must be maintained, versioned, and served alongside the downstream prediction model. A feature that requires running a separate neural network to compute is categorically different from a feature computed by a deterministic function, and pipeline architecture must reflect this distinction.

The Training-Serving Skew Problem

Training-serving skew occurs when features are computed differently in the training environment and the production inference environment. In standard software engineering, this manifests as differences in library versions, preprocessing logic, or data transformations between the training pipeline and the serving API. In molecular ML, the problem is amplified by the sensitivity of molecular descriptors to implementation details.

Consider a concrete scenario: a medicinal chemist queries a deployed solubility prediction model via a web API. The API computes Morgan fingerprints using RDKit 2024.09, but the model was trained using fingerprints generated with RDKit 2023.03. Between those releases, RDKit modified its handling of certain aromatic systems during canonicalization, and adjusted default behavior for stereo group processing in SMILES generation. For most molecules, the fingerprints are identical. For a fraction of molecules — those with the specific structural motifs affected by the changes — the fingerprints differ by one or more bits. The model returns predictions for these molecules, but the predictions are subtly wrong because the input features do not match what the model learned during training.

This failure mode is insidious because it produces no errors, no warnings, and no obvious degradation in aggregate metrics. The affected molecules are a small fraction of total queries, and their predictions are wrong by amounts that fall within normal prediction uncertainty. The skew is only detectable by someone who knows to look for it — by comparing feature vectors generated at training time with feature vectors generated at inference time for the same molecules.

Toward a Molecular Feature Store

The architectural solution is to centralize feature computation in a single service that feeds both training and inference. In the broader ML engineering ecosystem, this pattern is called a feature store. For molecular data, the implementation is straightforward in concept if not always in practice.

A molecular feature store accepts standardized molecular identifiers (canonical SMILES from the standardization pipeline) and returns precomputed, versioned feature vectors. Its key properties are idempotency (the same input always produces the same output), versioning (the feature computation version is tracked alongside the features themselves), caching (expensive descriptors are computed once and stored for retrieval), and auditability (every model can trace its features back to a specific computation version and a specific molecular representation).

The architecture follows a standard pattern: an input layer that accepts canonical SMILES and validates them against the compound registry; a computation layer that generates descriptors using pinned library versions and explicit parameterization; a storage layer keyed on the tuple of canonical SMILES and feature set version; a serving layer (REST API or gRPC) that returns feature vectors for real-time inference; and a batch export path that produces feature matrices with full version metadata for training pipelines.

For small biotechs and CROs, this does not need to be a production deployment of Feast or Tecton. A PostgreSQL table with compound identifiers, precomputed descriptors, and version metadata — served by a Flask API with a few hundred lines of code — achieves the core value proposition: training and inference consume features from the same source, eliminating the primary vector for training-serving skew. The investment is modest. The protection against silent prediction degradation is substantial.

Genomic Data Integration Patterns

Pharmaceutical ML increasingly requires integrating chemical and biological data. Compound properties must be joined with target information, pathway annotations, expression data, and genomic context. This multi-modal integration is a data engineering challenge that precedes any modeling decision, and it introduces failure modes that are invisible to practitioners working within a single data type.

The Multi-Format Problem

Genomic pipelines produce fundamentally different data artifacts from chemical pipelines, and the artifacts within genomics are themselves heterogeneous. Sequence data in FASTA format, variant calls in VCF format, expression quantifications in count matrices, and functional annotations in GFF format each use different schemas, different coordinate systems, and different versioning conventions. Joining chemical data — keyed on molecular identifiers like SMILES or InChIKey — with genomic data — keyed on gene symbols, protein accessions, or genomic coordinates — requires robust mapping layers that translate between identifier systems.

These mappings are fragile. Gene symbols change when the HUGO Gene Nomenclature Committee updates its conventions. Protein accessions are deprecated when UniProt merges entries during curation. Genomic coordinates shift between assembly versions, and liftover operations between assemblies can fail for complex regions including segmental duplications and centromeric sequences. A pipeline that hard-codes any of these identifiers without a managed cross-reference layer will silently lose data whenever upstream databases reorganize.

Building the Join Layer

The connective tissue of multi-modal pharmaceutical datasets is a set of mapping tables between identifier systems: ChEMBL target IDs to UniProt accessions, UniProt accessions to Ensembl gene IDs, Ensembl gene IDs to HGNC symbols, and so on. Each mapping must be versioned and refreshed as source databases update.

The engineering pattern is to maintain a versioned entity resolution table that is updated on a regular cadence (quarterly is typical for databases with biannual release cycles) with change detection that flags downstream impacts. When UniProt merges two accessions that previously mapped to different ChEMBL targets, the entity resolution system should detect this merge, flag all affected bioactivity records, and propagate the change to downstream datasets with an audit trail. Without this infrastructure, cross-database joins gradually degrade as identifier mappings become stale — a form of data drift that operates at the metadata level rather than the feature level.

Bioactivity Data Curation

Raw bioactivity data from public databases requires substantial curation before it is suitable for ML training. ChEMBL, the richest public source of structure-activity relationships, contains measurements reported in heterogeneous units (nM, μM, mM, %), from different assay types (binding, functional, ADME, toxicity), and with varying quality annotations. The database faithfully preserves what was reported in the original publication, which means the data engineering burden of harmonization falls on the consumer.

Effective curation involves several automated steps. Unit normalization converts all measurements to a common scale — typically nanomolar for potency measurements — handling the arithmetic correctly (micromolar to nanomolar is multiplication by 1,000, not division). Assay type filtering restricts training data to measurements from comparable experimental paradigms — mixing binding Ki values with functional EC50 values in a single regression target introduces systematic noise. Duplicate detection, applied after structure standardization, identifies cases where the same compound-target pair was measured in multiple publications and resolves conflicts (typically by median aggregation or by preferring higher-confidence assay types). Quality flag filtering removes records tagged with data validity concerns — potential transcription errors, values outside typical ranges, or suspected author errors.

Each of these curation steps requires domain-specific knowledge to implement correctly and must be documented in sufficient detail that a regulatory reviewer could understand and reproduce the curation logic. This is data engineering in the fullest sense: not merely moving data between systems, but actively transforming it according to domain-specific rules with full provenance tracking.

Orchestration and Pipeline Architecture for Molecular Data

The data engineering operations described in preceding sections — structure standardization, feature computation, genomic integration, bioactivity curation — form a dependency graph with specific ordering constraints. Standardization must complete before feature computation. Feature computation must complete before training data assembly. All stages depend on specific reference database versions. Manual execution of these dependencies — running scripts in sequence, eyeballing outputs, checking logs — is the default at most small pharmaceutical organizations. It is also the primary source of the silent failures documented in the first article of this series.

Why Orchestration Matters

Modern data orchestration tools solve precisely the problems that manual pipeline execution creates. They make dependencies explicit rather than implicit. They enforce execution order automatically. They provide retry logic for transient failures. They log what ran, when, and with what result. And they surface failures immediately through alerting, rather than allowing corrupted data to propagate silently through downstream stages.

For molecular data pipelines specifically, orchestration addresses several failure patterns that are difficult to catch manually. If the ChEMBL extraction step produces fewer records than expected (indicating a failed download or changed schema), an orchestrated pipeline can detect the anomaly and halt before standardization begins. If the standardization step encounters an unusual number of validation failures (suggesting a change in upstream data format), the pipeline can alert rather than silently dropping records. If feature computation takes significantly longer than historical baselines (possibly indicating library version changes or input distribution shifts), the anomaly is logged and flagged.

Tool Selection for Pharma Context

The data orchestration ecosystem has consolidated around three primary open-source options, each suited to different organizational contexts.

Dagster's asset-based orchestration model aligns naturally with molecular data workflows where the outputs — standardized compound sets, feature matrices, curated training datasets — are the primary concern. Assets are defined as Python functions with typed inputs and outputs, and Dagster automatically constructs the dependency graph from these definitions. Built-in data quality checks at each pipeline step enable validation-first development, where assertions about data properties are first-class citizens rather than afterthoughts. For organizations building new data infrastructure without legacy constraints, Dagster's design philosophy reduces the cognitive overhead of managing complex molecular data pipelines.

Apache Airflow remains the industry standard, with over 320 million downloads in 2024 alone — an order of magnitude more than its nearest competitor. Its task-based model is well-understood across the data engineering community, and managed cloud offerings (Amazon MWAA, Google Cloud Composer) reduce operational burden. For enterprise pharmaceutical organizations with existing platform engineering teams and established Airflow infrastructure, the ecosystem maturity and deployment options outweigh Dagster's design advantages.

Prefect occupies a middle ground, offering flexible hybrid execution models with lower operational overhead than Airflow for small teams. Its event-driven architecture is well-suited for ad hoc and irregularly scheduled workloads, which is common in research environments where pipeline execution is triggered by new data availability rather than fixed schedules.

For a computational team of ten to thirty people at a biotech or CRO — the typical profile of organizations building molecular ML systems — the decision is not primarily about technical capability (all three tools handle the required workloads) but about organizational fit. Dagster's asset-centric model and built-in validation are worth the smaller ecosystem for teams prioritizing data quality and lineage. Airflow's maturity and managed service options reduce operational risk for teams that need to move fast with minimal platform engineering investment.

A Reference Pipeline Architecture

Regardless of tool selection, the dependency structure for molecular ML data preparation follows a consistent pattern. A ChEMBL extraction stage pulls compound and bioactivity data from a pinned database version. A UniProt extraction stage retrieves target annotations. Both feed into an entity resolution stage that maintains cross-reference mappings between identifier systems. Standardization processes raw molecular structures into canonical forms and registers them in the compound hierarchy. Bioactivity curation filters, normalizes, and deduplicates activity measurements. Feature computation generates descriptor vectors from standardized structures and writes them to the feature store. Finally, training dataset assembly joins curated bioactivity data with computed features and produces versioned training matrices with full provenance metadata.

Each stage in this pipeline has defined inputs and outputs with schema validation, pinned dependency versions, data quality checks at boundaries, and metadata logging for auditability. The pipeline as a whole produces not just a training dataset, but a complete audit trail documenting every transformation from raw source data to ML-ready features.

Data Quality and Validation Patterns

Data quality in molecular datasets cannot be assessed by generic data validation tools alone. While standard checks — null detection, type validation, uniqueness constraints — remain necessary, they are insufficient for catching the domain-specific errors that most frequently corrupt pharmaceutical ML training data. Effective validation requires domain-specific rules automated at every pipeline boundary.

Domain-Specific Validation Rules

Chemical validity checks ensure that molecular representations parse correctly and produce chemically reasonable structures. A SMILES string that fails RDKit sanitization may contain invalid valence states (carbon with five bonds), impossible ring systems, or notation errors. Beyond parseability, property range checks flag molecules with implausible characteristics for their intended compound class: a "small molecule" with a molecular weight exceeding 2,000 daltons, a calculated LogP below -5 or above 10, or a polar surface area suggesting a molecule that could not plausibly cross a cell membrane.

Biological plausibility checks apply domain knowledge to activity measurements. An IC50 of 0.001 nM represents extraordinary potency that is vanishingly rare in medicinal chemistry — values this extreme warrant verification against the source publication. A molecular weight of 50,000 daltons for a compound in a small molecule bioactivity dataset indicates a data entry error or an improperly classified biologic. An assay reporting 150% inhibition at a single concentration suggests a normalization error rather than genuine super-stoichiometric activity.

Duplicate detection, applied after standardization, checks for compounds that arrived as different representations but resolve to the same canonical SMILES. The duplicate rate itself is an informative metric: a baseline rate of 2-3% is typical for datasets assembled from multiple public sources, while a sudden increase to 10% suggests that a new data source with different conventions has been added without adequate standardization.

Schema Contracts Between Pipeline Stages

The principle of schema contracts — defining explicit expectations about data structure at every pipeline boundary — translates directly from general data engineering to molecular data workflows. Every pipeline stage should declare what it expects as input (required columns, data types, value ranges, null tolerance), what it produces as output (column names, types, expected row count ranges), and what conditions cause it to fail rather than proceed.

Framework-native tools enforce these contracts automatically. Dagster asset checks, Great Expectations suites, and Pandera schemas all provide mechanisms for declaring and validating data contracts. The specific choice matters less than the discipline of using one consistently. The pattern is simple: every pipeline stage produces a validation report alongside its data output. The downstream stage consumes both the data and the report, and refuses to proceed if validation failed. This transforms data quality from a hopeful aspiration into an engineering guarantee.

Monitoring for Data Drift at Ingestion

The drift detection practices described in the sustainability article — monitoring feature distributions, tracking prediction confidence, alerting on statistical shifts — should be applied at the data ingestion layer, not just the model layer. Tracking statistical properties of incoming data batches (feature distributions, null rates, categorical value frequencies, new category appearance rates) catches upstream changes before they corrupt downstream models.

When a data source changes its schema, updates its assay protocols, or modifies its formatting conventions, ingestion-level monitoring detects the shift immediately rather than allowing it to propagate through standardization, feature computation, and model training before manifesting as degraded prediction quality weeks or months later. This is the data engineering implementation of the proactive monitoring we advocated in the sustainability article — catching errors at the point of entry, where the cost of detection is measured in seconds rather than months of compromised analysis.

Practical Implementation for Resource-Constrained Teams

The patterns described in this article may seem to require the infrastructure budget and platform engineering headcount of a large pharmaceutical company. They do not. Most biotechs and CROs can implement the highest-impact elements incrementally, starting with interventions that deliver immediate value for modest effort and building toward more sophisticated infrastructure as needs grow and organizational maturity increases.

The Minimum Viable Data Layer

The following prioritized implementation sequence provides maximum risk reduction with minimum upfront investment.

During the first two weeks, implement chemical standardization. Write a standardization function using RDKit's `MolStandardize` module that performs validation, normalization, canonical tautomer generation, salt stripping, and charge neutralization. Apply this function to all incoming molecular data before any other processing. Store the canonical SMILES as the primary molecular identifier alongside the original input representation. This single step eliminates the largest category of silent data errors in pharmaceutical ML — the molecular identity confusions that inflated our opening scenario's model performance by contaminating training data with duplicates and near-duplicates.

During week three, add schema validation at ingestion. Use Pandera or Great Expectations to define expected columns, data types, and value ranges for your primary data sources. Configure the validation to fail loudly on violations rather than proceeding with corrupted data. The first time this catches an upstream format change that would previously have silently broken your pipeline, the investment pays for itself.

During weeks four through six, centralize feature computation. Move molecular descriptor computation from individual analysis notebooks into a shared Python module with a pinned RDKit version and explicit parameterization. Create a simple feature cache — even a pickle file or SQLite database keyed on the tuple of canonical SMILES and feature configuration hash — that prevents redundant computation and ensures consistent features across analyses. This eliminates the training-serving skew problem without requiring production feature store infrastructure.

During the second month, introduce orchestration. Wrap the standardization, validation, and feature computation steps into a Dagster job or Airflow DAG. This makes dependencies explicit, provides a dashboard for monitoring pipeline health, and creates a foundation for adding more sophisticated stages later. The orchestration layer also serves as living documentation of the pipeline's structure — new team members can understand the data flow by examining the DAG rather than tracing through scripts.

During months three through six, formalize the feature store. Migrate the feature cache into a queryable PostgreSQL table with compound identifiers, precomputed descriptors, and version metadata. Build a lightweight API that serves features to both training scripts and inference endpoints from the same source. At this point, training-serving consistency is architecturally guaranteed rather than manually maintained.

What Not to Build

Knowing what to avoid is as important as knowing what to build. Do not implement a custom molecular database when PostgreSQL with the RDKit cartridge extension provides SQL-level chemical searching, substructure queries, and similarity calculations out of the box. Do not write your own standardization rules when RDKit's MolStandardize module and the broader MolVS library provide peer-reviewed, community-maintained implementations that have been validated against millions of compounds across ChEMBL, PubChem, and canSAR. Do not engineer for scale you do not have — a compound library of 50,000 molecules does not need a distributed feature store running on Kubernetes; it needs a well-indexed PostgreSQL table and a few hundred lines of serving code.

The organizations that fail at molecular data engineering rarely fail because their infrastructure was insufficiently sophisticated. They fail because they had no infrastructure at all — because every computational chemist maintained their own preprocessing scripts, each implementing slightly different standardization conventions, producing slightly different features, and feeding slightly different training data to models that appeared to be trained identically.

Conclusion: The Complete Stack

This article completes a four-part series that addresses the full stack of concerns for pharmaceutical ML systems. The first article cataloged what breaks in pathogen genomics ML pipelines, establishing that most failures are infrastructure failures rather than modeling failures. The second article prescribed a systems-first approach to pipeline design, emphasizing modularity, observability, and explicit handling of uncertainty. The third article covered long-term sustainability — the organizational, governance, and knowledge-transfer practices that keep ML systems alive across personnel changes and organizational evolution. This article fills the foundation: the data engineering layer that determines whether models built on top of it are learning from chemistry or from data artifacts.

The data engineering layer is not glamorous. It does not produce novel architectures, state-of-the-art benchmarks, or publications in high-impact journals. But it determines whether models built on top of it are learning from reality or from the accidents of inconsistent data handling. A perfectly architected pipeline consuming carefully maintained models with comprehensive governance documentation will still produce unreliable predictions if the data it ingests confuses salt forms with parent molecules, treats tautomers as distinct compounds, or computes different features at training time and inference time.

The convergence of chemical and biological data in pharmaceutical ML is accelerating. Foundation models trained simultaneously on molecular structures, protein sequences, and genomic data demand even more rigorous data engineering — consistent molecular representations, reliable cross-modal entity resolution, and auditable provenance across data types that were never designed to interoperate. Organizations that invest in their data foundation now, treating molecular data engineering as infrastructure rather than scripting, will be positioned to adopt these capabilities as they mature. Those that continue treating data preparation as someone else's problem will continue to build models on sand.

The $200,000 model from the opening of the sustainability article — the one that was shelved 18 months after its developer left because nobody could run it, debug it, or update it — in most cases, if you trace the root cause back far enough, the problem started in the data layer. Not in the architecture. Not in the hyperparameters. Not in the deployment configuration. In the data.

Fix the foundation first.

**References**

Al-Sherhi, M. et al. (2022). canSAR chemistry registration and standardization pipeline. *Journal of Cheminformatics*, 14(38). https://doi.org/10.1186/s13321-022-00606-7
Landrum, G. (2020). Trying out the new tautomer canonicalization code. *RDKit Blog*. https://greglandrum.github.io/rdkit-blog/
RDKit: Open-Source Cheminformatics Software. MolStandardize module documentation. https://www.rdkit.org/docs/source/rdkit.Chem.MolStandardize.rdMolStandardize.html
Ebejer, J.P. (2022). Better Models Through Molecular Standardization. *Oxford Protein Informatics Group Blog*. https://www.blopig.com/blog/2022/05/molecular-standardization/
David, L. et al. (2023). From intuition to AI: evolution of small molecule representations in drug discovery. *Briefings in Bioinformatics*, 25(1), bbad422.
Vamathevan, J. et al. (2019). Applications of machine learning in drug discovery and development. *Nature Reviews Drug Discovery*, 18, 463–477.
Jiménez-Luna, J. et al. (2020). Drug discovery with explainable artificial intelligence. *Nature Machine Intelligence*, 2, 573–584.
U.S. Food and Drug Administration. (2025). Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products (Draft Guidance).
U.S. Food and Drug Administration. (2021). Good Machine Learning Practice for Medical Device Development: Guiding Principles.
Workflow Informatics. (2025). From Data to Drug Candidates: Optimizing Informatics for ML and GenAI. *Drug Discovery & Development*.
PracData. (2025). State of Open Source Workflow Orchestration Systems 2025. https://www.pracdata.io/

Taylor Powell

Building the Data Foundation: Data Engineering Patterns for Molecular and Genomic ML in Pharma

Ten Essential Practices for Building Sustainable ML Systems in Pharma