Ten Essential Practices for Building Sustainable ML Systems in Pharma

Feb 11

A promising machine learning model for predicting drug-drug interactions gets shelved 18 months after a postdoc leaves the organization. The model cost $200,000 to develop and showed excellent validation performance, but now it sits dormant. The principal investigator cannot figure out how to run it, the data pipeline mysteriously broke after a dependency update, and nobody documented the preprocessing steps that were critical to achieving the reported accuracy. This scenario, while fictional, reflects a pattern playing out across pharmaceutical and biotech organizations: machine learning systems that fail not due to poor science, but due to inadequate attention to sustainability.

Unlike the consumer technology sector where "move fast and break things" can be a viable philosophy, pharmaceutical ML systems operate under fundamentally different constraints. These systems need lifespans measured in years or decades, not months. Regulatory requirements demand that organizations demonstrate reproducibility years after initial deployment, potentially in response to FDA audits or patent disputes. Team turnover is inevitable as postdocs graduate, scientists accept positions at other companies, and organizational restructuring shuffles responsibilities. Perhaps most critically, you cannot easily sunset a model that has become integrated into active drug development programs without significant business disruption.

The U.S. Food and Drug Administration has recognized these challenges in its guidance for AI/ML-based medical devices, emphasizing the need for a Total Product Lifecycle (TPLC) approach that maintains safety and effectiveness throughout a system's operational life. The FDA's 2021 Good Machine Learning Practice principles explicitly state that organizations must embrace "excellence principles of culture of quality and organizational" responsibility, going beyond initial development to ensure long-term viability.

In previous articles on this platform, we have covered ML pipeline failure modes and customization strategies. But even well-designed pipelines fail if they cannot be maintained by teams that change over time. Sustainability is not purely a technical challenge—it encompasses organizational processes, regulatory compliance, and human factors. These ten practices will help ensure your ML systems remain useful, compliant, and operable long after the original developers move on, transforming them from fragile prototypes into robust infrastructure.

Practice 1: Treat Data Provenance as a First-Class Requirement

Every data point used in training, validation, or production must be traceable to its source with complete timestamps and transformation history. This principle, seemingly obvious, is violated more often than upheld in practice.

Why It Matters in Pharma

Pharmaceutical machine learning systems face unique accountability requirements. FDA audits may ask you to explain exactly how a model arrived at a specific prediction made several years ago. Dataset drift detection requires knowing precisely what changed between training and production. Reproducing results from a 2019 publication for a regulatory submission demands access to the exact data version used in the original analysis.

The FDA's guidance on computational model credibility emphasizes "comprehensive and representative input" documentation, making data provenance not merely a best practice but a regulatory expectation. When ML predictions influence drug development decisions, the ability to trace those predictions back to specific data sources becomes a compliance requirement.

Implementation Approaches

Modern data version control tools provide Git-like workflows specifically designed for large datasets. DVC (Data Version Control), an open-source system widely used in drug discovery applications, "helps data scientists manage, track and version data and models, as well as run reproducible experiments" through a command-line interface that integrates seamlessly with existing Git repositories.

DVC addresses the fundamental mismatch between traditional version control systems and machine learning requirements. While Git excels at tracking source code, it struggles with the multi-gigabyte datasets common in pharmaceutical applications. DVC stores large data files in external storage (Amazon S3, Google Cloud Storage, or on-premises systems) while maintaining version metadata in Git, providing "the same user experience as if they were in the repo" without the performance penalties.

Practical implementation requires establishing metadata standards. Every dataset should include a data card documenting its provenance:

```json
{
"data_id": "chembl_v31_subset_2023-06-15",
"source_system": "ChEMBL v31",
"extraction_date": "2023-06-15T14:30:00Z",
"extraction_query": "SELECT * FROM compound_properties WHERE herg_activity IS NOT NULL",
"preprocessing_pipeline_version": "v2.3.1",
"preprocessing_git_commit": "a4f3d9e",
"validation_checks_passed": ["schema_valid", "no_nulls_in_key_fields", "id_uniqueness"],
"row_count": 12450,
"checksum_md5": "e4d909c290d0fb1ca068ffaddf22cbd0"
}
```

Pharma-Specific Example

A toxicity prediction model at a mid-size pharmaceutical company failed an FDA audit not because the model was inaccurate, but because the team could not prove which version of ChEMBL they had used for training. The model performed well, but recreating the exact training set required three months of archaeological work through Git commit messages and email threads. Engineers eventually found scattered references suggesting ChEMBL v29 but could not demonstrate this with certainty.

Following this painful experience, the organization implemented a policy requiring a `data_manifest.json` file with every model artifact. This simple metadata file documents data sources, extraction dates, preprocessing versions, and validation status. What once required months of investigation now takes minutes, and the organization can confidently respond to regulatory inquiries with documented evidence.

Common Pitfalls

The most frequent failure mode is storing files named "cleaned_data_final_v3_FINAL_ACTUALLY_FINAL.csv" without accompanying metadata. Such naming conventions encode minimal information and quickly become archaeological puzzles. Even when version numbers appear in filenames, they typically lack crucial context about what changed between versions or how the data was generated.

Practical Recommendations

Create a data card template inspired by the model card framework and require it for every dataset. This lightweight documentation overhead during data preparation pays enormous dividends during audits, troubleshooting, and knowledge transfer. Store data cards in version control alongside code, making them easily discoverable and ensuring they evolve with the data itself.

Practice 2: Build for the "Bus Factor" of One

Assume the primary developer will be unavailable tomorrow. Can someone else run, debug, and update your system using only available documentation? This thought experiment, sometimes called the "bus factor" or "lottery factor," exposes sustainability vulnerabilities in ML systems.

Why It Matters in Pharma

Pharmaceutical organizations face constant personnel transitions. Postdocs complete their training and move to academic positions or industry roles. Contractors reach the end of their engagements. Key personnel take parental leave, medical leave, or accept positions at competing organizations. Mergers and acquisitions shuffle entire teams. According to industry surveys, the average tenure for postdoctoral researchers ranges from two to four years, making knowledge loss a predictable rather than exceptional event.

Implementation Approaches

Effective documentation operates at multiple levels. At the repository level, every directory should contain a `README.md` explaining its purpose and contents. At the pipeline level, maintain a runbook covering common operational tasks. At the architecture level, document key design decisions using Architecture Decision Records (ADRs) that explain not just what was built but why specific approaches were chosen over alternatives.

Runbook example structure:

```markdown
# Pipeline Runbook: Toxicity Prediction Model
## Quick Start (5 minutes to first prediction)
1. Clone repository: `git clone ...`
2. Install dependencies: `conda env create -f environment.yml`
3. Download data: `dvc pull`
4. Run inference: `python predict.py --input examples/sample.csv`
## Architecture Overview
[Include diagram showing data flow from input to prediction]
## Common Tasks
### Task: Retrain model with new data
1. Place new data in `data/raw/update_YYYY-MM-DD.csv`
2. Run preprocessing: `python src/preprocess.py --input data/raw/update_YYYY-MM-DD.csv`
3. Trigger training: `python src/train.py --config configs/production.yaml`
4. Expected output: New model in `models/toxicity_vX.Y.Z.pkl` with validation report
### Task: Debug prediction failures
1. Check logs in `logs/predictions.log` for error messages
2. Validate input schema: `python src/validate_input.py --data problem_file.csv`
3. Common issues:
- Missing features → Error: "KeyError: 'molecular_weight'" → Check preprocessing step
- Out-of-range values → Warning: "LogP=15.2 exceeds training range" → Consider if molecule is in applicable chemical space
## Key Design Decisions
### Why MongoDB over PostgreSQL?
We chose MongoDB for feature storage because molecular descriptors vary significantly between compounds (not all molecules have the same set of calculated properties). MongoDB's flexible schema accommodates this variability without requiring complex JOIN operations. Decision logged in ADR-003.
### Why ensemble models?
Single models showed instability on edge cases. Ensemble of Random Forest + Gradient Boosting + Neural Network provides robustness through diversity. Increases inference time by 3x but reduces prediction variance by 40%. Decision logged in ADR-007.
```

Minimize dependency complexity. Every additional dependency, especially exotic or cutting-edge libraries, increases maintenance burden. A project using PyTorch, NumPy, and Scikit-learn will be maintainable for years. A project requiring specific versions of seven specialized libraries will become unmaintainable as soon as one library's API changes or becomes deprecated.

Pharma-Specific Example

At a contract development and manufacturing organization (CDMO), the primary crystal structure prediction pipeline was maintained by a single computational chemist. When he departed for a faculty position, the system broke within two weeks due to a routine dependency update. Recovery took six months because nobody understood the custom preprocessing steps for handling protein-ligand interactions.

The organization now conducts quarterly "handoff drills" where a different team member must successfully run the entire pipeline using only available documentation. If they cannot complete the task within a day, documentation gaps are identified and addressed. This discipline ensures that every critical system can survive the loss of its primary maintainer.

Practical Testing

Have someone unfamiliar with your project attempt to run it using only your documentation. Time how long the process takes and track every question they ask. Each question represents a documentation gap. Well-documented systems should enable a qualified but unfamiliar user to achieve basic functionality within an hour and complete complex tasks within a day.

Practice 3: Implement Automated Monitoring Before Deployment

You cannot maintain what you cannot measure. Establishing comprehensive monitoring before deploying to production enables detection of degradation before users encounter problems.

Why It Matters in Pharma

ML models degrade silently as chemical space shifts, assay conditions change, or underlying biological understanding evolves. Data pipeline failures may not trigger immediate errors but gradually compromise prediction quality. Regulatory compliance requires demonstrating ongoing performance throughout a system's operational lifetime, not just at initial validation.

Monitoring Hierarchy

Effective monitoring operates at multiple levels, from basic system health to business impact:

Input Data Quality: Track missing value percentages, outlier detection, schema validation, and feature distribution statistics. A sudden increase in missing values often indicates upstream pipeline issues before they manifest as prediction failures.
Model Performance: Monitor predictions on held-out test sets over time. For classification tasks, track accuracy, precision, recall, and AUC-ROC. For regression tasks, monitor mean absolute error and R-squared. Performance should remain stable; degradation signals the need for investigation.
Prediction Distribution: Analyze whether the model encounters novel inputs outside its training distribution. Calculate similarity metrics between new inputs and training data. In pharmaceutical applications, Tanimoto similarity for molecular structures provides effective chemical space coverage assessment.
System Health: Monitor inference latency, memory consumption, error rates, and throughput. Performance degradation often appears in system metrics before manifesting in prediction quality.
Business Metrics: Track whether predictions are actually being used. The most accurate model provides zero value if researchers do not trust it or find it too slow for practical use.

Implementation Tiers

Organizations can adopt monitoring at different sophistication levels depending on resources and maturity:

Basic: Daily cron job executes test suite and emails results if failures occur. Total implementation time: one afternoon.
Intermediate: Prometheus + Grafana dashboards with alerting. Visualize key metrics and configure alerts for threshold violations. Implementation time: one week.
Advanced: ML-specific monitoring platforms like Evidently AI or Whylabs providing sophisticated drift detection and model observability. These platforms detect "changes in the overall data distribution" and provide statistical tests for identifying significant shifts. Implementation time: two weeks.

Pharma-Specific Example

A bioavailability prediction model at a preclinical contract research organization silently degraded over eight months. The training data consisted entirely of oral small molecules in traditional pharmaceutical chemical space. However, the company pivoted to peptide therapeutics, and the model encountered molecules far outside its training distribution. Nobody noticed until a failed synthesis campaign prompted investigation.

The organization now monitors three key indicators: Tanimoto similarity between new query molecules and the training set (alert triggers if more than 10% of predictions fall below 0.3 similarity), prediction confidence distributions (flag if high-uncertainty predictions exceed baseline), and downstream success rates of recommended compounds. These leading indicators provide early warning before business impact occurs.

Monitoring Checklist

Effective monitoring implementations address these components:

Input data schema validation (detect structural changes)
Feature distribution drift detection (identify statistical shifts)
Model prediction distribution tracking (monitor output patterns)
Performance metrics on validation set (measure accuracy, updated quarterly or monthly)
System uptime and latency (ensure reliability)
Dependency version tracking (prevent unexpected breakage)

Starting Simply

A Python script that runs weekly and posts results to Slack is infinitely better than no monitoring. Start with basic checks and incrementally add sophistication. The perfect monitoring system that never gets implemented helps nobody.

Practice 4: Version Everything (Data, Code, Models, Environment)

Reproducibility requires versioning the entire computational graph, not just model weights or source code. Incomplete versioning creates reproduction challenges that compound over time.

Why It Matters in Pharma

The inability to reproduce results from an Investigational New Drug (IND) filing creates regulatory nightmares. Debugging production issues requires knowing exactly what code, data, and dependencies ran at specific times. Rolling back to previous versions must be fast and reliable when new deployments introduce unexpected behavior. Patent disputes may require demonstrating computational methods used years earlier.

The FDA's guidance on AI/ML-enabled medical devices emphasizes "detailed traceability from raw data to model output" as essential for regulatory compliance. This traceability depends fundamentally on comprehensive versioning.

The Four Versioning Pillars

Code: Use Git with tagged releases for production deployments. Tags create immutable references to specific code states, enabling precise reproduction. Tag format example: `v2.3.1-production-2023-11-20`.
Data: DVC, LakeFS, or at minimum timestamped snapshots with checksums. DVC enables "describing projects in a way that can be built and reproduced", forming the foundation for continuous integration and continuous delivery of ML systems. Store data version hashes in metadata to enable exact reproduction.
Models: MLflow, Weights & Biases, or versioned object storage with comprehensive metadata. Every model artifact should be retrievable by unique identifier with associated training metadata.
Environment: Docker images with pinned dependencies or Conda `environment.yml` files specifying exact versions. Using "latest" tags or unpinned versions creates reproducibility landmines.

Implementation Pattern

Every model artifact should include comprehensive metadata:

```python
model_metadata = {
'model_id': 'toxicity_v2.3.1',
'model_type': 'random_forest_ensemble',
'git_commit': 'a4f3d9e',
'git_tag': 'v2.3.1-production',
'data_version': 'chembl_v31_2023-06-15',
'data_checksum': 'e4d909c290d0fb1ca068ffaddf22cbd0',
'docker_image': 'company/ml-base:2023.11@sha256:abcd...',
'training_date': '2023-11-20T14:30:00Z',
'training_duration_hours': 4.2,
'dependencies': 'requirements_v2.3.txt',
'hyperparameters': {
'n_estimators': 100,
'max_depth': 15,
'min_samples_split': 10
},
'validation_metrics': {
'auc_roc': 0.89,
'precision': 0.84,
'recall': 0.82
}
}
```

Pharma-Specific Example

A large pharmaceutical company's QSAR model for hERG liability needed rerunning for a patent dispute four years after initial development. Because the organization had implemented comprehensive versioning—code in Git with tags, data in DVC, environment as Docker image, model in MLflow—one engineer reproduced exact results in two hours. The version metadata pointed to specific Git commit `v1.8.2`, data version `herg_training_2019-03-15`, and Docker image with SHA hash ensuring identical dependencies.

The opposing counsel's technical expert could not reproduce their own model's results, significantly weakening their case. This real-world example demonstrates how comprehensive versioning provides business value far exceeding implementation cost.

Common Pitfalls

The most dangerous failure mode is versioning code while using "latest" for Docker base images or running `pip install package` without pinning versions. This creates the illusion of reproducibility while introducing subtle variation. A model trained with NumPy 1.20 may behave differently than the same code with NumPy 1.24, even if the code is identical.

Practical Workflow

Establish this discipline for every model training run:

Tag Git commit when training starts (`git tag v2.3.1-training-start`)
Log data version hash to MLflow (`data_version: chembl_v31_abc123`)
Save Docker image SHA to model metadata (not just `ml-base:latest` but `ml-base:2023.11@sha256:...`)
Create reproduction script referencing all versions
Test reproduction quarterly to verify the process works

Quarterly reproduction testing catches versioning failures before they become critical. If you cannot reproduce your own results three months later, you certainly cannot reproduce them three years later under regulatory scrutiny.

Practice 5: Design for Data Drift and Concept Drift

The world changes. Training data becomes stale. Failing to plan for this reality guarantees eventual model failure.

Why It Matters in Pharma

Chemical libraries evolve as medicinal chemists explore new scaffolds and modifications. Assay conditions change when laboratories acquire new equipment or update protocols. Biological understanding improves, shifting what features are considered important. Regulatory requirements evolve. Market factors influence which therapeutic areas receive investment, changing the distribution of molecules being evaluated.

Machine learning systems deployed in pharmaceutical settings must anticipate these changes. Research from Cornell University on learning under concept drift emphasizes that "not accounting for the changing underlying relationships between inputs and outputs can severely degrade models in production."

Types of Drift

Data Drift (Covariate Shift): Input distribution changes. Example: An ADME model trained on small molecules now receives peptide queries. Feature distributions shift even though the underlying relationships remain constant.
Concept Drift: Relationships between inputs and outputs change. Example: A new crystallization protocol alters how molecular properties influence solubility, invalidating learned relationships even when molecules themselves remain similar.
Label Drift: Output distribution changes. Example: A company pivots from central nervous system drugs to oncology therapeutics, completely changing the distribution of desired properties.

Understanding which type of drift is occurring informs appropriate responses. Data drift may require expanding training data to cover new chemical space. Concept drift typically necessitates retraining to learn new relationships.

Detection Strategies

Statistical hypothesis tests provide quantitative drift detection. The Kolmogorov-Smirnov test assesses whether two distributions differ significantly. Chi-squared tests work for categorical features. IBM's analysis of model drift notes that many popular drift detectors use "time distribution-based methods that measure potential deviations between two probability distributions" as core mechanisms.

Monitor model performance on recent validation data compared to historical baselines. Degrading performance on recent data while historical performance remains stable signals that the model's learned relationships no longer reflect current reality.

Domain experts provide invaluable perspective on edge case predictions. Establish regular reviews where scientists examine challenging or surprising predictions. Their reactions often reveal drift before it appears in quantitative metrics.

Response Strategies

Retrain: Scheduled retraining (quarterly or annually) or triggered by drift detection ensures models stay current. When sufficient new data accumulates, retraining captures evolved patterns.
Update: Incremental learning techniques add new knowledge without full retraining. Appropriate when relationships gradually evolve rather than fundamentally shift.
Ensemble: Combine old and new models during transition periods. Gradual weighting shifts from old to new model as confidence in new model grows.
Retire: If drift is severe and fundamental, acknowledge the model's limitations. Continuing to use an obsolete model may be worse than having no model.

Pharma-Specific Example

A CYP450 inhibition model at a biotech company was trained on data from 2015-2018, covering traditional small molecule inhibitors. By 2022, the compound library had shifted toward PROTACs and molecular glues—chemistries barely present in training data. Automated Tanimoto similarity monitoring detected this shift: average similarity to training set dropped from 0.65 to 0.42 over 18 months.

Rather than trusting degraded predictions, the team implemented a three-phase response:

Flag predictions on novel chemotypes as "low confidence" to prevent researchers from making decisions based on unreliable outputs
Collect experimental data on 500 representative PROTACs to establish ground truth
Retrain with mixed dataset containing both traditional inhibitors and new chemistries
Validate performance on held-out set before restoring full confidence

This measured approach prevented both the continuation of unreliable predictions and the premature abandonment of useful infrastructure.

### Drift Response Decision Tree

```
Is performance declining?
├─ Yes → Is new data available?
│ ├─ Yes → Retrain and validate
│ └─ No → Reduce model scope or flag uncertain predictions
└─ No → Is input distribution shifting?
├─ Yes → Collect new validation data, increase monitoring frequency
└─ No → Maintain current schedule
```

This simple decision tree guides responses to drift signals, balancing the costs of action (retraining requires time and data) against the costs of inaction (degraded predictions mislead researchers).

Practice 6: Maintain a Living Test Suite

Tests are not merely for catching bugs. They serve as executable documentation of expected behavior and regression prevention mechanisms.

Why It Matters in Pharma

Refactoring without tests is reckless when predictions influence drug development decisions involving millions of dollars. Tests document edge cases and business logic that may not be obvious from code alone. Regulatory auditors appreciate demonstrated validation and quality assurance practices.

Test Hierarchy for ML Systems

Unit Tests (fast, numerous): Test data preprocessing functions, feature engineering logic, and utility functions in isolation. Example:

```python
def test_molecular_weight_calculation():
"""Caffeine should have MW ~194 Da"""
caffeine_smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
mw = calculate_molecular_weight(caffeine_smiles)
assert 194.0 < mw < 195.0, f"Unexpected MW: {mw}"
```

Integration Tests (medium speed): Test pipeline end-to-end on small datasets, verify API endpoints function correctly, confirm database connections work. These tests ensure components work together even when individual units function correctly.

Model Validation Tests (slower): Verify performance on fixed validation sets, test prediction invariants (certain structural changes should not drastically alter predictions), confirm known molecules give expected results.

```python
def test_known_herg_blockers():
"""Terfenadine is a known hERG blocker and should be flagged"""
terfenadine_smiles = "CC(C)(C)c1ccc(cc1)C(O)CCCN2CCC(CC2)C(O)(c3ccccc3)c4ccccc4"
prediction = model.predict(terfenadine_smiles)
assert prediction['herg_inhibition'] > 0.7, \
f"Known blocker should be flagged high risk, got {prediction['herg_inhibition']}"
```

Property-Based Tests: Test invariants that should hold across many inputs. Adding a single methyl group should not change logP by more than 2 units. Molecules with symmetric structures should give identical predictions regardless of which tautomer is provided.

Pharma-Specific Example

A formulation prediction model at a pharmaceutical company contained a subtle bug: NaN values in tablet hardness features were being filled with zero instead of the column median. This caused systematic under-prediction of dissolution rates, but the error went undetected for nine months. Researchers noticed that all predictions for a novel excipient seemed suspiciously low, triggering investigation.

A proper test suite would have caught this immediately:

```python
def test_nan_handling_in_predictions():
"""NaN inputs should be handled explicitly, not silently converted to zero"""
input_with_nan = create_test_input(hardness=np.nan)
result = model.predict(input_with_nan)
# Should not produce NaN prediction
assert not np.isnan(result['dissolution_rate']), \
"Model produced NaN prediction"
# Should document that imputation occurred
assert 'missing_features' in result['metadata'], \
"Missing feature handling not documented"
assert 'hardness' in result['metadata']['missing_features'], \
"Hardness imputation not logged"
```

Practical Test Coverage Goals

Aim for 80% or higher coverage of data processing code, where bugs are frequent and consequences are severe. Achieve 100% coverage of critical path functions that directly influence predictions. Maintain at least 20 "known molecules" with expected predictions as regression tests. Add regression tests for every bug discovered—if a bug occurred once, similar bugs may occur during future changes.

Continuous Testing

Run full test suite on every commit via continuous integration/continuous delivery (CI/CD) pipelines. DVC's CI/CD integration enables "enforcing integrity with application-specific tests: Data validation" and model validation automatically.

Run validation tests weekly on production models to detect drift or degradation. Conduct annual "adversarial testing" sessions where domain experts deliberately try to find edge cases that fool the model. These sessions often reveal failure modes that standard metrics miss.

Practice 7: Document Deployment Context and Limitations

Be explicit about when your model should and should not be used. Ambiguity leads to misuse, which leads to expensive failures.

Why It Matters in Pharma

Misapplied models cause costly failures—bad compounds advance while good compounds are incorrectly rejected. Legal and regulatory liability increases when limitations are not clearly communicated. New users will absolutely misuse models without explicit guidance about appropriate application domains.

Model Card Components for Pharma

Adapt standard model card frameworks to pharmaceutical contexts:

Intended Use: "This model predicts blood-brain barrier permeability for small molecules (MW 150-600 Da) with standard functional groups. Optimized for central nervous system drug discovery applications."
Out-of-Scope Use: "NOT suitable for: peptides (>800 Da), PROTACs, compounds containing unusual metals (beyond Fe, Zn, Cu), prodrugs, molecules violating Lipinski's Rule of Five. Performance degrades significantly on these compound classes."
Training Data Characteristics: "Trained on 12,450 molecules from ChEMBL v29, published 2010-2020. Dataset composition: 70% CNS-active compounds, 30% peripherally acting drugs. Chemical space: primarily Lipinski-compliant small molecules."
Performance Metrics: "Test set (n=3,112): AUC-ROC 0.83, Precision 0.76, Recall 0.79. Performance degrades on high molecular weight compounds: AUC-ROC 0.71 for MW > 600 Da vs. 0.85 for MW < 500 Da."
Known Failure Modes: "Systematically under-predicts permeability for zwitterionic compounds at physiological pH. Struggles with quaternary ammonium compounds. Overconfident predictions for molecules containing exotic heterocycles not well-represented in training data."
Update Schedule: "Retrained annually each Q1 using updated ChEMBL release. Last update: March 2023. Next scheduled update: March 2024. Ad hoc retraining triggered if drift detection exceeds thresholds."
Responsible Use Guidelines: "Predictions should be treated as prioritization tools, not definitive assessments. Always validate top candidates experimentally before advancing to expensive synthesis or in vivo studies. Consult medicinal chemist for compounds flagged as 'low confidence' or falling outside training distribution."

Pharma-Specific Example

A solubility prediction model at a contract research organization was being used by medicinal chemists to optimize compound properties. Several expensive synthesis failures occurred because chemists assumed the model worked for all molecules. Investigation revealed the model had been trained exclusively on neutral molecules, but 30% of queries involved salts and zwitterions where performance was terrible (R² = 0.23 vs. 0.81 for neutral molecules).

After adding prominent warnings ("This model only works for neutral molecules. For salts and zwitterions, use Model XYZ instead") and implementing an automatic router based on molecular charge state, misuse dropped 90%. The router checks formal charge and directs queries to appropriate models, preventing users from needing to remember limitations.

Implementation Approaches

Create a `MODEL_CARD.md` file in your repository documenting all limitations and appropriate use cases. Display limitations prominently in UI/API responses alongside predictions. Add programmatic checks that warn users when inputs fall in uncertain regions:

```python
def assess_prediction_confidence(molecule):
"""Assess confidence in prediction based on molecular properties"""
confidence = 'high'
warnings = []

# Check similarity to training distribution
max_similarity = max(tanimoto_similarity(molecule, train_mol)
for train_mol in training_set_sample)
if max_similarity < 0.3:
confidence = 'low'
warnings.append("Novel chemical space - no similar training examples")

# Check molecular weight range
mw = calculate_molecular_weight(molecule)
if mw > 600:
confidence = 'medium' if confidence == 'high' else confidence
warnings.append(f"Molecular weight {mw:.1f} Da exceeds typical range (150-600 Da)")

# Check for problematic functional groups
if contains_quaternary_ammonium(molecule):
confidence = 'low'
warnings.append("Contains quaternary ammonium group - known failure mode")

return {
'confidence': confidence,
'warnings': warnings,
'recommendation': get_recommendation(confidence)
}

def get_recommendation(confidence):
"""Provide usage guidance based on confidence level"""
recommendations = {
'high': "Prediction reliable for prioritization",
'medium': "Use with caution, validate top candidates",
'low': "Prediction unreliable - consult domain expert or collect experimental data"
}
return recommendations[confidence]
```

This programmatic approach ensures warnings appear automatically rather than relying on users to remember limitations.

Practice 8: Plan for Knowledge Transfer from Day One

Documentation is necessary but insufficient. Create multiple knowledge transfer mechanisms operating at different levels.

Why It Matters in Pharma

The average tenure of postdoctoral researchers ranges from two to four years. Scientists receive promotions, transfer internally, or join competitors. Institutional knowledge evaporates without explicit capture mechanisms. The cost of lost knowledge manifests in replication of previous work, inability to maintain systems, and institutional amnesia about why specific decisions were made.

Knowledge Transfer Hierarchy

Level 1: Documentation (passive): READMEs, runbooks, and architecture diagrams provide baseline reference. Code comments should explain *why* decisions were made, not just *what* the code does. Decision logs (ADRs) document rationale for non-obvious choices.

Level 2: Recorded Walkthroughs (asynchronous): 15-minute Loom videos demonstrating "how to retrain the model" or "how to debug common failure modes" enable self-service learning. Screen recordings of debugging sessions capture tacit knowledge about troubleshooting approaches. Quarterly "state of the system" presentations (recorded) provide context about current status and planned changes.

Level 3: Shadowing and Pairing (synchronous): New team members pair with maintainers for one week, observing and asking questions. "Teaching sessions" where maintainers explain design decisions build deeper understanding than documentation alone. Code reviews become teaching opportunities rather than mere quality gates.

Level 4: Redundancy (organizational): Always maintain two or more people who can maintain critical systems. Quarterly "swap days" where backup maintainers take over operations test readiness. Cross-training between teams creates broader organizational resilience.

Pharma-Specific Example

A biosimulation platform at a large pharmaceutical company was maintained by one exceptional computational biologist for six years. When she accepted a director role at another company, the system nearly collapsed. Documentation existed but assumed deep domain knowledge. Recovery revealed critical gaps:

Documentation stated "calibrate the PK parameters" but not which parameters or how to judge successful calibration
No record of which experimental datasets were trustworthy versus problematic
Custom preprocessing steps were mentioned but reasoning existed only in her memory

The organization now requires that every significant system has a "primary" and "shadow" maintainer. The shadow must successfully complete quarterly updates solo. Exit interviews include four-hour technical handoff sessions with recordings preserved. This structure ensures knowledge persists beyond individual tenures.

Knowledge Transfer Checklist

For departing team members:

Update all documentation to reflect current state
Record 30-minute walkthrough of system architecture
Document all known issues and workarounds
Capture tribal knowledge ("this data source is unreliable on Tuesdays due to ETL job timing")
Conduct handoff session with replacement (recorded)
Provide two hours of "office hours" one month after departure (Slack/email) for follow-up questions

Practical Recommendations

Create a "new maintainer onboarding checklist" that forces knowledge transfer discipline. The checklist should enable a qualified but unfamiliar person to achieve operational competence within one week and deep understanding within one month.

Practice 9: Establish Clear Governance and Ownership

Someone must be responsible for each ML system. Ambiguity leads to neglect, which leads to system failure.

Why It Matters in Pharma

Research projects end, but models often linger in production. Organizational reorganizations shuffle responsibilities. Without clear ownership, nobody maintains systems, monitors performance, or makes retirement decisions. According to industry surveys, the average pharmaceutical organization has dozens of ML models with unclear ownership and uncertain business value.

Governance Framework

Ownership Roles:

Owner: Responsible for system health, updates, and retirement decisions (typically senior scientist or engineering lead)
Maintainers: Handle day-to-day operations and bug fixes (1-2 engineers or scientists)
Stakeholders: Business users who rely on predictions (medicinal chemistry teams, project leaders)
Sponsor: Provides funding and strategic direction (director or VP level)

Regular Reviews:

Monthly: Maintainer checks monitoring dashboards, addresses alerts, documents issues
Quarterly: Owner reviews performance metrics, decides whether retraining is needed, assesses business value
Annually: Sponsor reviews business impact, decides to continue/sunset/major update, allocates budget

Decision Authorities:

Maintainer can: fix bugs, update dependencies, tune hyperparameters within validated ranges
Owner can: retrain model, change architecture, allocate maintenance budget, expand scope
Sponsor can: sunset system, approve major overhaul, redirect resources

Lifecycle States:

```
Development → Testing → Production → Maintenance → Deprecated → Archived
```

Each state has defined entry/exit criteria and responsibilities. Clear transitions prevent systems from lingering in undefined states.

Pharma-Specific Example

A mid-size biotech company had 23 ML models scattered across teams. An audit revealed concerning findings:

7 models with unclear owners (original developers had left the company)
4 models still running but with no users accessing predictions
2 models with critical bugs that nobody was monitoring
$40,000 per year in cloud computing costs for abandoned systems

The organization implemented a mandatory model registry with required fields:

```yaml
model_id: "adme-logd-v3"
owner: "jane.smith@company.com"
maintainers:
- "john.doe@company.com"
- "sarah.jones@company.com"
sponsor: "vp-research@company.com"
stakeholders:
- "medchem-team@company.com"
- "adme-group@company.com"
status: "production"
last_review: "2023-11-01"
next_review: "2024-02-01"
business_value: "Screens 500 compounds/month, saves approximately $200K annually in failed synthesis attempts"
retirement_criteria: "If usage drops below 50 predictions/month for 2 consecutive quarters, trigger retirement review"
```

They decommissioned 9 unused models within six months, saving $25,000 annually in cloud costs while improving focus on valuable systems.

Red Flags Indicating Governance Failure

Nobody can provide an accurate count of production models
Critical model maintained by someone who left six months ago
Users do not know who to contact when predictions seem incorrect
Models running with no documented business justification

These symptoms indicate governance breakdowns that will eventually cause operational failures.

Practice 10: Design Exit Strategies and Archival Procedures

All systems eventually become obsolete. Plan for graceful retirement rather than abandonment.

Why It Matters in Pharma

Regulatory requirements may demand access to old models years after they stop active use. Failed drug development projects still generate intellectual property requiring preservation. Computational resources are finite—running everything forever is unsustainable. Abandoned systems create security vulnerabilities and technical debt.

FDA guidance on software as a medical device establishes retention requirements. For systems involved in FDA-regulated work, organizations must maintain model artifacts for minimum seven years. For models used in approved drugs, retention extends to the product lifetime plus ten years or longer.

Retirement Triggers

Models should be considered for retirement when:

Usage drops below threshold (e.g., <10 predictions/month for 6 months)
Replacement model deployed and validated
Business area shut down (project terminated, therapeutic area exited)
Performance degraded beyond acceptable limits and retraining is infeasible
Maintenance cost exceeds business value

Retirement Process

Phase 1: Sunset Announcement (3-6 months notice)

Email stakeholders with retirement date and alternatives. Add deprecation warnings to API/UI notifying users. Stop accepting new integrations. Document migration path to replacement system. Provide users time to adapt workflows.

Phase 2: Archival (before shutdown)

Create comprehensive archive package containing:

All code (Git repository snapshot with complete history)
All data (or representative sampling if full dataset is impractically large, plus data card)
Trained model artifacts with version metadata
Environment specification (Docker image with SHA hash)
Full documentation including lessons learned and known limitations
Test suite results demonstrating last-known performance
Business context (why built, how used, why retired, what value delivered)

Store in durable locations with appropriate retention policies:

Institutional repository with DOI (e.g., Zenodo) for public/publishable work
Company archive system with long-term retention guarantees
Cloud object storage with lifecycle policies and redundancy

Phase 3: Decommission

Shut down compute resources and redirect web endpoints to archived documentation. Remove from active monitoring systems. Update model registry status to "archived" with pointer to archive location.

Phase 4: Retention Policy

Document retention requirements based on regulatory obligations. Maintain minimum seven years for FDA-regulated work, longer for models used in approved drug applications. Specify what is retained and where.

Pharma-Specific Example

A pharmacokinetics model at a large pharmaceutical company was used in the IND filing for a compound that eventually failed Phase II clinical trials. The project terminated in 2018. In 2023, five years later, a patent dispute emerged around similar chemistry.

Because the organization had properly archived the model, the legal team retrieved the Docker image from company archives. A computational expert spun up the environment and reproduced predictions within one day. The complete audit trail from data → prediction → decision was available, enabling the judge to accept computational evidence as reliable. If they had simply deleted the model after project termination, this evidence would have been irretrievably lost, potentially costing millions in patent litigation.

Archive Checklist

Before retirement, verify:

Code snapshot (Git tag + export)
Model artifacts with version hashes (checksums for verification)
Training/validation data (or representative sample with documented sampling method)
Environment specification (Docker image with SHA, or Conda environment.yml)
Documentation bundle (README, model card, runbook, lessons learned)
Performance test results (final validation metrics)
Known issues and limitations (what worked, what didn't, edge cases)
Business context memo (why created, business value delivered, why retired)
Archive location documented in model registry
Stakeholders notified of retirement and archive location

Cost-Benefit Analysis

Archive creation typically requires 8-16 hours of effort. Storage costs approximate $50-100 annually. If needed for regulatory compliance or legal proceedings, value can reach millions. This makes proper archival one of the highest return-on-investment sustainability practices.

Testing Archive Usability

Have someone unfamiliar with the system attempt to run it from the archive alone. If they cannot reproduce basic functionality within a day, the archive is incomplete. Good archives enable reproduction by qualified personnel without requiring tribal knowledge.

Conclusion: Sustainability as a Competitive Advantage

These ten practices represent more than defensive measures against organizational failure. They create strategic value that compounds over time.

Faster Iteration: Well-maintained systems are easier to update and improve. When you can confidently modify a system without breaking production, experimentation becomes less risky and innovation accelerates.
Knowledge Compounding: Each generation of researchers and engineers builds on previous work instead of restarting from scratch. Institutional knowledge accumulates rather than evaporating with each personnel change.
Regulatory Confidence: Regulators trust organizations with robust ML governance. Demonstrating comprehensive version control, monitoring, and documentation during audits builds credibility that extends beyond specific models to organizational competence.
Talent Retention: Scientists and engineers want to work on maintainable systems rather than fighting technical debt. Organizations known for sustainable ML infrastructure attract and retain higher quality talent.
ROI Realization: Models only provide value if they remain useful long-term. A model that costs $200,000 to develop but is used for 5 years delivers far better return on investment than a model that gets abandoned after 18 months.

The Sustainability Mindset

Sustainable ML development requires thinking in years rather than sprints. Value boring reliability over exciting novelty—production systems should be boring. Treat future maintainers (including future-you) with respect by leaving clear documentation and robust infrastructure. Recognize that documentation is not overhead but a core deliverable of any production ML system.

Action Items for Tomorrow

Pick your most critical ML system and work through this checklist:

Reproducibility: Can I reproduce last month's predictions exactly using only version control artifacts?
Bus Factor: If I got hit by a bus tomorrow, could someone else maintain this system?
Monitoring: Do I know when this model is being misused or performing poorly?
Currency: Have I tested this model on data collected in the last 30 days?
Governance: Is there a clear owner and retirement plan?

Identify the most glaring gap and fix it first. Then schedule quarterly sustainability reviews to maintain discipline over time.

Final Reflection

In pharmaceutical ML, unlike consumer technology, your systems may need to justify decisions made today for decades to come. A model supporting an IND filing in 2024 may face scrutiny in patent litigation in 2034 or regulatory audit in 2029. Investing in sustainability is not optional—it is professional responsibility.

The models we build today will either become robust infrastructure that compounds in value, or they will become tomorrow's technical debt. Choose wisely.

References

U.S. Food and Drug Administration. (2021). Artificial Intelligence and Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. https://www.fda.gov/media/145022/download
U.S. Food and Drug Administration. (2021). Good Machine Learning Practice for Medical Device Development: Guiding Principles. https://www.fda.gov/media/153486/download
U.S. Food and Drug Administration. (2022). Using Artificial Intelligence and Machine Learning in the Development of Drug and Biological Products. https://www.fda.gov/media/167973/download
Ardigen. (2021). DVC – Data Version Control System. https://ardigen.com/dvc-data-version-control-system/
DVC. (2025). Get Started with DVC. https://doc.dvc.org/start
Evidently AI. Data drift in ML, and how to detect and handle it. https://www.evidentlyai.com/ml-in-production/data-drift
IBM. (2025). What Is Model Drift? https://www.ibm.com/think/topics/model-drift
Databricks. (2019). Productionizing Machine Learning - From Deployment to Drift Detection. https://www.databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html
Arize. (2023). Model Drift & Machine Learning: Concept Drift, Feature Drift, Etc. https://arize.com/model-drift/
Evidently AI. What is concept drift in ML, and how to detect and address it. https://www.evidentlyai.com/ml-in-production/concept-drift

Taylor Powell

Ten Essential Practices for Building Sustainable ML Systems in Pharma

Building the Data Foundation: Data Engineering Patterns for Molecular and Genomic ML in Pharma

Developing Customizable Machine Learning Pipelines: A Systems-First Approach to Reliability and Trust