🔬 Scientific Methodology

Transparent, reproducible, and rigorously validated approach to protein disorder prediction

📊 Validation Results

Overall Performance

Accuracy

84.52%

k=5 optimal threshold

Precision (Structured)

88.01%

Low false positives

Recall (Disordered)

83.13%

Identifies disorder well

F1-Score (DisProt)

80.12%

Disordered proteins

F1-Score (PDB)

74.36%

Structured proteins

Dataset Size

40,000

Training sequences

Cross-Validation Results

Homology-Aware CV: >75% accuracy (prevents data leakage via MMseqs2 clustering)
MobiDB Independent Test: >70% accuracy (generalization to external dataset)
Label-Shuffle Control: ~50% accuracy (proves model learns real patterns, not artifacts)
Bootstrap Confidence Intervals: Narrow intervals indicate statistical significance

Processing Speed

Single sequence: <50ms
Batch (50 sequences): <500ms
1000 sequences: <10 seconds
Throughput: 100+ sequences/second for batch processing

🔬 Scientific Approach

Our classifier leverages fundamental biophysical principles to distinguish structured from intrinsically disordered proteins. Unlike machine learning approaches that can act as "black boxes," our method is interpretable and based on well-established protein chemistry.

1. Global Sequence Features

We extract seven biophysical and compositional features from protein sequences:

Amino acid composition analysis - Distribution of different amino acid types
Charge distribution patterns - Absolute net charge proportion
Hydrogen bonding potential - Capacity for forming stabilizing H-bonds
Hydrophobicity profiles - Normalized average hydrophobicity
Shannon entropy measurements - Sequence complexity and diversity
Proline frequency - Known disorder-promoting residue
Bulky hydrophobic frequency - W, C, F, Y, I, V, L content

2. Threshold-Based Classification

Our approach uses empirically-derived thresholds from training data:

No machine learning training required - Thresholds calculated from statistical midpoints
Reproducible across datasets - Same features, same thresholds
Interpretable results - Each feature's contribution is transparent
Fast computation - No complex model inference needed

Classification Rule: A protein is classified as structured if it meets at least k=5 of the 7 feature conditions (empirically optimized for best F1-score).

3. Validation Framework

We employ rigorous validation to ensure our model learns true biological principles:

Homology-aware cross-validation - MMseqs2 clustering at 30% identity prevents sequence similarity leakage
Independent test sets - MobiDB validation ensures generalization beyond training data
Statistical significance testing - Bootstrap confidence intervals and McNemar's test
Label-shuffle controls - Confirms model learns real patterns (performance drops to ~50% with random labels)

🔒 Proprietary Details Not Revealed:

Exact feature weightings and normalization constants
Threshold calculation formulas and optimization algorithms
Proprietary performance optimizations and caching strategies

📚 Training & Validation Data

Structured Proteins (PDB)

15,000 sequences

High-quality protein structures from the Protein Data Bank (PDB). These represent well-folded, stable protein domains with experimentally determined 3D structures.

Disordered Proteins (DisProt)

25,000 sequences

Intrinsically disordered proteins from DisProt database, experimentally validated to lack stable 3D structure under physiological conditions.

Independent Validation (MobiDB)

1,000+ sequences

External validation set from MobiDB, which aggregates disorder predictions and experimental annotations from multiple sources.

🧠 Concept Model Framework

Our classifier implements the Concept Model framework, a paradigm for building interpretable classification systems:

Four-Layer Architecture

M1 (Property Vectors): The 7 biophysical features extracted from each sequence
M2 (Constraints): Empirically-derived thresholds that define "structured-like" vs "disordered-like" for each feature
M3 (Transformation Rules): Count how many conditions are met; classify based on threshold k=5
M4 (Goal State): True labels (structured vs. disordered) used for validation

This framework ensures transparency and interpretability - every classification decision can be traced back to specific biophysical properties.

⚠️ Known Limitations

Important Considerations:

Binary classification only: Predicts disordered vs. structured at the whole-protein level, not per-residue disorder
Global features: Does not detect local disorder regions within otherwise structured proteins
No structural details: Does not predict specific 3D structures or binding sites
Threshold-based: Edge cases near decision boundary may have lower confidence

Not Suitable For:

Clinical diagnostic decisions requiring regulatory approval
Single-residue disorder mapping or IDR boundary detection
Regulatory submissions requiring FDA/EMA-approved tools
Detailed structural analysis or protein engineering

🚀 Future Improvements

Per-Residue Prediction

Sliding window approach to predict disorder at individual residue level (currently in research phase).

Confidence Scores

Enhanced confidence metrics based on distance from decision boundary and feature agreement.

Expanded Features

Additional biophysical properties and sequence motifs for improved accuracy.

📖 References & Further Reading

Concept Model Framework: Emergent Concept Modeling
DisProt Database: Piovesan et al. (2021) "DisProt: intrinsic protein disorder annotation in 2020"
MobiDB Database: Piovesan et al. (2021) "MobiDB: intrinsically disordered proteins in 2021"
PDB: Protein Data Bank - www.rcsb.org
MMseqs2: Steinegger & Söding (2017) "MMseqs2 enables sensitive protein sequence searching"
Full Validation Guide: GitHub Documentation

Ready to Try It?

Start classifying protein sequences with our validated API

Get Started API Documentation