Tautomer and Protomer Analysis for Drug Discovery

This notebook covers the chemistry of tautomerism and protonation states, which are critical for:

  • Molecular docking - Wrong tautomer = wrong binding pose

  • pKa prediction - Understanding ionization at physiological pH

  • SAR analysis - Tautomeric preference affects activity

  • Crystal structure interpretation - Hydrogen positions often ambiguous

Chemistry Background

Tautomerism

Tautomers are constitutional isomers that interconvert rapidly via proton transfer:

Keto ⇌ Enol         (carbonyl compounds)
Lactam ⇌ Lactim     (cyclic amides)
Amino ⇌ Imino       (aminoheterocycles)
Thione ⇌ Thiol      (thiocarbonyls)

The tautomeric ratio depends on:

  • Thermodynamic stability (ΔG)

  • Solvent effects

  • Intramolecular hydrogen bonding

Protonation States (Protomers)

At physiological pH (7.4), functional groups are protonated or deprotonated:

Group

Typical pKa

State at pH 7.4

Carboxylic acid

4-5

Deprotonated (COO⁻)

Aliphatic amine

9-11

Protonated (NH₃⁺)

Aromatic amine

4-5

Neutral (NH₂)

Phenol

9-10

Neutral (OH)

Imidazole

6-7

Mixed

[ ]:
import os
import tempfile
from pathlib import Path

import Auto3D
from Auto3D import Auto3DOptions, main
from rdkit import Chem
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.MolStandardize import rdMolStandardize
import pandas as pd

print(f"Auto3D version: {Auto3D.__version__}")

1. Common Tautomeric Systems in Drugs

Many drug molecules contain tautomerizable groups. Understanding which tautomer is bioactive is crucial.

[ ]:
# Classic tautomeric examples from medicinal chemistry
tautomeric_examples = {
    # Keto-enol tautomerism
    "warfarin_keto": "CC(=O)CC(C1=CC=CC=C1)C2=C(O)C3=CC=CC=C3OC2=O",
    "warfarin_enol": "CC(O)=CC(C1=CC=CC=C1)C2=C(O)C3=CC=CC=C3OC2=O",

    # Lactam-lactim in nucleobases
    "guanine_keto": "NC1=NC2=C(N=CN2)C(=O)N1",  # Dominant form
    "guanine_enol": "NC1=NC2=C(N=CN2)C(O)=N1",  # Rare form

    # Thione-thiol in antithyroid drugs
    "methimazole_thione": "CN1C=CN=C1S",  # Actually C=S

    # Amino-imino in sulfonamides
    "sulfamethoxazole": "CC1=CC(NS(=O)(=O)C2=CC=C(N)C=C2)=NO1",

    # Imidazole tautomerism (histidine)
    "histamine_N1H": "NCCC1=CNC=N1",  # N1-H tautomer
    "histamine_N3H": "NCCC1=CN=CN1",  # N3-H tautomer
}

print("Key tautomeric systems in drugs:")
for name, smi in tautomeric_examples.items():
    mol = Chem.MolFromSmiles(smi)
    if mol:
        print(f"  {name}: {Chem.MolToSmiles(mol)}")

2. Enumerating Tautomers with Auto3D

Auto3D can enumerate tautomers before 3D generation using the enumerate_tautomer=True option.

[ ]:
# Let's analyze warfarin - an important anticoagulant
# The keto-enol equilibrium affects its binding to vitamin K epoxide reductase

warfarin = "CC(=O)CC(C1=CC=CC=C1)C2=C(O)C3=CC=CC=C3OC2=O"

# Use RDKit's tautomer enumerator to see all forms
enumerator = rdMolStandardize.TautomerEnumerator()
mol = Chem.MolFromSmiles(warfarin)

tautomers = list(enumerator.Enumerate(mol))
print(f"Warfarin tautomers found: {len(tautomers)}")
print()

for i, taut in enumerate(tautomers):
    smi = Chem.MolToSmiles(taut)
    print(f"Tautomer {i+1}: {smi}")
[ ]:
# Generate 3D structures for all tautomers to compare energies

# Write tautomers to file
with tempfile.NamedTemporaryFile(mode='w', suffix='.smi', delete=False) as f:
    for i, taut in enumerate(tautomers[:5]):  # Limit to top 5
        smi = Chem.MolToSmiles(taut)
        f.write(f"{smi} warfarin_taut_{i+1}\n")
    taut_file = f.name

print(f"Wrote {min(5, len(tautomers))} tautomers to {taut_file}")
[ ]:
# Run Auto3D on tautomers
if __name__ == "__main__":
    config = Auto3DOptions(
        path=taut_file,
        k=1,                        # Lowest energy conformer per tautomer
        optimizing_engine="AIMNET",
        use_gpu=True,
    )

    taut_output = main(config)
    print(f"Output: {taut_output}")

3. Calculating Tautomer Stability

The most stable tautomer has the lowest energy. We can rank tautomers by their electronic energy from Auto3D.

[ ]:
def analyze_tautomer_energies(sdf_path):
    """
    Compare energies of different tautomers.

    The energy difference determines the equilibrium ratio:
    K = exp(-ΔG/RT)

    At 298K, 1.36 kcal/mol difference = 10:1 ratio
    """
    mols = list(Chem.SDMolSupplier(sdf_path, removeHs=False))

    data = []
    for mol in mols:
        if mol is None:
            continue
        name = mol.GetProp("_Name")

        # Get energy in Hartree
        if mol.HasProp("E_hartree"):
            e_hartree = float(mol.GetProp("E_hartree"))
        elif mol.HasProp("E_tot"):
            e_hartree = float(mol.GetProp("E_tot")) / 27.2114  # eV to Hartree
        else:
            continue

        data.append({"Name": name, "E_hartree": e_hartree})

    df = pd.DataFrame(data)

    # Calculate relative energies in kcal/mol
    e_min = df["E_hartree"].min()
    df["E_rel_kcal"] = (df["E_hartree"] - e_min) * 627.509  # Hartree to kcal/mol

    # Estimate Boltzmann populations at 298K
    import numpy as np
    RT = 0.001987 * 298  # kcal/mol
    df["Boltzmann_weight"] = np.exp(-df["E_rel_kcal"] / RT)
    df["Population_%"] = 100 * df["Boltzmann_weight"] / df["Boltzmann_weight"].sum()

    return df.sort_values("E_rel_kcal")


if 'taut_output' in dir():
    df_taut = analyze_tautomer_energies(taut_output)
    print("Tautomer Stability Analysis:")
    print(df_taut[["Name", "E_rel_kcal", "Population_%"]].to_string(index=False))

4. Protonation State Analysis

For docking and simulations, we need the correct protonation state at the target pH.

[ ]:
# Common ionizable groups and their typical pKa values
ionizable_examples = {
    # Acids - deprotonated at pH 7.4
    "benzoic_acid": "C1=CC=C(C=C1)C(=O)O",       # pKa ~4.2
    "phenol": "C1=CC=C(C=C1)O",                   # pKa ~10
    "sulfonamide": "CC1=CC=C(C=C1)S(=O)(=O)N",   # pKa ~10

    # Bases - protonated at pH 7.4
    "piperidine": "C1CCNCC1",                     # pKa ~11
    "aniline": "NC1=CC=CC=C1",                    # pKa ~4.6
    "imidazole": "C1=CN=CN1",                     # pKa ~7.0
    "pyridine": "C1=CC=NC=C1",                    # pKa ~5.2

    # Zwitterions
    "glycine": "NCC(=O)O",                        # pKa1 ~2.3, pKa2 ~9.6
}

def predict_protonation_state(smiles, pH=7.4):
    """
    Simple rule-based prediction of protonation state.

    For more accurate predictions, use:
    - Dimorphite-DL
    - OpenEye's pKa tools
    - Epik (Schrodinger)
    """
    mol = Chem.MolFromSmiles(smiles)

    # Count ionizable groups
    carboxylic = len(mol.GetSubstructMatches(Chem.MolFromSmarts("C(=O)[OH]")))
    aliphatic_N = len(mol.GetSubstructMatches(Chem.MolFromSmarts("[NX3;H2,H1;!$(NC=O)]")))
    aromatic_N = len(mol.GetSubstructMatches(Chem.MolFromSmarts("[nH]")))

    state = {
        "carboxylic_acids": carboxylic,
        "aliphatic_amines": aliphatic_N,
        "aromatic_N": aromatic_N,
        "predicted_charge": aliphatic_N - carboxylic  # Simplified
    }

    return state


print("Ionizable group analysis at pH 7.4:")
print("-" * 60)
for name, smi in ionizable_examples.items():
    state = predict_protonation_state(smi)
    print(f"{name:20s}: charge = {state['predicted_charge']:+d}")

5. pH-Dependent Conformer Generation

Different protonation states can have different conformational preferences due to:

  • Electrostatic effects - charged groups repel/attract

  • Hydrogen bonding - protonated groups form different H-bonds

  • Tautomeric shifts - pH affects tautomer equilibrium

[ ]:
# Compare neutral vs zwitterionic glycine
glycine_forms = {
    "glycine_neutral": "NCC(=O)O",           # Low pH
    "glycine_zwitterion": "[NH3+]CC(=O)[O-]", # Physiological pH
    "glycine_anion": "NCC(=O)[O-]",          # High pH
}

# Write to file
with tempfile.NamedTemporaryFile(mode='w', suffix='.smi', delete=False) as f:
    for name, smi in glycine_forms.items():
        f.write(f"{smi} {name}\n")
    glycine_file = f.name

print("Glycine protonation states:")
for name, smi in glycine_forms.items():
    mol = Chem.MolFromSmiles(smi)
    charge = Chem.GetFormalCharge(mol)
    print(f"  {name}: charge = {charge:+d}")
[ ]:
# Generate 3D structures for each protonation state
if __name__ == "__main__":
    config = Auto3DOptions(
        path=glycine_file,
        k=1,
        optimizing_engine="AIMNET",
        use_gpu=True,
    )

    glycine_output = main(config)
    print(f"Output: {glycine_output}")

6. Histidine: A Complex Case

Histidine has three ionizable groups and two tautomeric forms, making it one of the most complex amino acids:

  • Carboxyl: pKa ~1.8

  • Amino: pKa ~9.2

  • Imidazole: pKa ~6.0

At pH 7.4, ~10% of histidine’s imidazole is protonated.

[ ]:
# All relevant histidine forms for drug design
histidine_forms = {
    # Neutral imidazole tautomers
    "His_Nδ1H": "[NH3+][C@@H](Cc1c[nH]cn1)C(=O)[O-]",  # δ-tautomer
    "His_Nε2H": "[NH3+][C@@H](Cc1cnc[nH]1)C(=O)[O-]",  # ε-tautomer

    # Protonated imidazole (positively charged)
    "His_charged": "[NH3+][C@@H](Cc1c[nH+]c[nH]1)C(=O)[O-]",
}

print("Histidine forms relevant at physiological pH:")
for name, smi in histidine_forms.items():
    mol = Chem.MolFromSmiles(smi)
    if mol:
        charge = Chem.GetFormalCharge(mol)
        print(f"  {name}: net charge = {charge:+d}")

7. Integration with Auto3D Tautomer Enumeration

Auto3D has built-in tautomer enumeration that can be enabled during conformer generation.

CLI Usage

# Enable tautomer enumeration from command line
auto3d run drugs.smi --k=3 --enumerate-tautomer --gpu

# With ANI2xt (recommended for tautomers)
auto3d run drugs.smi --k=3 --enumerate-tautomer --engine=ANI2xt --gpu

# Using configuration file for full control
auto3d run drugs.smi -c tautomer_config.yaml

Example tautomer_config.yaml:

k: 3
enumerate_tautomer: true
tauto_engine: rdkit
pKaNorm: true
optimizing_engine: ANI2xt
use_gpu: true
[ ]:
# Example: Run Auto3D with tautomer enumeration
# This generates conformers for ALL tautomers, then ranks by energy

drug_with_tautomers = {
    "sulfasalazine": "OC1=CC=C(N=NC2=CC=C(NC3=CC=CC=N3)C=C2)C(=C1)C(=O)O",
    "allopurinol": "O=C1NC=NC2=C1NNC=2",  # Xanthine oxidase inhibitor
    "omeprazole": "COC1=CC2=NC(CS(=O)C3=NC4=CC=CC=C4N3)=NC2=CC1OC",
}

with tempfile.NamedTemporaryFile(mode='w', suffix='.smi', delete=False) as f:
    for name, smi in drug_with_tautomers.items():
        f.write(f"{smi} {name}\n")
    drug_file = f.name

print(f"Drugs with complex tautomerism: {len(drug_with_tautomers)}")
[ ]:
# Run with tautomer enumeration enabled
if __name__ == "__main__":
    config = Auto3DOptions(
        path=drug_file,
        k=3,                          # Top 3 conformers (may include different tautomers)
        enumerate_tautomer=True,      # Enable tautomer enumeration
        tauto_engine="rdkit",         # Use RDKit tautomer engine
        pKaNorm=True,                 # Normalize to dominant protonation state
        optimizing_engine="AIMNET",
        use_gpu=True,
    )

    drug_output = main(config)
    print(f"Output: {drug_output}")

8. Practical Guidelines

For Docking

  1. Generate multiple tautomers if the molecule has tautomerizable groups

  2. Use the dominant protomer at target tissue pH (usually 7.4)

  3. Consider both forms for ambiguous cases (e.g., imidazole pKa ~7)

For QSAR/ML

  1. Use canonical tautomer for consistency

  2. Consider tautomeric preference as a descriptor

  3. Be aware that training data may have inconsistent tautomers

For Thermodynamics

  1. Calculate ΔG for each tautomer/protomer

  2. Use Boltzmann averaging for ensemble properties

  3. Account for pH in free energy calculations

[ ]:
# Cleanup
import os
for f in [taut_file, glycine_file, drug_file]:
    if f in dir() and os.path.exists(f):
        os.unlink(f)

Summary

This tutorial covered:

  1. Tautomerism fundamentals - keto-enol, lactam-lactim, amino-imino

  2. Tautomer enumeration - using RDKit and Auto3D

  3. Energy-based ranking - identifying the dominant tautomer

  4. Protonation states - pH-dependent ionization

  5. Complex cases - histidine, drugs with multiple ionizable groups

Key takeaway: Always consider tautomers and protonation states when preparing molecules for computational chemistry.