{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e50753cf",
   "metadata": {},
   "source": [
    "# This notebook performs a goodlist/badlist substructure screening for a multi-molecule .sdf input file.  Also screens for rings of a minimum size, and maximum bond order."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75c36b50",
   "metadata": {},
   "source": [
    "## Enter screening parameters in the cell below.  Chemical input and output filenames should be in .sdf format, while goodlist and badlist files can be .txt files containing SMARTS patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "21ecbc57",
   "metadata": {},
   "outputs": [],
   "source": [
    "in_file = \"C:/Users/chris/Desktop/JoVE_demo/C2H5NO2.sdf\"\n",
    "out_file = \"C:/Users/chris/Desktop/JoVE_demo/gly-out.sdf\"\n",
    "goodlist = \"C:/Users/chris/Desktop/JoVE_demo/N-meGly+Ala.txt\" # substructures desired in final library\n",
    "badlist = \"C:/Users/chris/Desktop/JoVE_demo/JoVE-MAYGEN_badlist_SMARTS.txt\"\n",
    "max_bond = 3 # only molecules with bond orders below this number will be kept\n",
    "min_ring = 5 # only molecules with rings of this size or larger will be kept"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e2e8ab7",
   "metadata": {},
   "source": [
    "### If you encounter a SyntaxError mentioning a unicode error, make sure there are no single backslashes (\\\\) in the file names.  Change any single backslashes to double backslashes (\\\\\\\\) or forward slashes (/)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5116189",
   "metadata": {},
   "source": [
    "## Module imports below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "349fffbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from rdkit import Chem"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9d47ae5",
   "metadata": {},
   "source": [
    "## Function definitions below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "bc32f2b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def minRingSize(mol, ring_min):\n",
    "    \"\"\"Determine presence of rings below a specified size.\n",
    "    \n",
    "    Determine if there are rings present (size 3 or greater) that are smaller than a user-defined minimum.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    mol : RDKit Mol object\n",
    "    \n",
    "    ring_min : int\n",
    "        Minimum ring size desired.  E.g. If ring_min = 6, then the function will look for rings of size 3, 4, and 5.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    True if there are no rings below the specified minimum ring size; returns False otherwise.\n",
    "\n",
    "    \"\"\"\n",
    "    ring_atoms = mol.GetRingInfo().AtomRings()\n",
    "    sizes = set([len(a) for a in ring_atoms]) #lengths of all ring systems in molecule\n",
    "    mins = set(list(range(3,ring_min)))\n",
    "    return sizes.isdisjoint(mins)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0cfbd825",
   "metadata": {},
   "outputs": [],
   "source": [
    "def maxBonds(mol, bonds):\n",
    "    \"\"\"Test maximum bond order of input molecule.\n",
    "    \n",
    "    Tests if a molecule has a bond order equal to the input value.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    mol : RDKit Mol object\n",
    "    \n",
    "    bonds : float\n",
    "        Maximum bond order desired in a molecule.  Set as a float value because aromatic bonds in RDKit have a bond order of 2.5.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    True if at least one bond in the molecule has the specified bond order.  Returns False otherwise.\n",
    "\n",
    "    \"\"\"\n",
    "    bonds = float(bonds)\n",
    "    max_bonds = False\n",
    "    for b in mol.GetBonds():\n",
    "        if b.GetBondTypeAsDouble() == bonds:\n",
    "            max_bonds = True\n",
    "            break\n",
    "        else:\n",
    "            continue\n",
    "    return max_bonds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08142cd3",
   "metadata": {},
   "source": [
    "## Main program"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "64450c8a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Molecules read:  84\n",
      "CPU times: total: 15.6 ms\n",
      "Wall time: 11 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# load in source data file\n",
    "data = Chem.SDMolSupplier(in_file, removeHs=True)\n",
    "data_set = set([Chem.MolToSmiles(m) for m in data]) # set of all input SMILES\n",
    "bad_mols = set() # set of SMILES for molecules with unwanted substructures; output will lack these molecules\n",
    "bad_substructs = []\n",
    "for line in open(badlist, 'r'):\n",
    "    bad_substructs.append(Chem.MolFromSmarts(line))\n",
    "print(\"Molecules read: \", len(data_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "daa5db86",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Molecules left after ring screening:  53\n",
      "CPU times: total: 15.6 ms\n",
      "Wall time: 4 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# perform ring size screening\n",
    "for mol in list(data_set):\n",
    "    if not minRingSize(Chem.MolFromSmiles(mol), min_ring): #evaluates as True if molecule has a ring below the stated minimum size\n",
    "        bad_mols.add(mol)\n",
    "len(bad_mols)\n",
    "# update data to search after each screening step, reducing subsequent screening time\n",
    "data_set = (data_set - bad_mols)\n",
    "print(\"Molecules left after ring screening: \", len(data_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0cf07e1c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Molecules left after bond order screening:  53\n",
      "CPU times: total: 15.6 ms\n",
      "Wall time: 3 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# perform bond order screening\n",
    "for mol in list(data_set):\n",
    "    if maxBonds(Chem.MolFromSmiles(mol), max_bond):\n",
    "        bad_mols.add(mol)\n",
    "data_set = (data_set - bad_mols)\n",
    "print(\"Molecules left after bond order screening: \", len(data_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "2e31107d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Molecules left after removal of unwanted substructures:  5\n",
      "CPU times: total: 62.5 ms\n",
      "Wall time: 32 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "for mol in list(data_set):\n",
    "    bad_chk = False\n",
    "    for bad in bad_substructs:\n",
    "        if not bad_chk:\n",
    "            if Chem.MolFromSmiles(mol).HasSubstructMatch(bad):\n",
    "                bad_chk = True\n",
    "                break\n",
    "            else:\n",
    "                continue\n",
    "        else:\n",
    "            continue\n",
    "    if bad_chk:\n",
    "        bad_mols.add(mol)\n",
    "data_set = (data_set-bad_mols)\n",
    "print(\"Molecules left after removal of unwanted substructures: \", len(data_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dbea550",
   "metadata": {},
   "source": [
    "## Use the following cell if there is a goodlist (file with SMARTS strings for substructures that must be included).  Otherwise, change the cell type from \"Code\" to \"Markdown\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c1e7cb46",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Molecules left after retention of desired substructures:  1\n",
      "CPU times: total: 0 ns\n",
      "Wall time: 0 ns\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "good_substructs = []\n",
    "for line in open(goodlist, 'r'):\n",
    "    good_substructs.append(Chem.MolFromSmarts(line))\n",
    "for mol in list(data_set):\n",
    "    #perform goodlist screening if backbone not represented by a phosphorous pseudoatom\n",
    "    if (\"P\" in mol) or (\"p\" in mol):\n",
    "        continue\n",
    "    else:\n",
    "        good_count = 0 # counts number of goodlist substructures in a molecule\n",
    "        for good in good_substructs:\n",
    "            if Chem.MolFromSmiles(mol).HasSubstructMatch(good):\n",
    "                good_count += 1\n",
    "        #if good_count < len(good_substructs): # retained structures must have all goodlist substructures, can be changed if desired\n",
    "        if good_count == 0: # retained structures must have at least one goodlist substructure\n",
    "            bad_mols.add(mol)\n",
    "data_set = (data_set-bad_mols)\n",
    "print(\"Molecules left after retention of desired substructures: \", len(data_set))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c9e3061",
   "metadata": {},
   "source": [
    "### If you encounter a NameError in this cell for the 'goodlist' variable, you may have commented out the goodlist definition line in the first cell but ran this cell accidentally.  Change the cell type to \"Markdown\" by highlighting the cell and pressing \"M\", or in the menu at the top of the page select Cell --> Cell Type --> Markdown."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7d9496a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# calculate and output filtered molecules\n",
    "out_mols = list(data_set - bad_mols)\n",
    "w = Chem.SDWriter(out_file)\n",
    "for mol in out_mols:\n",
    "    m2 = Chem.MolFromSmiles(mol)\n",
    "    w.write(m2)\n",
    "w.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "002bd404",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}