{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Duplicated values\n",
    "\n",
    "Sometimes, data comes with messy data. \n",
    "\n",
    "We will use the package [dedupe](https://dedupe.io/) to eliminate duplicates. \n",
    "\n",
    "\n",
    "Some alternatives are  the packages [recordlinkage](https://pypi.org/project/recordlinkage/) and [thefuzz](https://github.com/seatgeek/thefuzz).\n",
    "\n",
    "Instead of using directly the package dedupe, we are going to use **pandas-dedupe**:\n",
    "\n",
    "\n",
    "**pip install pandas-dedupe**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "Let's start by loading messy data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore') # Avoid warnings\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import pandas_dedupe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('https://raw.githubusercontent.com/dedupeio/dedupe-examples/master/csv_example/csv_example_messy_input.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Let's do some initial checking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3337, 32)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>Source</th>\n",
       "      <th>Site name</th>\n",
       "      <th>Address</th>\n",
       "      <th>Zip</th>\n",
       "      <th>Phone</th>\n",
       "      <th>Fax</th>\n",
       "      <th>Program Name</th>\n",
       "      <th>Length of Day</th>\n",
       "      <th>IDHS Provider ID</th>\n",
       "      <th>...</th>\n",
       "      <th>Executive Director</th>\n",
       "      <th>Center Director</th>\n",
       "      <th>ECE Available Programs</th>\n",
       "      <th>NAEYC Valid Until</th>\n",
       "      <th>NAEYC Program Id</th>\n",
       "      <th>Email Address</th>\n",
       "      <th>Ounce of Prevention Description</th>\n",
       "      <th>Purple binder service type</th>\n",
       "      <th>Column</th>\n",
       "      <th>Column2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Salvation Army - Temple / Salvation Army</td>\n",
       "      <td>1 N Ogden Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2262649.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Salvation Army - Temple / Salvation Army</td>\n",
       "      <td>1 N Ogden Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2262649.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>National Louis University - Dr. Effie O. Elli...</td>\n",
       "      <td>10 S Kedzie Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5339011.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>National Louis University - Dr. Effie O. Elli...</td>\n",
       "      <td>10 S Kedzie Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5339011.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
       "      <td>10001 S Woodlawn Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2916100.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
       "      <td>10001 S Woodlawn Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2916100.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
       "      <td>1001 W Roosevelt Rd</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9395115.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
       "      <td>1001 W Roosevelt Rd</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9395115.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Hull House Association - Uptown Head Start / ...</td>\n",
       "      <td>1020 W Bryn Mawr Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7695753.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Hull House Association - Child Dev. Central O...</td>\n",
       "      <td>1030 W Van Buren St</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9068600.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 32 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Id                                 Source  \\\n",
       "0   0  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "1   1  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "2   2  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "3   3  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "4   4  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "5   5  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "6   6  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "7   7  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "8   8  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "9   9  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "\n",
       "                                           Site name                Address  \\\n",
       "0           Salvation Army - Temple / Salvation Army         1 N Ogden Ave    \n",
       "1           Salvation Army - Temple / Salvation Army         1 N Ogden Ave    \n",
       "2   National Louis University - Dr. Effie O. Elli...       10 S Kedzie Ave    \n",
       "3   National Louis University - Dr. Effie O. Elli...       10 S Kedzie Ave    \n",
       "4   Board Trustees-City Colleges of Chicago - Oli...  10001 S Woodlawn Ave    \n",
       "5   Board Trustees-City Colleges of Chicago - Oli...  10001 S Woodlawn Ave    \n",
       "6   Easter Seals Society of Metropolitan Chicago ...   1001 W Roosevelt Rd    \n",
       "7   Easter Seals Society of Metropolitan Chicago ...   1001 W Roosevelt Rd    \n",
       "8   Hull House Association - Uptown Head Start / ...  1020 W Bryn Mawr Ave    \n",
       "9   Hull House Association - Child Dev. Central O...   1030 W Van Buren St    \n",
       "\n",
       "   Zip      Phone  Fax Program Name Length of Day IDHS Provider ID  ...  \\\n",
       "0  NaN  2262649.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "1  NaN  2262649.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "2  NaN  5339011.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "3  NaN  5339011.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "4  NaN  2916100.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "5  NaN  2916100.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "6  NaN  9395115.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "7  NaN  9395115.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "8  NaN  7695753.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "9  NaN  9068600.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "\n",
       "  Executive Director Center Director ECE Available Programs NAEYC Valid Until  \\\n",
       "0                NaN             NaN                    NaN               NaN   \n",
       "1                NaN             NaN                    NaN               NaN   \n",
       "2                NaN             NaN                    NaN               NaN   \n",
       "3                NaN             NaN                    NaN               NaN   \n",
       "4                NaN             NaN                    NaN               NaN   \n",
       "5                NaN             NaN                    NaN               NaN   \n",
       "6                NaN             NaN                    NaN               NaN   \n",
       "7                NaN             NaN                    NaN               NaN   \n",
       "8                NaN             NaN                    NaN               NaN   \n",
       "9                NaN             NaN                    NaN               NaN   \n",
       "\n",
       "  NAEYC Program Id Email Address  Ounce of Prevention Description  \\\n",
       "0              NaN           NaN                              NaN   \n",
       "1              NaN           NaN                              NaN   \n",
       "2              NaN           NaN                              NaN   \n",
       "3              NaN           NaN                              NaN   \n",
       "4              NaN           NaN                              NaN   \n",
       "5              NaN           NaN                              NaN   \n",
       "6              NaN           NaN                              NaN   \n",
       "7              NaN           NaN                              NaN   \n",
       "8              NaN           NaN                              NaN   \n",
       "9              NaN           NaN                              NaN   \n",
       "\n",
       "   Purple binder service type Column Column2  \n",
       "0                         NaN    NaN     NaN  \n",
       "1                         NaN    NaN     NaN  \n",
       "2                         NaN    NaN     NaN  \n",
       "3                         NaN    NaN     NaN  \n",
       "4                         NaN    NaN     NaN  \n",
       "5                         NaN    NaN     NaN  \n",
       "6                         NaN    NaN     NaN  \n",
       "7                         NaN    NaN     NaN  \n",
       "8                         NaN    NaN     NaN  \n",
       "9                         NaN    NaN     NaN  \n",
       "\n",
       "[10 rows x 32 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',\n",
      "       'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',\n",
      "       'Neighborhood', 'Funded Enrollment', 'Program Option',\n",
      "       'Number per Site EHS', 'Number per Site HS', 'Director',\n",
      "       'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',\n",
      "       'Website', 'Executive Director', 'Center Director',\n",
      "       'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',\n",
      "       'Email Address', 'Ounce of Prevention Description',\n",
      "       'Purple binder service type', 'Column', 'Column2'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(df.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Id                                   int64\n",
      "Source                              object\n",
      "Site name                           object\n",
      "Address                             object\n",
      "Zip                                float64\n",
      "Phone                              float64\n",
      "Fax                                 object\n",
      "Program Name                        object\n",
      "Length of Day                       object\n",
      "IDHS Provider ID                    object\n",
      "Agency                              object\n",
      "Neighborhood                        object\n",
      "Funded Enrollment                   object\n",
      "Program Option                      object\n",
      "Number per Site EHS                 object\n",
      "Number per Site HS                  object\n",
      "Director                           float64\n",
      "Head Start Fund                    float64\n",
      "Eearly Head Start Fund              object\n",
      "CC fund                             object\n",
      "Progmod                             object\n",
      "Website                             object\n",
      "Executive Director                  object\n",
      "Center Director                     object\n",
      "ECE Available Programs              object\n",
      "NAEYC Valid Until                   object\n",
      "NAEYC Program Id                   float64\n",
      "Email Address                       object\n",
      "Ounce of Prevention Description     object\n",
      "Purple binder service type          object\n",
      "Column                             float64\n",
      "Column2                             object\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "print(df.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Id                                    0\n",
       "Source                                0\n",
       "Site name                             0\n",
       "Address                               0\n",
       "Zip                                1333\n",
       "Phone                               146\n",
       "Fax                                3299\n",
       "Program Name                       2009\n",
       "Length of Day                      2009\n",
       "IDHS Provider ID                   3298\n",
       "Agency                             3325\n",
       "Neighborhood                       2754\n",
       "Funded Enrollment                  2424\n",
       "Program Option                     2800\n",
       "Number per Site EHS                3319\n",
       "Number per Site HS                 3319\n",
       "Director                           3337\n",
       "Head Start Fund                    3337\n",
       "Eearly Head Start Fund             2881\n",
       "CC fund                            2818\n",
       "Progmod                            2818\n",
       "Website                            2815\n",
       "Executive Director                 3114\n",
       "Center Director                    2874\n",
       "ECE Available Programs             2379\n",
       "NAEYC Valid Until                  2968\n",
       "NAEYC Program Id                   3337\n",
       "Email Address                      3203\n",
       "Ounce of Prevention Description    3185\n",
       "Purple binder service type         3215\n",
       "Column                             3337\n",
       "Column2                            3018\n",
       "dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Missing values\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Check duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       False\n",
       "1       False\n",
       "2       False\n",
       "3       False\n",
       "4       False\n",
       "        ...  \n",
       "3332    False\n",
       "3333    False\n",
       "3334    False\n",
       "3335    False\n",
       "3336    False\n",
       "Length: 3337, dtype: bool"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.duplicated()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Remove duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>Source</th>\n",
       "      <th>Site name</th>\n",
       "      <th>Address</th>\n",
       "      <th>Zip</th>\n",
       "      <th>Phone</th>\n",
       "      <th>Fax</th>\n",
       "      <th>Program Name</th>\n",
       "      <th>Length of Day</th>\n",
       "      <th>IDHS Provider ID</th>\n",
       "      <th>...</th>\n",
       "      <th>Executive Director</th>\n",
       "      <th>Center Director</th>\n",
       "      <th>ECE Available Programs</th>\n",
       "      <th>NAEYC Valid Until</th>\n",
       "      <th>NAEYC Program Id</th>\n",
       "      <th>Email Address</th>\n",
       "      <th>Ounce of Prevention Description</th>\n",
       "      <th>Purple binder service type</th>\n",
       "      <th>Column</th>\n",
       "      <th>Column2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Salvation Army - Temple / Salvation Army</td>\n",
       "      <td>1 N Ogden Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2262649.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Salvation Army - Temple / Salvation Army</td>\n",
       "      <td>1 N Ogden Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2262649.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>National Louis University - Dr. Effie O. Elli...</td>\n",
       "      <td>10 S Kedzie Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5339011.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>National Louis University - Dr. Effie O. Elli...</td>\n",
       "      <td>10 S Kedzie Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5339011.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
       "      <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
       "      <td>10001 S Woodlawn Ave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2916100.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Child Care</td>\n",
       "      <td>EXTENDED DAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 32 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Id                                 Source  \\\n",
       "0   0  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "1   1  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "2   2  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "3   3  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "4   4  CPS_Early_Childhood_Portal_scrape.csv   \n",
       "\n",
       "                                           Site name                Address  \\\n",
       "0           Salvation Army - Temple / Salvation Army         1 N Ogden Ave    \n",
       "1           Salvation Army - Temple / Salvation Army         1 N Ogden Ave    \n",
       "2   National Louis University - Dr. Effie O. Elli...       10 S Kedzie Ave    \n",
       "3   National Louis University - Dr. Effie O. Elli...       10 S Kedzie Ave    \n",
       "4   Board Trustees-City Colleges of Chicago - Oli...  10001 S Woodlawn Ave    \n",
       "\n",
       "   Zip      Phone  Fax Program Name Length of Day IDHS Provider ID  ...  \\\n",
       "0  NaN  2262649.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "1  NaN  2262649.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "2  NaN  5339011.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "3  NaN  5339011.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "4  NaN  2916100.0  NaN   Child Care  EXTENDED DAY              NaN  ...   \n",
       "\n",
       "  Executive Director Center Director ECE Available Programs NAEYC Valid Until  \\\n",
       "0                NaN             NaN                    NaN               NaN   \n",
       "1                NaN             NaN                    NaN               NaN   \n",
       "2                NaN             NaN                    NaN               NaN   \n",
       "3                NaN             NaN                    NaN               NaN   \n",
       "4                NaN             NaN                    NaN               NaN   \n",
       "\n",
       "  NAEYC Program Id Email Address  Ounce of Prevention Description  \\\n",
       "0              NaN           NaN                              NaN   \n",
       "1              NaN           NaN                              NaN   \n",
       "2              NaN           NaN                              NaN   \n",
       "3              NaN           NaN                              NaN   \n",
       "4              NaN           NaN                              NaN   \n",
       "\n",
       "   Purple binder service type Column Column2  \n",
       "0                         NaN    NaN     NaN  \n",
       "1                         NaN    NaN     NaN  \n",
       "2                         NaN    NaN     NaN  \n",
       "3                         NaN    NaN     NaN  \n",
       "4                         NaN    NaN     NaN  \n",
       "\n",
       "[5 rows x 32 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop_duplicates(inplace=True)\n",
    "df[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Remove 'real duplicates'\n",
    "\n",
    "The problem is that the records are not the same. \n",
    "\n",
    "Data is messy. \n",
    "\n",
    "We will use **dedupe**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Importing data ...\n",
      "Reading from dedupe_dataframe_learned_settings\n",
      "Clustering...\n",
      "# duplicate sets 871\n"
     ]
    }
   ],
   "source": [
    "# canonalize for standardizing names in a cluster\n",
    "df_dedupe = pandas_dedupe.dedupe_dataframe(df, ['Source', 'Site name', 'Address', 'Zip', 'Phone', 'Email Address'], canonicalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "If you want to retrain, you should delete the settings and training files (the dedupe* and link_dataframes* files).\n",
    "\n",
    "\n",
    "Now, if you inspect the dataframe, you will see the duplicated records that have been clustered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',\n",
       "       'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',\n",
       "       'Neighborhood', 'Funded Enrollment', 'Program Option',\n",
       "       'Number per Site EHS', 'Number per Site HS', 'Director',\n",
       "       'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',\n",
       "       'Website', 'Executive Director', 'Center Director',\n",
       "       'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',\n",
       "       'Email Address', 'Ounce of Prevention Description',\n",
       "       'Purple binder service type', 'Column', 'Column2', 'cluster id',\n",
       "       'confidence', 'canonical_Id', 'canonical_Source', 'canonical_Site name',\n",
       "       'canonical_Address', 'canonical_Zip', 'canonical_Phone',\n",
       "       'canonical_Fax', 'canonical_Program Name', 'canonical_Length of Day',\n",
       "       'canonical_IDHS Provider ID', 'canonical_Agency',\n",
       "       'canonical_Neighborhood', 'canonical_Funded Enrollment',\n",
       "       'canonical_Program Option', 'canonical_Number per Site EHS',\n",
       "       'canonical_Number per Site HS', 'canonical_Director',\n",
       "       'canonical_Head Start Fund', 'canonical_Eearly Head Start Fund',\n",
       "       'canonical_CC fund', 'canonical_Progmod', 'canonical_Website',\n",
       "       'canonical_Executive Director', 'canonical_Center Director',\n",
       "       'canonical_ECE Available Programs', 'canonical_NAEYC Valid Until',\n",
       "       'canonical_NAEYC Program Id', 'canonical_Email Address',\n",
       "       'canonical_Ounce of Prevention Description',\n",
       "       'canonical_Purple binder service type', 'canonical_Column',\n",
       "       'canonical_Column2'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dedupe.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>Source</th>\n",
       "      <th>Site name</th>\n",
       "      <th>Address</th>\n",
       "      <th>Zip</th>\n",
       "      <th>Phone</th>\n",
       "      <th>Fax</th>\n",
       "      <th>Program Name</th>\n",
       "      <th>Length of Day</th>\n",
       "      <th>IDHS Provider ID</th>\n",
       "      <th>...</th>\n",
       "      <th>canonical_Executive Director</th>\n",
       "      <th>canonical_Center Director</th>\n",
       "      <th>canonical_ECE Available Programs</th>\n",
       "      <th>canonical_NAEYC Valid Until</th>\n",
       "      <th>canonical_NAEYC Program Id</th>\n",
       "      <th>canonical_Email Address</th>\n",
       "      <th>canonical_Ounce of Prevention Description</th>\n",
       "      <th>canonical_Purple binder service type</th>\n",
       "      <th>canonical_Column</th>\n",
       "      <th>canonical_Column2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3327</th>\n",
       "      <td>3327</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>precious infants &amp; tots learning center</td>\n",
       "      <td>624 e 47th street</td>\n",
       "      <td>60653.0</td>\n",
       "      <td>2682685.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>early head start</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3300</th>\n",
       "      <td>3300</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>ywca metropolitan chicago</td>\n",
       "      <td>360 n michigan avenue</td>\n",
       "      <td>60601.0</td>\n",
       "      <td>3726600.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3299</th>\n",
       "      <td>3299</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>ymca west side</td>\n",
       "      <td>5080 w harrison street</td>\n",
       "      <td>60644.0</td>\n",
       "      <td>9553100.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3285</th>\n",
       "      <td>3285</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>woodlawn organization</td>\n",
       "      <td>6040 s harper avenue</td>\n",
       "      <td>60637.0</td>\n",
       "      <td>2885840.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3281</th>\n",
       "      <td>3281</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>urban family and community centers</td>\n",
       "      <td>4241 w washington boulevard</td>\n",
       "      <td>60624.0</td>\n",
       "      <td>7228333.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1837</th>\n",
       "      <td>1837</td>\n",
       "      <td>chapin_dfss_providers_2011_070212.csv</td>\n",
       "      <td>north avenue day nursery fcch-carolyn price</td>\n",
       "      <td>2020 w jackson</td>\n",
       "      <td>60612.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>5833ip(ehs collaboration enhanced home ip), 58...</td>\n",
       "      <td>www.crcl.net</td>\n",
       "      <td>cp</td>\n",
       "      <td>betty lee</td>\n",
       "      <td></td>\n",
       "      <td>05/31/14</td>\n",
       "      <td>723127</td>\n",
       "      <td>youngt@crcl.net</td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2643</th>\n",
       "      <td>2643</td>\n",
       "      <td>ece chicago find a school scrape.csv</td>\n",
       "      <td>mary crane north 0-3</td>\n",
       "      <td>2905 n. leavitt</td>\n",
       "      <td>60618.0</td>\n",
       "      <td>3485528.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>4374it(child care it center), 4374ps(child car...</td>\n",
       "      <td>www.marycrane.org</td>\n",
       "      <td>lavetter terry</td>\n",
       "      <td>martuice williams</td>\n",
       "      <td></td>\n",
       "      <td>08/01/16</td>\n",
       "      <td>722999</td>\n",
       "      <td>info@marycrane.org</td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3228</th>\n",
       "      <td>3228</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>mary crane center north</td>\n",
       "      <td>2905 n leavitt street</td>\n",
       "      <td>60618.0</td>\n",
       "      <td>9753322.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>4374it(child care it center), 4374ps(child car...</td>\n",
       "      <td>www.marycrane.org</td>\n",
       "      <td>lavetter terry</td>\n",
       "      <td>martuice williams</td>\n",
       "      <td></td>\n",
       "      <td>08/01/16</td>\n",
       "      <td>722999</td>\n",
       "      <td>info@marycrane.org</td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3229</th>\n",
       "      <td>3229</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>mary crane family and day care center</td>\n",
       "      <td>2905 n clybourn avenue</td>\n",
       "      <td>60618.0</td>\n",
       "      <td>3485528.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>4374it(child care it center), 4374ps(child car...</td>\n",
       "      <td>www.marycrane.org</td>\n",
       "      <td>lavetter terry</td>\n",
       "      <td>martuice williams</td>\n",
       "      <td></td>\n",
       "      <td>08/01/16</td>\n",
       "      <td>722999</td>\n",
       "      <td>info@marycrane.org</td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3241</th>\n",
       "      <td>3241</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>our lady of guadalupe early childhood center</td>\n",
       "      <td>9129 s burley avenue</td>\n",
       "      <td>60617.0</td>\n",
       "      <td>9785320.0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>2488it(child care it center), 7030ps(hs collab...</td>\n",
       "      <td>www.catholiccharities.net</td>\n",
       "      <td>laura rios</td>\n",
       "      <td>deborah o'brien</td>\n",
       "      <td></td>\n",
       "      <td>01/31/13</td>\n",
       "      <td>486949</td>\n",
       "      <td>pgutierr@catholiccharities.net</td>\n",
       "      <td></td>\n",
       "      <td>child care</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3337 rows × 66 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        Id                                 Source  \\\n",
       "3327  3327      purple_binder_early_childhood.csv   \n",
       "3300  3300      purple_binder_early_childhood.csv   \n",
       "3299  3299      purple_binder_early_childhood.csv   \n",
       "3285  3285      purple_binder_early_childhood.csv   \n",
       "3281  3281      purple_binder_early_childhood.csv   \n",
       "...    ...                                    ...   \n",
       "1837  1837  chapin_dfss_providers_2011_070212.csv   \n",
       "2643  2643   ece chicago find a school scrape.csv   \n",
       "3228  3228      purple_binder_early_childhood.csv   \n",
       "3229  3229      purple_binder_early_childhood.csv   \n",
       "3241  3241      purple_binder_early_childhood.csv   \n",
       "\n",
       "                                         Site name  \\\n",
       "3327       precious infants & tots learning center   \n",
       "3300                     ywca metropolitan chicago   \n",
       "3299                                ymca west side   \n",
       "3285                         woodlawn organization   \n",
       "3281            urban family and community centers   \n",
       "...                                            ...   \n",
       "1837   north avenue day nursery fcch-carolyn price   \n",
       "2643                          mary crane north 0-3   \n",
       "3228                       mary crane center north   \n",
       "3229         mary crane family and day care center   \n",
       "3241  our lady of guadalupe early childhood center   \n",
       "\n",
       "                          Address      Zip      Phone   Fax Program Name  \\\n",
       "3327            624 e 47th street  60653.0  2682685.0  None         None   \n",
       "3300        360 n michigan avenue  60601.0  3726600.0  None         None   \n",
       "3299       5080 w harrison street  60644.0  9553100.0  None         None   \n",
       "3285         6040 s harper avenue  60637.0  2885840.0  None         None   \n",
       "3281  4241 w washington boulevard  60624.0  7228333.0  None         None   \n",
       "...                           ...      ...        ...   ...          ...   \n",
       "1837               2020 w jackson  60612.0       None  None         None   \n",
       "2643              2905 n. leavitt  60618.0  3485528.0  None         None   \n",
       "3228        2905 n leavitt street  60618.0  9753322.0  None         None   \n",
       "3229       2905 n clybourn avenue  60618.0  3485528.0  None         None   \n",
       "3241         9129 s burley avenue  60617.0  9785320.0  None         None   \n",
       "\n",
       "     Length of Day IDHS Provider ID  ...  \\\n",
       "3327          None             None  ...   \n",
       "3300          None             None  ...   \n",
       "3299          None             None  ...   \n",
       "3285          None             None  ...   \n",
       "3281          None             None  ...   \n",
       "...            ...              ...  ...   \n",
       "1837          None             None  ...   \n",
       "2643          None             None  ...   \n",
       "3228          None             None  ...   \n",
       "3229          None             None  ...   \n",
       "3241          None             None  ...   \n",
       "\n",
       "                           canonical_Executive Director  \\\n",
       "3327                                                      \n",
       "3300                                                      \n",
       "3299                                                      \n",
       "3285                                                      \n",
       "3281                                                      \n",
       "...                                                 ...   \n",
       "1837  5833ip(ehs collaboration enhanced home ip), 58...   \n",
       "2643  4374it(child care it center), 4374ps(child car...   \n",
       "3228  4374it(child care it center), 4374ps(child car...   \n",
       "3229  4374it(child care it center), 4374ps(child car...   \n",
       "3241  2488it(child care it center), 7030ps(hs collab...   \n",
       "\n",
       "      canonical_Center Director canonical_ECE Available Programs  \\\n",
       "3327                                                               \n",
       "3300                                                               \n",
       "3299                                                               \n",
       "3285                                                               \n",
       "3281                                                               \n",
       "...                         ...                              ...   \n",
       "1837               www.crcl.net                               cp   \n",
       "2643          www.marycrane.org                   lavetter terry   \n",
       "3228          www.marycrane.org                   lavetter terry   \n",
       "3229          www.marycrane.org                   lavetter terry   \n",
       "3241  www.catholiccharities.net                       laura rios   \n",
       "\n",
       "     canonical_NAEYC Valid Until canonical_NAEYC Program Id  \\\n",
       "3327                                                          \n",
       "3300                                                          \n",
       "3299                                                          \n",
       "3285                                                          \n",
       "3281                                                          \n",
       "...                          ...                        ...   \n",
       "1837                   betty lee                              \n",
       "2643           martuice williams                              \n",
       "3228           martuice williams                              \n",
       "3229           martuice williams                              \n",
       "3241             deborah o'brien                              \n",
       "\n",
       "     canonical_Email Address canonical_Ounce of Prevention Description  \\\n",
       "3327                                                                     \n",
       "3300                                                                     \n",
       "3299                                                                     \n",
       "3285                                                                     \n",
       "3281                                                                     \n",
       "...                      ...                                       ...   \n",
       "1837                05/31/14                                    723127   \n",
       "2643                08/01/16                                    722999   \n",
       "3228                08/01/16                                    722999   \n",
       "3229                08/01/16                                    722999   \n",
       "3241                01/31/13                                    486949   \n",
       "\n",
       "     canonical_Purple binder service type canonical_Column canonical_Column2  \n",
       "3327                                                        early head start  \n",
       "3300                                                              child care  \n",
       "3299                                                              child care  \n",
       "3285                                                              child care  \n",
       "3281                                                              child care  \n",
       "...                                   ...              ...               ...  \n",
       "1837                      youngt@crcl.net                         child care  \n",
       "2643                   info@marycrane.org                         child care  \n",
       "3228                   info@marycrane.org                         child care  \n",
       "3229                   info@marycrane.org                         child care  \n",
       "3241       pgutierr@catholiccharities.net                         child care  \n",
       "\n",
       "[3337 rows x 66 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sorted = df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)\n",
    "df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>cluster id</th>\n",
       "      <th>confidence</th>\n",
       "      <th>Source</th>\n",
       "      <th>Zip</th>\n",
       "      <th>Address</th>\n",
       "      <th>canonical_Executive Director</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3327</th>\n",
       "      <td>3327</td>\n",
       "      <td>870</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60653.0</td>\n",
       "      <td>624 e 47th street</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3300</th>\n",
       "      <td>3300</td>\n",
       "      <td>869</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60601.0</td>\n",
       "      <td>360 n michigan avenue</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3299</th>\n",
       "      <td>3299</td>\n",
       "      <td>868</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60644.0</td>\n",
       "      <td>5080 w harrison street</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3285</th>\n",
       "      <td>3285</td>\n",
       "      <td>867</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60637.0</td>\n",
       "      <td>6040 s harper avenue</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3281</th>\n",
       "      <td>3281</td>\n",
       "      <td>866</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60624.0</td>\n",
       "      <td>4241 w washington boulevard</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1837</th>\n",
       "      <td>1837</td>\n",
       "      <td>40</td>\n",
       "      <td>0.133246</td>\n",
       "      <td>chapin_dfss_providers_2011_070212.csv</td>\n",
       "      <td>60612.0</td>\n",
       "      <td>2020 w jackson</td>\n",
       "      <td>5833ip(ehs collaboration enhanced home ip), 58...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2643</th>\n",
       "      <td>2643</td>\n",
       "      <td>31</td>\n",
       "      <td>0.130475</td>\n",
       "      <td>ece chicago find a school scrape.csv</td>\n",
       "      <td>60618.0</td>\n",
       "      <td>2905 n. leavitt</td>\n",
       "      <td>4374it(child care it center), 4374ps(child car...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3228</th>\n",
       "      <td>3228</td>\n",
       "      <td>31</td>\n",
       "      <td>0.130475</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60618.0</td>\n",
       "      <td>2905 n leavitt street</td>\n",
       "      <td>4374it(child care it center), 4374ps(child car...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3229</th>\n",
       "      <td>3229</td>\n",
       "      <td>31</td>\n",
       "      <td>0.130474</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60618.0</td>\n",
       "      <td>2905 n clybourn avenue</td>\n",
       "      <td>4374it(child care it center), 4374ps(child car...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3241</th>\n",
       "      <td>3241</td>\n",
       "      <td>27</td>\n",
       "      <td>0.058419</td>\n",
       "      <td>purple_binder_early_childhood.csv</td>\n",
       "      <td>60617.0</td>\n",
       "      <td>9129 s burley avenue</td>\n",
       "      <td>2488it(child care it center), 7030ps(hs collab...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3337 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        Id  cluster id  confidence                                 Source  \\\n",
       "3327  3327         870    1.000000      purple_binder_early_childhood.csv   \n",
       "3300  3300         869    1.000000      purple_binder_early_childhood.csv   \n",
       "3299  3299         868    1.000000      purple_binder_early_childhood.csv   \n",
       "3285  3285         867    1.000000      purple_binder_early_childhood.csv   \n",
       "3281  3281         866    1.000000      purple_binder_early_childhood.csv   \n",
       "...    ...         ...         ...                                    ...   \n",
       "1837  1837          40    0.133246  chapin_dfss_providers_2011_070212.csv   \n",
       "2643  2643          31    0.130475   ece chicago find a school scrape.csv   \n",
       "3228  3228          31    0.130475      purple_binder_early_childhood.csv   \n",
       "3229  3229          31    0.130474      purple_binder_early_childhood.csv   \n",
       "3241  3241          27    0.058419      purple_binder_early_childhood.csv   \n",
       "\n",
       "          Zip                      Address  \\\n",
       "3327  60653.0            624 e 47th street   \n",
       "3300  60601.0        360 n michigan avenue   \n",
       "3299  60644.0       5080 w harrison street   \n",
       "3285  60637.0         6040 s harper avenue   \n",
       "3281  60624.0  4241 w washington boulevard   \n",
       "...       ...                          ...   \n",
       "1837  60612.0               2020 w jackson   \n",
       "2643  60618.0              2905 n. leavitt   \n",
       "3228  60618.0        2905 n leavitt street   \n",
       "3229  60618.0       2905 n clybourn avenue   \n",
       "3241  60617.0         9129 s burley avenue   \n",
       "\n",
       "                           canonical_Executive Director  \n",
       "3327                                                     \n",
       "3300                                                     \n",
       "3299                                                     \n",
       "3285                                                     \n",
       "3281                                                     \n",
       "...                                                 ...  \n",
       "1837  5833ip(ehs collaboration enhanced home ip), 58...  \n",
       "2643  4374it(child care it center), 4374ps(child car...  \n",
       "3228  4374it(child care it center), 4374ps(child car...  \n",
       "3229  4374it(child care it center), 4374ps(child car...  \n",
       "3241  2488it(child care it center), 7030ps(hs collab...  \n",
       "\n",
       "[3337 rows x 7 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sorted[['Id', 'cluster id', 'confidence', 'Source', 'Zip', 'Address', 'canonical_Executive Director']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Matching / Linking records\n",
    "Another problem is matching / linking records from different sources."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Let's load two datasets from FEBRl (Freely extensible biomedical record linkage)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "#load dataframes\n",
    "dfa = pd.read_csv('data/dataset1-febrl.csv')\n",
    "dfb = pd.read_csv('data/dataset2-febrl.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We cannot have missing values for applying record matching with this library, so we fill them.\n",
    "\n",
    "The problem is that many values are ' ' (not NaN). So, we first convert to NaN, and then we drop them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "dfa.replace(['', ' '], np.nan, inplace=True)\n",
    "dfb.replace(['', ' '], np.nan, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "dfa.dropna(inplace=True)\n",
    "dfb.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rec_id</th>\n",
       "      <th>given_name</th>\n",
       "      <th>surname</th>\n",
       "      <th>street_number</th>\n",
       "      <th>address_1</th>\n",
       "      <th>address_2</th>\n",
       "      <th>suburb</th>\n",
       "      <th>postcode</th>\n",
       "      <th>state</th>\n",
       "      <th>date_of_birth</th>\n",
       "      <th>soc_sec_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>rec-122-org</td>\n",
       "      <td>lachlan</td>\n",
       "      <td>berry</td>\n",
       "      <td>69</td>\n",
       "      <td>giblin street</td>\n",
       "      <td>killarney</td>\n",
       "      <td>bittern</td>\n",
       "      <td>4814</td>\n",
       "      <td>qld</td>\n",
       "      <td>19990219</td>\n",
       "      <td>7364009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>rec-373-org</td>\n",
       "      <td>deakin</td>\n",
       "      <td>sondergeld</td>\n",
       "      <td>48</td>\n",
       "      <td>goldfinch circuit</td>\n",
       "      <td>kooltuo</td>\n",
       "      <td>canterbury</td>\n",
       "      <td>2776</td>\n",
       "      <td>vic</td>\n",
       "      <td>19600210</td>\n",
       "      <td>2635962</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>rec-227-org</td>\n",
       "      <td>luke</td>\n",
       "      <td>purdon</td>\n",
       "      <td>23</td>\n",
       "      <td>ramsay place</td>\n",
       "      <td>mirani</td>\n",
       "      <td>garbutt</td>\n",
       "      <td>2260</td>\n",
       "      <td>vic</td>\n",
       "      <td>19831024</td>\n",
       "      <td>8099933</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>rec-294-org</td>\n",
       "      <td>william</td>\n",
       "      <td>bishop</td>\n",
       "      <td>21</td>\n",
       "      <td>neworra place</td>\n",
       "      <td>apmnt 65</td>\n",
       "      <td>worongary</td>\n",
       "      <td>6225</td>\n",
       "      <td>qld</td>\n",
       "      <td>19490130</td>\n",
       "      <td>9773843</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>rec-81-dup-0</td>\n",
       "      <td>abbey</td>\n",
       "      <td>fit</td>\n",
       "      <td>13</td>\n",
       "      <td>kosciusko avenue</td>\n",
       "      <td>the wharf complex</td>\n",
       "      <td>yass</td>\n",
       "      <td>2594</td>\n",
       "      <td>nsw</td>\n",
       "      <td>19870510</td>\n",
       "      <td>7661096</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>rec-34-org</td>\n",
       "      <td>isabella</td>\n",
       "      <td>lodder</td>\n",
       "      <td>156</td>\n",
       "      <td>messenger street</td>\n",
       "      <td>tongbong sanctuary</td>\n",
       "      <td>bayswater</td>\n",
       "      <td>4870</td>\n",
       "      <td>vic</td>\n",
       "      <td>19650714</td>\n",
       "      <td>2790666</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>rec-478-org</td>\n",
       "      <td>anthony</td>\n",
       "      <td>beazley</td>\n",
       "      <td>12</td>\n",
       "      <td>birubi place</td>\n",
       "      <td>currandina</td>\n",
       "      <td>flemington</td>\n",
       "      <td>2477</td>\n",
       "      <td>qld</td>\n",
       "      <td>19730924</td>\n",
       "      <td>6558077</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>rec-225-org</td>\n",
       "      <td>alia</td>\n",
       "      <td>streich</td>\n",
       "      <td>74</td>\n",
       "      <td>maranoa street</td>\n",
       "      <td>rocky bend</td>\n",
       "      <td>rowville</td>\n",
       "      <td>6152</td>\n",
       "      <td>vic</td>\n",
       "      <td>19790418</td>\n",
       "      <td>1975340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>rec-452-org</td>\n",
       "      <td>alissa</td>\n",
       "      <td>kilmartin</td>\n",
       "      <td>37</td>\n",
       "      <td>reveley crescent</td>\n",
       "      <td>crown allot</td>\n",
       "      <td>wolumla</td>\n",
       "      <td>6210</td>\n",
       "      <td>nsw</td>\n",
       "      <td>19041118</td>\n",
       "      <td>7994055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>rec-67-org</td>\n",
       "      <td>jacob</td>\n",
       "      <td>lyden</td>\n",
       "      <td>25</td>\n",
       "      <td>haddon street</td>\n",
       "      <td>glenview</td>\n",
       "      <td>woodville north</td>\n",
       "      <td>2226</td>\n",
       "      <td>qld</td>\n",
       "      <td>19910424</td>\n",
       "      <td>6426415</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          rec_id  given_name      surname  street_number           address_1  \\\n",
       "1    rec-122-org     lachlan        berry             69       giblin street   \n",
       "2    rec-373-org      deakin   sondergeld             48   goldfinch circuit   \n",
       "4    rec-227-org        luke       purdon             23        ramsay place   \n",
       "7    rec-294-org     william       bishop             21       neworra place   \n",
       "10  rec-81-dup-0       abbey          fit             13    kosciusko avenue   \n",
       "11    rec-34-org    isabella       lodder            156    messenger street   \n",
       "12   rec-478-org     anthony      beazley             12        birubi place   \n",
       "13   rec-225-org        alia      streich             74      maranoa street   \n",
       "15   rec-452-org      alissa    kilmartin             37    reveley crescent   \n",
       "16    rec-67-org       jacob        lyden             25       haddon street   \n",
       "\n",
       "              address_2            suburb   postcode  state  date_of_birth  \\\n",
       "1             killarney           bittern       4814    qld       19990219   \n",
       "2               kooltuo        canterbury       2776    vic       19600210   \n",
       "4                mirani           garbutt       2260    vic       19831024   \n",
       "7              apmnt 65         worongary       6225    qld       19490130   \n",
       "10    the wharf complex              yass       2594    nsw       19870510   \n",
       "11   tongbong sanctuary         bayswater       4870    vic       19650714   \n",
       "12           currandina        flemington       2477    qld       19730924   \n",
       "13           rocky bend          rowville       6152    vic       19790418   \n",
       "15          crown allot           wolumla       6210    nsw       19041118   \n",
       "16             glenview   woodville north       2226    qld       19910424   \n",
       "\n",
       "     soc_sec_id  \n",
       "1       7364009  \n",
       "2       2635962  \n",
       "4       8099933  \n",
       "7       9773843  \n",
       "10      7661096  \n",
       "11      2790666  \n",
       "12      6558077  \n",
       "13      1975340  \n",
       "15      7994055  \n",
       "16      6426415  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfa.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rec_id</th>\n",
       "      <th>given_name</th>\n",
       "      <th>surname</th>\n",
       "      <th>street_number</th>\n",
       "      <th>address_1</th>\n",
       "      <th>address_2</th>\n",
       "      <th>suburb</th>\n",
       "      <th>postcode</th>\n",
       "      <th>state</th>\n",
       "      <th>date_of_birth</th>\n",
       "      <th>soc_sec_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>rec-2778-org</td>\n",
       "      <td>sarah</td>\n",
       "      <td>bruhn</td>\n",
       "      <td>44</td>\n",
       "      <td>forbes street</td>\n",
       "      <td>wintersloe</td>\n",
       "      <td>kellerberrin</td>\n",
       "      <td>4510</td>\n",
       "      <td>vic</td>\n",
       "      <td>19300213</td>\n",
       "      <td>7535316</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>rec-712-dup-0</td>\n",
       "      <td>jacob</td>\n",
       "      <td>lanyon</td>\n",
       "      <td>5</td>\n",
       "      <td>milne cove</td>\n",
       "      <td>wellwod</td>\n",
       "      <td>beaconsfield upper</td>\n",
       "      <td>2602</td>\n",
       "      <td>vic</td>\n",
       "      <td>19080712</td>\n",
       "      <td>9497788</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>rec-1321-org</td>\n",
       "      <td>brinley</td>\n",
       "      <td>efthimiou</td>\n",
       "      <td>35</td>\n",
       "      <td>sturdee crescent</td>\n",
       "      <td>tremearne</td>\n",
       "      <td>scarborough</td>\n",
       "      <td>5211</td>\n",
       "      <td>qld</td>\n",
       "      <td>19940319</td>\n",
       "      <td>6814956</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>rec-3004-org</td>\n",
       "      <td>aleisha</td>\n",
       "      <td>hobson</td>\n",
       "      <td>54</td>\n",
       "      <td>oliver street</td>\n",
       "      <td>inglewood</td>\n",
       "      <td>toowoomba</td>\n",
       "      <td>3175</td>\n",
       "      <td>qld</td>\n",
       "      <td>19290427</td>\n",
       "      <td>5967384</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>rec-1384-org</td>\n",
       "      <td>ethan</td>\n",
       "      <td>gazzola</td>\n",
       "      <td>49</td>\n",
       "      <td>sheaffe street</td>\n",
       "      <td>bimby vale</td>\n",
       "      <td>port pirie</td>\n",
       "      <td>3088</td>\n",
       "      <td>sa</td>\n",
       "      <td>19631225</td>\n",
       "      <td>3832742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>rec-3981-org</td>\n",
       "      <td>alicia</td>\n",
       "      <td>hope</td>\n",
       "      <td>100</td>\n",
       "      <td>mansfield place</td>\n",
       "      <td>sunset</td>\n",
       "      <td>byford</td>\n",
       "      <td>6061</td>\n",
       "      <td>sa</td>\n",
       "      <td>19421201</td>\n",
       "      <td>7934773</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>rec-916-org</td>\n",
       "      <td>benjamin</td>\n",
       "      <td>kolosche</td>\n",
       "      <td>78</td>\n",
       "      <td>keenan street</td>\n",
       "      <td>wingara</td>\n",
       "      <td>raymond terrace</td>\n",
       "      <td>3212</td>\n",
       "      <td>sa</td>\n",
       "      <td>19450918</td>\n",
       "      <td>5698873</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>rec-63-dup-0</td>\n",
       "      <td>olivia</td>\n",
       "      <td>white</td>\n",
       "      <td>55</td>\n",
       "      <td>duffy street</td>\n",
       "      <td>shopping village</td>\n",
       "      <td>mirrabooka</td>\n",
       "      <td>2260</td>\n",
       "      <td>vic</td>\n",
       "      <td>19000106</td>\n",
       "      <td>4996142</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>rec-112-org</td>\n",
       "      <td>joshua</td>\n",
       "      <td>rudd</td>\n",
       "      <td>78</td>\n",
       "      <td>max henry crescent</td>\n",
       "      <td>brentwood vlge</td>\n",
       "      <td>port douglas</td>\n",
       "      <td>2315</td>\n",
       "      <td>vic</td>\n",
       "      <td>19951125</td>\n",
       "      <td>1697892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>rec-3297-org</td>\n",
       "      <td>rachael</td>\n",
       "      <td>lomman</td>\n",
       "      <td>37</td>\n",
       "      <td>carlile street</td>\n",
       "      <td>clonturkle</td>\n",
       "      <td>bronte</td>\n",
       "      <td>2177</td>\n",
       "      <td>nsw</td>\n",
       "      <td>19910228</td>\n",
       "      <td>9462397</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           rec_id  given_name     surname  street_number            address_1  \\\n",
       "0    rec-2778-org       sarah       bruhn             44        forbes street   \n",
       "1   rec-712-dup-0       jacob      lanyon              5           milne cove   \n",
       "2    rec-1321-org     brinley   efthimiou             35     sturdee crescent   \n",
       "3    rec-3004-org     aleisha      hobson             54        oliver street   \n",
       "4    rec-1384-org       ethan     gazzola             49       sheaffe street   \n",
       "5    rec-3981-org      alicia        hope            100      mansfield place   \n",
       "6     rec-916-org    benjamin    kolosche             78        keenan street   \n",
       "8    rec-63-dup-0      olivia       white             55         duffy street   \n",
       "10    rec-112-org      joshua        rudd             78   max henry crescent   \n",
       "11   rec-3297-org     rachael      lomman             37       carlile street   \n",
       "\n",
       "            address_2               suburb   postcode  state  date_of_birth  \\\n",
       "0          wintersloe         kellerberrin       4510    vic       19300213   \n",
       "1             wellwod   beaconsfield upper       2602    vic       19080712   \n",
       "2           tremearne          scarborough       5211    qld       19940319   \n",
       "3           inglewood            toowoomba       3175    qld       19290427   \n",
       "4          bimby vale           port pirie       3088     sa       19631225   \n",
       "5              sunset               byford       6061     sa       19421201   \n",
       "6             wingara      raymond terrace       3212     sa       19450918   \n",
       "8    shopping village           mirrabooka       2260    vic       19000106   \n",
       "10     brentwood vlge         port douglas       2315    vic       19951125   \n",
       "11         clonturkle               bronte       2177    nsw       19910228   \n",
       "\n",
       "     soc_sec_id  \n",
       "0       7535316  \n",
       "1       9497788  \n",
       "2       6814956  \n",
       "3       5967384  \n",
       "4       3832742  \n",
       "5       7934773  \n",
       "6       5698873  \n",
       "8       4996142  \n",
       "10      1697892  \n",
       "11      9462397  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfb.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Check the two datasets have the same columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',\n",
      "       ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',\n",
      "       ' soc_sec_id'],\n",
      "      dtype='object')\n",
      "Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',\n",
      "       ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',\n",
      "       ' soc_sec_id'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(dfa.columns)\n",
    "print(dfb.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Let's match..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Importing data ...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : reeve\n",
      " surname : quilliam\n",
      " street_number : 2\n",
      " address_1 : renwick street\n",
      " address_2 : yarrabee\n",
      " suburb : barwon heads\n",
      " postcode : 2340\n",
      " state : nsw\n",
      " date_of_birth : 19810406\n",
      " soc_sec_id : 1066923\n",
      "\n",
      " given_name : jessica\n",
      " surname : reid\n",
      " street_number : 280\n",
      " address_1 : medley street\n",
      " address_2 : warra creek\n",
      " suburb : ballarat\n",
      " postcode : 3149\n",
      " state : nsw\n",
      " date_of_birth : 19830907\n",
      " soc_sec_id : 1067529\n",
      "\n",
      "0/10 positive, 0/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting active labeling...\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : daniel\n",
      " surname : couzens\n",
      " street_number : 37\n",
      " address_1 : coventry close\n",
      " address_2 : cressbrook\n",
      " suburb : mount eliza\n",
      " postcode : 5073\n",
      " state : nsw\n",
      " date_of_birth : 19881127\n",
      " soc_sec_id : 6934299\n",
      "\n",
      " given_name : dante\n",
      " surname : dakin\n",
      " street_number : 3\n",
      " address_1 : chuculba crescent\n",
      " address_2 : greenpatch\n",
      " suburb : forbes\n",
      " postcode : 5072\n",
      " state : nsw\n",
      " date_of_birth : 19481028\n",
      " soc_sec_id : 7288639\n",
      "\n",
      "0/10 positive, 1/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : lachlan\n",
      " surname : jukic\n",
      " street_number : 2\n",
      " address_1 : morgan crescent\n",
      " address_2 : parklea\n",
      " suburb : raymond terrace\n",
      " postcode : 2250\n",
      " state : nsw\n",
      " date_of_birth : 19780702\n",
      " soc_sec_id : 4027998\n",
      "\n",
      " given_name : meg\n",
      " surname : feil\n",
      " street_number : 17\n",
      " address_1 : biraban place\n",
      " address_2 : hughloch lincoln red stud\n",
      " suburb : hawthorne\n",
      " postcode : 3429\n",
      " state : vic\n",
      " date_of_birth : 19060812\n",
      " soc_sec_id : 4027997\n",
      "\n",
      "0/10 positive, 2/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : jacob\n",
      " surname : lyen\n",
      " street_number : 25\n",
      " address_1 : haddon srteet\n",
      " address_2 : glenvie w\n",
      " suburb : woodville north\n",
      " postcode : 2226\n",
      " state : qld\n",
      " date_of_birth : 19910424\n",
      " soc_sec_id : 6426415\n",
      "\n",
      " given_name : zac\n",
      " surname : white\n",
      " street_number : 26\n",
      " address_1 : companion crescent\n",
      " address_2 : glenview\n",
      " suburb : toronto\n",
      " postcode : 2226\n",
      " state : sa\n",
      " date_of_birth : 19431117\n",
      " soc_sec_id : 3437945\n",
      "\n",
      "0/10 positive, 3/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : kydan\n",
      " surname : mccarthy\n",
      " street_number : 67\n",
      " address_1 : clemenger street\n",
      " address_2 : the points holsteins\n",
      " suburb : fairlawn\n",
      " postcode : 6415\n",
      " state : nsw\n",
      " date_of_birth : 19720518\n",
      " soc_sec_id : 6527653\n",
      "\n",
      " given_name : daniel\n",
      " surname : mccarthy\n",
      " street_number : 6\n",
      " address_1 : brunton street\n",
      " address_2 : tall pines\n",
      " suburb : fairlight\n",
      " postcode : 3155\n",
      " state : nsw\n",
      " date_of_birth : 19760107\n",
      " soc_sec_id : 8093038\n",
      "\n",
      "0/10 positive, 4/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : brooklyn\n",
      " surname : manson\n",
      " street_number : 27\n",
      " address_1 : clive steele avenue\n",
      " address_2 : port fairy road\n",
      " suburb : mount low\n",
      " postcode : 3450\n",
      " state : vic\n",
      " date_of_birth : 19710727\n",
      " soc_sec_id : 4493900\n",
      "\n",
      " given_name : ruby\n",
      " surname : mason\n",
      " street_number : 7\n",
      " address_1 : clive steele avenue\n",
      " address_2 : kooyong\n",
      " suburb : botany\n",
      " postcode : 3636\n",
      " state : vic\n",
      " date_of_birth : 19730913\n",
      " soc_sec_id : 4397223\n",
      "\n",
      "0/10 positive, 5/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : emiily\n",
      " surname : coleman\n",
      " street_number : 108\n",
      " address_1 : chewings street\n",
      " address_2 : berkeley vlge\n",
      " suburb : wellington point\n",
      " postcode : 2550\n",
      " state : nsw\n",
      " date_of_birth : 19421221\n",
      " soc_sec_id : 7206933\n",
      "\n",
      " given_name : emiily\n",
      " surname : went\n",
      " street_number : 18\n",
      " address_1 : glenmaggie street\n",
      " address_2 : berkeley vlge\n",
      " suburb : blue haven\n",
      " postcode : 6051\n",
      " state : vic\n",
      " date_of_birth : 19521205\n",
      " soc_sec_id : 8530937\n",
      "\n",
      "0/10 positive, 6/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : jacob\n",
      " surname : white\n",
      " street_number : 5\n",
      " address_1 : findlay street\n",
      " address_2 : booroopki park rmb 596\n",
      " suburb : robina\n",
      " postcode : 2197\n",
      " state : vic\n",
      " date_of_birth : 19170205\n",
      " soc_sec_id : 4702928\n",
      "\n",
      " given_name : talia\n",
      " surname : reid\n",
      " street_number : 147\n",
      " address_1 : sid barnes crescent\n",
      " address_2 : tathra\n",
      " suburb : berowra heights\n",
      " postcode : 2170\n",
      " state : vic\n",
      " date_of_birth : 19230203\n",
      " soc_sec_id : 4712927\n",
      "\n",
      "0/10 positive, 7/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : carla\n",
      " surname : amiet\n",
      " street_number : 50\n",
      " address_1 : carstensz street\n",
      " address_2 : oaklodge\n",
      " suburb : blackmans bay\n",
      " postcode : 3180\n",
      " state : nsw\n",
      " date_of_birth : 19790801\n",
      " soc_sec_id : 9646483\n",
      "\n",
      " given_name : cameron\n",
      " surname : coleman\n",
      " street_number : 10\n",
      " address_1 : edwards street\n",
      " address_2 : broadbridge manor\n",
      " suburb : blackmans bay\n",
      " postcode : 3630\n",
      " state : nsw\n",
      " date_of_birth : 19871030\n",
      " soc_sec_id : 5502408\n",
      "\n",
      "0/10 positive, 8/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : lara\n",
      " surname : matthews\n",
      " street_number : 6\n",
      " address_1 : beaumaris street\n",
      " address_2 : sunforest therapy centre\n",
      " suburb : the basin\n",
      " postcode : 4179\n",
      " state : nsw\n",
      " date_of_birth : 19911006\n",
      " soc_sec_id : 2164704\n",
      "\n",
      " given_name : dominic\n",
      " surname : matthews\n",
      " street_number : 67\n",
      " address_1 : campbell street\n",
      " address_2 : narraburra lodge\n",
      " suburb : coonabarabran\n",
      " postcode : 3174\n",
      " state : nsw\n",
      " date_of_birth : 19470226\n",
      " soc_sec_id : 3115384\n",
      "\n",
      "0/10 positive, 9/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : jack\n",
      " surname : matthews\n",
      " street_number : 17\n",
      " address_1 : herron crescent\n",
      " address_2 : broadmere\n",
      " suburb : highton\n",
      " postcode : 2035\n",
      " state : vic\n",
      " date_of_birth : 19081119\n",
      " soc_sec_id : 5613395\n",
      "\n",
      " given_name : alexandra\n",
      " surname : matthews\n",
      " street_number : 174\n",
      " address_1 : port jackson circuit\n",
      " address_2 : old timers south\n",
      " suburb : whaleback\n",
      " postcode : 2830\n",
      " state : vic\n",
      " date_of_birth : 19261017\n",
      " soc_sec_id : 2919332\n",
      "\n",
      "0/10 positive, 10/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : emiily\n",
      " surname : kowald\n",
      " street_number : 18\n",
      " address_1 : daglish street\n",
      " address_2 : oakdale\n",
      " suburb : avalon\n",
      " postcode : 3030\n",
      " state : vic\n",
      " date_of_birth : 19250313\n",
      " soc_sec_id : 1590627\n",
      "\n",
      " given_name : emiily\n",
      " surname : went\n",
      " street_number : 18\n",
      " address_1 : glenmaggie street\n",
      " address_2 : berkeley vlge\n",
      " suburb : blue haven\n",
      " postcode : 6051\n",
      " state : vic\n",
      " date_of_birth : 19521205\n",
      " soc_sec_id : 8530937\n",
      "\n",
      "0/10 positive, 11/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : jaden\n",
      " surname : humphreys\n",
      " street_number : 14\n",
      " address_1 : euree street\n",
      " address_2 : moorillyah\n",
      " suburb : thornlie\n",
      " postcode : 3116\n",
      " state : wa\n",
      " date_of_birth : 19700116\n",
      " soc_sec_id : 9382782\n",
      "\n",
      " given_name : isabelle\n",
      " surname : jolly\n",
      " street_number : 166\n",
      " address_1 : hodges street\n",
      " address_2 : bosmit\n",
      " suburb : thornlie\n",
      " postcode : 3163\n",
      " state : wa\n",
      " date_of_birth : 19050126\n",
      " soc_sec_id : 2719590\n",
      "\n",
      "0/10 positive, 12/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : brooklyn\n",
      " surname : manson\n",
      " street_number : 27\n",
      " address_1 : clive steele avenue\n",
      " address_2 : port fairy road\n",
      " suburb : yagoona\n",
      " postcode : 3450\n",
      " state : vic\n",
      " date_of_birth : 19710727\n",
      " soc_sec_id : 4493900\n",
      "\n",
      " given_name : laura\n",
      " surname : campbell\n",
      " street_number : 152\n",
      " address_1 : clive steele avenue\n",
      " address_2 : irrigation farm\n",
      " suburb : yagoona\n",
      " postcode : 3350\n",
      " state : vic\n",
      " date_of_birth : 19160610\n",
      " soc_sec_id : 6214635\n",
      "\n",
      "0/10 positive, 13/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : jacob\n",
      " surname : white\n",
      " street_number : 5\n",
      " address_1 : findlay street\n",
      " address_2 : booroopki park rmb 596\n",
      " suburb : robina\n",
      " postcode : 2197\n",
      " state : vic\n",
      " date_of_birth : 19170205\n",
      " soc_sec_id : 4702928\n",
      "\n",
      " given_name : jakob\n",
      " surname : menzies\n",
      " street_number : 33\n",
      " address_1 : coverdale street\n",
      " address_2 : bundong\n",
      " suburb : worongary\n",
      " postcode : 2190\n",
      " state : vic\n",
      " date_of_birth : 19140610\n",
      " soc_sec_id : 4557295\n",
      "\n",
      "0/10 positive, 14/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : benjamin\n",
      " surname : kirchener\n",
      " street_number : 12\n",
      " address_1 : stuart street\n",
      " address_2 : wurrami\n",
      " suburb : theodore\n",
      " postcode : 2620\n",
      " state : wa\n",
      " date_of_birth : 19751110\n",
      " soc_sec_id : 1766048\n",
      "\n",
      " given_name : max\n",
      " surname : rees\n",
      " street_number : 10\n",
      " address_1 : waite street\n",
      " address_2 : rosedown\n",
      " suburb : nambour\n",
      " postcode : 2620\n",
      " state : wa\n",
      " date_of_birth : 19751230\n",
      " soc_sec_id : 1361900\n",
      "\n",
      "0/10 positive, 15/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " n\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : brydon\n",
      " surname : webb\n",
      " street_number : 11\n",
      " address_1 : walker crescent\n",
      " address_2 : bugoren\n",
      " suburb : charlestown\n",
      " postcode : 6258\n",
      " state : nsw\n",
      " date_of_birth : 19190528\n",
      " soc_sec_id : 4191569\n",
      "\n",
      " given_name : bradley\n",
      " surname : haberfield\n",
      " street_number : 11\n",
      " address_1 : carumbo place\n",
      " address_2 : bungarra\n",
      " suburb : canley heights\n",
      " postcode : 2758\n",
      " state : vic\n",
      " date_of_birth : 19190528\n",
      " soc_sec_id : 5039500\n",
      "\n",
      "0/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : brydon\n",
      " surname : webb\n",
      " street_number : 11\n",
      " address_1 : walker crescent\n",
      " address_2 : bugoren\n",
      " suburb : charlestown\n",
      " postcode : 6258\n",
      " state : nsw\n",
      " date_of_birth : 19190528\n",
      " soc_sec_id : 4191569\n",
      "\n",
      " given_name : bradley\n",
      " surname : haberfield\n",
      " street_number : 11\n",
      " address_1 : carumbi place\n",
      " address_2 : bungarra\n",
      " suburb : canley heights\n",
      " postcode : 2758\n",
      " state : vic\n",
      " date_of_birth : 19190528\n",
      " soc_sec_id : 5039500\n",
      "\n",
      "1/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : lochlan\n",
      " surname : savidge\n",
      " street_number : 29\n",
      " address_1 : pohlman street\n",
      " address_2 : moline village\n",
      " suburb : jingili\n",
      " postcode : 3071\n",
      " state : vic\n",
      " date_of_birth : 19140228\n",
      " soc_sec_id : 1498207\n",
      "\n",
      " given_name : jayme\n",
      " surname : parr\n",
      " street_number : 2\n",
      " address_1 : clive steele avenue\n",
      " address_2 : henry kendall hostel\n",
      " suburb : hoskinstown\n",
      " postcode : 2770\n",
      " state : nsw\n",
      " date_of_birth : 19140228\n",
      " soc_sec_id : 5840194\n",
      "\n",
      "2/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : charlie\n",
      " surname : headon\n",
      " street_number : 17\n",
      " address_1 : sheehy street\n",
      " address_2 : hawkins masonic vlge\n",
      " suburb : warrandyte\n",
      " postcode : 3073\n",
      " state : vic\n",
      " date_of_birth : 19880814\n",
      " soc_sec_id : 7871445\n",
      "\n",
      " given_name : william\n",
      " surname : hislop\n",
      " street_number : 17\n",
      " address_1 : deane street\n",
      " address_2 : sunbury\n",
      " suburb : cedar creek\n",
      " postcode : 3073\n",
      " state : qld\n",
      " date_of_birth : 19830819\n",
      " soc_sec_id : 6153593\n",
      "\n",
      "3/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : mitchell\n",
      " surname : scrbak\n",
      " street_number : 950\n",
      " address_1 : holyman street\n",
      " address_2 : berkeley vlge\n",
      " suburb : safety bay\n",
      " postcode : 4300\n",
      " state : qld\n",
      " date_of_birth : 19060811\n",
      " soc_sec_id : 3592109\n",
      "\n",
      " given_name : hannah\n",
      " surname : beams\n",
      " street_number : 9\n",
      " address_1 : light street\n",
      " address_2 : castle hill farm\n",
      " suburb : sale\n",
      " postcode : 3221\n",
      " state : vic\n",
      " date_of_birth : 19560811\n",
      " soc_sec_id : 7444484\n",
      "\n",
      "4/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : finley\n",
      " surname : haeusler\n",
      " street_number : 27\n",
      " address_1 : noarlunga crescent\n",
      " address_2 : spring ridge\n",
      " suburb : nambour\n",
      " postcode : 3180\n",
      " state : vic\n",
      " date_of_birth : 19660711\n",
      " soc_sec_id : 3025217\n",
      "\n",
      " given_name : elki\n",
      " surname : trent\n",
      " street_number : 27\n",
      " address_1 : wray place\n",
      " address_2 : the mews royal hotel bldg\n",
      " suburb : gawler east\n",
      " postcode : 3152\n",
      " state : nsw\n",
      " date_of_birth : 19600211\n",
      " soc_sec_id : 5679502\n",
      "\n",
      "5/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : india\n",
      " surname : negrean\n",
      " street_number : 1\n",
      " address_1 : barringer street\n",
      " address_2 : sherwood\n",
      " suburb : parkinson\n",
      " postcode : 3168\n",
      " state : nsw\n",
      " date_of_birth : 19860923\n",
      " soc_sec_id : 2097928\n",
      "\n",
      " given_name : logan\n",
      " surname : selth\n",
      " street_number : 147\n",
      " address_1 : goyder street\n",
      " address_2 : rivonia\n",
      " suburb : queenscliff\n",
      " postcode : 2120\n",
      " state : tas\n",
      " date_of_birth : 19860921\n",
      " soc_sec_id : 4161322\n",
      "\n",
      "6/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : annablle\n",
      " surname : kounis\n",
      " street_number : 121\n",
      " address_1 : calder place\n",
      " address_2 : brambletye vinyard\n",
      " suburb : joondkalup\n",
      " postcode : 3120\n",
      " state : nsw\n",
      " date_of_birth : 19640907\n",
      " soc_sec_id : 1612956\n",
      "\n",
      " given_name : claurdia\n",
      " surname : clelland\n",
      " street_number : 12\n",
      " address_1 : box hill a venue\n",
      " address_2 : st francis vlge\n",
      " suburb : old beach\n",
      " postcode : 3127\n",
      " state : wa\n",
      " date_of_birth : 19640902\n",
      " soc_sec_id : 9508954\n",
      "\n",
      "7/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : troy\n",
      " surname : reid\n",
      " street_number : 1\n",
      " address_1 : allan street\n",
      " address_2 : townview\n",
      " suburb : page\n",
      " postcode : 2774\n",
      " state : qld\n",
      " date_of_birth : 19250727\n",
      " soc_sec_id : 3580821\n",
      "\n",
      " given_name : william\n",
      " surname : tossell\n",
      " street_number : 1\n",
      " address_1 : lutana street\n",
      " address_2 : nara cnsa\n",
      " suburb : craigmore\n",
      " postcode : 2509\n",
      " state : nsw\n",
      " date_of_birth : 19250116\n",
      " soc_sec_id : 5322906\n",
      "\n",
      "8/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : isaac\n",
      " surname : quilliam\n",
      " street_number : 11\n",
      " address_1 : namatjira drive\n",
      " address_2 : delaware\n",
      " suburb : geelong west\n",
      " postcode : 3072\n",
      " state : nsw\n",
      " date_of_birth : 19930926\n",
      " soc_sec_id : 1556150\n",
      "\n",
      " given_name : bailey\n",
      " surname : clarke\n",
      " street_number : 1\n",
      " address_1 : hetherington circuit\n",
      " address_2 : gundaline\n",
      " suburb : harden\n",
      " postcode : 2077\n",
      " state : vic\n",
      " date_of_birth : 19930416\n",
      " soc_sec_id : 6134615\n",
      "\n",
      "9/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : james\n",
      " surname : blake\n",
      " street_number : 19\n",
      " address_1 : sturt avenue\n",
      " address_2 : laloki\n",
      " suburb : carnegie\n",
      " postcode : 2218\n",
      " state : nsw\n",
      " date_of_birth : 19050716\n",
      " soc_sec_id : 7830672\n",
      "\n",
      " given_name : finn\n",
      " surname : kapoor\n",
      " street_number : 1994\n",
      " address_1 : sturt avenue\n",
      " address_2 : john flynn medical centre\n",
      " suburb : mullumbimby\n",
      " postcode : 2262\n",
      " state : vic\n",
      " date_of_birth : 19880816\n",
      " soc_sec_id : 8680815\n",
      "\n",
      "10/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " y\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " given_name : harry\n",
      " surname : ryn\n",
      " street_number : 5\n",
      " address_1 : kellway street\n",
      " address_2 : rowethorpe\n",
      " suburb : toowong\n",
      " postcode : 3931\n",
      " state : nsw\n",
      " date_of_birth : 19220503\n",
      " soc_sec_id : 7228670\n",
      "\n",
      " given_name : samantha\n",
      " surname : grierson\n",
      " street_number : 5\n",
      " address_1 : kennedy street\n",
      " address_2 : tantallon\n",
      " suburb : oakleigh\n",
      " postcode : 3034\n",
      " state : vic\n",
      " date_of_birth : 19210114\n",
      " soc_sec_id : 4683164\n",
      "\n",
      "11/10 positive, 16/10 negative\n",
      "Do these records refer to the same thing?\n",
      "(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " f\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Finished labeling\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Clustering...\n"
     ]
    },
    {
     "ename": "BlockingError",
     "evalue": "No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mBlockingError\u001b[0m                             Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[31], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m#initiate matching\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m df_final \u001b[38;5;241m=\u001b[39m pandas_dedupe\u001b[38;5;241m.\u001b[39mlink_dataframes(dfa, dfb, [\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m given_name\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m surname\u001b[39m\u001b[38;5;124m'\u001b[39m,  \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m street_number\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m address_1\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m address_2\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m suburb\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m postcode\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m state\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m date_of_birth\u001b[39m\u001b[38;5;124m'\u001b[39m,\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m soc_sec_id\u001b[39m\u001b[38;5;124m'\u001b[39m])\n",
      "File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas_dedupe/link_dataframes.py:112\u001b[0m, in \u001b[0;36mlink_dataframes\u001b[0;34m(dfa, dfb, field_properties, config_name, n_cores)\u001b[0m\n\u001b[1;32m    100\u001b[0m \u001b[38;5;66;03m# ## Blocking\u001b[39;00m\n\u001b[1;32m    101\u001b[0m \n\u001b[1;32m    102\u001b[0m \u001b[38;5;66;03m# ## Clustering\u001b[39;00m\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    108\u001b[0m \u001b[38;5;66;03m# If we had more data, we would not pass in all the blocked data into\u001b[39;00m\n\u001b[1;32m    109\u001b[0m \u001b[38;5;66;03m# this function but a representative sample.\u001b[39;00m\n\u001b[1;32m    111\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mClustering...\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[0;32m--> 112\u001b[0m linked_records \u001b[38;5;241m=\u001b[39m linker\u001b[38;5;241m.\u001b[39mjoin(data_1, data_2, \u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m    114\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m# duplicate sets\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;28mlen\u001b[39m(linked_records))\n\u001b[1;32m    117\u001b[0m \u001b[38;5;66;03m#Convert linked records into dataframe\u001b[39;00m\n",
      "File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/api.py:549\u001b[0m, in \u001b[0;36mRecordLinkMatching.join\u001b[0;34m(self, data_1, data_2, threshold, constraint)\u001b[0m\n\u001b[1;32m    543\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m constraint \u001b[38;5;129;01min\u001b[39;00m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmany-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmany-to-many\u001b[39m\u001b[38;5;124m\"\u001b[39m}, (\n\u001b[1;32m    544\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is an invalid constraint option. Valid options include \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    545\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one, many-to-one, or many-to-many\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m constraint\n\u001b[1;32m    546\u001b[0m )\n\u001b[1;32m    548\u001b[0m pairs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpairs(data_1, data_2)\n\u001b[0;32m--> 549\u001b[0m pair_scores \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mscore(pairs)\n\u001b[1;32m    551\u001b[0m links: Links\n\u001b[1;32m    552\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m constraint \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
      "File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/api.py:125\u001b[0m, in \u001b[0;36mIntegralMatching.score\u001b[0;34m(self, pairs)\u001b[0m\n\u001b[1;32m    116\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m    117\u001b[0m \u001b[38;5;124;03mScores pairs of records. Returns pairs of tuples of records id and\u001b[39;00m\n\u001b[1;32m    118\u001b[0m \u001b[38;5;124;03massociated probabilities that the pair of records are match\u001b[39;00m\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    122\u001b[0m \n\u001b[1;32m    123\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m    124\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 125\u001b[0m     matches \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mscoreDuplicates(\n\u001b[1;32m    126\u001b[0m         pairs, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdata_model\u001b[38;5;241m.\u001b[39mdistances, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclassifier, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnum_cores\n\u001b[1;32m    127\u001b[0m     )\n\u001b[1;32m    128\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m:\n\u001b[1;32m    129\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[1;32m    130\u001b[0m \u001b[38;5;250m        \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m    131\u001b[0m \u001b[38;5;124;03m        You need to either turn off multiprocessing or protect\u001b[39;00m\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    134\u001b[0m \u001b[38;5;124;03m        https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods\"\"\"\u001b[39;00m\n\u001b[1;32m    135\u001b[0m     )\n",
      "File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/core.py:126\u001b[0m, in \u001b[0;36mscoreDuplicates\u001b[0;34m(record_pairs, featurizer, classifier, num_cores)\u001b[0m\n\u001b[1;32m    124\u001b[0m first, record_pairs \u001b[38;5;241m=\u001b[39m peek(record_pairs)\n\u001b[1;32m    125\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m first \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 126\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m BlockingError(\n\u001b[1;32m    127\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNo records have been blocked together. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    128\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIs the data you are trying to match like \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    129\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mthe data you trained on? If so, try adding \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    130\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmore training data.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    131\u001b[0m     )\n\u001b[1;32m    133\u001b[0m record_pairs_queue: _Queue \u001b[38;5;241m=\u001b[39m Queue(\u001b[38;5;241m2\u001b[39m)\n\u001b[1;32m    134\u001b[0m exception_queue: _Queue \u001b[38;5;241m=\u001b[39m Queue()\n",
      "\u001b[0;31mBlockingError\u001b[0m: No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data."
     ]
    }
   ],
   "source": [
    "#initiate matching\n",
    "df_final = pandas_dedupe.link_dataframes(dfa, dfb, [' given_name', ' surname',  ' street_number', ' address_1', ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',' soc_sec_id'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Exercise\n",
    "Try to deduplicate the data of the visitors of the White House.\n",
    "You can find the data [here](https://obamawhitehouse.archives.gov/goodgovernment/tools/visitor-records)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)\n",
    "* [Dedupe](https://dedupe.io/) package\n",
    "* [pandas-dedupe](https://pypi.org/project/pandas-dedupe/) package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "height": "158.667px",
    "left": "400px",
    "right": "20px",
    "top": "50px",
    "width": "700px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}