1
0
mirror of https://github.com/gsi-upm/sitc synced 2025-01-08 20:11:27 +00:00
sitc/ml21/preprocessing/05_Duplicated_Values.ipynb
2024-04-03 22:50:36 +02:00

3536 lines
121 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Duplicated values\n",
"\n",
"Sometimes, data comes with messy data. \n",
"\n",
"We will use the package [dedupe](https://dedupe.io/) to eliminate duplicates. \n",
"\n",
"\n",
"Some alternatives are the packages [recordlinkage](https://pypi.org/project/recordlinkage/) and [thefuzz](https://github.com/seatgeek/thefuzz).\n",
"\n",
"Instead of using directly the package dedupe, we are going to use **pandas-dedupe**:\n",
"\n",
"\n",
"**pip install pandas-dedupe**\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"Let's start by loading messy data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore') # Avoid warnings\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"import pandas_dedupe"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/dedupeio/dedupe-examples/master/csv_example/csv_example_messy_input.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Let's do some initial checking"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(3337, 32)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Source</th>\n",
" <th>Site name</th>\n",
" <th>Address</th>\n",
" <th>Zip</th>\n",
" <th>Phone</th>\n",
" <th>Fax</th>\n",
" <th>Program Name</th>\n",
" <th>Length of Day</th>\n",
" <th>IDHS Provider ID</th>\n",
" <th>...</th>\n",
" <th>Executive Director</th>\n",
" <th>Center Director</th>\n",
" <th>ECE Available Programs</th>\n",
" <th>NAEYC Valid Until</th>\n",
" <th>NAEYC Program Id</th>\n",
" <th>Email Address</th>\n",
" <th>Ounce of Prevention Description</th>\n",
" <th>Purple binder service type</th>\n",
" <th>Column</th>\n",
" <th>Column2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Salvation Army - Temple / Salvation Army</td>\n",
" <td>1 N Ogden Ave</td>\n",
" <td>NaN</td>\n",
" <td>2262649.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Salvation Army - Temple / Salvation Army</td>\n",
" <td>1 N Ogden Ave</td>\n",
" <td>NaN</td>\n",
" <td>2262649.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
" <td>10 S Kedzie Ave</td>\n",
" <td>NaN</td>\n",
" <td>5339011.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
" <td>10 S Kedzie Ave</td>\n",
" <td>NaN</td>\n",
" <td>5339011.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
" <td>10001 S Woodlawn Ave</td>\n",
" <td>NaN</td>\n",
" <td>2916100.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
" <td>10001 S Woodlawn Ave</td>\n",
" <td>NaN</td>\n",
" <td>2916100.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>6</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
" <td>1001 W Roosevelt Rd</td>\n",
" <td>NaN</td>\n",
" <td>9395115.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>7</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
" <td>1001 W Roosevelt Rd</td>\n",
" <td>NaN</td>\n",
" <td>9395115.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>8</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Hull House Association - Uptown Head Start / ...</td>\n",
" <td>1020 W Bryn Mawr Ave</td>\n",
" <td>NaN</td>\n",
" <td>7695753.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Hull House Association - Child Dev. Central O...</td>\n",
" <td>1030 W Van Buren St</td>\n",
" <td>NaN</td>\n",
" <td>9068600.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 32 columns</p>\n",
"</div>"
],
"text/plain": [
" Id Source \\\n",
"0 0 CPS_Early_Childhood_Portal_scrape.csv \n",
"1 1 CPS_Early_Childhood_Portal_scrape.csv \n",
"2 2 CPS_Early_Childhood_Portal_scrape.csv \n",
"3 3 CPS_Early_Childhood_Portal_scrape.csv \n",
"4 4 CPS_Early_Childhood_Portal_scrape.csv \n",
"5 5 CPS_Early_Childhood_Portal_scrape.csv \n",
"6 6 CPS_Early_Childhood_Portal_scrape.csv \n",
"7 7 CPS_Early_Childhood_Portal_scrape.csv \n",
"8 8 CPS_Early_Childhood_Portal_scrape.csv \n",
"9 9 CPS_Early_Childhood_Portal_scrape.csv \n",
"\n",
" Site name Address \\\n",
"0 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
"1 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
"2 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
"3 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
"4 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
"5 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
"6 Easter Seals Society of Metropolitan Chicago ... 1001 W Roosevelt Rd \n",
"7 Easter Seals Society of Metropolitan Chicago ... 1001 W Roosevelt Rd \n",
"8 Hull House Association - Uptown Head Start / ... 1020 W Bryn Mawr Ave \n",
"9 Hull House Association - Child Dev. Central O... 1030 W Van Buren St \n",
"\n",
" Zip Phone Fax Program Name Length of Day IDHS Provider ID ... \\\n",
"0 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
"1 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
"2 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
"3 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
"4 NaN 2916100.0 NaN Child Care EXTENDED DAY NaN ... \n",
"5 NaN 2916100.0 NaN Child Care EXTENDED DAY NaN ... \n",
"6 NaN 9395115.0 NaN Child Care EXTENDED DAY NaN ... \n",
"7 NaN 9395115.0 NaN Child Care EXTENDED DAY NaN ... \n",
"8 NaN 7695753.0 NaN Child Care EXTENDED DAY NaN ... \n",
"9 NaN 9068600.0 NaN Child Care EXTENDED DAY NaN ... \n",
"\n",
" Executive Director Center Director ECE Available Programs NAEYC Valid Until \\\n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"5 NaN NaN NaN NaN \n",
"6 NaN NaN NaN NaN \n",
"7 NaN NaN NaN NaN \n",
"8 NaN NaN NaN NaN \n",
"9 NaN NaN NaN NaN \n",
"\n",
" NAEYC Program Id Email Address Ounce of Prevention Description \\\n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"5 NaN NaN NaN \n",
"6 NaN NaN NaN \n",
"7 NaN NaN NaN \n",
"8 NaN NaN NaN \n",
"9 NaN NaN NaN \n",
"\n",
" Purple binder service type Column Column2 \n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"5 NaN NaN NaN \n",
"6 NaN NaN NaN \n",
"7 NaN NaN NaN \n",
"8 NaN NaN NaN \n",
"9 NaN NaN NaN \n",
"\n",
"[10 rows x 32 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',\n",
" 'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',\n",
" 'Neighborhood', 'Funded Enrollment', 'Program Option',\n",
" 'Number per Site EHS', 'Number per Site HS', 'Director',\n",
" 'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',\n",
" 'Website', 'Executive Director', 'Center Director',\n",
" 'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',\n",
" 'Email Address', 'Ounce of Prevention Description',\n",
" 'Purple binder service type', 'Column', 'Column2'],\n",
" dtype='object')\n"
]
}
],
"source": [
"print(df.columns)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Id int64\n",
"Source object\n",
"Site name object\n",
"Address object\n",
"Zip float64\n",
"Phone float64\n",
"Fax object\n",
"Program Name object\n",
"Length of Day object\n",
"IDHS Provider ID object\n",
"Agency object\n",
"Neighborhood object\n",
"Funded Enrollment object\n",
"Program Option object\n",
"Number per Site EHS object\n",
"Number per Site HS object\n",
"Director float64\n",
"Head Start Fund float64\n",
"Eearly Head Start Fund object\n",
"CC fund object\n",
"Progmod object\n",
"Website object\n",
"Executive Director object\n",
"Center Director object\n",
"ECE Available Programs object\n",
"NAEYC Valid Until object\n",
"NAEYC Program Id float64\n",
"Email Address object\n",
"Ounce of Prevention Description object\n",
"Purple binder service type object\n",
"Column float64\n",
"Column2 object\n",
"dtype: object\n"
]
}
],
"source": [
"print(df.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Id 0\n",
"Source 0\n",
"Site name 0\n",
"Address 0\n",
"Zip 1333\n",
"Phone 146\n",
"Fax 3299\n",
"Program Name 2009\n",
"Length of Day 2009\n",
"IDHS Provider ID 3298\n",
"Agency 3325\n",
"Neighborhood 2754\n",
"Funded Enrollment 2424\n",
"Program Option 2800\n",
"Number per Site EHS 3319\n",
"Number per Site HS 3319\n",
"Director 3337\n",
"Head Start Fund 3337\n",
"Eearly Head Start Fund 2881\n",
"CC fund 2818\n",
"Progmod 2818\n",
"Website 2815\n",
"Executive Director 3114\n",
"Center Director 2874\n",
"ECE Available Programs 2379\n",
"NAEYC Valid Until 2968\n",
"NAEYC Program Id 3337\n",
"Email Address 3203\n",
"Ounce of Prevention Description 3185\n",
"Purple binder service type 3215\n",
"Column 3337\n",
"Column2 3018\n",
"dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Missing values\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Check duplicates"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"3332 False\n",
"3333 False\n",
"3334 False\n",
"3335 False\n",
"3336 False\n",
"Length: 3337, dtype: bool"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.duplicated()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Remove duplicates"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Source</th>\n",
" <th>Site name</th>\n",
" <th>Address</th>\n",
" <th>Zip</th>\n",
" <th>Phone</th>\n",
" <th>Fax</th>\n",
" <th>Program Name</th>\n",
" <th>Length of Day</th>\n",
" <th>IDHS Provider ID</th>\n",
" <th>...</th>\n",
" <th>Executive Director</th>\n",
" <th>Center Director</th>\n",
" <th>ECE Available Programs</th>\n",
" <th>NAEYC Valid Until</th>\n",
" <th>NAEYC Program Id</th>\n",
" <th>Email Address</th>\n",
" <th>Ounce of Prevention Description</th>\n",
" <th>Purple binder service type</th>\n",
" <th>Column</th>\n",
" <th>Column2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Salvation Army - Temple / Salvation Army</td>\n",
" <td>1 N Ogden Ave</td>\n",
" <td>NaN</td>\n",
" <td>2262649.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Salvation Army - Temple / Salvation Army</td>\n",
" <td>1 N Ogden Ave</td>\n",
" <td>NaN</td>\n",
" <td>2262649.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
" <td>10 S Kedzie Ave</td>\n",
" <td>NaN</td>\n",
" <td>5339011.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
" <td>10 S Kedzie Ave</td>\n",
" <td>NaN</td>\n",
" <td>5339011.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
" <td>10001 S Woodlawn Ave</td>\n",
" <td>NaN</td>\n",
" <td>2916100.0</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 32 columns</p>\n",
"</div>"
],
"text/plain": [
" Id Source \\\n",
"0 0 CPS_Early_Childhood_Portal_scrape.csv \n",
"1 1 CPS_Early_Childhood_Portal_scrape.csv \n",
"2 2 CPS_Early_Childhood_Portal_scrape.csv \n",
"3 3 CPS_Early_Childhood_Portal_scrape.csv \n",
"4 4 CPS_Early_Childhood_Portal_scrape.csv \n",
"\n",
" Site name Address \\\n",
"0 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
"1 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
"2 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
"3 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
"4 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
"\n",
" Zip Phone Fax Program Name Length of Day IDHS Provider ID ... \\\n",
"0 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
"1 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
"2 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
"3 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
"4 NaN 2916100.0 NaN Child Care EXTENDED DAY NaN ... \n",
"\n",
" Executive Director Center Director ECE Available Programs NAEYC Valid Until \\\n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"\n",
" NAEYC Program Id Email Address Ounce of Prevention Description \\\n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
" Purple binder service type Column Column2 \n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
"[5 rows x 32 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop_duplicates(inplace=True)\n",
"df[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Remove 'real duplicates'\n",
"\n",
"The problem is that the records are not the same. \n",
"\n",
"Data is messy. \n",
"\n",
"We will use **dedupe**."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Importing data ...\n",
"Reading from dedupe_dataframe_learned_settings\n",
"Clustering...\n",
"# duplicate sets 871\n"
]
}
],
"source": [
"# canonalize for standardizing names in a cluster\n",
"df_dedupe = pandas_dedupe.dedupe_dataframe(df, ['Source', 'Site name', 'Address', 'Zip', 'Phone', 'Email Address'], canonicalize=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"If you want to retrain, you should delete the settings and training files (the dedupe* and link_dataframes* files).\n",
"\n",
"\n",
"Now, if you inspect the dataframe, you will see the duplicated records that have been clustered."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',\n",
" 'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',\n",
" 'Neighborhood', 'Funded Enrollment', 'Program Option',\n",
" 'Number per Site EHS', 'Number per Site HS', 'Director',\n",
" 'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',\n",
" 'Website', 'Executive Director', 'Center Director',\n",
" 'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',\n",
" 'Email Address', 'Ounce of Prevention Description',\n",
" 'Purple binder service type', 'Column', 'Column2', 'cluster id',\n",
" 'confidence', 'canonical_Id', 'canonical_Source', 'canonical_Site name',\n",
" 'canonical_Address', 'canonical_Zip', 'canonical_Phone',\n",
" 'canonical_Fax', 'canonical_Program Name', 'canonical_Length of Day',\n",
" 'canonical_IDHS Provider ID', 'canonical_Agency',\n",
" 'canonical_Neighborhood', 'canonical_Funded Enrollment',\n",
" 'canonical_Program Option', 'canonical_Number per Site EHS',\n",
" 'canonical_Number per Site HS', 'canonical_Director',\n",
" 'canonical_Head Start Fund', 'canonical_Eearly Head Start Fund',\n",
" 'canonical_CC fund', 'canonical_Progmod', 'canonical_Website',\n",
" 'canonical_Executive Director', 'canonical_Center Director',\n",
" 'canonical_ECE Available Programs', 'canonical_NAEYC Valid Until',\n",
" 'canonical_NAEYC Program Id', 'canonical_Email Address',\n",
" 'canonical_Ounce of Prevention Description',\n",
" 'canonical_Purple binder service type', 'canonical_Column',\n",
" 'canonical_Column2'],\n",
" dtype='object')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dedupe.columns"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Source</th>\n",
" <th>Site name</th>\n",
" <th>Address</th>\n",
" <th>Zip</th>\n",
" <th>Phone</th>\n",
" <th>Fax</th>\n",
" <th>Program Name</th>\n",
" <th>Length of Day</th>\n",
" <th>IDHS Provider ID</th>\n",
" <th>...</th>\n",
" <th>canonical_Executive Director</th>\n",
" <th>canonical_Center Director</th>\n",
" <th>canonical_ECE Available Programs</th>\n",
" <th>canonical_NAEYC Valid Until</th>\n",
" <th>canonical_NAEYC Program Id</th>\n",
" <th>canonical_Email Address</th>\n",
" <th>canonical_Ounce of Prevention Description</th>\n",
" <th>canonical_Purple binder service type</th>\n",
" <th>canonical_Column</th>\n",
" <th>canonical_Column2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3327</th>\n",
" <td>3327</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>precious infants &amp; tots learning center</td>\n",
" <td>624 e 47th street</td>\n",
" <td>60653.0</td>\n",
" <td>2682685.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>early head start</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3300</th>\n",
" <td>3300</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>ywca metropolitan chicago</td>\n",
" <td>360 n michigan avenue</td>\n",
" <td>60601.0</td>\n",
" <td>3726600.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3299</th>\n",
" <td>3299</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>ymca west side</td>\n",
" <td>5080 w harrison street</td>\n",
" <td>60644.0</td>\n",
" <td>9553100.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3285</th>\n",
" <td>3285</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>woodlawn organization</td>\n",
" <td>6040 s harper avenue</td>\n",
" <td>60637.0</td>\n",
" <td>2885840.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3281</th>\n",
" <td>3281</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>urban family and community centers</td>\n",
" <td>4241 w washington boulevard</td>\n",
" <td>60624.0</td>\n",
" <td>7228333.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1837</th>\n",
" <td>1837</td>\n",
" <td>chapin_dfss_providers_2011_070212.csv</td>\n",
" <td>north avenue day nursery fcch-carolyn price</td>\n",
" <td>2020 w jackson</td>\n",
" <td>60612.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>5833ip(ehs collaboration enhanced home ip), 58...</td>\n",
" <td>www.crcl.net</td>\n",
" <td>cp</td>\n",
" <td>betty lee</td>\n",
" <td></td>\n",
" <td>05/31/14</td>\n",
" <td>723127</td>\n",
" <td>youngt@crcl.net</td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2643</th>\n",
" <td>2643</td>\n",
" <td>ece chicago find a school scrape.csv</td>\n",
" <td>mary crane north 0-3</td>\n",
" <td>2905 n. leavitt</td>\n",
" <td>60618.0</td>\n",
" <td>3485528.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
" <td>www.marycrane.org</td>\n",
" <td>lavetter terry</td>\n",
" <td>martuice williams</td>\n",
" <td></td>\n",
" <td>08/01/16</td>\n",
" <td>722999</td>\n",
" <td>info@marycrane.org</td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3228</th>\n",
" <td>3228</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>mary crane center north</td>\n",
" <td>2905 n leavitt street</td>\n",
" <td>60618.0</td>\n",
" <td>9753322.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
" <td>www.marycrane.org</td>\n",
" <td>lavetter terry</td>\n",
" <td>martuice williams</td>\n",
" <td></td>\n",
" <td>08/01/16</td>\n",
" <td>722999</td>\n",
" <td>info@marycrane.org</td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3229</th>\n",
" <td>3229</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>mary crane family and day care center</td>\n",
" <td>2905 n clybourn avenue</td>\n",
" <td>60618.0</td>\n",
" <td>3485528.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
" <td>www.marycrane.org</td>\n",
" <td>lavetter terry</td>\n",
" <td>martuice williams</td>\n",
" <td></td>\n",
" <td>08/01/16</td>\n",
" <td>722999</td>\n",
" <td>info@marycrane.org</td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3241</th>\n",
" <td>3241</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>our lady of guadalupe early childhood center</td>\n",
" <td>9129 s burley avenue</td>\n",
" <td>60617.0</td>\n",
" <td>9785320.0</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>2488it(child care it center), 7030ps(hs collab...</td>\n",
" <td>www.catholiccharities.net</td>\n",
" <td>laura rios</td>\n",
" <td>deborah o'brien</td>\n",
" <td></td>\n",
" <td>01/31/13</td>\n",
" <td>486949</td>\n",
" <td>pgutierr@catholiccharities.net</td>\n",
" <td></td>\n",
" <td>child care</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3337 rows × 66 columns</p>\n",
"</div>"
],
"text/plain": [
" Id Source \\\n",
"3327 3327 purple_binder_early_childhood.csv \n",
"3300 3300 purple_binder_early_childhood.csv \n",
"3299 3299 purple_binder_early_childhood.csv \n",
"3285 3285 purple_binder_early_childhood.csv \n",
"3281 3281 purple_binder_early_childhood.csv \n",
"... ... ... \n",
"1837 1837 chapin_dfss_providers_2011_070212.csv \n",
"2643 2643 ece chicago find a school scrape.csv \n",
"3228 3228 purple_binder_early_childhood.csv \n",
"3229 3229 purple_binder_early_childhood.csv \n",
"3241 3241 purple_binder_early_childhood.csv \n",
"\n",
" Site name \\\n",
"3327 precious infants & tots learning center \n",
"3300 ywca metropolitan chicago \n",
"3299 ymca west side \n",
"3285 woodlawn organization \n",
"3281 urban family and community centers \n",
"... ... \n",
"1837 north avenue day nursery fcch-carolyn price \n",
"2643 mary crane north 0-3 \n",
"3228 mary crane center north \n",
"3229 mary crane family and day care center \n",
"3241 our lady of guadalupe early childhood center \n",
"\n",
" Address Zip Phone Fax Program Name \\\n",
"3327 624 e 47th street 60653.0 2682685.0 None None \n",
"3300 360 n michigan avenue 60601.0 3726600.0 None None \n",
"3299 5080 w harrison street 60644.0 9553100.0 None None \n",
"3285 6040 s harper avenue 60637.0 2885840.0 None None \n",
"3281 4241 w washington boulevard 60624.0 7228333.0 None None \n",
"... ... ... ... ... ... \n",
"1837 2020 w jackson 60612.0 None None None \n",
"2643 2905 n. leavitt 60618.0 3485528.0 None None \n",
"3228 2905 n leavitt street 60618.0 9753322.0 None None \n",
"3229 2905 n clybourn avenue 60618.0 3485528.0 None None \n",
"3241 9129 s burley avenue 60617.0 9785320.0 None None \n",
"\n",
" Length of Day IDHS Provider ID ... \\\n",
"3327 None None ... \n",
"3300 None None ... \n",
"3299 None None ... \n",
"3285 None None ... \n",
"3281 None None ... \n",
"... ... ... ... \n",
"1837 None None ... \n",
"2643 None None ... \n",
"3228 None None ... \n",
"3229 None None ... \n",
"3241 None None ... \n",
"\n",
" canonical_Executive Director \\\n",
"3327 \n",
"3300 \n",
"3299 \n",
"3285 \n",
"3281 \n",
"... ... \n",
"1837 5833ip(ehs collaboration enhanced home ip), 58... \n",
"2643 4374it(child care it center), 4374ps(child car... \n",
"3228 4374it(child care it center), 4374ps(child car... \n",
"3229 4374it(child care it center), 4374ps(child car... \n",
"3241 2488it(child care it center), 7030ps(hs collab... \n",
"\n",
" canonical_Center Director canonical_ECE Available Programs \\\n",
"3327 \n",
"3300 \n",
"3299 \n",
"3285 \n",
"3281 \n",
"... ... ... \n",
"1837 www.crcl.net cp \n",
"2643 www.marycrane.org lavetter terry \n",
"3228 www.marycrane.org lavetter terry \n",
"3229 www.marycrane.org lavetter terry \n",
"3241 www.catholiccharities.net laura rios \n",
"\n",
" canonical_NAEYC Valid Until canonical_NAEYC Program Id \\\n",
"3327 \n",
"3300 \n",
"3299 \n",
"3285 \n",
"3281 \n",
"... ... ... \n",
"1837 betty lee \n",
"2643 martuice williams \n",
"3228 martuice williams \n",
"3229 martuice williams \n",
"3241 deborah o'brien \n",
"\n",
" canonical_Email Address canonical_Ounce of Prevention Description \\\n",
"3327 \n",
"3300 \n",
"3299 \n",
"3285 \n",
"3281 \n",
"... ... ... \n",
"1837 05/31/14 723127 \n",
"2643 08/01/16 722999 \n",
"3228 08/01/16 722999 \n",
"3229 08/01/16 722999 \n",
"3241 01/31/13 486949 \n",
"\n",
" canonical_Purple binder service type canonical_Column canonical_Column2 \n",
"3327 early head start \n",
"3300 child care \n",
"3299 child care \n",
"3285 child care \n",
"3281 child care \n",
"... ... ... ... \n",
"1837 youngt@crcl.net child care \n",
"2643 info@marycrane.org child care \n",
"3228 info@marycrane.org child care \n",
"3229 info@marycrane.org child care \n",
"3241 pgutierr@catholiccharities.net child care \n",
"\n",
"[3337 rows x 66 columns]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_sorted = df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)\n",
"df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>cluster id</th>\n",
" <th>confidence</th>\n",
" <th>Source</th>\n",
" <th>Zip</th>\n",
" <th>Address</th>\n",
" <th>canonical_Executive Director</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3327</th>\n",
" <td>3327</td>\n",
" <td>870</td>\n",
" <td>1.000000</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60653.0</td>\n",
" <td>624 e 47th street</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3300</th>\n",
" <td>3300</td>\n",
" <td>869</td>\n",
" <td>1.000000</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60601.0</td>\n",
" <td>360 n michigan avenue</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3299</th>\n",
" <td>3299</td>\n",
" <td>868</td>\n",
" <td>1.000000</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60644.0</td>\n",
" <td>5080 w harrison street</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3285</th>\n",
" <td>3285</td>\n",
" <td>867</td>\n",
" <td>1.000000</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60637.0</td>\n",
" <td>6040 s harper avenue</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3281</th>\n",
" <td>3281</td>\n",
" <td>866</td>\n",
" <td>1.000000</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60624.0</td>\n",
" <td>4241 w washington boulevard</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1837</th>\n",
" <td>1837</td>\n",
" <td>40</td>\n",
" <td>0.133246</td>\n",
" <td>chapin_dfss_providers_2011_070212.csv</td>\n",
" <td>60612.0</td>\n",
" <td>2020 w jackson</td>\n",
" <td>5833ip(ehs collaboration enhanced home ip), 58...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2643</th>\n",
" <td>2643</td>\n",
" <td>31</td>\n",
" <td>0.130475</td>\n",
" <td>ece chicago find a school scrape.csv</td>\n",
" <td>60618.0</td>\n",
" <td>2905 n. leavitt</td>\n",
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3228</th>\n",
" <td>3228</td>\n",
" <td>31</td>\n",
" <td>0.130475</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60618.0</td>\n",
" <td>2905 n leavitt street</td>\n",
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3229</th>\n",
" <td>3229</td>\n",
" <td>31</td>\n",
" <td>0.130474</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60618.0</td>\n",
" <td>2905 n clybourn avenue</td>\n",
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3241</th>\n",
" <td>3241</td>\n",
" <td>27</td>\n",
" <td>0.058419</td>\n",
" <td>purple_binder_early_childhood.csv</td>\n",
" <td>60617.0</td>\n",
" <td>9129 s burley avenue</td>\n",
" <td>2488it(child care it center), 7030ps(hs collab...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3337 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" Id cluster id confidence Source \\\n",
"3327 3327 870 1.000000 purple_binder_early_childhood.csv \n",
"3300 3300 869 1.000000 purple_binder_early_childhood.csv \n",
"3299 3299 868 1.000000 purple_binder_early_childhood.csv \n",
"3285 3285 867 1.000000 purple_binder_early_childhood.csv \n",
"3281 3281 866 1.000000 purple_binder_early_childhood.csv \n",
"... ... ... ... ... \n",
"1837 1837 40 0.133246 chapin_dfss_providers_2011_070212.csv \n",
"2643 2643 31 0.130475 ece chicago find a school scrape.csv \n",
"3228 3228 31 0.130475 purple_binder_early_childhood.csv \n",
"3229 3229 31 0.130474 purple_binder_early_childhood.csv \n",
"3241 3241 27 0.058419 purple_binder_early_childhood.csv \n",
"\n",
" Zip Address \\\n",
"3327 60653.0 624 e 47th street \n",
"3300 60601.0 360 n michigan avenue \n",
"3299 60644.0 5080 w harrison street \n",
"3285 60637.0 6040 s harper avenue \n",
"3281 60624.0 4241 w washington boulevard \n",
"... ... ... \n",
"1837 60612.0 2020 w jackson \n",
"2643 60618.0 2905 n. leavitt \n",
"3228 60618.0 2905 n leavitt street \n",
"3229 60618.0 2905 n clybourn avenue \n",
"3241 60617.0 9129 s burley avenue \n",
"\n",
" canonical_Executive Director \n",
"3327 \n",
"3300 \n",
"3299 \n",
"3285 \n",
"3281 \n",
"... ... \n",
"1837 5833ip(ehs collaboration enhanced home ip), 58... \n",
"2643 4374it(child care it center), 4374ps(child car... \n",
"3228 4374it(child care it center), 4374ps(child car... \n",
"3229 4374it(child care it center), 4374ps(child car... \n",
"3241 2488it(child care it center), 7030ps(hs collab... \n",
"\n",
"[3337 rows x 7 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_sorted[['Id', 'cluster id', 'confidence', 'Source', 'Zip', 'Address', 'canonical_Executive Director']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Matching / Linking records\n",
"Another problem is matching / linking records from different sources."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Let's load two datasets from FEBRl (Freely extensible biomedical record linkage)."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"#load dataframes\n",
"dfa = pd.read_csv('data/dataset1-febrl.csv')\n",
"dfb = pd.read_csv('data/dataset2-febrl.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We cannot have missing values for applying record matching with this library, so we fill them.\n",
"\n",
"The problem is that many values are ' ' (not NaN). So, we first convert to NaN, and then we drop them."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"dfa.replace(['', ' '], np.nan, inplace=True)\n",
"dfb.replace(['', ' '], np.nan, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"dfa.dropna(inplace=True)\n",
"dfb.dropna(inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>rec_id</th>\n",
" <th>given_name</th>\n",
" <th>surname</th>\n",
" <th>street_number</th>\n",
" <th>address_1</th>\n",
" <th>address_2</th>\n",
" <th>suburb</th>\n",
" <th>postcode</th>\n",
" <th>state</th>\n",
" <th>date_of_birth</th>\n",
" <th>soc_sec_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>rec-122-org</td>\n",
" <td>lachlan</td>\n",
" <td>berry</td>\n",
" <td>69</td>\n",
" <td>giblin street</td>\n",
" <td>killarney</td>\n",
" <td>bittern</td>\n",
" <td>4814</td>\n",
" <td>qld</td>\n",
" <td>19990219</td>\n",
" <td>7364009</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>rec-373-org</td>\n",
" <td>deakin</td>\n",
" <td>sondergeld</td>\n",
" <td>48</td>\n",
" <td>goldfinch circuit</td>\n",
" <td>kooltuo</td>\n",
" <td>canterbury</td>\n",
" <td>2776</td>\n",
" <td>vic</td>\n",
" <td>19600210</td>\n",
" <td>2635962</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>rec-227-org</td>\n",
" <td>luke</td>\n",
" <td>purdon</td>\n",
" <td>23</td>\n",
" <td>ramsay place</td>\n",
" <td>mirani</td>\n",
" <td>garbutt</td>\n",
" <td>2260</td>\n",
" <td>vic</td>\n",
" <td>19831024</td>\n",
" <td>8099933</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>rec-294-org</td>\n",
" <td>william</td>\n",
" <td>bishop</td>\n",
" <td>21</td>\n",
" <td>neworra place</td>\n",
" <td>apmnt 65</td>\n",
" <td>worongary</td>\n",
" <td>6225</td>\n",
" <td>qld</td>\n",
" <td>19490130</td>\n",
" <td>9773843</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>rec-81-dup-0</td>\n",
" <td>abbey</td>\n",
" <td>fit</td>\n",
" <td>13</td>\n",
" <td>kosciusko avenue</td>\n",
" <td>the wharf complex</td>\n",
" <td>yass</td>\n",
" <td>2594</td>\n",
" <td>nsw</td>\n",
" <td>19870510</td>\n",
" <td>7661096</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>rec-34-org</td>\n",
" <td>isabella</td>\n",
" <td>lodder</td>\n",
" <td>156</td>\n",
" <td>messenger street</td>\n",
" <td>tongbong sanctuary</td>\n",
" <td>bayswater</td>\n",
" <td>4870</td>\n",
" <td>vic</td>\n",
" <td>19650714</td>\n",
" <td>2790666</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>rec-478-org</td>\n",
" <td>anthony</td>\n",
" <td>beazley</td>\n",
" <td>12</td>\n",
" <td>birubi place</td>\n",
" <td>currandina</td>\n",
" <td>flemington</td>\n",
" <td>2477</td>\n",
" <td>qld</td>\n",
" <td>19730924</td>\n",
" <td>6558077</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>rec-225-org</td>\n",
" <td>alia</td>\n",
" <td>streich</td>\n",
" <td>74</td>\n",
" <td>maranoa street</td>\n",
" <td>rocky bend</td>\n",
" <td>rowville</td>\n",
" <td>6152</td>\n",
" <td>vic</td>\n",
" <td>19790418</td>\n",
" <td>1975340</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>rec-452-org</td>\n",
" <td>alissa</td>\n",
" <td>kilmartin</td>\n",
" <td>37</td>\n",
" <td>reveley crescent</td>\n",
" <td>crown allot</td>\n",
" <td>wolumla</td>\n",
" <td>6210</td>\n",
" <td>nsw</td>\n",
" <td>19041118</td>\n",
" <td>7994055</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>rec-67-org</td>\n",
" <td>jacob</td>\n",
" <td>lyden</td>\n",
" <td>25</td>\n",
" <td>haddon street</td>\n",
" <td>glenview</td>\n",
" <td>woodville north</td>\n",
" <td>2226</td>\n",
" <td>qld</td>\n",
" <td>19910424</td>\n",
" <td>6426415</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" rec_id given_name surname street_number address_1 \\\n",
"1 rec-122-org lachlan berry 69 giblin street \n",
"2 rec-373-org deakin sondergeld 48 goldfinch circuit \n",
"4 rec-227-org luke purdon 23 ramsay place \n",
"7 rec-294-org william bishop 21 neworra place \n",
"10 rec-81-dup-0 abbey fit 13 kosciusko avenue \n",
"11 rec-34-org isabella lodder 156 messenger street \n",
"12 rec-478-org anthony beazley 12 birubi place \n",
"13 rec-225-org alia streich 74 maranoa street \n",
"15 rec-452-org alissa kilmartin 37 reveley crescent \n",
"16 rec-67-org jacob lyden 25 haddon street \n",
"\n",
" address_2 suburb postcode state date_of_birth \\\n",
"1 killarney bittern 4814 qld 19990219 \n",
"2 kooltuo canterbury 2776 vic 19600210 \n",
"4 mirani garbutt 2260 vic 19831024 \n",
"7 apmnt 65 worongary 6225 qld 19490130 \n",
"10 the wharf complex yass 2594 nsw 19870510 \n",
"11 tongbong sanctuary bayswater 4870 vic 19650714 \n",
"12 currandina flemington 2477 qld 19730924 \n",
"13 rocky bend rowville 6152 vic 19790418 \n",
"15 crown allot wolumla 6210 nsw 19041118 \n",
"16 glenview woodville north 2226 qld 19910424 \n",
"\n",
" soc_sec_id \n",
"1 7364009 \n",
"2 2635962 \n",
"4 8099933 \n",
"7 9773843 \n",
"10 7661096 \n",
"11 2790666 \n",
"12 6558077 \n",
"13 1975340 \n",
"15 7994055 \n",
"16 6426415 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfa.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>rec_id</th>\n",
" <th>given_name</th>\n",
" <th>surname</th>\n",
" <th>street_number</th>\n",
" <th>address_1</th>\n",
" <th>address_2</th>\n",
" <th>suburb</th>\n",
" <th>postcode</th>\n",
" <th>state</th>\n",
" <th>date_of_birth</th>\n",
" <th>soc_sec_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>rec-2778-org</td>\n",
" <td>sarah</td>\n",
" <td>bruhn</td>\n",
" <td>44</td>\n",
" <td>forbes street</td>\n",
" <td>wintersloe</td>\n",
" <td>kellerberrin</td>\n",
" <td>4510</td>\n",
" <td>vic</td>\n",
" <td>19300213</td>\n",
" <td>7535316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>rec-712-dup-0</td>\n",
" <td>jacob</td>\n",
" <td>lanyon</td>\n",
" <td>5</td>\n",
" <td>milne cove</td>\n",
" <td>wellwod</td>\n",
" <td>beaconsfield upper</td>\n",
" <td>2602</td>\n",
" <td>vic</td>\n",
" <td>19080712</td>\n",
" <td>9497788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>rec-1321-org</td>\n",
" <td>brinley</td>\n",
" <td>efthimiou</td>\n",
" <td>35</td>\n",
" <td>sturdee crescent</td>\n",
" <td>tremearne</td>\n",
" <td>scarborough</td>\n",
" <td>5211</td>\n",
" <td>qld</td>\n",
" <td>19940319</td>\n",
" <td>6814956</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>rec-3004-org</td>\n",
" <td>aleisha</td>\n",
" <td>hobson</td>\n",
" <td>54</td>\n",
" <td>oliver street</td>\n",
" <td>inglewood</td>\n",
" <td>toowoomba</td>\n",
" <td>3175</td>\n",
" <td>qld</td>\n",
" <td>19290427</td>\n",
" <td>5967384</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>rec-1384-org</td>\n",
" <td>ethan</td>\n",
" <td>gazzola</td>\n",
" <td>49</td>\n",
" <td>sheaffe street</td>\n",
" <td>bimby vale</td>\n",
" <td>port pirie</td>\n",
" <td>3088</td>\n",
" <td>sa</td>\n",
" <td>19631225</td>\n",
" <td>3832742</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>rec-3981-org</td>\n",
" <td>alicia</td>\n",
" <td>hope</td>\n",
" <td>100</td>\n",
" <td>mansfield place</td>\n",
" <td>sunset</td>\n",
" <td>byford</td>\n",
" <td>6061</td>\n",
" <td>sa</td>\n",
" <td>19421201</td>\n",
" <td>7934773</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>rec-916-org</td>\n",
" <td>benjamin</td>\n",
" <td>kolosche</td>\n",
" <td>78</td>\n",
" <td>keenan street</td>\n",
" <td>wingara</td>\n",
" <td>raymond terrace</td>\n",
" <td>3212</td>\n",
" <td>sa</td>\n",
" <td>19450918</td>\n",
" <td>5698873</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>rec-63-dup-0</td>\n",
" <td>olivia</td>\n",
" <td>white</td>\n",
" <td>55</td>\n",
" <td>duffy street</td>\n",
" <td>shopping village</td>\n",
" <td>mirrabooka</td>\n",
" <td>2260</td>\n",
" <td>vic</td>\n",
" <td>19000106</td>\n",
" <td>4996142</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>rec-112-org</td>\n",
" <td>joshua</td>\n",
" <td>rudd</td>\n",
" <td>78</td>\n",
" <td>max henry crescent</td>\n",
" <td>brentwood vlge</td>\n",
" <td>port douglas</td>\n",
" <td>2315</td>\n",
" <td>vic</td>\n",
" <td>19951125</td>\n",
" <td>1697892</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>rec-3297-org</td>\n",
" <td>rachael</td>\n",
" <td>lomman</td>\n",
" <td>37</td>\n",
" <td>carlile street</td>\n",
" <td>clonturkle</td>\n",
" <td>bronte</td>\n",
" <td>2177</td>\n",
" <td>nsw</td>\n",
" <td>19910228</td>\n",
" <td>9462397</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" rec_id given_name surname street_number address_1 \\\n",
"0 rec-2778-org sarah bruhn 44 forbes street \n",
"1 rec-712-dup-0 jacob lanyon 5 milne cove \n",
"2 rec-1321-org brinley efthimiou 35 sturdee crescent \n",
"3 rec-3004-org aleisha hobson 54 oliver street \n",
"4 rec-1384-org ethan gazzola 49 sheaffe street \n",
"5 rec-3981-org alicia hope 100 mansfield place \n",
"6 rec-916-org benjamin kolosche 78 keenan street \n",
"8 rec-63-dup-0 olivia white 55 duffy street \n",
"10 rec-112-org joshua rudd 78 max henry crescent \n",
"11 rec-3297-org rachael lomman 37 carlile street \n",
"\n",
" address_2 suburb postcode state date_of_birth \\\n",
"0 wintersloe kellerberrin 4510 vic 19300213 \n",
"1 wellwod beaconsfield upper 2602 vic 19080712 \n",
"2 tremearne scarborough 5211 qld 19940319 \n",
"3 inglewood toowoomba 3175 qld 19290427 \n",
"4 bimby vale port pirie 3088 sa 19631225 \n",
"5 sunset byford 6061 sa 19421201 \n",
"6 wingara raymond terrace 3212 sa 19450918 \n",
"8 shopping village mirrabooka 2260 vic 19000106 \n",
"10 brentwood vlge port douglas 2315 vic 19951125 \n",
"11 clonturkle bronte 2177 nsw 19910228 \n",
"\n",
" soc_sec_id \n",
"0 7535316 \n",
"1 9497788 \n",
"2 6814956 \n",
"3 5967384 \n",
"4 3832742 \n",
"5 7934773 \n",
"6 5698873 \n",
"8 4996142 \n",
"10 1697892 \n",
"11 9462397 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfb.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Check the two datasets have the same columns."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',\n",
" ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',\n",
" ' soc_sec_id'],\n",
" dtype='object')\n",
"Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',\n",
" ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',\n",
" ' soc_sec_id'],\n",
" dtype='object')\n"
]
}
],
"source": [
"print(dfa.columns)\n",
"print(dfb.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let's match..."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Importing data ...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : reeve\n",
" surname : quilliam\n",
" street_number : 2\n",
" address_1 : renwick street\n",
" address_2 : yarrabee\n",
" suburb : barwon heads\n",
" postcode : 2340\n",
" state : nsw\n",
" date_of_birth : 19810406\n",
" soc_sec_id : 1066923\n",
"\n",
" given_name : jessica\n",
" surname : reid\n",
" street_number : 280\n",
" address_1 : medley street\n",
" address_2 : warra creek\n",
" suburb : ballarat\n",
" postcode : 3149\n",
" state : nsw\n",
" date_of_birth : 19830907\n",
" soc_sec_id : 1067529\n",
"\n",
"0/10 positive, 0/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Starting active labeling...\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : daniel\n",
" surname : couzens\n",
" street_number : 37\n",
" address_1 : coventry close\n",
" address_2 : cressbrook\n",
" suburb : mount eliza\n",
" postcode : 5073\n",
" state : nsw\n",
" date_of_birth : 19881127\n",
" soc_sec_id : 6934299\n",
"\n",
" given_name : dante\n",
" surname : dakin\n",
" street_number : 3\n",
" address_1 : chuculba crescent\n",
" address_2 : greenpatch\n",
" suburb : forbes\n",
" postcode : 5072\n",
" state : nsw\n",
" date_of_birth : 19481028\n",
" soc_sec_id : 7288639\n",
"\n",
"0/10 positive, 1/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : lachlan\n",
" surname : jukic\n",
" street_number : 2\n",
" address_1 : morgan crescent\n",
" address_2 : parklea\n",
" suburb : raymond terrace\n",
" postcode : 2250\n",
" state : nsw\n",
" date_of_birth : 19780702\n",
" soc_sec_id : 4027998\n",
"\n",
" given_name : meg\n",
" surname : feil\n",
" street_number : 17\n",
" address_1 : biraban place\n",
" address_2 : hughloch lincoln red stud\n",
" suburb : hawthorne\n",
" postcode : 3429\n",
" state : vic\n",
" date_of_birth : 19060812\n",
" soc_sec_id : 4027997\n",
"\n",
"0/10 positive, 2/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : jacob\n",
" surname : lyen\n",
" street_number : 25\n",
" address_1 : haddon srteet\n",
" address_2 : glenvie w\n",
" suburb : woodville north\n",
" postcode : 2226\n",
" state : qld\n",
" date_of_birth : 19910424\n",
" soc_sec_id : 6426415\n",
"\n",
" given_name : zac\n",
" surname : white\n",
" street_number : 26\n",
" address_1 : companion crescent\n",
" address_2 : glenview\n",
" suburb : toronto\n",
" postcode : 2226\n",
" state : sa\n",
" date_of_birth : 19431117\n",
" soc_sec_id : 3437945\n",
"\n",
"0/10 positive, 3/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : kydan\n",
" surname : mccarthy\n",
" street_number : 67\n",
" address_1 : clemenger street\n",
" address_2 : the points holsteins\n",
" suburb : fairlawn\n",
" postcode : 6415\n",
" state : nsw\n",
" date_of_birth : 19720518\n",
" soc_sec_id : 6527653\n",
"\n",
" given_name : daniel\n",
" surname : mccarthy\n",
" street_number : 6\n",
" address_1 : brunton street\n",
" address_2 : tall pines\n",
" suburb : fairlight\n",
" postcode : 3155\n",
" state : nsw\n",
" date_of_birth : 19760107\n",
" soc_sec_id : 8093038\n",
"\n",
"0/10 positive, 4/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : brooklyn\n",
" surname : manson\n",
" street_number : 27\n",
" address_1 : clive steele avenue\n",
" address_2 : port fairy road\n",
" suburb : mount low\n",
" postcode : 3450\n",
" state : vic\n",
" date_of_birth : 19710727\n",
" soc_sec_id : 4493900\n",
"\n",
" given_name : ruby\n",
" surname : mason\n",
" street_number : 7\n",
" address_1 : clive steele avenue\n",
" address_2 : kooyong\n",
" suburb : botany\n",
" postcode : 3636\n",
" state : vic\n",
" date_of_birth : 19730913\n",
" soc_sec_id : 4397223\n",
"\n",
"0/10 positive, 5/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : emiily\n",
" surname : coleman\n",
" street_number : 108\n",
" address_1 : chewings street\n",
" address_2 : berkeley vlge\n",
" suburb : wellington point\n",
" postcode : 2550\n",
" state : nsw\n",
" date_of_birth : 19421221\n",
" soc_sec_id : 7206933\n",
"\n",
" given_name : emiily\n",
" surname : went\n",
" street_number : 18\n",
" address_1 : glenmaggie street\n",
" address_2 : berkeley vlge\n",
" suburb : blue haven\n",
" postcode : 6051\n",
" state : vic\n",
" date_of_birth : 19521205\n",
" soc_sec_id : 8530937\n",
"\n",
"0/10 positive, 6/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : jacob\n",
" surname : white\n",
" street_number : 5\n",
" address_1 : findlay street\n",
" address_2 : booroopki park rmb 596\n",
" suburb : robina\n",
" postcode : 2197\n",
" state : vic\n",
" date_of_birth : 19170205\n",
" soc_sec_id : 4702928\n",
"\n",
" given_name : talia\n",
" surname : reid\n",
" street_number : 147\n",
" address_1 : sid barnes crescent\n",
" address_2 : tathra\n",
" suburb : berowra heights\n",
" postcode : 2170\n",
" state : vic\n",
" date_of_birth : 19230203\n",
" soc_sec_id : 4712927\n",
"\n",
"0/10 positive, 7/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : carla\n",
" surname : amiet\n",
" street_number : 50\n",
" address_1 : carstensz street\n",
" address_2 : oaklodge\n",
" suburb : blackmans bay\n",
" postcode : 3180\n",
" state : nsw\n",
" date_of_birth : 19790801\n",
" soc_sec_id : 9646483\n",
"\n",
" given_name : cameron\n",
" surname : coleman\n",
" street_number : 10\n",
" address_1 : edwards street\n",
" address_2 : broadbridge manor\n",
" suburb : blackmans bay\n",
" postcode : 3630\n",
" state : nsw\n",
" date_of_birth : 19871030\n",
" soc_sec_id : 5502408\n",
"\n",
"0/10 positive, 8/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : lara\n",
" surname : matthews\n",
" street_number : 6\n",
" address_1 : beaumaris street\n",
" address_2 : sunforest therapy centre\n",
" suburb : the basin\n",
" postcode : 4179\n",
" state : nsw\n",
" date_of_birth : 19911006\n",
" soc_sec_id : 2164704\n",
"\n",
" given_name : dominic\n",
" surname : matthews\n",
" street_number : 67\n",
" address_1 : campbell street\n",
" address_2 : narraburra lodge\n",
" suburb : coonabarabran\n",
" postcode : 3174\n",
" state : nsw\n",
" date_of_birth : 19470226\n",
" soc_sec_id : 3115384\n",
"\n",
"0/10 positive, 9/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : jack\n",
" surname : matthews\n",
" street_number : 17\n",
" address_1 : herron crescent\n",
" address_2 : broadmere\n",
" suburb : highton\n",
" postcode : 2035\n",
" state : vic\n",
" date_of_birth : 19081119\n",
" soc_sec_id : 5613395\n",
"\n",
" given_name : alexandra\n",
" surname : matthews\n",
" street_number : 174\n",
" address_1 : port jackson circuit\n",
" address_2 : old timers south\n",
" suburb : whaleback\n",
" postcode : 2830\n",
" state : vic\n",
" date_of_birth : 19261017\n",
" soc_sec_id : 2919332\n",
"\n",
"0/10 positive, 10/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : emiily\n",
" surname : kowald\n",
" street_number : 18\n",
" address_1 : daglish street\n",
" address_2 : oakdale\n",
" suburb : avalon\n",
" postcode : 3030\n",
" state : vic\n",
" date_of_birth : 19250313\n",
" soc_sec_id : 1590627\n",
"\n",
" given_name : emiily\n",
" surname : went\n",
" street_number : 18\n",
" address_1 : glenmaggie street\n",
" address_2 : berkeley vlge\n",
" suburb : blue haven\n",
" postcode : 6051\n",
" state : vic\n",
" date_of_birth : 19521205\n",
" soc_sec_id : 8530937\n",
"\n",
"0/10 positive, 11/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : jaden\n",
" surname : humphreys\n",
" street_number : 14\n",
" address_1 : euree street\n",
" address_2 : moorillyah\n",
" suburb : thornlie\n",
" postcode : 3116\n",
" state : wa\n",
" date_of_birth : 19700116\n",
" soc_sec_id : 9382782\n",
"\n",
" given_name : isabelle\n",
" surname : jolly\n",
" street_number : 166\n",
" address_1 : hodges street\n",
" address_2 : bosmit\n",
" suburb : thornlie\n",
" postcode : 3163\n",
" state : wa\n",
" date_of_birth : 19050126\n",
" soc_sec_id : 2719590\n",
"\n",
"0/10 positive, 12/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : brooklyn\n",
" surname : manson\n",
" street_number : 27\n",
" address_1 : clive steele avenue\n",
" address_2 : port fairy road\n",
" suburb : yagoona\n",
" postcode : 3450\n",
" state : vic\n",
" date_of_birth : 19710727\n",
" soc_sec_id : 4493900\n",
"\n",
" given_name : laura\n",
" surname : campbell\n",
" street_number : 152\n",
" address_1 : clive steele avenue\n",
" address_2 : irrigation farm\n",
" suburb : yagoona\n",
" postcode : 3350\n",
" state : vic\n",
" date_of_birth : 19160610\n",
" soc_sec_id : 6214635\n",
"\n",
"0/10 positive, 13/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : jacob\n",
" surname : white\n",
" street_number : 5\n",
" address_1 : findlay street\n",
" address_2 : booroopki park rmb 596\n",
" suburb : robina\n",
" postcode : 2197\n",
" state : vic\n",
" date_of_birth : 19170205\n",
" soc_sec_id : 4702928\n",
"\n",
" given_name : jakob\n",
" surname : menzies\n",
" street_number : 33\n",
" address_1 : coverdale street\n",
" address_2 : bundong\n",
" suburb : worongary\n",
" postcode : 2190\n",
" state : vic\n",
" date_of_birth : 19140610\n",
" soc_sec_id : 4557295\n",
"\n",
"0/10 positive, 14/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : benjamin\n",
" surname : kirchener\n",
" street_number : 12\n",
" address_1 : stuart street\n",
" address_2 : wurrami\n",
" suburb : theodore\n",
" postcode : 2620\n",
" state : wa\n",
" date_of_birth : 19751110\n",
" soc_sec_id : 1766048\n",
"\n",
" given_name : max\n",
" surname : rees\n",
" street_number : 10\n",
" address_1 : waite street\n",
" address_2 : rosedown\n",
" suburb : nambour\n",
" postcode : 2620\n",
" state : wa\n",
" date_of_birth : 19751230\n",
" soc_sec_id : 1361900\n",
"\n",
"0/10 positive, 15/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" n\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : brydon\n",
" surname : webb\n",
" street_number : 11\n",
" address_1 : walker crescent\n",
" address_2 : bugoren\n",
" suburb : charlestown\n",
" postcode : 6258\n",
" state : nsw\n",
" date_of_birth : 19190528\n",
" soc_sec_id : 4191569\n",
"\n",
" given_name : bradley\n",
" surname : haberfield\n",
" street_number : 11\n",
" address_1 : carumbo place\n",
" address_2 : bungarra\n",
" suburb : canley heights\n",
" postcode : 2758\n",
" state : vic\n",
" date_of_birth : 19190528\n",
" soc_sec_id : 5039500\n",
"\n",
"0/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : brydon\n",
" surname : webb\n",
" street_number : 11\n",
" address_1 : walker crescent\n",
" address_2 : bugoren\n",
" suburb : charlestown\n",
" postcode : 6258\n",
" state : nsw\n",
" date_of_birth : 19190528\n",
" soc_sec_id : 4191569\n",
"\n",
" given_name : bradley\n",
" surname : haberfield\n",
" street_number : 11\n",
" address_1 : carumbi place\n",
" address_2 : bungarra\n",
" suburb : canley heights\n",
" postcode : 2758\n",
" state : vic\n",
" date_of_birth : 19190528\n",
" soc_sec_id : 5039500\n",
"\n",
"1/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : lochlan\n",
" surname : savidge\n",
" street_number : 29\n",
" address_1 : pohlman street\n",
" address_2 : moline village\n",
" suburb : jingili\n",
" postcode : 3071\n",
" state : vic\n",
" date_of_birth : 19140228\n",
" soc_sec_id : 1498207\n",
"\n",
" given_name : jayme\n",
" surname : parr\n",
" street_number : 2\n",
" address_1 : clive steele avenue\n",
" address_2 : henry kendall hostel\n",
" suburb : hoskinstown\n",
" postcode : 2770\n",
" state : nsw\n",
" date_of_birth : 19140228\n",
" soc_sec_id : 5840194\n",
"\n",
"2/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : charlie\n",
" surname : headon\n",
" street_number : 17\n",
" address_1 : sheehy street\n",
" address_2 : hawkins masonic vlge\n",
" suburb : warrandyte\n",
" postcode : 3073\n",
" state : vic\n",
" date_of_birth : 19880814\n",
" soc_sec_id : 7871445\n",
"\n",
" given_name : william\n",
" surname : hislop\n",
" street_number : 17\n",
" address_1 : deane street\n",
" address_2 : sunbury\n",
" suburb : cedar creek\n",
" postcode : 3073\n",
" state : qld\n",
" date_of_birth : 19830819\n",
" soc_sec_id : 6153593\n",
"\n",
"3/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : mitchell\n",
" surname : scrbak\n",
" street_number : 950\n",
" address_1 : holyman street\n",
" address_2 : berkeley vlge\n",
" suburb : safety bay\n",
" postcode : 4300\n",
" state : qld\n",
" date_of_birth : 19060811\n",
" soc_sec_id : 3592109\n",
"\n",
" given_name : hannah\n",
" surname : beams\n",
" street_number : 9\n",
" address_1 : light street\n",
" address_2 : castle hill farm\n",
" suburb : sale\n",
" postcode : 3221\n",
" state : vic\n",
" date_of_birth : 19560811\n",
" soc_sec_id : 7444484\n",
"\n",
"4/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : finley\n",
" surname : haeusler\n",
" street_number : 27\n",
" address_1 : noarlunga crescent\n",
" address_2 : spring ridge\n",
" suburb : nambour\n",
" postcode : 3180\n",
" state : vic\n",
" date_of_birth : 19660711\n",
" soc_sec_id : 3025217\n",
"\n",
" given_name : elki\n",
" surname : trent\n",
" street_number : 27\n",
" address_1 : wray place\n",
" address_2 : the mews royal hotel bldg\n",
" suburb : gawler east\n",
" postcode : 3152\n",
" state : nsw\n",
" date_of_birth : 19600211\n",
" soc_sec_id : 5679502\n",
"\n",
"5/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : india\n",
" surname : negrean\n",
" street_number : 1\n",
" address_1 : barringer street\n",
" address_2 : sherwood\n",
" suburb : parkinson\n",
" postcode : 3168\n",
" state : nsw\n",
" date_of_birth : 19860923\n",
" soc_sec_id : 2097928\n",
"\n",
" given_name : logan\n",
" surname : selth\n",
" street_number : 147\n",
" address_1 : goyder street\n",
" address_2 : rivonia\n",
" suburb : queenscliff\n",
" postcode : 2120\n",
" state : tas\n",
" date_of_birth : 19860921\n",
" soc_sec_id : 4161322\n",
"\n",
"6/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : annablle\n",
" surname : kounis\n",
" street_number : 121\n",
" address_1 : calder place\n",
" address_2 : brambletye vinyard\n",
" suburb : joondkalup\n",
" postcode : 3120\n",
" state : nsw\n",
" date_of_birth : 19640907\n",
" soc_sec_id : 1612956\n",
"\n",
" given_name : claurdia\n",
" surname : clelland\n",
" street_number : 12\n",
" address_1 : box hill a venue\n",
" address_2 : st francis vlge\n",
" suburb : old beach\n",
" postcode : 3127\n",
" state : wa\n",
" date_of_birth : 19640902\n",
" soc_sec_id : 9508954\n",
"\n",
"7/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : troy\n",
" surname : reid\n",
" street_number : 1\n",
" address_1 : allan street\n",
" address_2 : townview\n",
" suburb : page\n",
" postcode : 2774\n",
" state : qld\n",
" date_of_birth : 19250727\n",
" soc_sec_id : 3580821\n",
"\n",
" given_name : william\n",
" surname : tossell\n",
" street_number : 1\n",
" address_1 : lutana street\n",
" address_2 : nara cnsa\n",
" suburb : craigmore\n",
" postcode : 2509\n",
" state : nsw\n",
" date_of_birth : 19250116\n",
" soc_sec_id : 5322906\n",
"\n",
"8/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : isaac\n",
" surname : quilliam\n",
" street_number : 11\n",
" address_1 : namatjira drive\n",
" address_2 : delaware\n",
" suburb : geelong west\n",
" postcode : 3072\n",
" state : nsw\n",
" date_of_birth : 19930926\n",
" soc_sec_id : 1556150\n",
"\n",
" given_name : bailey\n",
" surname : clarke\n",
" street_number : 1\n",
" address_1 : hetherington circuit\n",
" address_2 : gundaline\n",
" suburb : harden\n",
" postcode : 2077\n",
" state : vic\n",
" date_of_birth : 19930416\n",
" soc_sec_id : 6134615\n",
"\n",
"9/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : james\n",
" surname : blake\n",
" street_number : 19\n",
" address_1 : sturt avenue\n",
" address_2 : laloki\n",
" suburb : carnegie\n",
" postcode : 2218\n",
" state : nsw\n",
" date_of_birth : 19050716\n",
" soc_sec_id : 7830672\n",
"\n",
" given_name : finn\n",
" surname : kapoor\n",
" street_number : 1994\n",
" address_1 : sturt avenue\n",
" address_2 : john flynn medical centre\n",
" suburb : mullumbimby\n",
" postcode : 2262\n",
" state : vic\n",
" date_of_birth : 19880816\n",
" soc_sec_id : 8680815\n",
"\n",
"10/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" y\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" given_name : harry\n",
" surname : ryn\n",
" street_number : 5\n",
" address_1 : kellway street\n",
" address_2 : rowethorpe\n",
" suburb : toowong\n",
" postcode : 3931\n",
" state : nsw\n",
" date_of_birth : 19220503\n",
" soc_sec_id : 7228670\n",
"\n",
" given_name : samantha\n",
" surname : grierson\n",
" street_number : 5\n",
" address_1 : kennedy street\n",
" address_2 : tantallon\n",
" suburb : oakleigh\n",
" postcode : 3034\n",
" state : vic\n",
" date_of_birth : 19210114\n",
" soc_sec_id : 4683164\n",
"\n",
"11/10 positive, 16/10 negative\n",
"Do these records refer to the same thing?\n",
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
]
},
{
"name": "stdin",
"output_type": "stream",
"text": [
" f\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Finished labeling\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Clustering...\n"
]
},
{
"ename": "BlockingError",
"evalue": "No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mBlockingError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[31], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m#initiate matching\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m df_final \u001b[38;5;241m=\u001b[39m pandas_dedupe\u001b[38;5;241m.\u001b[39mlink_dataframes(dfa, dfb, [\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m given_name\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m surname\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m street_number\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m address_1\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m address_2\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m suburb\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m postcode\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m state\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m date_of_birth\u001b[39m\u001b[38;5;124m'\u001b[39m,\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m soc_sec_id\u001b[39m\u001b[38;5;124m'\u001b[39m])\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas_dedupe/link_dataframes.py:112\u001b[0m, in \u001b[0;36mlink_dataframes\u001b[0;34m(dfa, dfb, field_properties, config_name, n_cores)\u001b[0m\n\u001b[1;32m 100\u001b[0m \u001b[38;5;66;03m# ## Blocking\u001b[39;00m\n\u001b[1;32m 101\u001b[0m \n\u001b[1;32m 102\u001b[0m \u001b[38;5;66;03m# ## Clustering\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 108\u001b[0m \u001b[38;5;66;03m# If we had more data, we would not pass in all the blocked data into\u001b[39;00m\n\u001b[1;32m 109\u001b[0m \u001b[38;5;66;03m# this function but a representative sample.\u001b[39;00m\n\u001b[1;32m 111\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mClustering...\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[0;32m--> 112\u001b[0m linked_records \u001b[38;5;241m=\u001b[39m linker\u001b[38;5;241m.\u001b[39mjoin(data_1, data_2, \u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m 114\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m# duplicate sets\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;28mlen\u001b[39m(linked_records))\n\u001b[1;32m 117\u001b[0m \u001b[38;5;66;03m#Convert linked records into dataframe\u001b[39;00m\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/api.py:549\u001b[0m, in \u001b[0;36mRecordLinkMatching.join\u001b[0;34m(self, data_1, data_2, threshold, constraint)\u001b[0m\n\u001b[1;32m 543\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m constraint \u001b[38;5;129;01min\u001b[39;00m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmany-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmany-to-many\u001b[39m\u001b[38;5;124m\"\u001b[39m}, (\n\u001b[1;32m 544\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is an invalid constraint option. Valid options include \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 545\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one, many-to-one, or many-to-many\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m constraint\n\u001b[1;32m 546\u001b[0m )\n\u001b[1;32m 548\u001b[0m pairs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpairs(data_1, data_2)\n\u001b[0;32m--> 549\u001b[0m pair_scores \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mscore(pairs)\n\u001b[1;32m 551\u001b[0m links: Links\n\u001b[1;32m 552\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m constraint \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/api.py:125\u001b[0m, in \u001b[0;36mIntegralMatching.score\u001b[0;34m(self, pairs)\u001b[0m\n\u001b[1;32m 116\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 117\u001b[0m \u001b[38;5;124;03mScores pairs of records. Returns pairs of tuples of records id and\u001b[39;00m\n\u001b[1;32m 118\u001b[0m \u001b[38;5;124;03massociated probabilities that the pair of records are match\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 122\u001b[0m \n\u001b[1;32m 123\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 124\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 125\u001b[0m matches \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mscoreDuplicates(\n\u001b[1;32m 126\u001b[0m pairs, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdata_model\u001b[38;5;241m.\u001b[39mdistances, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclassifier, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnum_cores\n\u001b[1;32m 127\u001b[0m )\n\u001b[1;32m 128\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m:\n\u001b[1;32m 129\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[1;32m 130\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 131\u001b[0m \u001b[38;5;124;03m You need to either turn off multiprocessing or protect\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 134\u001b[0m \u001b[38;5;124;03m https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods\"\"\"\u001b[39;00m\n\u001b[1;32m 135\u001b[0m )\n",
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/core.py:126\u001b[0m, in \u001b[0;36mscoreDuplicates\u001b[0;34m(record_pairs, featurizer, classifier, num_cores)\u001b[0m\n\u001b[1;32m 124\u001b[0m first, record_pairs \u001b[38;5;241m=\u001b[39m peek(record_pairs)\n\u001b[1;32m 125\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m first \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 126\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m BlockingError(\n\u001b[1;32m 127\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNo records have been blocked together. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 128\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIs the data you are trying to match like \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 129\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mthe data you trained on? If so, try adding \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 130\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmore training data.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 131\u001b[0m )\n\u001b[1;32m 133\u001b[0m record_pairs_queue: _Queue \u001b[38;5;241m=\u001b[39m Queue(\u001b[38;5;241m2\u001b[39m)\n\u001b[1;32m 134\u001b[0m exception_queue: _Queue \u001b[38;5;241m=\u001b[39m Queue()\n",
"\u001b[0;31mBlockingError\u001b[0m: No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data."
]
}
],
"source": [
"#initiate matching\n",
"df_final = pandas_dedupe.link_dataframes(dfa, dfb, [' given_name', ' surname', ' street_number', ' address_1', ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',' soc_sec_id'])\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Exercise\n",
"Try to deduplicate the data of the visitors of the White House.\n",
"You can find the data [here](https://obamawhitehouse.archives.gov/goodgovernment/tools/visitor-records)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)\n",
"* [Dedupe](https://dedupe.io/) package\n",
"* [pandas-dedupe](https://pypi.org/project/pandas-dedupe/) package"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"height": "158.667px",
"left": "400px",
"right": "20px",
"top": "50px",
"width": "700px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}