mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-16 19:42:28 +00:00
3536 lines
121 KiB
Plaintext
3536 lines
121 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Course Notes for Learning Intelligent Systems"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Duplicated values\n",
|
||
"\n",
|
||
"Sometimes, data comes with messy data. \n",
|
||
"\n",
|
||
"We will use the package [dedupe](https://dedupe.io/) to eliminate duplicates. \n",
|
||
"\n",
|
||
"\n",
|
||
"Some alternatives are the packages [recordlinkage](https://pypi.org/project/recordlinkage/) and [thefuzz](https://github.com/seatgeek/thefuzz).\n",
|
||
"\n",
|
||
"Instead of using directly the package dedupe, we are going to use **pandas-dedupe**:\n",
|
||
"\n",
|
||
"\n",
|
||
"**pip install pandas-dedupe**\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"\n",
|
||
"Let's start by loading messy data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import warnings\n",
|
||
"warnings.filterwarnings('ignore') # Avoid warnings\n",
|
||
"\n",
|
||
"import pandas as pd\n",
|
||
"import numpy as np\n",
|
||
"import pandas_dedupe"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"df = pd.read_csv('https://raw.githubusercontent.com/dedupeio/dedupe-examples/master/csv_example/csv_example_messy_input.csv')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Let's do some initial checking"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(3337, 32)"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Id</th>\n",
|
||
" <th>Source</th>\n",
|
||
" <th>Site name</th>\n",
|
||
" <th>Address</th>\n",
|
||
" <th>Zip</th>\n",
|
||
" <th>Phone</th>\n",
|
||
" <th>Fax</th>\n",
|
||
" <th>Program Name</th>\n",
|
||
" <th>Length of Day</th>\n",
|
||
" <th>IDHS Provider ID</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>Executive Director</th>\n",
|
||
" <th>Center Director</th>\n",
|
||
" <th>ECE Available Programs</th>\n",
|
||
" <th>NAEYC Valid Until</th>\n",
|
||
" <th>NAEYC Program Id</th>\n",
|
||
" <th>Email Address</th>\n",
|
||
" <th>Ounce of Prevention Description</th>\n",
|
||
" <th>Purple binder service type</th>\n",
|
||
" <th>Column</th>\n",
|
||
" <th>Column2</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Salvation Army - Temple / Salvation Army</td>\n",
|
||
" <td>1 N Ogden Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2262649.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Salvation Army - Temple / Salvation Army</td>\n",
|
||
" <td>1 N Ogden Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2262649.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
|
||
" <td>10 S Kedzie Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>5339011.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>3</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
|
||
" <td>10 S Kedzie Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>5339011.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>4</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
|
||
" <td>10001 S Woodlawn Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2916100.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>5</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
|
||
" <td>10001 S Woodlawn Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2916100.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6</th>\n",
|
||
" <td>6</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
|
||
" <td>1001 W Roosevelt Rd</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>9395115.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>7</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
|
||
" <td>1001 W Roosevelt Rd</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>9395115.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8</th>\n",
|
||
" <td>8</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Hull House Association - Uptown Head Start / ...</td>\n",
|
||
" <td>1020 W Bryn Mawr Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>7695753.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9</th>\n",
|
||
" <td>9</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Hull House Association - Child Dev. Central O...</td>\n",
|
||
" <td>1030 W Van Buren St</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>9068600.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>10 rows × 32 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Id Source \\\n",
|
||
"0 0 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"1 1 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"2 2 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"3 3 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"4 4 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"5 5 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"6 6 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"7 7 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"8 8 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"9 9 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"\n",
|
||
" Site name Address \\\n",
|
||
"0 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
|
||
"1 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
|
||
"2 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
|
||
"3 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
|
||
"4 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
|
||
"5 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
|
||
"6 Easter Seals Society of Metropolitan Chicago ... 1001 W Roosevelt Rd \n",
|
||
"7 Easter Seals Society of Metropolitan Chicago ... 1001 W Roosevelt Rd \n",
|
||
"8 Hull House Association - Uptown Head Start / ... 1020 W Bryn Mawr Ave \n",
|
||
"9 Hull House Association - Child Dev. Central O... 1030 W Van Buren St \n",
|
||
"\n",
|
||
" Zip Phone Fax Program Name Length of Day IDHS Provider ID ... \\\n",
|
||
"0 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"1 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"2 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"3 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"4 NaN 2916100.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"5 NaN 2916100.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"6 NaN 9395115.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"7 NaN 9395115.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"8 NaN 7695753.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"9 NaN 9068600.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"\n",
|
||
" Executive Director Center Director ECE Available Programs NAEYC Valid Until \\\n",
|
||
"0 NaN NaN NaN NaN \n",
|
||
"1 NaN NaN NaN NaN \n",
|
||
"2 NaN NaN NaN NaN \n",
|
||
"3 NaN NaN NaN NaN \n",
|
||
"4 NaN NaN NaN NaN \n",
|
||
"5 NaN NaN NaN NaN \n",
|
||
"6 NaN NaN NaN NaN \n",
|
||
"7 NaN NaN NaN NaN \n",
|
||
"8 NaN NaN NaN NaN \n",
|
||
"9 NaN NaN NaN NaN \n",
|
||
"\n",
|
||
" NAEYC Program Id Email Address Ounce of Prevention Description \\\n",
|
||
"0 NaN NaN NaN \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 NaN NaN NaN \n",
|
||
"3 NaN NaN NaN \n",
|
||
"4 NaN NaN NaN \n",
|
||
"5 NaN NaN NaN \n",
|
||
"6 NaN NaN NaN \n",
|
||
"7 NaN NaN NaN \n",
|
||
"8 NaN NaN NaN \n",
|
||
"9 NaN NaN NaN \n",
|
||
"\n",
|
||
" Purple binder service type Column Column2 \n",
|
||
"0 NaN NaN NaN \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 NaN NaN NaN \n",
|
||
"3 NaN NaN NaN \n",
|
||
"4 NaN NaN NaN \n",
|
||
"5 NaN NaN NaN \n",
|
||
"6 NaN NaN NaN \n",
|
||
"7 NaN NaN NaN \n",
|
||
"8 NaN NaN NaN \n",
|
||
"9 NaN NaN NaN \n",
|
||
"\n",
|
||
"[10 rows x 32 columns]"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.head(10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',\n",
|
||
" 'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',\n",
|
||
" 'Neighborhood', 'Funded Enrollment', 'Program Option',\n",
|
||
" 'Number per Site EHS', 'Number per Site HS', 'Director',\n",
|
||
" 'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',\n",
|
||
" 'Website', 'Executive Director', 'Center Director',\n",
|
||
" 'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',\n",
|
||
" 'Email Address', 'Ounce of Prevention Description',\n",
|
||
" 'Purple binder service type', 'Column', 'Column2'],\n",
|
||
" dtype='object')\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(df.columns)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Id int64\n",
|
||
"Source object\n",
|
||
"Site name object\n",
|
||
"Address object\n",
|
||
"Zip float64\n",
|
||
"Phone float64\n",
|
||
"Fax object\n",
|
||
"Program Name object\n",
|
||
"Length of Day object\n",
|
||
"IDHS Provider ID object\n",
|
||
"Agency object\n",
|
||
"Neighborhood object\n",
|
||
"Funded Enrollment object\n",
|
||
"Program Option object\n",
|
||
"Number per Site EHS object\n",
|
||
"Number per Site HS object\n",
|
||
"Director float64\n",
|
||
"Head Start Fund float64\n",
|
||
"Eearly Head Start Fund object\n",
|
||
"CC fund object\n",
|
||
"Progmod object\n",
|
||
"Website object\n",
|
||
"Executive Director object\n",
|
||
"Center Director object\n",
|
||
"ECE Available Programs object\n",
|
||
"NAEYC Valid Until object\n",
|
||
"NAEYC Program Id float64\n",
|
||
"Email Address object\n",
|
||
"Ounce of Prevention Description object\n",
|
||
"Purple binder service type object\n",
|
||
"Column float64\n",
|
||
"Column2 object\n",
|
||
"dtype: object\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(df.dtypes)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Id 0\n",
|
||
"Source 0\n",
|
||
"Site name 0\n",
|
||
"Address 0\n",
|
||
"Zip 1333\n",
|
||
"Phone 146\n",
|
||
"Fax 3299\n",
|
||
"Program Name 2009\n",
|
||
"Length of Day 2009\n",
|
||
"IDHS Provider ID 3298\n",
|
||
"Agency 3325\n",
|
||
"Neighborhood 2754\n",
|
||
"Funded Enrollment 2424\n",
|
||
"Program Option 2800\n",
|
||
"Number per Site EHS 3319\n",
|
||
"Number per Site HS 3319\n",
|
||
"Director 3337\n",
|
||
"Head Start Fund 3337\n",
|
||
"Eearly Head Start Fund 2881\n",
|
||
"CC fund 2818\n",
|
||
"Progmod 2818\n",
|
||
"Website 2815\n",
|
||
"Executive Director 3114\n",
|
||
"Center Director 2874\n",
|
||
"ECE Available Programs 2379\n",
|
||
"NAEYC Valid Until 2968\n",
|
||
"NAEYC Program Id 3337\n",
|
||
"Email Address 3203\n",
|
||
"Ounce of Prevention Description 3185\n",
|
||
"Purple binder service type 3215\n",
|
||
"Column 3337\n",
|
||
"Column2 3018\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Missing values\n",
|
||
"df.isnull().sum()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Check duplicates"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0 False\n",
|
||
"1 False\n",
|
||
"2 False\n",
|
||
"3 False\n",
|
||
"4 False\n",
|
||
" ... \n",
|
||
"3332 False\n",
|
||
"3333 False\n",
|
||
"3334 False\n",
|
||
"3335 False\n",
|
||
"3336 False\n",
|
||
"Length: 3337, dtype: bool"
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.duplicated()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Remove duplicates"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Id</th>\n",
|
||
" <th>Source</th>\n",
|
||
" <th>Site name</th>\n",
|
||
" <th>Address</th>\n",
|
||
" <th>Zip</th>\n",
|
||
" <th>Phone</th>\n",
|
||
" <th>Fax</th>\n",
|
||
" <th>Program Name</th>\n",
|
||
" <th>Length of Day</th>\n",
|
||
" <th>IDHS Provider ID</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>Executive Director</th>\n",
|
||
" <th>Center Director</th>\n",
|
||
" <th>ECE Available Programs</th>\n",
|
||
" <th>NAEYC Valid Until</th>\n",
|
||
" <th>NAEYC Program Id</th>\n",
|
||
" <th>Email Address</th>\n",
|
||
" <th>Ounce of Prevention Description</th>\n",
|
||
" <th>Purple binder service type</th>\n",
|
||
" <th>Column</th>\n",
|
||
" <th>Column2</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Salvation Army - Temple / Salvation Army</td>\n",
|
||
" <td>1 N Ogden Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2262649.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Salvation Army - Temple / Salvation Army</td>\n",
|
||
" <td>1 N Ogden Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2262649.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
|
||
" <td>10 S Kedzie Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>5339011.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>3</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
|
||
" <td>10 S Kedzie Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>5339011.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>4</td>\n",
|
||
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
|
||
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
|
||
" <td>10001 S Woodlawn Ave</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2916100.0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Child Care</td>\n",
|
||
" <td>EXTENDED DAY</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 32 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Id Source \\\n",
|
||
"0 0 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"1 1 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"2 2 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"3 3 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"4 4 CPS_Early_Childhood_Portal_scrape.csv \n",
|
||
"\n",
|
||
" Site name Address \\\n",
|
||
"0 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
|
||
"1 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
|
||
"2 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
|
||
"3 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
|
||
"4 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
|
||
"\n",
|
||
" Zip Phone Fax Program Name Length of Day IDHS Provider ID ... \\\n",
|
||
"0 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"1 NaN 2262649.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"2 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"3 NaN 5339011.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"4 NaN 2916100.0 NaN Child Care EXTENDED DAY NaN ... \n",
|
||
"\n",
|
||
" Executive Director Center Director ECE Available Programs NAEYC Valid Until \\\n",
|
||
"0 NaN NaN NaN NaN \n",
|
||
"1 NaN NaN NaN NaN \n",
|
||
"2 NaN NaN NaN NaN \n",
|
||
"3 NaN NaN NaN NaN \n",
|
||
"4 NaN NaN NaN NaN \n",
|
||
"\n",
|
||
" NAEYC Program Id Email Address Ounce of Prevention Description \\\n",
|
||
"0 NaN NaN NaN \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 NaN NaN NaN \n",
|
||
"3 NaN NaN NaN \n",
|
||
"4 NaN NaN NaN \n",
|
||
"\n",
|
||
" Purple binder service type Column Column2 \n",
|
||
"0 NaN NaN NaN \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 NaN NaN NaN \n",
|
||
"3 NaN NaN NaN \n",
|
||
"4 NaN NaN NaN \n",
|
||
"\n",
|
||
"[5 rows x 32 columns]"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.drop_duplicates(inplace=True)\n",
|
||
"df[0:5]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Remove 'real duplicates'\n",
|
||
"\n",
|
||
"The problem is that the records are not the same. \n",
|
||
"\n",
|
||
"Data is messy. \n",
|
||
"\n",
|
||
"We will use **dedupe**."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Importing data ...\n",
|
||
"Reading from dedupe_dataframe_learned_settings\n",
|
||
"Clustering...\n",
|
||
"# duplicate sets 871\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# canonalize for standardizing names in a cluster\n",
|
||
"df_dedupe = pandas_dedupe.dedupe_dataframe(df, ['Source', 'Site name', 'Address', 'Zip', 'Phone', 'Email Address'], canonicalize=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"If you want to retrain, you should delete the settings and training files (the dedupe* and link_dataframes* files).\n",
|
||
"\n",
|
||
"\n",
|
||
"Now, if you inspect the dataframe, you will see the duplicated records that have been clustered."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',\n",
|
||
" 'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',\n",
|
||
" 'Neighborhood', 'Funded Enrollment', 'Program Option',\n",
|
||
" 'Number per Site EHS', 'Number per Site HS', 'Director',\n",
|
||
" 'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',\n",
|
||
" 'Website', 'Executive Director', 'Center Director',\n",
|
||
" 'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',\n",
|
||
" 'Email Address', 'Ounce of Prevention Description',\n",
|
||
" 'Purple binder service type', 'Column', 'Column2', 'cluster id',\n",
|
||
" 'confidence', 'canonical_Id', 'canonical_Source', 'canonical_Site name',\n",
|
||
" 'canonical_Address', 'canonical_Zip', 'canonical_Phone',\n",
|
||
" 'canonical_Fax', 'canonical_Program Name', 'canonical_Length of Day',\n",
|
||
" 'canonical_IDHS Provider ID', 'canonical_Agency',\n",
|
||
" 'canonical_Neighborhood', 'canonical_Funded Enrollment',\n",
|
||
" 'canonical_Program Option', 'canonical_Number per Site EHS',\n",
|
||
" 'canonical_Number per Site HS', 'canonical_Director',\n",
|
||
" 'canonical_Head Start Fund', 'canonical_Eearly Head Start Fund',\n",
|
||
" 'canonical_CC fund', 'canonical_Progmod', 'canonical_Website',\n",
|
||
" 'canonical_Executive Director', 'canonical_Center Director',\n",
|
||
" 'canonical_ECE Available Programs', 'canonical_NAEYC Valid Until',\n",
|
||
" 'canonical_NAEYC Program Id', 'canonical_Email Address',\n",
|
||
" 'canonical_Ounce of Prevention Description',\n",
|
||
" 'canonical_Purple binder service type', 'canonical_Column',\n",
|
||
" 'canonical_Column2'],\n",
|
||
" dtype='object')"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df_dedupe.columns"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Id</th>\n",
|
||
" <th>Source</th>\n",
|
||
" <th>Site name</th>\n",
|
||
" <th>Address</th>\n",
|
||
" <th>Zip</th>\n",
|
||
" <th>Phone</th>\n",
|
||
" <th>Fax</th>\n",
|
||
" <th>Program Name</th>\n",
|
||
" <th>Length of Day</th>\n",
|
||
" <th>IDHS Provider ID</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>canonical_Executive Director</th>\n",
|
||
" <th>canonical_Center Director</th>\n",
|
||
" <th>canonical_ECE Available Programs</th>\n",
|
||
" <th>canonical_NAEYC Valid Until</th>\n",
|
||
" <th>canonical_NAEYC Program Id</th>\n",
|
||
" <th>canonical_Email Address</th>\n",
|
||
" <th>canonical_Ounce of Prevention Description</th>\n",
|
||
" <th>canonical_Purple binder service type</th>\n",
|
||
" <th>canonical_Column</th>\n",
|
||
" <th>canonical_Column2</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3327</th>\n",
|
||
" <td>3327</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>precious infants & tots learning center</td>\n",
|
||
" <td>624 e 47th street</td>\n",
|
||
" <td>60653.0</td>\n",
|
||
" <td>2682685.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td>early head start</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3300</th>\n",
|
||
" <td>3300</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>ywca metropolitan chicago</td>\n",
|
||
" <td>360 n michigan avenue</td>\n",
|
||
" <td>60601.0</td>\n",
|
||
" <td>3726600.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3299</th>\n",
|
||
" <td>3299</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>ymca west side</td>\n",
|
||
" <td>5080 w harrison street</td>\n",
|
||
" <td>60644.0</td>\n",
|
||
" <td>9553100.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3285</th>\n",
|
||
" <td>3285</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>woodlawn organization</td>\n",
|
||
" <td>6040 s harper avenue</td>\n",
|
||
" <td>60637.0</td>\n",
|
||
" <td>2885840.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3281</th>\n",
|
||
" <td>3281</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>urban family and community centers</td>\n",
|
||
" <td>4241 w washington boulevard</td>\n",
|
||
" <td>60624.0</td>\n",
|
||
" <td>7228333.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1837</th>\n",
|
||
" <td>1837</td>\n",
|
||
" <td>chapin_dfss_providers_2011_070212.csv</td>\n",
|
||
" <td>north avenue day nursery fcch-carolyn price</td>\n",
|
||
" <td>2020 w jackson</td>\n",
|
||
" <td>60612.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>5833ip(ehs collaboration enhanced home ip), 58...</td>\n",
|
||
" <td>www.crcl.net</td>\n",
|
||
" <td>cp</td>\n",
|
||
" <td>betty lee</td>\n",
|
||
" <td></td>\n",
|
||
" <td>05/31/14</td>\n",
|
||
" <td>723127</td>\n",
|
||
" <td>youngt@crcl.net</td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2643</th>\n",
|
||
" <td>2643</td>\n",
|
||
" <td>ece chicago find a school scrape.csv</td>\n",
|
||
" <td>mary crane north 0-3</td>\n",
|
||
" <td>2905 n. leavitt</td>\n",
|
||
" <td>60618.0</td>\n",
|
||
" <td>3485528.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
|
||
" <td>www.marycrane.org</td>\n",
|
||
" <td>lavetter terry</td>\n",
|
||
" <td>martuice williams</td>\n",
|
||
" <td></td>\n",
|
||
" <td>08/01/16</td>\n",
|
||
" <td>722999</td>\n",
|
||
" <td>info@marycrane.org</td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3228</th>\n",
|
||
" <td>3228</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>mary crane center north</td>\n",
|
||
" <td>2905 n leavitt street</td>\n",
|
||
" <td>60618.0</td>\n",
|
||
" <td>9753322.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
|
||
" <td>www.marycrane.org</td>\n",
|
||
" <td>lavetter terry</td>\n",
|
||
" <td>martuice williams</td>\n",
|
||
" <td></td>\n",
|
||
" <td>08/01/16</td>\n",
|
||
" <td>722999</td>\n",
|
||
" <td>info@marycrane.org</td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3229</th>\n",
|
||
" <td>3229</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>mary crane family and day care center</td>\n",
|
||
" <td>2905 n clybourn avenue</td>\n",
|
||
" <td>60618.0</td>\n",
|
||
" <td>3485528.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
|
||
" <td>www.marycrane.org</td>\n",
|
||
" <td>lavetter terry</td>\n",
|
||
" <td>martuice williams</td>\n",
|
||
" <td></td>\n",
|
||
" <td>08/01/16</td>\n",
|
||
" <td>722999</td>\n",
|
||
" <td>info@marycrane.org</td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3241</th>\n",
|
||
" <td>3241</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>our lady of guadalupe early childhood center</td>\n",
|
||
" <td>9129 s burley avenue</td>\n",
|
||
" <td>60617.0</td>\n",
|
||
" <td>9785320.0</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>None</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>2488it(child care it center), 7030ps(hs collab...</td>\n",
|
||
" <td>www.catholiccharities.net</td>\n",
|
||
" <td>laura rios</td>\n",
|
||
" <td>deborah o'brien</td>\n",
|
||
" <td></td>\n",
|
||
" <td>01/31/13</td>\n",
|
||
" <td>486949</td>\n",
|
||
" <td>pgutierr@catholiccharities.net</td>\n",
|
||
" <td></td>\n",
|
||
" <td>child care</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>3337 rows × 66 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Id Source \\\n",
|
||
"3327 3327 purple_binder_early_childhood.csv \n",
|
||
"3300 3300 purple_binder_early_childhood.csv \n",
|
||
"3299 3299 purple_binder_early_childhood.csv \n",
|
||
"3285 3285 purple_binder_early_childhood.csv \n",
|
||
"3281 3281 purple_binder_early_childhood.csv \n",
|
||
"... ... ... \n",
|
||
"1837 1837 chapin_dfss_providers_2011_070212.csv \n",
|
||
"2643 2643 ece chicago find a school scrape.csv \n",
|
||
"3228 3228 purple_binder_early_childhood.csv \n",
|
||
"3229 3229 purple_binder_early_childhood.csv \n",
|
||
"3241 3241 purple_binder_early_childhood.csv \n",
|
||
"\n",
|
||
" Site name \\\n",
|
||
"3327 precious infants & tots learning center \n",
|
||
"3300 ywca metropolitan chicago \n",
|
||
"3299 ymca west side \n",
|
||
"3285 woodlawn organization \n",
|
||
"3281 urban family and community centers \n",
|
||
"... ... \n",
|
||
"1837 north avenue day nursery fcch-carolyn price \n",
|
||
"2643 mary crane north 0-3 \n",
|
||
"3228 mary crane center north \n",
|
||
"3229 mary crane family and day care center \n",
|
||
"3241 our lady of guadalupe early childhood center \n",
|
||
"\n",
|
||
" Address Zip Phone Fax Program Name \\\n",
|
||
"3327 624 e 47th street 60653.0 2682685.0 None None \n",
|
||
"3300 360 n michigan avenue 60601.0 3726600.0 None None \n",
|
||
"3299 5080 w harrison street 60644.0 9553100.0 None None \n",
|
||
"3285 6040 s harper avenue 60637.0 2885840.0 None None \n",
|
||
"3281 4241 w washington boulevard 60624.0 7228333.0 None None \n",
|
||
"... ... ... ... ... ... \n",
|
||
"1837 2020 w jackson 60612.0 None None None \n",
|
||
"2643 2905 n. leavitt 60618.0 3485528.0 None None \n",
|
||
"3228 2905 n leavitt street 60618.0 9753322.0 None None \n",
|
||
"3229 2905 n clybourn avenue 60618.0 3485528.0 None None \n",
|
||
"3241 9129 s burley avenue 60617.0 9785320.0 None None \n",
|
||
"\n",
|
||
" Length of Day IDHS Provider ID ... \\\n",
|
||
"3327 None None ... \n",
|
||
"3300 None None ... \n",
|
||
"3299 None None ... \n",
|
||
"3285 None None ... \n",
|
||
"3281 None None ... \n",
|
||
"... ... ... ... \n",
|
||
"1837 None None ... \n",
|
||
"2643 None None ... \n",
|
||
"3228 None None ... \n",
|
||
"3229 None None ... \n",
|
||
"3241 None None ... \n",
|
||
"\n",
|
||
" canonical_Executive Director \\\n",
|
||
"3327 \n",
|
||
"3300 \n",
|
||
"3299 \n",
|
||
"3285 \n",
|
||
"3281 \n",
|
||
"... ... \n",
|
||
"1837 5833ip(ehs collaboration enhanced home ip), 58... \n",
|
||
"2643 4374it(child care it center), 4374ps(child car... \n",
|
||
"3228 4374it(child care it center), 4374ps(child car... \n",
|
||
"3229 4374it(child care it center), 4374ps(child car... \n",
|
||
"3241 2488it(child care it center), 7030ps(hs collab... \n",
|
||
"\n",
|
||
" canonical_Center Director canonical_ECE Available Programs \\\n",
|
||
"3327 \n",
|
||
"3300 \n",
|
||
"3299 \n",
|
||
"3285 \n",
|
||
"3281 \n",
|
||
"... ... ... \n",
|
||
"1837 www.crcl.net cp \n",
|
||
"2643 www.marycrane.org lavetter terry \n",
|
||
"3228 www.marycrane.org lavetter terry \n",
|
||
"3229 www.marycrane.org lavetter terry \n",
|
||
"3241 www.catholiccharities.net laura rios \n",
|
||
"\n",
|
||
" canonical_NAEYC Valid Until canonical_NAEYC Program Id \\\n",
|
||
"3327 \n",
|
||
"3300 \n",
|
||
"3299 \n",
|
||
"3285 \n",
|
||
"3281 \n",
|
||
"... ... ... \n",
|
||
"1837 betty lee \n",
|
||
"2643 martuice williams \n",
|
||
"3228 martuice williams \n",
|
||
"3229 martuice williams \n",
|
||
"3241 deborah o'brien \n",
|
||
"\n",
|
||
" canonical_Email Address canonical_Ounce of Prevention Description \\\n",
|
||
"3327 \n",
|
||
"3300 \n",
|
||
"3299 \n",
|
||
"3285 \n",
|
||
"3281 \n",
|
||
"... ... ... \n",
|
||
"1837 05/31/14 723127 \n",
|
||
"2643 08/01/16 722999 \n",
|
||
"3228 08/01/16 722999 \n",
|
||
"3229 08/01/16 722999 \n",
|
||
"3241 01/31/13 486949 \n",
|
||
"\n",
|
||
" canonical_Purple binder service type canonical_Column canonical_Column2 \n",
|
||
"3327 early head start \n",
|
||
"3300 child care \n",
|
||
"3299 child care \n",
|
||
"3285 child care \n",
|
||
"3281 child care \n",
|
||
"... ... ... ... \n",
|
||
"1837 youngt@crcl.net child care \n",
|
||
"2643 info@marycrane.org child care \n",
|
||
"3228 info@marycrane.org child care \n",
|
||
"3229 info@marycrane.org child care \n",
|
||
"3241 pgutierr@catholiccharities.net child care \n",
|
||
"\n",
|
||
"[3337 rows x 66 columns]"
|
||
]
|
||
},
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df_sorted = df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)\n",
|
||
"df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Id</th>\n",
|
||
" <th>cluster id</th>\n",
|
||
" <th>confidence</th>\n",
|
||
" <th>Source</th>\n",
|
||
" <th>Zip</th>\n",
|
||
" <th>Address</th>\n",
|
||
" <th>canonical_Executive Director</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3327</th>\n",
|
||
" <td>3327</td>\n",
|
||
" <td>870</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60653.0</td>\n",
|
||
" <td>624 e 47th street</td>\n",
|
||
" <td></td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3300</th>\n",
|
||
" <td>3300</td>\n",
|
||
" <td>869</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60601.0</td>\n",
|
||
" <td>360 n michigan avenue</td>\n",
|
||
" <td></td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3299</th>\n",
|
||
" <td>3299</td>\n",
|
||
" <td>868</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60644.0</td>\n",
|
||
" <td>5080 w harrison street</td>\n",
|
||
" <td></td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3285</th>\n",
|
||
" <td>3285</td>\n",
|
||
" <td>867</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60637.0</td>\n",
|
||
" <td>6040 s harper avenue</td>\n",
|
||
" <td></td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3281</th>\n",
|
||
" <td>3281</td>\n",
|
||
" <td>866</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60624.0</td>\n",
|
||
" <td>4241 w washington boulevard</td>\n",
|
||
" <td></td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1837</th>\n",
|
||
" <td>1837</td>\n",
|
||
" <td>40</td>\n",
|
||
" <td>0.133246</td>\n",
|
||
" <td>chapin_dfss_providers_2011_070212.csv</td>\n",
|
||
" <td>60612.0</td>\n",
|
||
" <td>2020 w jackson</td>\n",
|
||
" <td>5833ip(ehs collaboration enhanced home ip), 58...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2643</th>\n",
|
||
" <td>2643</td>\n",
|
||
" <td>31</td>\n",
|
||
" <td>0.130475</td>\n",
|
||
" <td>ece chicago find a school scrape.csv</td>\n",
|
||
" <td>60618.0</td>\n",
|
||
" <td>2905 n. leavitt</td>\n",
|
||
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3228</th>\n",
|
||
" <td>3228</td>\n",
|
||
" <td>31</td>\n",
|
||
" <td>0.130475</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60618.0</td>\n",
|
||
" <td>2905 n leavitt street</td>\n",
|
||
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3229</th>\n",
|
||
" <td>3229</td>\n",
|
||
" <td>31</td>\n",
|
||
" <td>0.130474</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60618.0</td>\n",
|
||
" <td>2905 n clybourn avenue</td>\n",
|
||
" <td>4374it(child care it center), 4374ps(child car...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3241</th>\n",
|
||
" <td>3241</td>\n",
|
||
" <td>27</td>\n",
|
||
" <td>0.058419</td>\n",
|
||
" <td>purple_binder_early_childhood.csv</td>\n",
|
||
" <td>60617.0</td>\n",
|
||
" <td>9129 s burley avenue</td>\n",
|
||
" <td>2488it(child care it center), 7030ps(hs collab...</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>3337 rows × 7 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Id cluster id confidence Source \\\n",
|
||
"3327 3327 870 1.000000 purple_binder_early_childhood.csv \n",
|
||
"3300 3300 869 1.000000 purple_binder_early_childhood.csv \n",
|
||
"3299 3299 868 1.000000 purple_binder_early_childhood.csv \n",
|
||
"3285 3285 867 1.000000 purple_binder_early_childhood.csv \n",
|
||
"3281 3281 866 1.000000 purple_binder_early_childhood.csv \n",
|
||
"... ... ... ... ... \n",
|
||
"1837 1837 40 0.133246 chapin_dfss_providers_2011_070212.csv \n",
|
||
"2643 2643 31 0.130475 ece chicago find a school scrape.csv \n",
|
||
"3228 3228 31 0.130475 purple_binder_early_childhood.csv \n",
|
||
"3229 3229 31 0.130474 purple_binder_early_childhood.csv \n",
|
||
"3241 3241 27 0.058419 purple_binder_early_childhood.csv \n",
|
||
"\n",
|
||
" Zip Address \\\n",
|
||
"3327 60653.0 624 e 47th street \n",
|
||
"3300 60601.0 360 n michigan avenue \n",
|
||
"3299 60644.0 5080 w harrison street \n",
|
||
"3285 60637.0 6040 s harper avenue \n",
|
||
"3281 60624.0 4241 w washington boulevard \n",
|
||
"... ... ... \n",
|
||
"1837 60612.0 2020 w jackson \n",
|
||
"2643 60618.0 2905 n. leavitt \n",
|
||
"3228 60618.0 2905 n leavitt street \n",
|
||
"3229 60618.0 2905 n clybourn avenue \n",
|
||
"3241 60617.0 9129 s burley avenue \n",
|
||
"\n",
|
||
" canonical_Executive Director \n",
|
||
"3327 \n",
|
||
"3300 \n",
|
||
"3299 \n",
|
||
"3285 \n",
|
||
"3281 \n",
|
||
"... ... \n",
|
||
"1837 5833ip(ehs collaboration enhanced home ip), 58... \n",
|
||
"2643 4374it(child care it center), 4374ps(child car... \n",
|
||
"3228 4374it(child care it center), 4374ps(child car... \n",
|
||
"3229 4374it(child care it center), 4374ps(child car... \n",
|
||
"3241 2488it(child care it center), 7030ps(hs collab... \n",
|
||
"\n",
|
||
"[3337 rows x 7 columns]"
|
||
]
|
||
},
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df_sorted[['Id', 'cluster id', 'confidence', 'Source', 'Zip', 'Address', 'canonical_Executive Director']]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Matching / Linking records\n",
|
||
"Another problem is matching / linking records from different sources."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"Let's load two datasets from FEBRl (Freely extensible biomedical record linkage)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#load dataframes\n",
|
||
"dfa = pd.read_csv('data/dataset1-febrl.csv')\n",
|
||
"dfb = pd.read_csv('data/dataset2-febrl.csv')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"We cannot have missing values for applying record matching with this library, so we fill them.\n",
|
||
"\n",
|
||
"The problem is that many values are ' ' (not NaN). So, we first convert to NaN, and then we drop them."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"dfa.replace(['', ' '], np.nan, inplace=True)\n",
|
||
"dfb.replace(['', ' '], np.nan, inplace=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"dfa.dropna(inplace=True)\n",
|
||
"dfb.dropna(inplace=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>rec_id</th>\n",
|
||
" <th>given_name</th>\n",
|
||
" <th>surname</th>\n",
|
||
" <th>street_number</th>\n",
|
||
" <th>address_1</th>\n",
|
||
" <th>address_2</th>\n",
|
||
" <th>suburb</th>\n",
|
||
" <th>postcode</th>\n",
|
||
" <th>state</th>\n",
|
||
" <th>date_of_birth</th>\n",
|
||
" <th>soc_sec_id</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>rec-122-org</td>\n",
|
||
" <td>lachlan</td>\n",
|
||
" <td>berry</td>\n",
|
||
" <td>69</td>\n",
|
||
" <td>giblin street</td>\n",
|
||
" <td>killarney</td>\n",
|
||
" <td>bittern</td>\n",
|
||
" <td>4814</td>\n",
|
||
" <td>qld</td>\n",
|
||
" <td>19990219</td>\n",
|
||
" <td>7364009</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>rec-373-org</td>\n",
|
||
" <td>deakin</td>\n",
|
||
" <td>sondergeld</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>goldfinch circuit</td>\n",
|
||
" <td>kooltuo</td>\n",
|
||
" <td>canterbury</td>\n",
|
||
" <td>2776</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19600210</td>\n",
|
||
" <td>2635962</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>rec-227-org</td>\n",
|
||
" <td>luke</td>\n",
|
||
" <td>purdon</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>ramsay place</td>\n",
|
||
" <td>mirani</td>\n",
|
||
" <td>garbutt</td>\n",
|
||
" <td>2260</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19831024</td>\n",
|
||
" <td>8099933</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>rec-294-org</td>\n",
|
||
" <td>william</td>\n",
|
||
" <td>bishop</td>\n",
|
||
" <td>21</td>\n",
|
||
" <td>neworra place</td>\n",
|
||
" <td>apmnt 65</td>\n",
|
||
" <td>worongary</td>\n",
|
||
" <td>6225</td>\n",
|
||
" <td>qld</td>\n",
|
||
" <td>19490130</td>\n",
|
||
" <td>9773843</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>10</th>\n",
|
||
" <td>rec-81-dup-0</td>\n",
|
||
" <td>abbey</td>\n",
|
||
" <td>fit</td>\n",
|
||
" <td>13</td>\n",
|
||
" <td>kosciusko avenue</td>\n",
|
||
" <td>the wharf complex</td>\n",
|
||
" <td>yass</td>\n",
|
||
" <td>2594</td>\n",
|
||
" <td>nsw</td>\n",
|
||
" <td>19870510</td>\n",
|
||
" <td>7661096</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>11</th>\n",
|
||
" <td>rec-34-org</td>\n",
|
||
" <td>isabella</td>\n",
|
||
" <td>lodder</td>\n",
|
||
" <td>156</td>\n",
|
||
" <td>messenger street</td>\n",
|
||
" <td>tongbong sanctuary</td>\n",
|
||
" <td>bayswater</td>\n",
|
||
" <td>4870</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19650714</td>\n",
|
||
" <td>2790666</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>rec-478-org</td>\n",
|
||
" <td>anthony</td>\n",
|
||
" <td>beazley</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>birubi place</td>\n",
|
||
" <td>currandina</td>\n",
|
||
" <td>flemington</td>\n",
|
||
" <td>2477</td>\n",
|
||
" <td>qld</td>\n",
|
||
" <td>19730924</td>\n",
|
||
" <td>6558077</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>13</th>\n",
|
||
" <td>rec-225-org</td>\n",
|
||
" <td>alia</td>\n",
|
||
" <td>streich</td>\n",
|
||
" <td>74</td>\n",
|
||
" <td>maranoa street</td>\n",
|
||
" <td>rocky bend</td>\n",
|
||
" <td>rowville</td>\n",
|
||
" <td>6152</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19790418</td>\n",
|
||
" <td>1975340</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>15</th>\n",
|
||
" <td>rec-452-org</td>\n",
|
||
" <td>alissa</td>\n",
|
||
" <td>kilmartin</td>\n",
|
||
" <td>37</td>\n",
|
||
" <td>reveley crescent</td>\n",
|
||
" <td>crown allot</td>\n",
|
||
" <td>wolumla</td>\n",
|
||
" <td>6210</td>\n",
|
||
" <td>nsw</td>\n",
|
||
" <td>19041118</td>\n",
|
||
" <td>7994055</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>16</th>\n",
|
||
" <td>rec-67-org</td>\n",
|
||
" <td>jacob</td>\n",
|
||
" <td>lyden</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>haddon street</td>\n",
|
||
" <td>glenview</td>\n",
|
||
" <td>woodville north</td>\n",
|
||
" <td>2226</td>\n",
|
||
" <td>qld</td>\n",
|
||
" <td>19910424</td>\n",
|
||
" <td>6426415</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" rec_id given_name surname street_number address_1 \\\n",
|
||
"1 rec-122-org lachlan berry 69 giblin street \n",
|
||
"2 rec-373-org deakin sondergeld 48 goldfinch circuit \n",
|
||
"4 rec-227-org luke purdon 23 ramsay place \n",
|
||
"7 rec-294-org william bishop 21 neworra place \n",
|
||
"10 rec-81-dup-0 abbey fit 13 kosciusko avenue \n",
|
||
"11 rec-34-org isabella lodder 156 messenger street \n",
|
||
"12 rec-478-org anthony beazley 12 birubi place \n",
|
||
"13 rec-225-org alia streich 74 maranoa street \n",
|
||
"15 rec-452-org alissa kilmartin 37 reveley crescent \n",
|
||
"16 rec-67-org jacob lyden 25 haddon street \n",
|
||
"\n",
|
||
" address_2 suburb postcode state date_of_birth \\\n",
|
||
"1 killarney bittern 4814 qld 19990219 \n",
|
||
"2 kooltuo canterbury 2776 vic 19600210 \n",
|
||
"4 mirani garbutt 2260 vic 19831024 \n",
|
||
"7 apmnt 65 worongary 6225 qld 19490130 \n",
|
||
"10 the wharf complex yass 2594 nsw 19870510 \n",
|
||
"11 tongbong sanctuary bayswater 4870 vic 19650714 \n",
|
||
"12 currandina flemington 2477 qld 19730924 \n",
|
||
"13 rocky bend rowville 6152 vic 19790418 \n",
|
||
"15 crown allot wolumla 6210 nsw 19041118 \n",
|
||
"16 glenview woodville north 2226 qld 19910424 \n",
|
||
"\n",
|
||
" soc_sec_id \n",
|
||
"1 7364009 \n",
|
||
"2 2635962 \n",
|
||
"4 8099933 \n",
|
||
"7 9773843 \n",
|
||
"10 7661096 \n",
|
||
"11 2790666 \n",
|
||
"12 6558077 \n",
|
||
"13 1975340 \n",
|
||
"15 7994055 \n",
|
||
"16 6426415 "
|
||
]
|
||
},
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"dfa.head(10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>rec_id</th>\n",
|
||
" <th>given_name</th>\n",
|
||
" <th>surname</th>\n",
|
||
" <th>street_number</th>\n",
|
||
" <th>address_1</th>\n",
|
||
" <th>address_2</th>\n",
|
||
" <th>suburb</th>\n",
|
||
" <th>postcode</th>\n",
|
||
" <th>state</th>\n",
|
||
" <th>date_of_birth</th>\n",
|
||
" <th>soc_sec_id</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>rec-2778-org</td>\n",
|
||
" <td>sarah</td>\n",
|
||
" <td>bruhn</td>\n",
|
||
" <td>44</td>\n",
|
||
" <td>forbes street</td>\n",
|
||
" <td>wintersloe</td>\n",
|
||
" <td>kellerberrin</td>\n",
|
||
" <td>4510</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19300213</td>\n",
|
||
" <td>7535316</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>rec-712-dup-0</td>\n",
|
||
" <td>jacob</td>\n",
|
||
" <td>lanyon</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>milne cove</td>\n",
|
||
" <td>wellwod</td>\n",
|
||
" <td>beaconsfield upper</td>\n",
|
||
" <td>2602</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19080712</td>\n",
|
||
" <td>9497788</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>rec-1321-org</td>\n",
|
||
" <td>brinley</td>\n",
|
||
" <td>efthimiou</td>\n",
|
||
" <td>35</td>\n",
|
||
" <td>sturdee crescent</td>\n",
|
||
" <td>tremearne</td>\n",
|
||
" <td>scarborough</td>\n",
|
||
" <td>5211</td>\n",
|
||
" <td>qld</td>\n",
|
||
" <td>19940319</td>\n",
|
||
" <td>6814956</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>rec-3004-org</td>\n",
|
||
" <td>aleisha</td>\n",
|
||
" <td>hobson</td>\n",
|
||
" <td>54</td>\n",
|
||
" <td>oliver street</td>\n",
|
||
" <td>inglewood</td>\n",
|
||
" <td>toowoomba</td>\n",
|
||
" <td>3175</td>\n",
|
||
" <td>qld</td>\n",
|
||
" <td>19290427</td>\n",
|
||
" <td>5967384</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>rec-1384-org</td>\n",
|
||
" <td>ethan</td>\n",
|
||
" <td>gazzola</td>\n",
|
||
" <td>49</td>\n",
|
||
" <td>sheaffe street</td>\n",
|
||
" <td>bimby vale</td>\n",
|
||
" <td>port pirie</td>\n",
|
||
" <td>3088</td>\n",
|
||
" <td>sa</td>\n",
|
||
" <td>19631225</td>\n",
|
||
" <td>3832742</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>rec-3981-org</td>\n",
|
||
" <td>alicia</td>\n",
|
||
" <td>hope</td>\n",
|
||
" <td>100</td>\n",
|
||
" <td>mansfield place</td>\n",
|
||
" <td>sunset</td>\n",
|
||
" <td>byford</td>\n",
|
||
" <td>6061</td>\n",
|
||
" <td>sa</td>\n",
|
||
" <td>19421201</td>\n",
|
||
" <td>7934773</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6</th>\n",
|
||
" <td>rec-916-org</td>\n",
|
||
" <td>benjamin</td>\n",
|
||
" <td>kolosche</td>\n",
|
||
" <td>78</td>\n",
|
||
" <td>keenan street</td>\n",
|
||
" <td>wingara</td>\n",
|
||
" <td>raymond terrace</td>\n",
|
||
" <td>3212</td>\n",
|
||
" <td>sa</td>\n",
|
||
" <td>19450918</td>\n",
|
||
" <td>5698873</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8</th>\n",
|
||
" <td>rec-63-dup-0</td>\n",
|
||
" <td>olivia</td>\n",
|
||
" <td>white</td>\n",
|
||
" <td>55</td>\n",
|
||
" <td>duffy street</td>\n",
|
||
" <td>shopping village</td>\n",
|
||
" <td>mirrabooka</td>\n",
|
||
" <td>2260</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19000106</td>\n",
|
||
" <td>4996142</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>10</th>\n",
|
||
" <td>rec-112-org</td>\n",
|
||
" <td>joshua</td>\n",
|
||
" <td>rudd</td>\n",
|
||
" <td>78</td>\n",
|
||
" <td>max henry crescent</td>\n",
|
||
" <td>brentwood vlge</td>\n",
|
||
" <td>port douglas</td>\n",
|
||
" <td>2315</td>\n",
|
||
" <td>vic</td>\n",
|
||
" <td>19951125</td>\n",
|
||
" <td>1697892</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>11</th>\n",
|
||
" <td>rec-3297-org</td>\n",
|
||
" <td>rachael</td>\n",
|
||
" <td>lomman</td>\n",
|
||
" <td>37</td>\n",
|
||
" <td>carlile street</td>\n",
|
||
" <td>clonturkle</td>\n",
|
||
" <td>bronte</td>\n",
|
||
" <td>2177</td>\n",
|
||
" <td>nsw</td>\n",
|
||
" <td>19910228</td>\n",
|
||
" <td>9462397</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" rec_id given_name surname street_number address_1 \\\n",
|
||
"0 rec-2778-org sarah bruhn 44 forbes street \n",
|
||
"1 rec-712-dup-0 jacob lanyon 5 milne cove \n",
|
||
"2 rec-1321-org brinley efthimiou 35 sturdee crescent \n",
|
||
"3 rec-3004-org aleisha hobson 54 oliver street \n",
|
||
"4 rec-1384-org ethan gazzola 49 sheaffe street \n",
|
||
"5 rec-3981-org alicia hope 100 mansfield place \n",
|
||
"6 rec-916-org benjamin kolosche 78 keenan street \n",
|
||
"8 rec-63-dup-0 olivia white 55 duffy street \n",
|
||
"10 rec-112-org joshua rudd 78 max henry crescent \n",
|
||
"11 rec-3297-org rachael lomman 37 carlile street \n",
|
||
"\n",
|
||
" address_2 suburb postcode state date_of_birth \\\n",
|
||
"0 wintersloe kellerberrin 4510 vic 19300213 \n",
|
||
"1 wellwod beaconsfield upper 2602 vic 19080712 \n",
|
||
"2 tremearne scarborough 5211 qld 19940319 \n",
|
||
"3 inglewood toowoomba 3175 qld 19290427 \n",
|
||
"4 bimby vale port pirie 3088 sa 19631225 \n",
|
||
"5 sunset byford 6061 sa 19421201 \n",
|
||
"6 wingara raymond terrace 3212 sa 19450918 \n",
|
||
"8 shopping village mirrabooka 2260 vic 19000106 \n",
|
||
"10 brentwood vlge port douglas 2315 vic 19951125 \n",
|
||
"11 clonturkle bronte 2177 nsw 19910228 \n",
|
||
"\n",
|
||
" soc_sec_id \n",
|
||
"0 7535316 \n",
|
||
"1 9497788 \n",
|
||
"2 6814956 \n",
|
||
"3 5967384 \n",
|
||
"4 3832742 \n",
|
||
"5 7934773 \n",
|
||
"6 5698873 \n",
|
||
"8 4996142 \n",
|
||
"10 1697892 \n",
|
||
"11 9462397 "
|
||
]
|
||
},
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"dfb.head(10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Check the two datasets have the same columns."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',\n",
|
||
" ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',\n",
|
||
" ' soc_sec_id'],\n",
|
||
" dtype='object')\n",
|
||
"Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',\n",
|
||
" ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',\n",
|
||
" ' soc_sec_id'],\n",
|
||
" dtype='object')\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(dfa.columns)\n",
|
||
"print(dfb.columns)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Let's match..."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Importing data ...\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : reeve\n",
|
||
" surname : quilliam\n",
|
||
" street_number : 2\n",
|
||
" address_1 : renwick street\n",
|
||
" address_2 : yarrabee\n",
|
||
" suburb : barwon heads\n",
|
||
" postcode : 2340\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19810406\n",
|
||
" soc_sec_id : 1066923\n",
|
||
"\n",
|
||
" given_name : jessica\n",
|
||
" surname : reid\n",
|
||
" street_number : 280\n",
|
||
" address_1 : medley street\n",
|
||
" address_2 : warra creek\n",
|
||
" suburb : ballarat\n",
|
||
" postcode : 3149\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19830907\n",
|
||
" soc_sec_id : 1067529\n",
|
||
"\n",
|
||
"0/10 positive, 0/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Starting active labeling...\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : daniel\n",
|
||
" surname : couzens\n",
|
||
" street_number : 37\n",
|
||
" address_1 : coventry close\n",
|
||
" address_2 : cressbrook\n",
|
||
" suburb : mount eliza\n",
|
||
" postcode : 5073\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19881127\n",
|
||
" soc_sec_id : 6934299\n",
|
||
"\n",
|
||
" given_name : dante\n",
|
||
" surname : dakin\n",
|
||
" street_number : 3\n",
|
||
" address_1 : chuculba crescent\n",
|
||
" address_2 : greenpatch\n",
|
||
" suburb : forbes\n",
|
||
" postcode : 5072\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19481028\n",
|
||
" soc_sec_id : 7288639\n",
|
||
"\n",
|
||
"0/10 positive, 1/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : lachlan\n",
|
||
" surname : jukic\n",
|
||
" street_number : 2\n",
|
||
" address_1 : morgan crescent\n",
|
||
" address_2 : parklea\n",
|
||
" suburb : raymond terrace\n",
|
||
" postcode : 2250\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19780702\n",
|
||
" soc_sec_id : 4027998\n",
|
||
"\n",
|
||
" given_name : meg\n",
|
||
" surname : feil\n",
|
||
" street_number : 17\n",
|
||
" address_1 : biraban place\n",
|
||
" address_2 : hughloch lincoln red stud\n",
|
||
" suburb : hawthorne\n",
|
||
" postcode : 3429\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19060812\n",
|
||
" soc_sec_id : 4027997\n",
|
||
"\n",
|
||
"0/10 positive, 2/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : jacob\n",
|
||
" surname : lyen\n",
|
||
" street_number : 25\n",
|
||
" address_1 : haddon srteet\n",
|
||
" address_2 : glenvie w\n",
|
||
" suburb : woodville north\n",
|
||
" postcode : 2226\n",
|
||
" state : qld\n",
|
||
" date_of_birth : 19910424\n",
|
||
" soc_sec_id : 6426415\n",
|
||
"\n",
|
||
" given_name : zac\n",
|
||
" surname : white\n",
|
||
" street_number : 26\n",
|
||
" address_1 : companion crescent\n",
|
||
" address_2 : glenview\n",
|
||
" suburb : toronto\n",
|
||
" postcode : 2226\n",
|
||
" state : sa\n",
|
||
" date_of_birth : 19431117\n",
|
||
" soc_sec_id : 3437945\n",
|
||
"\n",
|
||
"0/10 positive, 3/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : kydan\n",
|
||
" surname : mccarthy\n",
|
||
" street_number : 67\n",
|
||
" address_1 : clemenger street\n",
|
||
" address_2 : the points holsteins\n",
|
||
" suburb : fairlawn\n",
|
||
" postcode : 6415\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19720518\n",
|
||
" soc_sec_id : 6527653\n",
|
||
"\n",
|
||
" given_name : daniel\n",
|
||
" surname : mccarthy\n",
|
||
" street_number : 6\n",
|
||
" address_1 : brunton street\n",
|
||
" address_2 : tall pines\n",
|
||
" suburb : fairlight\n",
|
||
" postcode : 3155\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19760107\n",
|
||
" soc_sec_id : 8093038\n",
|
||
"\n",
|
||
"0/10 positive, 4/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : brooklyn\n",
|
||
" surname : manson\n",
|
||
" street_number : 27\n",
|
||
" address_1 : clive steele avenue\n",
|
||
" address_2 : port fairy road\n",
|
||
" suburb : mount low\n",
|
||
" postcode : 3450\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19710727\n",
|
||
" soc_sec_id : 4493900\n",
|
||
"\n",
|
||
" given_name : ruby\n",
|
||
" surname : mason\n",
|
||
" street_number : 7\n",
|
||
" address_1 : clive steele avenue\n",
|
||
" address_2 : kooyong\n",
|
||
" suburb : botany\n",
|
||
" postcode : 3636\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19730913\n",
|
||
" soc_sec_id : 4397223\n",
|
||
"\n",
|
||
"0/10 positive, 5/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : emiily\n",
|
||
" surname : coleman\n",
|
||
" street_number : 108\n",
|
||
" address_1 : chewings street\n",
|
||
" address_2 : berkeley vlge\n",
|
||
" suburb : wellington point\n",
|
||
" postcode : 2550\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19421221\n",
|
||
" soc_sec_id : 7206933\n",
|
||
"\n",
|
||
" given_name : emiily\n",
|
||
" surname : went\n",
|
||
" street_number : 18\n",
|
||
" address_1 : glenmaggie street\n",
|
||
" address_2 : berkeley vlge\n",
|
||
" suburb : blue haven\n",
|
||
" postcode : 6051\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19521205\n",
|
||
" soc_sec_id : 8530937\n",
|
||
"\n",
|
||
"0/10 positive, 6/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : jacob\n",
|
||
" surname : white\n",
|
||
" street_number : 5\n",
|
||
" address_1 : findlay street\n",
|
||
" address_2 : booroopki park rmb 596\n",
|
||
" suburb : robina\n",
|
||
" postcode : 2197\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19170205\n",
|
||
" soc_sec_id : 4702928\n",
|
||
"\n",
|
||
" given_name : talia\n",
|
||
" surname : reid\n",
|
||
" street_number : 147\n",
|
||
" address_1 : sid barnes crescent\n",
|
||
" address_2 : tathra\n",
|
||
" suburb : berowra heights\n",
|
||
" postcode : 2170\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19230203\n",
|
||
" soc_sec_id : 4712927\n",
|
||
"\n",
|
||
"0/10 positive, 7/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : carla\n",
|
||
" surname : amiet\n",
|
||
" street_number : 50\n",
|
||
" address_1 : carstensz street\n",
|
||
" address_2 : oaklodge\n",
|
||
" suburb : blackmans bay\n",
|
||
" postcode : 3180\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19790801\n",
|
||
" soc_sec_id : 9646483\n",
|
||
"\n",
|
||
" given_name : cameron\n",
|
||
" surname : coleman\n",
|
||
" street_number : 10\n",
|
||
" address_1 : edwards street\n",
|
||
" address_2 : broadbridge manor\n",
|
||
" suburb : blackmans bay\n",
|
||
" postcode : 3630\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19871030\n",
|
||
" soc_sec_id : 5502408\n",
|
||
"\n",
|
||
"0/10 positive, 8/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : lara\n",
|
||
" surname : matthews\n",
|
||
" street_number : 6\n",
|
||
" address_1 : beaumaris street\n",
|
||
" address_2 : sunforest therapy centre\n",
|
||
" suburb : the basin\n",
|
||
" postcode : 4179\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19911006\n",
|
||
" soc_sec_id : 2164704\n",
|
||
"\n",
|
||
" given_name : dominic\n",
|
||
" surname : matthews\n",
|
||
" street_number : 67\n",
|
||
" address_1 : campbell street\n",
|
||
" address_2 : narraburra lodge\n",
|
||
" suburb : coonabarabran\n",
|
||
" postcode : 3174\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19470226\n",
|
||
" soc_sec_id : 3115384\n",
|
||
"\n",
|
||
"0/10 positive, 9/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : jack\n",
|
||
" surname : matthews\n",
|
||
" street_number : 17\n",
|
||
" address_1 : herron crescent\n",
|
||
" address_2 : broadmere\n",
|
||
" suburb : highton\n",
|
||
" postcode : 2035\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19081119\n",
|
||
" soc_sec_id : 5613395\n",
|
||
"\n",
|
||
" given_name : alexandra\n",
|
||
" surname : matthews\n",
|
||
" street_number : 174\n",
|
||
" address_1 : port jackson circuit\n",
|
||
" address_2 : old timers south\n",
|
||
" suburb : whaleback\n",
|
||
" postcode : 2830\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19261017\n",
|
||
" soc_sec_id : 2919332\n",
|
||
"\n",
|
||
"0/10 positive, 10/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : emiily\n",
|
||
" surname : kowald\n",
|
||
" street_number : 18\n",
|
||
" address_1 : daglish street\n",
|
||
" address_2 : oakdale\n",
|
||
" suburb : avalon\n",
|
||
" postcode : 3030\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19250313\n",
|
||
" soc_sec_id : 1590627\n",
|
||
"\n",
|
||
" given_name : emiily\n",
|
||
" surname : went\n",
|
||
" street_number : 18\n",
|
||
" address_1 : glenmaggie street\n",
|
||
" address_2 : berkeley vlge\n",
|
||
" suburb : blue haven\n",
|
||
" postcode : 6051\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19521205\n",
|
||
" soc_sec_id : 8530937\n",
|
||
"\n",
|
||
"0/10 positive, 11/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : jaden\n",
|
||
" surname : humphreys\n",
|
||
" street_number : 14\n",
|
||
" address_1 : euree street\n",
|
||
" address_2 : moorillyah\n",
|
||
" suburb : thornlie\n",
|
||
" postcode : 3116\n",
|
||
" state : wa\n",
|
||
" date_of_birth : 19700116\n",
|
||
" soc_sec_id : 9382782\n",
|
||
"\n",
|
||
" given_name : isabelle\n",
|
||
" surname : jolly\n",
|
||
" street_number : 166\n",
|
||
" address_1 : hodges street\n",
|
||
" address_2 : bosmit\n",
|
||
" suburb : thornlie\n",
|
||
" postcode : 3163\n",
|
||
" state : wa\n",
|
||
" date_of_birth : 19050126\n",
|
||
" soc_sec_id : 2719590\n",
|
||
"\n",
|
||
"0/10 positive, 12/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : brooklyn\n",
|
||
" surname : manson\n",
|
||
" street_number : 27\n",
|
||
" address_1 : clive steele avenue\n",
|
||
" address_2 : port fairy road\n",
|
||
" suburb : yagoona\n",
|
||
" postcode : 3450\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19710727\n",
|
||
" soc_sec_id : 4493900\n",
|
||
"\n",
|
||
" given_name : laura\n",
|
||
" surname : campbell\n",
|
||
" street_number : 152\n",
|
||
" address_1 : clive steele avenue\n",
|
||
" address_2 : irrigation farm\n",
|
||
" suburb : yagoona\n",
|
||
" postcode : 3350\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19160610\n",
|
||
" soc_sec_id : 6214635\n",
|
||
"\n",
|
||
"0/10 positive, 13/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : jacob\n",
|
||
" surname : white\n",
|
||
" street_number : 5\n",
|
||
" address_1 : findlay street\n",
|
||
" address_2 : booroopki park rmb 596\n",
|
||
" suburb : robina\n",
|
||
" postcode : 2197\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19170205\n",
|
||
" soc_sec_id : 4702928\n",
|
||
"\n",
|
||
" given_name : jakob\n",
|
||
" surname : menzies\n",
|
||
" street_number : 33\n",
|
||
" address_1 : coverdale street\n",
|
||
" address_2 : bundong\n",
|
||
" suburb : worongary\n",
|
||
" postcode : 2190\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19140610\n",
|
||
" soc_sec_id : 4557295\n",
|
||
"\n",
|
||
"0/10 positive, 14/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : benjamin\n",
|
||
" surname : kirchener\n",
|
||
" street_number : 12\n",
|
||
" address_1 : stuart street\n",
|
||
" address_2 : wurrami\n",
|
||
" suburb : theodore\n",
|
||
" postcode : 2620\n",
|
||
" state : wa\n",
|
||
" date_of_birth : 19751110\n",
|
||
" soc_sec_id : 1766048\n",
|
||
"\n",
|
||
" given_name : max\n",
|
||
" surname : rees\n",
|
||
" street_number : 10\n",
|
||
" address_1 : waite street\n",
|
||
" address_2 : rosedown\n",
|
||
" suburb : nambour\n",
|
||
" postcode : 2620\n",
|
||
" state : wa\n",
|
||
" date_of_birth : 19751230\n",
|
||
" soc_sec_id : 1361900\n",
|
||
"\n",
|
||
"0/10 positive, 15/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" n\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : brydon\n",
|
||
" surname : webb\n",
|
||
" street_number : 11\n",
|
||
" address_1 : walker crescent\n",
|
||
" address_2 : bugoren\n",
|
||
" suburb : charlestown\n",
|
||
" postcode : 6258\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19190528\n",
|
||
" soc_sec_id : 4191569\n",
|
||
"\n",
|
||
" given_name : bradley\n",
|
||
" surname : haberfield\n",
|
||
" street_number : 11\n",
|
||
" address_1 : carumbo place\n",
|
||
" address_2 : bungarra\n",
|
||
" suburb : canley heights\n",
|
||
" postcode : 2758\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19190528\n",
|
||
" soc_sec_id : 5039500\n",
|
||
"\n",
|
||
"0/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : brydon\n",
|
||
" surname : webb\n",
|
||
" street_number : 11\n",
|
||
" address_1 : walker crescent\n",
|
||
" address_2 : bugoren\n",
|
||
" suburb : charlestown\n",
|
||
" postcode : 6258\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19190528\n",
|
||
" soc_sec_id : 4191569\n",
|
||
"\n",
|
||
" given_name : bradley\n",
|
||
" surname : haberfield\n",
|
||
" street_number : 11\n",
|
||
" address_1 : carumbi place\n",
|
||
" address_2 : bungarra\n",
|
||
" suburb : canley heights\n",
|
||
" postcode : 2758\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19190528\n",
|
||
" soc_sec_id : 5039500\n",
|
||
"\n",
|
||
"1/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : lochlan\n",
|
||
" surname : savidge\n",
|
||
" street_number : 29\n",
|
||
" address_1 : pohlman street\n",
|
||
" address_2 : moline village\n",
|
||
" suburb : jingili\n",
|
||
" postcode : 3071\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19140228\n",
|
||
" soc_sec_id : 1498207\n",
|
||
"\n",
|
||
" given_name : jayme\n",
|
||
" surname : parr\n",
|
||
" street_number : 2\n",
|
||
" address_1 : clive steele avenue\n",
|
||
" address_2 : henry kendall hostel\n",
|
||
" suburb : hoskinstown\n",
|
||
" postcode : 2770\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19140228\n",
|
||
" soc_sec_id : 5840194\n",
|
||
"\n",
|
||
"2/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : charlie\n",
|
||
" surname : headon\n",
|
||
" street_number : 17\n",
|
||
" address_1 : sheehy street\n",
|
||
" address_2 : hawkins masonic vlge\n",
|
||
" suburb : warrandyte\n",
|
||
" postcode : 3073\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19880814\n",
|
||
" soc_sec_id : 7871445\n",
|
||
"\n",
|
||
" given_name : william\n",
|
||
" surname : hislop\n",
|
||
" street_number : 17\n",
|
||
" address_1 : deane street\n",
|
||
" address_2 : sunbury\n",
|
||
" suburb : cedar creek\n",
|
||
" postcode : 3073\n",
|
||
" state : qld\n",
|
||
" date_of_birth : 19830819\n",
|
||
" soc_sec_id : 6153593\n",
|
||
"\n",
|
||
"3/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : mitchell\n",
|
||
" surname : scrbak\n",
|
||
" street_number : 950\n",
|
||
" address_1 : holyman street\n",
|
||
" address_2 : berkeley vlge\n",
|
||
" suburb : safety bay\n",
|
||
" postcode : 4300\n",
|
||
" state : qld\n",
|
||
" date_of_birth : 19060811\n",
|
||
" soc_sec_id : 3592109\n",
|
||
"\n",
|
||
" given_name : hannah\n",
|
||
" surname : beams\n",
|
||
" street_number : 9\n",
|
||
" address_1 : light street\n",
|
||
" address_2 : castle hill farm\n",
|
||
" suburb : sale\n",
|
||
" postcode : 3221\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19560811\n",
|
||
" soc_sec_id : 7444484\n",
|
||
"\n",
|
||
"4/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : finley\n",
|
||
" surname : haeusler\n",
|
||
" street_number : 27\n",
|
||
" address_1 : noarlunga crescent\n",
|
||
" address_2 : spring ridge\n",
|
||
" suburb : nambour\n",
|
||
" postcode : 3180\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19660711\n",
|
||
" soc_sec_id : 3025217\n",
|
||
"\n",
|
||
" given_name : elki\n",
|
||
" surname : trent\n",
|
||
" street_number : 27\n",
|
||
" address_1 : wray place\n",
|
||
" address_2 : the mews royal hotel bldg\n",
|
||
" suburb : gawler east\n",
|
||
" postcode : 3152\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19600211\n",
|
||
" soc_sec_id : 5679502\n",
|
||
"\n",
|
||
"5/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : india\n",
|
||
" surname : negrean\n",
|
||
" street_number : 1\n",
|
||
" address_1 : barringer street\n",
|
||
" address_2 : sherwood\n",
|
||
" suburb : parkinson\n",
|
||
" postcode : 3168\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19860923\n",
|
||
" soc_sec_id : 2097928\n",
|
||
"\n",
|
||
" given_name : logan\n",
|
||
" surname : selth\n",
|
||
" street_number : 147\n",
|
||
" address_1 : goyder street\n",
|
||
" address_2 : rivonia\n",
|
||
" suburb : queenscliff\n",
|
||
" postcode : 2120\n",
|
||
" state : tas\n",
|
||
" date_of_birth : 19860921\n",
|
||
" soc_sec_id : 4161322\n",
|
||
"\n",
|
||
"6/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : annablle\n",
|
||
" surname : kounis\n",
|
||
" street_number : 121\n",
|
||
" address_1 : calder place\n",
|
||
" address_2 : brambletye vinyard\n",
|
||
" suburb : joondkalup\n",
|
||
" postcode : 3120\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19640907\n",
|
||
" soc_sec_id : 1612956\n",
|
||
"\n",
|
||
" given_name : claurdia\n",
|
||
" surname : clelland\n",
|
||
" street_number : 12\n",
|
||
" address_1 : box hill a venue\n",
|
||
" address_2 : st francis vlge\n",
|
||
" suburb : old beach\n",
|
||
" postcode : 3127\n",
|
||
" state : wa\n",
|
||
" date_of_birth : 19640902\n",
|
||
" soc_sec_id : 9508954\n",
|
||
"\n",
|
||
"7/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : troy\n",
|
||
" surname : reid\n",
|
||
" street_number : 1\n",
|
||
" address_1 : allan street\n",
|
||
" address_2 : townview\n",
|
||
" suburb : page\n",
|
||
" postcode : 2774\n",
|
||
" state : qld\n",
|
||
" date_of_birth : 19250727\n",
|
||
" soc_sec_id : 3580821\n",
|
||
"\n",
|
||
" given_name : william\n",
|
||
" surname : tossell\n",
|
||
" street_number : 1\n",
|
||
" address_1 : lutana street\n",
|
||
" address_2 : nara cnsa\n",
|
||
" suburb : craigmore\n",
|
||
" postcode : 2509\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19250116\n",
|
||
" soc_sec_id : 5322906\n",
|
||
"\n",
|
||
"8/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : isaac\n",
|
||
" surname : quilliam\n",
|
||
" street_number : 11\n",
|
||
" address_1 : namatjira drive\n",
|
||
" address_2 : delaware\n",
|
||
" suburb : geelong west\n",
|
||
" postcode : 3072\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19930926\n",
|
||
" soc_sec_id : 1556150\n",
|
||
"\n",
|
||
" given_name : bailey\n",
|
||
" surname : clarke\n",
|
||
" street_number : 1\n",
|
||
" address_1 : hetherington circuit\n",
|
||
" address_2 : gundaline\n",
|
||
" suburb : harden\n",
|
||
" postcode : 2077\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19930416\n",
|
||
" soc_sec_id : 6134615\n",
|
||
"\n",
|
||
"9/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : james\n",
|
||
" surname : blake\n",
|
||
" street_number : 19\n",
|
||
" address_1 : sturt avenue\n",
|
||
" address_2 : laloki\n",
|
||
" suburb : carnegie\n",
|
||
" postcode : 2218\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19050716\n",
|
||
" soc_sec_id : 7830672\n",
|
||
"\n",
|
||
" given_name : finn\n",
|
||
" surname : kapoor\n",
|
||
" street_number : 1994\n",
|
||
" address_1 : sturt avenue\n",
|
||
" address_2 : john flynn medical centre\n",
|
||
" suburb : mullumbimby\n",
|
||
" postcode : 2262\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19880816\n",
|
||
" soc_sec_id : 8680815\n",
|
||
"\n",
|
||
"10/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" y\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" given_name : harry\n",
|
||
" surname : ryn\n",
|
||
" street_number : 5\n",
|
||
" address_1 : kellway street\n",
|
||
" address_2 : rowethorpe\n",
|
||
" suburb : toowong\n",
|
||
" postcode : 3931\n",
|
||
" state : nsw\n",
|
||
" date_of_birth : 19220503\n",
|
||
" soc_sec_id : 7228670\n",
|
||
"\n",
|
||
" given_name : samantha\n",
|
||
" surname : grierson\n",
|
||
" street_number : 5\n",
|
||
" address_1 : kennedy street\n",
|
||
" address_2 : tantallon\n",
|
||
" suburb : oakleigh\n",
|
||
" postcode : 3034\n",
|
||
" state : vic\n",
|
||
" date_of_birth : 19210114\n",
|
||
" soc_sec_id : 4683164\n",
|
||
"\n",
|
||
"11/10 positive, 16/10 negative\n",
|
||
"Do these records refer to the same thing?\n",
|
||
"(y)es / (n)o / (u)nsure / (f)inished / (p)revious\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" f\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Finished labeling\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Clustering...\n"
|
||
]
|
||
},
|
||
{
|
||
"ename": "BlockingError",
|
||
"evalue": "No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data.",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[0;31mBlockingError\u001b[0m Traceback (most recent call last)",
|
||
"Cell \u001b[0;32mIn[31], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m#initiate matching\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m df_final \u001b[38;5;241m=\u001b[39m pandas_dedupe\u001b[38;5;241m.\u001b[39mlink_dataframes(dfa, dfb, [\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m given_name\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m surname\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m street_number\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m address_1\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m address_2\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m suburb\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m postcode\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m state\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m date_of_birth\u001b[39m\u001b[38;5;124m'\u001b[39m,\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m soc_sec_id\u001b[39m\u001b[38;5;124m'\u001b[39m])\n",
|
||
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/pandas_dedupe/link_dataframes.py:112\u001b[0m, in \u001b[0;36mlink_dataframes\u001b[0;34m(dfa, dfb, field_properties, config_name, n_cores)\u001b[0m\n\u001b[1;32m 100\u001b[0m \u001b[38;5;66;03m# ## Blocking\u001b[39;00m\n\u001b[1;32m 101\u001b[0m \n\u001b[1;32m 102\u001b[0m \u001b[38;5;66;03m# ## Clustering\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 108\u001b[0m \u001b[38;5;66;03m# If we had more data, we would not pass in all the blocked data into\u001b[39;00m\n\u001b[1;32m 109\u001b[0m \u001b[38;5;66;03m# this function but a representative sample.\u001b[39;00m\n\u001b[1;32m 111\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mClustering...\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[0;32m--> 112\u001b[0m linked_records \u001b[38;5;241m=\u001b[39m linker\u001b[38;5;241m.\u001b[39mjoin(data_1, data_2, \u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m 114\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m# duplicate sets\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;28mlen\u001b[39m(linked_records))\n\u001b[1;32m 117\u001b[0m \u001b[38;5;66;03m#Convert linked records into dataframe\u001b[39;00m\n",
|
||
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/api.py:549\u001b[0m, in \u001b[0;36mRecordLinkMatching.join\u001b[0;34m(self, data_1, data_2, threshold, constraint)\u001b[0m\n\u001b[1;32m 543\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m constraint \u001b[38;5;129;01min\u001b[39;00m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmany-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmany-to-many\u001b[39m\u001b[38;5;124m\"\u001b[39m}, (\n\u001b[1;32m 544\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is an invalid constraint option. Valid options include \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 545\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one, many-to-one, or many-to-many\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m constraint\n\u001b[1;32m 546\u001b[0m )\n\u001b[1;32m 548\u001b[0m pairs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpairs(data_1, data_2)\n\u001b[0;32m--> 549\u001b[0m pair_scores \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mscore(pairs)\n\u001b[1;32m 551\u001b[0m links: Links\n\u001b[1;32m 552\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m constraint \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone-to-one\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
|
||
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/api.py:125\u001b[0m, in \u001b[0;36mIntegralMatching.score\u001b[0;34m(self, pairs)\u001b[0m\n\u001b[1;32m 116\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 117\u001b[0m \u001b[38;5;124;03mScores pairs of records. Returns pairs of tuples of records id and\u001b[39;00m\n\u001b[1;32m 118\u001b[0m \u001b[38;5;124;03massociated probabilities that the pair of records are match\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 122\u001b[0m \n\u001b[1;32m 123\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 124\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 125\u001b[0m matches \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mscoreDuplicates(\n\u001b[1;32m 126\u001b[0m pairs, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdata_model\u001b[38;5;241m.\u001b[39mdistances, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclassifier, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnum_cores\n\u001b[1;32m 127\u001b[0m )\n\u001b[1;32m 128\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m:\n\u001b[1;32m 129\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[1;32m 130\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 131\u001b[0m \u001b[38;5;124;03m You need to either turn off multiprocessing or protect\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 134\u001b[0m \u001b[38;5;124;03m https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods\"\"\"\u001b[39;00m\n\u001b[1;32m 135\u001b[0m )\n",
|
||
"File \u001b[0;32m~/anaconda3/lib/python3.11/site-packages/dedupe/core.py:126\u001b[0m, in \u001b[0;36mscoreDuplicates\u001b[0;34m(record_pairs, featurizer, classifier, num_cores)\u001b[0m\n\u001b[1;32m 124\u001b[0m first, record_pairs \u001b[38;5;241m=\u001b[39m peek(record_pairs)\n\u001b[1;32m 125\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m first \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 126\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m BlockingError(\n\u001b[1;32m 127\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNo records have been blocked together. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 128\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIs the data you are trying to match like \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 129\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mthe data you trained on? If so, try adding \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 130\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmore training data.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 131\u001b[0m )\n\u001b[1;32m 133\u001b[0m record_pairs_queue: _Queue \u001b[38;5;241m=\u001b[39m Queue(\u001b[38;5;241m2\u001b[39m)\n\u001b[1;32m 134\u001b[0m exception_queue: _Queue \u001b[38;5;241m=\u001b[39m Queue()\n",
|
||
"\u001b[0;31mBlockingError\u001b[0m: No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data."
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"#initiate matching\n",
|
||
"df_final = pandas_dedupe.link_dataframes(dfa, dfb, [' given_name', ' surname', ' street_number', ' address_1', ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',' soc_sec_id'])\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Exercise\n",
|
||
"Try to deduplicate the data of the visitors of the White House.\n",
|
||
"You can find the data [here](https://obamawhitehouse.archives.gov/goodgovernment/tools/visitor-records)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"# References\n",
|
||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)\n",
|
||
"* [Dedupe](https://dedupe.io/) package\n",
|
||
"* [pandas-dedupe](https://pypi.org/project/pandas-dedupe/) package"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Licence\n",
|
||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||
"\n",
|
||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"celltoolbar": "Slideshow",
|
||
"datacleaner": {
|
||
"position": {
|
||
"height": "158.667px",
|
||
"left": "400px",
|
||
"right": "20px",
|
||
"top": "50px",
|
||
"width": "700px"
|
||
},
|
||
"python": {
|
||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||
},
|
||
"window_display": false
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.7"
|
||
},
|
||
"latex_envs": {
|
||
"LaTeX_envs_menu_present": true,
|
||
"autocomplete": true,
|
||
"bibliofile": "biblio.bib",
|
||
"cite_by": "apalike",
|
||
"current_citInitial": 1,
|
||
"eqLabelWithNumbers": true,
|
||
"eqNumInitial": 1,
|
||
"hotkeys": {
|
||
"equation": "Ctrl-E",
|
||
"itemize": "Ctrl-I"
|
||
},
|
||
"labels_anchors": false,
|
||
"latex_user_defs": false,
|
||
"report_style_numbering": false,
|
||
"user_envs_cfg": false
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|