![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)

# Duplicated values

Sometimes, data comes with messy data. 

We will use the package [dedupe](https://dedupe.io/) to eliminate duplicates. 


Some alternatives are  the packages [recordlinkage](https://pypi.org/project/recordlinkage/) and [thefuzz](https://github.com/seatgeek/thefuzz).

Instead of using directly the package dedupe, we are going to use **pandas-dedupe**:


**pip install pandas-dedupe**



Let's start by loading messy data.

In [4]:
import warnings
warnings.filterwarnings('ignore') # Avoid warnings

import pandas as pd
import numpy as np
import pandas_dedupe

In [12]:
df = pd.read_csv('https://raw.githubusercontent.com/dedupeio/dedupe-examples/master/csv_example/csv_example_messy_input.csv')

Let's do some initial checking

In [3]:
df.shape

(3337, 32)

In [4]:
df.head(10)

Unnamed: 0,Id,Source,Site name,Address,Zip,Phone,Fax,Program Name,Length of Day,IDHS Provider ID,...,Executive Director,Center Director,ECE Available Programs,NAEYC Valid Until,NAEYC Program Id,Email Address,Ounce of Prevention Description,Purple binder service type,Column,Column2
0,0,CPS_Early_Childhood_Portal_scrape.csv,Salvation Army - Temple / Salvation Army,1 N Ogden Ave,,2262649.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
1,1,CPS_Early_Childhood_Portal_scrape.csv,Salvation Army - Temple / Salvation Army,1 N Ogden Ave,,2262649.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
2,2,CPS_Early_Childhood_Portal_scrape.csv,National Louis University - Dr. Effie O. Elli...,10 S Kedzie Ave,,5339011.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
3,3,CPS_Early_Childhood_Portal_scrape.csv,National Louis University - Dr. Effie O. Elli...,10 S Kedzie Ave,,5339011.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
4,4,CPS_Early_Childhood_Portal_scrape.csv,Board Trustees-City Colleges of Chicago - Oli...,10001 S Woodlawn Ave,,2916100.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
5,5,CPS_Early_Childhood_Portal_scrape.csv,Board Trustees-City Colleges of Chicago - Oli...,10001 S Woodlawn Ave,,2916100.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
6,6,CPS_Early_Childhood_Portal_scrape.csv,Easter Seals Society of Metropolitan Chicago ...,1001 W Roosevelt Rd,,9395115.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
7,7,CPS_Early_Childhood_Portal_scrape.csv,Easter Seals Society of Metropolitan Chicago ...,1001 W Roosevelt Rd,,9395115.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
8,8,CPS_Early_Childhood_Portal_scrape.csv,Hull House Association - Uptown Head Start / ...,1020 W Bryn Mawr Ave,,7695753.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
9,9,CPS_Early_Childhood_Portal_scrape.csv,Hull House Association - Child Dev. Central O...,1030 W Van Buren St,,9068600.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,


In [6]:
print(df.columns)

Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',
       'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',
       'Neighborhood', 'Funded Enrollment', 'Program Option',
       'Number per Site EHS', 'Number per Site HS', 'Director',
       'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',
       'Website', 'Executive Director', 'Center Director',
       'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',
       'Email Address', 'Ounce of Prevention Description',
       'Purple binder service type', 'Column', 'Column2'],
      dtype='object')


In [6]:
print(df.dtypes)

Id                                   int64
Source                              object
Site name                           object
Address                             object
Zip                                float64
Phone                              float64
Fax                                 object
Program Name                        object
Length of Day                       object
IDHS Provider ID                    object
Agency                              object
Neighborhood                        object
Funded Enrollment                   object
Program Option                      object
Number per Site EHS                 object
Number per Site HS                  object
Director                           float64
Head Start Fund                    float64
Eearly Head Start Fund              object
CC fund                             object
Progmod                             object
Website                             object
Executive Director                  object
Center Dire

In [7]:
# Missing values
df.isnull().sum()

Id                                    0
Source                                0
Site name                             0
Address                               0
Zip                                1333
Phone                               146
Fax                                3299
Program Name                       2009
Length of Day                      2009
IDHS Provider ID                   3298
Agency                             3325
Neighborhood                       2754
Funded Enrollment                  2424
Program Option                     2800
Number per Site EHS                3319
Number per Site HS                 3319
Director                           3337
Head Start Fund                    3337
Eearly Head Start Fund             2881
CC fund                            2818
Progmod                            2818
Website                            2815
Executive Director                 3114
Center Director                    2874
ECE Available Programs             2379


## Check duplicates

In [8]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
3332    False
3333    False
3334    False
3335    False
3336    False
Length: 3337, dtype: bool

## Remove duplicates

In [7]:
df.drop_duplicates(inplace=True)
df[0:5]

Unnamed: 0,Id,Source,Site name,Address,Zip,Phone,Fax,Program Name,Length of Day,IDHS Provider ID,...,Executive Director,Center Director,ECE Available Programs,NAEYC Valid Until,NAEYC Program Id,Email Address,Ounce of Prevention Description,Purple binder service type,Column,Column2
0,0,CPS_Early_Childhood_Portal_scrape.csv,Salvation Army - Temple / Salvation Army,1 N Ogden Ave,,2262649.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
1,1,CPS_Early_Childhood_Portal_scrape.csv,Salvation Army - Temple / Salvation Army,1 N Ogden Ave,,2262649.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
2,2,CPS_Early_Childhood_Portal_scrape.csv,National Louis University - Dr. Effie O. Elli...,10 S Kedzie Ave,,5339011.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
3,3,CPS_Early_Childhood_Portal_scrape.csv,National Louis University - Dr. Effie O. Elli...,10 S Kedzie Ave,,5339011.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,
4,4,CPS_Early_Childhood_Portal_scrape.csv,Board Trustees-City Colleges of Chicago - Oli...,10001 S Woodlawn Ave,,2916100.0,,Child Care,EXTENDED DAY,,...,,,,,,,,,,


## Remove 'real duplicates'

The problem is that the records are not the same. 

Data is messy. 

We will use **dedupe**.

In [13]:
# canonalize for standardizing names in a cluster
df_dedupe = pandas_dedupe.dedupe_dataframe(df, ['Source', 'Site name', 'Address', 'Zip', 'Phone', 'Email Address'], canonicalize=True)

Importing data ...
Reading from dedupe_dataframe_learned_settings
Clustering...
# duplicate sets 871


If you want to retrain, you should delete the settings and training files (the dedupe* and link_dataframes* files).


Now, if you inspect the dataframe, you will see the duplicated records that have been clustered.

In [15]:
df_dedupe.columns

Index(['Id', 'Source', 'Site name', 'Address', 'Zip', 'Phone', 'Fax',
       'Program Name', 'Length of Day', 'IDHS Provider ID', 'Agency',
       'Neighborhood', 'Funded Enrollment', 'Program Option',
       'Number per Site EHS', 'Number per Site HS', 'Director',
       'Head Start Fund', 'Eearly Head Start Fund', 'CC fund', 'Progmod',
       'Website', 'Executive Director', 'Center Director',
       'ECE Available Programs', 'NAEYC Valid Until', 'NAEYC Program Id',
       'Email Address', 'Ounce of Prevention Description',
       'Purple binder service type', 'Column', 'Column2', 'cluster id',
       'confidence', 'canonical_Id', 'canonical_Source', 'canonical_Site name',
       'canonical_Address', 'canonical_Zip', 'canonical_Phone',
       'canonical_Fax', 'canonical_Program Name', 'canonical_Length of Day',
       'canonical_IDHS Provider ID', 'canonical_Agency',
       'canonical_Neighborhood', 'canonical_Funded Enrollment',
       'canonical_Program Option', 'canonical_Number p

In [22]:
df_sorted = df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)
df_dedupe.sort_values(['confidence', 'cluster id'], ascending=False)

Unnamed: 0,Id,Source,Site name,Address,Zip,Phone,Fax,Program Name,Length of Day,IDHS Provider ID,...,canonical_Executive Director,canonical_Center Director,canonical_ECE Available Programs,canonical_NAEYC Valid Until,canonical_NAEYC Program Id,canonical_Email Address,canonical_Ounce of Prevention Description,canonical_Purple binder service type,canonical_Column,canonical_Column2
3327,3327,purple_binder_early_childhood.csv,precious infants & tots learning center,624 e 47th street,60653.0,2682685.0,,,,,...,,,,,,,,,,early head start
3300,3300,purple_binder_early_childhood.csv,ywca metropolitan chicago,360 n michigan avenue,60601.0,3726600.0,,,,,...,,,,,,,,,,child care
3299,3299,purple_binder_early_childhood.csv,ymca west side,5080 w harrison street,60644.0,9553100.0,,,,,...,,,,,,,,,,child care
3285,3285,purple_binder_early_childhood.csv,woodlawn organization,6040 s harper avenue,60637.0,2885840.0,,,,,...,,,,,,,,,,child care
3281,3281,purple_binder_early_childhood.csv,urban family and community centers,4241 w washington boulevard,60624.0,7228333.0,,,,,...,,,,,,,,,,child care
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1837,1837,chapin_dfss_providers_2011_070212.csv,north avenue day nursery fcch-carolyn price,2020 w jackson,60612.0,,,,,,...,"5833ip(ehs collaboration enhanced home ip), 58...",www.crcl.net,cp,betty lee,,05/31/14,723127,youngt@crcl.net,,child care
2643,2643,ece chicago find a school scrape.csv,mary crane north 0-3,2905 n. leavitt,60618.0,3485528.0,,,,,...,"4374it(child care it center), 4374ps(child car...",www.marycrane.org,lavetter terry,martuice williams,,08/01/16,722999,info@marycrane.org,,child care
3228,3228,purple_binder_early_childhood.csv,mary crane center north,2905 n leavitt street,60618.0,9753322.0,,,,,...,"4374it(child care it center), 4374ps(child car...",www.marycrane.org,lavetter terry,martuice williams,,08/01/16,722999,info@marycrane.org,,child care
3229,3229,purple_binder_early_childhood.csv,mary crane family and day care center,2905 n clybourn avenue,60618.0,3485528.0,,,,,...,"4374it(child care it center), 4374ps(child car...",www.marycrane.org,lavetter terry,martuice williams,,08/01/16,722999,info@marycrane.org,,child care


In [24]:
df_sorted[['Id', 'cluster id', 'confidence', 'Source', 'Zip', 'Address', 'canonical_Executive Director']]

Unnamed: 0,Id,cluster id,confidence,Source,Zip,Address,canonical_Executive Director
3327,3327,870,1.000000,purple_binder_early_childhood.csv,60653.0,624 e 47th street,
3300,3300,869,1.000000,purple_binder_early_childhood.csv,60601.0,360 n michigan avenue,
3299,3299,868,1.000000,purple_binder_early_childhood.csv,60644.0,5080 w harrison street,
3285,3285,867,1.000000,purple_binder_early_childhood.csv,60637.0,6040 s harper avenue,
3281,3281,866,1.000000,purple_binder_early_childhood.csv,60624.0,4241 w washington boulevard,
...,...,...,...,...,...,...,...
1837,1837,40,0.133246,chapin_dfss_providers_2011_070212.csv,60612.0,2020 w jackson,"5833ip(ehs collaboration enhanced home ip), 58..."
2643,2643,31,0.130475,ece chicago find a school scrape.csv,60618.0,2905 n. leavitt,"4374it(child care it center), 4374ps(child car..."
3228,3228,31,0.130475,purple_binder_early_childhood.csv,60618.0,2905 n leavitt street,"4374it(child care it center), 4374ps(child car..."
3229,3229,31,0.130474,purple_binder_early_childhood.csv,60618.0,2905 n clybourn avenue,"4374it(child care it center), 4374ps(child car..."


## Matching / Linking records
Another problem is matching / linking records from different sources.

Let's load two datasets from FEBRl (Freely extensible biomedical record linkage).

In [25]:
#load dataframes
dfa = pd.read_csv('data/dataset1-febrl.csv')
dfb = pd.read_csv('data/dataset2-febrl.csv')

We cannot have missing values for applying record matching with this library, so we fill them.

The problem is that many values are ' ' (not NaN). So, we first convert to NaN, and then we drop them.

In [26]:
dfa.replace(['', ' '], np.nan, inplace=True)
dfb.replace(['', ' '], np.nan, inplace=True)

In [27]:
dfa.dropna(inplace=True)
dfb.dropna(inplace=True)

In [28]:
dfa.head(10)

Unnamed: 0,rec_id,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
1,rec-122-org,lachlan,berry,69,giblin street,killarney,bittern,4814,qld,19990219,7364009
2,rec-373-org,deakin,sondergeld,48,goldfinch circuit,kooltuo,canterbury,2776,vic,19600210,2635962
4,rec-227-org,luke,purdon,23,ramsay place,mirani,garbutt,2260,vic,19831024,8099933
7,rec-294-org,william,bishop,21,neworra place,apmnt 65,worongary,6225,qld,19490130,9773843
10,rec-81-dup-0,abbey,fit,13,kosciusko avenue,the wharf complex,yass,2594,nsw,19870510,7661096
11,rec-34-org,isabella,lodder,156,messenger street,tongbong sanctuary,bayswater,4870,vic,19650714,2790666
12,rec-478-org,anthony,beazley,12,birubi place,currandina,flemington,2477,qld,19730924,6558077
13,rec-225-org,alia,streich,74,maranoa street,rocky bend,rowville,6152,vic,19790418,1975340
15,rec-452-org,alissa,kilmartin,37,reveley crescent,crown allot,wolumla,6210,nsw,19041118,7994055
16,rec-67-org,jacob,lyden,25,haddon street,glenview,woodville north,2226,qld,19910424,6426415


In [29]:
dfb.head(10)

Unnamed: 0,rec_id,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
0,rec-2778-org,sarah,bruhn,44,forbes street,wintersloe,kellerberrin,4510,vic,19300213,7535316
1,rec-712-dup-0,jacob,lanyon,5,milne cove,wellwod,beaconsfield upper,2602,vic,19080712,9497788
2,rec-1321-org,brinley,efthimiou,35,sturdee crescent,tremearne,scarborough,5211,qld,19940319,6814956
3,rec-3004-org,aleisha,hobson,54,oliver street,inglewood,toowoomba,3175,qld,19290427,5967384
4,rec-1384-org,ethan,gazzola,49,sheaffe street,bimby vale,port pirie,3088,sa,19631225,3832742
5,rec-3981-org,alicia,hope,100,mansfield place,sunset,byford,6061,sa,19421201,7934773
6,rec-916-org,benjamin,kolosche,78,keenan street,wingara,raymond terrace,3212,sa,19450918,5698873
8,rec-63-dup-0,olivia,white,55,duffy street,shopping village,mirrabooka,2260,vic,19000106,4996142
10,rec-112-org,joshua,rudd,78,max henry crescent,brentwood vlge,port douglas,2315,vic,19951125,1697892
11,rec-3297-org,rachael,lomman,37,carlile street,clonturkle,bronte,2177,nsw,19910228,9462397


Check the two datasets have the same columns.

In [30]:
print(dfa.columns)
print(dfb.columns)

Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',
       ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',
       ' soc_sec_id'],
      dtype='object')
Index(['rec_id', ' given_name', ' surname', ' street_number', ' address_1',
       ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',
       ' soc_sec_id'],
      dtype='object')


Let's match...

In [31]:
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, [' given_name', ' surname',  ' street_number', ' address_1', ' address_2', ' suburb', ' postcode', ' state', ' date_of_birth',' soc_sec_id'])


Importing data ...


 given_name : reeve
 surname : quilliam
 street_number : 2
 address_1 : renwick street
 address_2 : yarrabee
 suburb : barwon heads
 postcode : 2340
 state : nsw
 date_of_birth : 19810406
 soc_sec_id : 1066923

 given_name : jessica
 surname : reid
 street_number : 280
 address_1 : medley street
 address_2 : warra creek
 suburb : ballarat
 postcode : 3149
 state : nsw
 date_of_birth : 19830907
 soc_sec_id : 1067529

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Starting active labeling...


 n


 given_name : daniel
 surname : couzens
 street_number : 37
 address_1 : coventry close
 address_2 : cressbrook
 suburb : mount eliza
 postcode : 5073
 state : nsw
 date_of_birth : 19881127
 soc_sec_id : 6934299

 given_name : dante
 surname : dakin
 street_number : 3
 address_1 : chuculba crescent
 address_2 : greenpatch
 suburb : forbes
 postcode : 5072
 state : nsw
 date_of_birth : 19481028
 soc_sec_id : 7288639

0/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : lachlan
 surname : jukic
 street_number : 2
 address_1 : morgan crescent
 address_2 : parklea
 suburb : raymond terrace
 postcode : 2250
 state : nsw
 date_of_birth : 19780702
 soc_sec_id : 4027998

 given_name : meg
 surname : feil
 street_number : 17
 address_1 : biraban place
 address_2 : hughloch lincoln red stud
 suburb : hawthorne
 postcode : 3429
 state : vic
 date_of_birth : 19060812
 soc_sec_id : 4027997

0/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : jacob
 surname : lyen
 street_number : 25
 address_1 : haddon srteet
 address_2 : glenvie w
 suburb : woodville north
 postcode : 2226
 state : qld
 date_of_birth : 19910424
 soc_sec_id : 6426415

 given_name : zac
 surname : white
 street_number : 26
 address_1 : companion crescent
 address_2 : glenview
 suburb : toronto
 postcode : 2226
 state : sa
 date_of_birth : 19431117
 soc_sec_id : 3437945

0/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : kydan
 surname : mccarthy
 street_number : 67
 address_1 : clemenger street
 address_2 : the points holsteins
 suburb : fairlawn
 postcode : 6415
 state : nsw
 date_of_birth : 19720518
 soc_sec_id : 6527653

 given_name : daniel
 surname : mccarthy
 street_number : 6
 address_1 : brunton street
 address_2 : tall pines
 suburb : fairlight
 postcode : 3155
 state : nsw
 date_of_birth : 19760107
 soc_sec_id : 8093038

0/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : brooklyn
 surname : manson
 street_number : 27
 address_1 : clive steele avenue
 address_2 : port fairy road
 suburb : mount low
 postcode : 3450
 state : vic
 date_of_birth : 19710727
 soc_sec_id : 4493900

 given_name : ruby
 surname : mason
 street_number : 7
 address_1 : clive steele avenue
 address_2 : kooyong
 suburb : botany
 postcode : 3636
 state : vic
 date_of_birth : 19730913
 soc_sec_id : 4397223

0/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : emiily
 surname : coleman
 street_number : 108
 address_1 : chewings street
 address_2 : berkeley vlge
 suburb : wellington point
 postcode : 2550
 state : nsw
 date_of_birth : 19421221
 soc_sec_id : 7206933

 given_name : emiily
 surname : went
 street_number : 18
 address_1 : glenmaggie street
 address_2 : berkeley vlge
 suburb : blue haven
 postcode : 6051
 state : vic
 date_of_birth : 19521205
 soc_sec_id : 8530937

0/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : jacob
 surname : white
 street_number : 5
 address_1 : findlay street
 address_2 : booroopki park rmb 596
 suburb : robina
 postcode : 2197
 state : vic
 date_of_birth : 19170205
 soc_sec_id : 4702928

 given_name : talia
 surname : reid
 street_number : 147
 address_1 : sid barnes crescent
 address_2 : tathra
 suburb : berowra heights
 postcode : 2170
 state : vic
 date_of_birth : 19230203
 soc_sec_id : 4712927

0/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : carla
 surname : amiet
 street_number : 50
 address_1 : carstensz street
 address_2 : oaklodge
 suburb : blackmans bay
 postcode : 3180
 state : nsw
 date_of_birth : 19790801
 soc_sec_id : 9646483

 given_name : cameron
 surname : coleman
 street_number : 10
 address_1 : edwards street
 address_2 : broadbridge manor
 suburb : blackmans bay
 postcode : 3630
 state : nsw
 date_of_birth : 19871030
 soc_sec_id : 5502408

0/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : lara
 surname : matthews
 street_number : 6
 address_1 : beaumaris street
 address_2 : sunforest therapy centre
 suburb : the basin
 postcode : 4179
 state : nsw
 date_of_birth : 19911006
 soc_sec_id : 2164704

 given_name : dominic
 surname : matthews
 street_number : 67
 address_1 : campbell street
 address_2 : narraburra lodge
 suburb : coonabarabran
 postcode : 3174
 state : nsw
 date_of_birth : 19470226
 soc_sec_id : 3115384

0/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : jack
 surname : matthews
 street_number : 17
 address_1 : herron crescent
 address_2 : broadmere
 suburb : highton
 postcode : 2035
 state : vic
 date_of_birth : 19081119
 soc_sec_id : 5613395

 given_name : alexandra
 surname : matthews
 street_number : 174
 address_1 : port jackson circuit
 address_2 : old timers south
 suburb : whaleback
 postcode : 2830
 state : vic
 date_of_birth : 19261017
 soc_sec_id : 2919332

0/10 positive, 10/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : emiily
 surname : kowald
 street_number : 18
 address_1 : daglish street
 address_2 : oakdale
 suburb : avalon
 postcode : 3030
 state : vic
 date_of_birth : 19250313
 soc_sec_id : 1590627

 given_name : emiily
 surname : went
 street_number : 18
 address_1 : glenmaggie street
 address_2 : berkeley vlge
 suburb : blue haven
 postcode : 6051
 state : vic
 date_of_birth : 19521205
 soc_sec_id : 8530937

0/10 positive, 11/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : jaden
 surname : humphreys
 street_number : 14
 address_1 : euree street
 address_2 : moorillyah
 suburb : thornlie
 postcode : 3116
 state : wa
 date_of_birth : 19700116
 soc_sec_id : 9382782

 given_name : isabelle
 surname : jolly
 street_number : 166
 address_1 : hodges street
 address_2 : bosmit
 suburb : thornlie
 postcode : 3163
 state : wa
 date_of_birth : 19050126
 soc_sec_id : 2719590

0/10 positive, 12/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : brooklyn
 surname : manson
 street_number : 27
 address_1 : clive steele avenue
 address_2 : port fairy road
 suburb : yagoona
 postcode : 3450
 state : vic
 date_of_birth : 19710727
 soc_sec_id : 4493900

 given_name : laura
 surname : campbell
 street_number : 152
 address_1 : clive steele avenue
 address_2 : irrigation farm
 suburb : yagoona
 postcode : 3350
 state : vic
 date_of_birth : 19160610
 soc_sec_id : 6214635

0/10 positive, 13/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : jacob
 surname : white
 street_number : 5
 address_1 : findlay street
 address_2 : booroopki park rmb 596
 suburb : robina
 postcode : 2197
 state : vic
 date_of_birth : 19170205
 soc_sec_id : 4702928

 given_name : jakob
 surname : menzies
 street_number : 33
 address_1 : coverdale street
 address_2 : bundong
 suburb : worongary
 postcode : 2190
 state : vic
 date_of_birth : 19140610
 soc_sec_id : 4557295

0/10 positive, 14/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : benjamin
 surname : kirchener
 street_number : 12
 address_1 : stuart street
 address_2 : wurrami
 suburb : theodore
 postcode : 2620
 state : wa
 date_of_birth : 19751110
 soc_sec_id : 1766048

 given_name : max
 surname : rees
 street_number : 10
 address_1 : waite street
 address_2 : rosedown
 suburb : nambour
 postcode : 2620
 state : wa
 date_of_birth : 19751230
 soc_sec_id : 1361900

0/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


 given_name : brydon
 surname : webb
 street_number : 11
 address_1 : walker crescent
 address_2 : bugoren
 suburb : charlestown
 postcode : 6258
 state : nsw
 date_of_birth : 19190528
 soc_sec_id : 4191569

 given_name : bradley
 surname : haberfield
 street_number : 11
 address_1 : carumbo place
 address_2 : bungarra
 suburb : canley heights
 postcode : 2758
 state : vic
 date_of_birth : 19190528
 soc_sec_id : 5039500

0/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : brydon
 surname : webb
 street_number : 11
 address_1 : walker crescent
 address_2 : bugoren
 suburb : charlestown
 postcode : 6258
 state : nsw
 date_of_birth : 19190528
 soc_sec_id : 4191569

 given_name : bradley
 surname : haberfield
 street_number : 11
 address_1 : carumbi place
 address_2 : bungarra
 suburb : canley heights
 postcode : 2758
 state : vic
 date_of_birth : 19190528
 soc_sec_id : 5039500

1/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : lochlan
 surname : savidge
 street_number : 29
 address_1 : pohlman street
 address_2 : moline village
 suburb : jingili
 postcode : 3071
 state : vic
 date_of_birth : 19140228
 soc_sec_id : 1498207

 given_name : jayme
 surname : parr
 street_number : 2
 address_1 : clive steele avenue
 address_2 : henry kendall hostel
 suburb : hoskinstown
 postcode : 2770
 state : nsw
 date_of_birth : 19140228
 soc_sec_id : 5840194

2/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : charlie
 surname : headon
 street_number : 17
 address_1 : sheehy street
 address_2 : hawkins masonic vlge
 suburb : warrandyte
 postcode : 3073
 state : vic
 date_of_birth : 19880814
 soc_sec_id : 7871445

 given_name : william
 surname : hislop
 street_number : 17
 address_1 : deane street
 address_2 : sunbury
 suburb : cedar creek
 postcode : 3073
 state : qld
 date_of_birth : 19830819
 soc_sec_id : 6153593

3/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : mitchell
 surname : scrbak
 street_number : 950
 address_1 : holyman street
 address_2 : berkeley vlge
 suburb : safety bay
 postcode : 4300
 state : qld
 date_of_birth : 19060811
 soc_sec_id : 3592109

 given_name : hannah
 surname : beams
 street_number : 9
 address_1 : light street
 address_2 : castle hill farm
 suburb : sale
 postcode : 3221
 state : vic
 date_of_birth : 19560811
 soc_sec_id : 7444484

4/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : finley
 surname : haeusler
 street_number : 27
 address_1 : noarlunga crescent
 address_2 : spring ridge
 suburb : nambour
 postcode : 3180
 state : vic
 date_of_birth : 19660711
 soc_sec_id : 3025217

 given_name : elki
 surname : trent
 street_number : 27
 address_1 : wray place
 address_2 : the mews royal hotel bldg
 suburb : gawler east
 postcode : 3152
 state : nsw
 date_of_birth : 19600211
 soc_sec_id : 5679502

5/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : india
 surname : negrean
 street_number : 1
 address_1 : barringer street
 address_2 : sherwood
 suburb : parkinson
 postcode : 3168
 state : nsw
 date_of_birth : 19860923
 soc_sec_id : 2097928

 given_name : logan
 surname : selth
 street_number : 147
 address_1 : goyder street
 address_2 : rivonia
 suburb : queenscliff
 postcode : 2120
 state : tas
 date_of_birth : 19860921
 soc_sec_id : 4161322

6/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : annablle
 surname : kounis
 street_number : 121
 address_1 : calder place
 address_2 : brambletye vinyard
 suburb : joondkalup
 postcode : 3120
 state : nsw
 date_of_birth : 19640907
 soc_sec_id : 1612956

 given_name : claurdia
 surname : clelland
 street_number : 12
 address_1 : box hill a venue
 address_2 : st francis vlge
 suburb : old beach
 postcode : 3127
 state : wa
 date_of_birth : 19640902
 soc_sec_id : 9508954

7/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : troy
 surname : reid
 street_number : 1
 address_1 : allan street
 address_2 : townview
 suburb : page
 postcode : 2774
 state : qld
 date_of_birth : 19250727
 soc_sec_id : 3580821

 given_name : william
 surname : tossell
 street_number : 1
 address_1 : lutana street
 address_2 : nara cnsa
 suburb : craigmore
 postcode : 2509
 state : nsw
 date_of_birth : 19250116
 soc_sec_id : 5322906

8/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : isaac
 surname : quilliam
 street_number : 11
 address_1 : namatjira drive
 address_2 : delaware
 suburb : geelong west
 postcode : 3072
 state : nsw
 date_of_birth : 19930926
 soc_sec_id : 1556150

 given_name : bailey
 surname : clarke
 street_number : 1
 address_1 : hetherington circuit
 address_2 : gundaline
 suburb : harden
 postcode : 2077
 state : vic
 date_of_birth : 19930416
 soc_sec_id : 6134615

9/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : james
 surname : blake
 street_number : 19
 address_1 : sturt avenue
 address_2 : laloki
 suburb : carnegie
 postcode : 2218
 state : nsw
 date_of_birth : 19050716
 soc_sec_id : 7830672

 given_name : finn
 surname : kapoor
 street_number : 1994
 address_1 : sturt avenue
 address_2 : john flynn medical centre
 suburb : mullumbimby
 postcode : 2262
 state : vic
 date_of_birth : 19880816
 soc_sec_id : 8680815

10/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


 given_name : harry
 surname : ryn
 street_number : 5
 address_1 : kellway street
 address_2 : rowethorpe
 suburb : toowong
 postcode : 3931
 state : nsw
 date_of_birth : 19220503
 soc_sec_id : 7228670

 given_name : samantha
 surname : grierson
 street_number : 5
 address_1 : kennedy street
 address_2 : tantallon
 suburb : oakleigh
 postcode : 3034
 state : vic
 date_of_birth : 19210114
 soc_sec_id : 4683164

11/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 f


Finished labeling


Clustering...


BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data.

## Exercise
Try to deduplicate the data of the visitors of the White House.
You can find the data [here](https://obamawhitehouse.archives.gov/goodgovernment/tools/visitor-records)

# References
* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, 
* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)
* [Dedupe](https://dedupe.io/) package
* [pandas-dedupe](https://pypi.org/project/pandas-dedupe/) package

## Licence
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.