1 00:00:00,290 --> 00:00:02,210 Privacy Technologies. 2 00:00:02,210 --> 00:00:04,500 In this lesson, we're going to talk about some different 3 00:00:04,500 --> 00:00:06,100 technologies that we use to help ensure 4 00:00:06,100 --> 00:00:07,880 the privacy of our customers. 5 00:00:07,880 --> 00:00:10,230 The first one is de-identification. 6 00:00:10,230 --> 00:00:12,180 When I'm talking about de-identification, 7 00:00:12,180 --> 00:00:13,850 this is the methods and technologies 8 00:00:13,850 --> 00:00:16,690 that remove identifying information from data 9 00:00:16,690 --> 00:00:18,810 before we distribute that data. 10 00:00:18,810 --> 00:00:21,120 Now, the real benefit of de-identification here 11 00:00:21,120 --> 00:00:24,370 is to be able to take data that may be protected by privacy. 12 00:00:24,370 --> 00:00:26,460 And once we do the de-identification, 13 00:00:26,460 --> 00:00:30,400 that data now becomes usable by us again for other purposes. 14 00:00:30,400 --> 00:00:32,300 Now, this doesn't violate anybody's privacy 15 00:00:32,300 --> 00:00:34,930 because we are de-identifying the data. 16 00:00:34,930 --> 00:00:37,900 Oftentimes, your de-identification is going to be implemented 17 00:00:37,900 --> 00:00:40,120 as part of your database design. 18 00:00:40,120 --> 00:00:41,500 Now, there are lots of different things 19 00:00:41,500 --> 00:00:42,380 that we have to talk about 20 00:00:42,380 --> 00:00:44,460 when we talk about de-identification. 21 00:00:44,460 --> 00:00:47,610 This includes things like data masking, tokenization, 22 00:00:47,610 --> 00:00:50,940 aggregation and banding, and re-identification. 23 00:00:50,940 --> 00:00:52,990 Now, when we talk about data masking, 24 00:00:52,990 --> 00:00:55,530 this is where a de-identification method is used 25 00:00:55,530 --> 00:00:58,410 where a generic or placeholder label is substituted in 26 00:00:58,410 --> 00:01:00,860 for real data while preserving the structure 27 00:01:00,860 --> 00:01:03,040 or format of the original data. 28 00:01:03,040 --> 00:01:06,790 So let's say you're going to give me all your credit cards. 29 00:01:06,790 --> 00:01:08,000 I take all your credit cards 30 00:01:08,000 --> 00:01:12,110 and I take away all of the information from your 16 digits 31 00:01:12,110 --> 00:01:15,660 and I put XXXX in front of all those 16 digits. 32 00:01:15,660 --> 00:01:16,950 That would mask the data. 33 00:01:16,950 --> 00:01:19,340 Nobody would be able to identify that credit card anymore 34 00:01:19,340 --> 00:01:21,840 as yours because we don't have the credit card. 35 00:01:21,840 --> 00:01:23,780 We just have XXXXX. 36 00:01:23,780 --> 00:01:25,300 That's a form of data masking. 37 00:01:25,300 --> 00:01:27,170 So really when we talk about data masking, 38 00:01:27,170 --> 00:01:28,770 we are covering up the data, 39 00:01:28,770 --> 00:01:30,820 or maybe I have a database of all my customers 40 00:01:30,820 --> 00:01:31,653 and for some reason, 41 00:01:31,653 --> 00:01:33,110 we collect their social security numbers. 42 00:01:33,110 --> 00:01:35,120 We would never do that, but let's say we did. 43 00:01:35,120 --> 00:01:36,620 Well, that's a nine-digit number. 44 00:01:36,620 --> 00:01:38,970 Instead of having your unique social security number, 45 00:01:38,970 --> 00:01:40,360 I might go back through the database 46 00:01:40,360 --> 00:01:44,900 and change all your social security numbers to 111-11-1111. 47 00:01:45,910 --> 00:01:48,570 And by doing that, I have now genericized it 48 00:01:48,570 --> 00:01:51,010 across all my students to have the same number. 49 00:01:51,010 --> 00:01:52,250 It keeps the same format. 50 00:01:52,250 --> 00:01:53,720 It keeps the same structure, 51 00:01:53,720 --> 00:01:56,280 but it doesn't actually take any personal information 52 00:01:56,280 --> 00:01:59,160 from you because I've erased that social security number. 53 00:01:59,160 --> 00:02:01,800 The next one we have is what's known as tokenization. 54 00:02:01,800 --> 00:02:04,880 Now, this is a de-identification method where a unique token 55 00:02:04,880 --> 00:02:07,480 is substituted in for real data. 56 00:02:07,480 --> 00:02:09,040 Now, when you do tokenization, 57 00:02:09,040 --> 00:02:10,740 one of the things you have to worry about 58 00:02:10,740 --> 00:02:14,080 is if you have the ability to go back and be reversible 59 00:02:14,080 --> 00:02:16,250 and usually with tokenization, it is. 60 00:02:16,250 --> 00:02:19,200 So again, let's say I had your social security numbers. 61 00:02:19,200 --> 00:02:21,470 Instead of changing them all to one, 62 00:02:21,470 --> 00:02:24,320 I assign a random number to each of my students. 63 00:02:24,320 --> 00:02:25,970 That's now their student ID. 64 00:02:25,970 --> 00:02:27,870 That student ID is now substitute in 65 00:02:27,870 --> 00:02:29,600 for that social security number field. 66 00:02:29,600 --> 00:02:32,050 But I might have a master list in my safe 67 00:02:32,050 --> 00:02:34,040 that says this student ID 68 00:02:34,040 --> 00:02:36,110 matches this social security number. 69 00:02:36,110 --> 00:02:37,930 That's what we're talking about with tokenization. 70 00:02:37,930 --> 00:02:40,650 We're using another number to represent the information. 71 00:02:40,650 --> 00:02:42,280 So if any of my staff go into the database 72 00:02:42,280 --> 00:02:44,060 and look at your social security number, 73 00:02:44,060 --> 00:02:45,940 they would just see the made up student number. 74 00:02:45,940 --> 00:02:47,550 They wouldn't get your real social security number 75 00:02:47,550 --> 00:02:49,120 because that's stored in my vault. 76 00:02:49,120 --> 00:02:51,600 But if I had some real business case where I needed it, 77 00:02:51,600 --> 00:02:52,920 I could then do the matching 78 00:02:52,920 --> 00:02:54,560 and then re-identify you that way. 79 00:02:54,560 --> 00:02:57,330 So it's a little bit more dangerous to do tokenization. 80 00:02:57,330 --> 00:03:00,750 The next one we want to talk about is aggregation and banding. 81 00:03:00,750 --> 00:03:04,370 Now, aggregation and banding is where you de-identify people 82 00:03:04,370 --> 00:03:06,820 by gathering the data and generalizing it 83 00:03:06,820 --> 00:03:08,730 to protect the individuals involved. 84 00:03:08,730 --> 00:03:11,000 So if we were using aggregation and banding, 85 00:03:11,000 --> 00:03:13,600 we might take all of our subjects in a medical trial 86 00:03:13,600 --> 00:03:16,370 and instead of identifying them as the person 87 00:03:16,370 --> 00:03:17,670 or the subject number, 88 00:03:17,670 --> 00:03:19,980 we would say out of the 100 people 89 00:03:19,980 --> 00:03:21,210 who participated in this trial, 90 00:03:21,210 --> 00:03:23,660 90% of them didn't have side effects. 91 00:03:23,660 --> 00:03:25,150 Now that doesn't mean any of those 90 92 00:03:25,150 --> 00:03:26,430 quickly identifies as you. 93 00:03:26,430 --> 00:03:28,510 It just means somebody didn't have a side effect. 94 00:03:28,510 --> 00:03:29,540 It's one of those 90. 95 00:03:29,540 --> 00:03:31,300 And if we knew that you didn't have side effects, 96 00:03:31,300 --> 00:03:32,650 well, you're just one of 90. 97 00:03:32,650 --> 00:03:34,290 We don't know you individually. 98 00:03:34,290 --> 00:03:36,630 And that's where we're able to protect your privacy. 99 00:03:36,630 --> 00:03:38,520 Now, let me give you another example of the dangers 100 00:03:38,520 --> 00:03:39,353 of some of these things 101 00:03:39,353 --> 00:03:41,680 and when you have to think about de-identification 102 00:03:41,680 --> 00:03:44,840 in terms of when somebody tries to re-identify people. 103 00:03:44,840 --> 00:03:46,720 So let's say that I went 104 00:03:46,720 --> 00:03:49,330 and did a corporate survey of my company. 105 00:03:49,330 --> 00:03:51,630 We went ahead and we sent out a survey to everybody, 106 00:03:51,630 --> 00:03:52,960 and we said don't tell us your name 107 00:03:52,960 --> 00:03:54,640 because we don't want to identify you. 108 00:03:54,640 --> 00:03:55,700 We want you to feel comfortable 109 00:03:55,700 --> 00:03:57,080 giving us your honest feedback. 110 00:03:57,080 --> 00:03:58,490 And we asked them a whole bunch of questions 111 00:03:58,490 --> 00:03:59,323 about the company. 112 00:03:59,323 --> 00:04:00,156 How do you like it here? 113 00:04:00,156 --> 00:04:01,160 Is the pay competitive? 114 00:04:01,160 --> 00:04:02,110 Do you enjoy your job? 115 00:04:02,110 --> 00:04:03,230 Do you like helping the students? 116 00:04:03,230 --> 00:04:04,380 All that kind of stuff. 117 00:04:04,380 --> 00:04:07,480 But then on the final question we asked something like, 118 00:04:07,480 --> 00:04:08,900 what is your age? 119 00:04:08,900 --> 00:04:10,240 What is your sex? 120 00:04:10,240 --> 00:04:11,590 Are you married or not? 121 00:04:11,590 --> 00:04:13,200 And we get that kind of information. 122 00:04:13,200 --> 00:04:16,440 So okay, that seems innocuous enough because we didn't ask 123 00:04:16,440 --> 00:04:18,240 for things like your social security number 124 00:04:18,240 --> 00:04:20,140 or your employee ID or your name 125 00:04:20,140 --> 00:04:21,810 so we still shouldn't be able to identify you. 126 00:04:21,810 --> 00:04:23,750 So we take all the results of the survey. 127 00:04:23,750 --> 00:04:24,750 We shuffle them all together 128 00:04:24,750 --> 00:04:25,840 and we start reading through them. 129 00:04:25,840 --> 00:04:27,630 This one's a five star, this one's a five star, 130 00:04:27,630 --> 00:04:28,950 this is a four and a half star. 131 00:04:28,950 --> 00:04:30,030 This one's a one. 132 00:04:30,030 --> 00:04:31,340 Hmm, well, now I'm upset. 133 00:04:31,340 --> 00:04:32,960 I want to know who this one is, right? 134 00:04:32,960 --> 00:04:34,530 Can I re-identify them? 135 00:04:34,530 --> 00:04:35,590 Well, let's say I look at them 136 00:04:35,590 --> 00:04:36,670 and I read through their comments 137 00:04:36,670 --> 00:04:39,280 and I get to the last page and it says this is a woman. 138 00:04:39,280 --> 00:04:41,940 This is somebody who's between the ages of 30 and 40. 139 00:04:41,940 --> 00:04:43,790 This is somebody who is married. 140 00:04:43,790 --> 00:04:46,800 Huh, well, based on that and my small staff, 141 00:04:46,800 --> 00:04:49,060 I know that's only one person in my company 142 00:04:49,060 --> 00:04:50,250 and so I know the person 143 00:04:50,250 --> 00:04:52,300 who thinks Jason's the worst boss ever. 144 00:04:52,300 --> 00:04:54,580 And lo and behold, it's my wife. 145 00:04:54,580 --> 00:04:55,950 Tamara went and filled out the survey 146 00:04:55,950 --> 00:04:57,410 and leaves me a one star review. 147 00:04:57,410 --> 00:04:58,320 Thanks, honey. 148 00:04:58,320 --> 00:04:59,830 You know, this is the kind of stuff that happens. 149 00:04:59,830 --> 00:05:01,170 But again, if you have this 150 00:05:01,170 --> 00:05:02,880 where you can re-identify somebody, 151 00:05:02,880 --> 00:05:05,860 then all that anonymization doesn't really help. 152 00:05:05,860 --> 00:05:07,220 Now, why does this happen? 153 00:05:07,220 --> 00:05:08,640 Well, because we're a small company. 154 00:05:08,640 --> 00:05:10,050 We only have 10 people. 155 00:05:10,050 --> 00:05:12,290 And so if we ask a question like that on the last page, 156 00:05:12,290 --> 00:05:14,590 and we don't to be honest, but if we did, 157 00:05:14,590 --> 00:05:16,230 it would be very easy for me to identify it 158 00:05:16,230 --> 00:05:18,510 because we only have a handful of employees. 159 00:05:18,510 --> 00:05:19,570 We have 10 people. 160 00:05:19,570 --> 00:05:21,550 And so if I ask things like 10 year-age bands 161 00:05:21,550 --> 00:05:24,700 like are you between 20 and 30, 30 and 40, 40 and 50 162 00:05:24,700 --> 00:05:26,010 and if you're a male or female 163 00:05:26,010 --> 00:05:27,370 and if you're married or not, 164 00:05:27,370 --> 00:05:30,550 that tells me pretty much I can identify everybody down 165 00:05:30,550 --> 00:05:32,120 based on that result. 166 00:05:32,120 --> 00:05:33,720 And so that would take away the ability 167 00:05:33,720 --> 00:05:35,710 of having that de-identification. 168 00:05:35,710 --> 00:05:38,150 So this is the concept of re-identification, right? 169 00:05:38,150 --> 00:05:39,710 Re-identification is an attack 170 00:05:39,710 --> 00:05:42,721 that combines de-identified data sets 171 00:05:42,721 --> 00:05:44,790 with other data sources, things that you know, 172 00:05:44,790 --> 00:05:47,600 to discover how secure the de-identification method is. 173 00:05:47,600 --> 00:05:49,690 And so if we use that system in our company, 174 00:05:49,690 --> 00:05:51,000 that would not be secure. 175 00:05:51,000 --> 00:05:53,580 Now, if I use that same system in my last job 176 00:05:53,580 --> 00:05:55,590 where I worked with 400 other people, 177 00:05:55,590 --> 00:05:56,680 it would have been very secure 178 00:05:56,680 --> 00:05:58,450 because there was a lot more people 179 00:05:58,450 --> 00:05:59,750 who might have been a woman 180 00:05:59,750 --> 00:06:02,000 who is married between 30 and 40 years old. 181 00:06:02,000 --> 00:06:04,770 And so it'd be very easy for them to hide in the bulk. 182 00:06:04,770 --> 00:06:06,720 Out of those 400 people at that company, 183 00:06:06,720 --> 00:06:09,320 that probably signifies about 50 or 60 people. 184 00:06:09,320 --> 00:06:11,750 And so I wouldn't be able to identify you individually 185 00:06:11,750 --> 00:06:12,700 asking those questions. 186 00:06:12,700 --> 00:06:14,450 So when you're building out surveys 187 00:06:14,450 --> 00:06:15,670 and when you're building out systems 188 00:06:15,670 --> 00:06:18,010 to have a de-identification in place, 189 00:06:18,010 --> 00:06:19,520 you need to think these things through 190 00:06:19,520 --> 00:06:22,150 because sometimes something that seems like it would work 191 00:06:22,150 --> 00:06:23,780 because it works at a large company 192 00:06:23,780 --> 00:06:26,443 won't work at a small company or vice versa.