Usable Human Authentication: A Quantitative Treatment


, ,

Jeremiah Blocki successfully defended his PhD thesis today at Carnegie Mellon University. Jeremiah’s thesis is titled “Usable Human Authentication: A Quantitative Treatment”. Jeremiah is co-advised by Manuel Blum and I. His thesis committee also included Luis von Ahn (CMU) and Ron Rivest (MIT).

Here is a slightly expanded version of the abstract of this exciting thesis. I am including pointers to the research publications that report on these results as well as popular press articles written for a broader audience.


A typical computer user today manages passwords for many different online accounts. Users struggle with this task — often forgetting their passwords or adopting insecure practices, such as using the same passwords for multiple accounts and selecting weak passwords. While there are many books, articles, papers and even comics about selecting strong individual passwords, there is very little work on password management schemes — systematic strategies to help users create and remember multiple passwords.

Before we can design good password management schemes it is necessary to address a fundamental question: How can we quantify the usability and security of a password management scheme. One way to quantify the usability of a password management scheme would be  to conduct user studies evaluating each user’s success at remembering multiple passwords over an extended period of time. However, these user studies would necessarily be slow and expensive and would need to be repeated for each new password management scheme.

Our thesis is that:

User models and security models can guide the development of password management schemes with analyzable usability and security properties.

We present several results in support of this thesis. Our password management schemes are precisely specified and publishable: the security proofs hold even if the adversary knows the scheme and has extensive background knowledge about the user (hobbies, birthdate, etc.).

First, we introduce Naturally Rehearsing Passwords schemes. Notably, our user model, which is based on research on human memory about spaced rehearsal, allows us to analyze the usability of this family of schemes while experimentally validating only the common user model underlying all of them. The constructions and security analysis are based on new techniques for combinatorial design (see also Beidemen and Blocki and, for comparison, Nisan and Wigderson).

Second, we introduce Human Computable Passwords schemes, which leverage human capabilities for simple arithmetic operations. We provide constructions that make modest demands on users and we prove that these constructions provide strong security: an adversary who has seen 100 10-digit passwords of a user cannot compute any other passwords except with very low probability. The security analysis is based on new statistical dimension lower bounds (building on recent results of Feldman et al).

Third, we also show that user models and security models can be used to develop server-side defenses that a company could adopt to protect the passwords of its users against online (see Optimizing Password Composition Policies paper) and offline attacks (see the GOTCHA paper).

In the news:

Naturally Rehearsing Passwords:

  1. Carnegie Mellon Scheme Uses Shared Visual Cues To Help People Remember Multiple Passwords, CMU News Release, December 2013.
  2. Memory Trick Increases Password Security, Scientific American, December 2013.
  3. Story time: Researchers picture way better password memory scheme, Network World, December 2013.
  4. “Naturally rehearsing passwords” touted as better system for secure access, Slashdot, December 2013.
  5. Bill Gates swallowing a bicycle is the key to a novel password system, ZDnet, December 2013.

GOTCHA Password Hackers:

  1. First there were CAPTCHAs, now there are GOTCHAs, Ars Technica, February 2014.
  2. Carnegie Mellon Researchers Use Inkblots To Improve Security of Online Passwords, CMU News Release, November 2013.
  3. Researchers dare AI experts to crack new GOTCHA password scheme, Network World, November 2013.
  4. Researchers dare AI experts to crack new GOTCHA password scheme, Slashdot, November 2013.
  5. Inkblot, the new fool-proof password system, The Times of India, November 2013.
  6. Now, ‘inkblot’ passwords for unbreakable security, The Economic Times, November 2013.
  7. Researchers use inkblots to make safer passwords, The Tartan, November 2013.
  8. Will GOTCHAs Replace CAPTCHAs?, MIT Technology Review, October 2013.







Privacy Compliance in Big Data Systems



Privacy has become a significant concern in modern society as personal information about individuals is increasingly collected, used, and shared, often using digital technologies, by a wide range of organizations.To allay privacy concerns, Web services companies, such as Facebook, Google and Microsoft, all make promises about how they will use personal information they gather. But ensuring that millions of lines of code in their systems respect these privacy promises is a challenging problem.

In a recent paper, jointly written with my PhD student Shayak Sen at Carnegie Mellon University and a team at Microsoft Research led by Saikat Guha, we present a workflow and tool chain to automate privacy policy compliance checking in big data systems. The tool chain has been applied to check compliance of over a million lines of code in the data analytics pipeline for Bing, Microsoft’s Web search engine. This is the first time automated privacy compliance analysis has been applied to the production code of an Internet-scale system. The paper was presented at the 2014 IEEE Symposium on Security and Privacy and recognized with a Best Student Paper Award.

Central to the design of the workflow and tool chain are (a) Legalease —a language that allows specification of privacy policies that impose restrictions on how user data is handled; and (b) Grok —a data inventory for Map-Reduce-like big data systems that tracks how user data flows among programs. Grok maps code-level schema elements to datatypes in Legalease, in essence, annotating existing programs with information flow types with minimal human input. Compliance checking is thus reduced to information flow analysis of big data systems. The system, bootstrapped by a small team, checks compliance daily of millions of lines of ever-changing source code written by several thousand developers.

While I refer the interested reader to the paper for technical details, let me highlight a few important lessons that we learned during the process of conducting this work.

1. Designing privacy policy languages for lawyers and software developers.

Privacy policies are often crafted by legal teams while software that has to respect these policies are written by developers. An important challenge is thus to design privacy policy languages that are usable by legal privacy teams, yet have precise operational meaning (semantics) that software developers can use to restrict how their code operates on personal information of users. Recognizing this challenge leads us to two questions:

  • How do we define what it means for a software system to respect a privacy policy?
  • What makes a privacy policy language usable by legal privacy teams?

The design of Legalese provides some (partial) answers to these questions. We formalize privacy policies as imposing restrictions on how personal information flows through software systems (see also Nissenbaum 2004, Hayati and Abadi 2005, Barth et al. 2006). An insight from prior work from my research group was that privacy policies often involve nested allow-deny information flow rules with exceptions (see DeYoung et al. 2010, Garg et al. 2011 for the first complete logical specification and audit of the HIPAA Privacy Rule for healthcare privacy in the US). Our hypothesis was that such rules match the mental model of privacy policy authors. Legalease builds on this insight, but instead of using first-order logic as the specification language, it has a simpler and more restricted syntax that we expected would be  accessible to legal privacy teams. For example, consider the privacy policy: “IP address will not be used for advertising. IP address may be used for detecting abuse. In such cases it will not be combined with account information.” The translation of this policy to Legalease maintains a correspondence with the natural language policy:

DENY DataType IPAddress

UseForPurpose Advertising


ALLOW DataType IPAddress

UseForPurpose AdsAbuseDetection


DENY Datatype IPAddress, AccountInfo

Note, in particular, the use of nested exceptions in this example. Legalease policies also have the pleasing property that exceptions only refine the immediately enclosing clause. This feature helps incrementally author and reason about policies: adding an exception changes allowed and denied information flows in a locally contained manner.

The results of our user study, involving participants from the legal privacy team and privacy champions at Microsoft (who sit between the legal privacy team and software developers), provide evidence in support of our hypothesis: after a short tutorial, the participants were able to encode the entire Bing policy pertaining to how users’ personal information will be used on the Bing servers (9 policy clauses) in about 15 minutes with high accuracy (see Section VI of the paper for details).

2. Bootstrapping existing software for privacy policy compliance.

Software systems that perform data analytics over personal information of users are often written without a technical connection to the privacy policies that they are meant to respect. Tens of millions of lines of such code are already in place in companies like Facebook, Google, and Microsoft. An important challenge is thus to bootstrap existing software for privacy compliance.

In this work, we develop Grok to annotate software written in programming languages that support the Map-Reduce programming model (e.g., Dremel, Hive, Scope) with Legalease’s policy datatypes. We focus on this class of languages because they are the languages of choice in industry for writing data analytics code. A simple way to conduct the bootstrapping process would be to ask developers to manually annotate all code with policy datatypes (e.g., labelling variables as IPAddress, programs as being for the purpose of Advertising etc.). However, this process is too labor-intensive to scale. Instead, we develop a set of techniques to automate the bootstrapping process (see Section IV of the paper for details). Perhaps, the two most notable findings are the following:

  • First, we show that a small number of manual annotations (~200)  combined with an automated data flow analysis can correctly bootstrap about 60% of the 1.1 million LoC codebase. This finding was a surprise for me. It suggests that the data flow graph of the big data pipeline that we worked with has a useful narrow waist property: a few nodes (~200) towards the beginning of the pipeline once correctly annotated are sufficient to annotate a big chunk (~60%) of the entire pipeline. I am curious whether this property generalizes to other big data pipelines for which privacy compliance is a concern and, if so, why. It also suggests that a narrow waist is a useful design principle for these pipelines in order to perform bootstrapping privacy compliance in a scalable manner.
  • Second, languages like Dremel, Hive, and Scope that support the Map-Reduce programming model are particularly simple from the standpoint of information flow analysis. Programs operate on columns of input tables and output their results as tables that other programs then operate on. There is no global state that is shared by multiple programs and very limited forms of information flow because of loops and conditionals (a major source of complexity in information flow analysis of programs written in C or Java). This form of simplicity is particularly useful to avoid false positives in our information flow analysis (i.e., concluding that a program violates a privacy policy when in fact it does not).

I view this result as a significant step forward in ensuring privacy compliance at scale. With the advent of technology that can help companies keep their privacy promises, users can reasonably expect stronger privacy promises from companies that operate over their personal information. Of course, much work remains to be done in expanding this kind of compliance technology to cover a broader set of privacy policies (including statistical privacy properties) and in its adoption in a wide range of organizations.


In the news:

  1. Carnegie Mellon, Microsoft Research Automate Privacy Compliance for Big Data Systems, CMU News Release, May 2014.
  2. How Difficult Is It For a Business To Comply With Its Own Privacy Policies, New York Business Lawyer Blog, June 2014.
  3. Impact and Implications of IEEE Award Winning Paper, CyLab Chronicles, June 2014.



Get every new post delivered to your Inbox.