Infosecisland.com

Adventures in Finding Cardholder Data

2014-05-21

On page 10 of the PCI DSS v3 under the heading of ‘Scope of PCI DSS Requirements’, second paragraph, is the following sentence.

“At least annually and prior to the annual assessment, the assessed entity should confirm the accuracy of their PCI DSS scope by identifying all locations and flows of cardholder data and ensuring they are included in the PCI DSS scope.”

Under the first bullet after that paragraph is the following.

“The assessed entity identifies and documents the existence of all cardholder data in their environment, to verify that no cardholder data exists outside of the currently defined CDE.”

In the past, organizations would rely on their database and file schemas along with their data flow diagrams and the project was done. However, the Council has come back and clarified that the search for cardholder data (CHD), primarily the primary account number (PAN). The Council has stated that this search needs to be more extensive to prove that PANs have not ended up on systems where it is not expected.

Data Loss Prevention

To deal with requirement 4.2, a lot of organizations invested in data loss prevention (DLP) solutions. As a result, organizations with DLP have turned those DLP solutions loose on their servers to find PANs and to confirm that PANs do not exist outside of their cardholder data environment (CDE).

Organizations that do this quickly find out three things; (1) the scope of their search is too small, (2) their DLP solution is not capable of looking into databases, and (3) their DLP tools are not as good at finding PANs at rest as they are when it’s moving such as with an email message.

On the scope side of the equation, it’s not just servers that are in scope for this PAN search, it’s every system on the network including infrastructure. However, for most infrastructure systems such as firewalls, routers and switches it is a simple task to rule them out for storing PANs. Where things can go awry is with load balancers, proxies and Web application firewalls (WAF) which can end up with PANs inadvertently stored in memory and/or disk due to how they operate.

Then there is the scanning of every server and PC on the network. For large organizations, the thought of scanning every server and PC for PANs can seem daunting. However, the Council does not specify that the identification of CHD needs to be done all at once, so such scanning can be spread out. The only time constraint is that this scanning must be completed before the organization’s PCI assessment starts.

The second issue that organizations encounter with DLP is that their DLP has no ability to look into their databases. Most DLP solutions are fine when it comes to flat files such as text, Word, PDF and Excel files, but the majority of DLP solutions have no ability to look into databases and their associated tables.

Some DLP solutions have add-on modules for database scanning but that typically required a license for each database instance to be scanned and thus can quickly become cost prohibitive for some organizations. DLPs that scan databases typically scan the more common databases such as Oracle, SQL Server and MySQL. But legacy enterprise databases such as DB/2, Informix, Sybase and even Oracle in a mainframe environment are only supported by a limited number of DLP solutions.

Another area where DLP solutions can have issues is with images. Most DLP solutions have no optical character recognition (OCR) capability to seek out PANs in images such as images of documents from scanners and facsimile machines. For those DLP solutions that can perform OCR, the OCR process slows the scanning process down considerably and the false positive rate can be huge particularly when it comes to facsimile documents or images of poor quality.

Finally there is the overall issue of identifying PANs at rest. It has been my experience that using DLP solutions for identifying PANs at rest is haphazard at best. I believe the reason for that is that most DLP solutions are relying on the open source Regular Expressions (RegEx) to find the PANs. As a result, they all suffer from the same shortcomings of RegEx and therefore their false positive rates end up being very similar.

The biggest reason for the false positive rate is the fact that most of these solutions using RegEx do not conduct a Luhn check to confirm that the number found is likely to be a PAN. That said, I have added a Luhn check to some of the open source solutions and it has amazed me how many 15 and 16 digit combinations can pass the Luhn check and yet not be a PAN based on further investigation. As a result, having a Luhn check to confirm a number as a potential PAN reduces false positives, but not as significantly as one might expect.

The next biggest reason RegEx has a high false positive rate is that RegEx looks at data both at a binary level and character level. As a result, I have seen PDFs flagged as containing PANs. I have also seen images that supposedly contained PANs when I knew that the tool being used had no OCR capability.

I have tried numerous approaches to reduce the level of false positive results, but have not seen significant reductions from varying the RegEx expressions. That said, I have found that the best results are obtained using separate expressions for each card brand’s account range versus a single, all-encompassing expression.

Simple Solutions

I wrote a post a while back regarding this scoping issue when it was introduced in v2. It documents all of the open source solutions available such as ccsrch, Find SSNs, SENF and Spider. All of these solutions run best when run locally on the system in question. For small environments, this is not an issue. However, for large organizations, having to have each user run the solution and report the results is not an option.

In addition, the false positive rates from these solutions can also be high. Then there is the issue of finding PANs in local databases such as SQL Lite, Access or MySQL. None of these simple solutions are equipped to find PANs in a database. As a result, PANs could be on these systems and you will not know it using these tools.

The bottom line is that while these techniques are better than doing nothing, they are not that much better. PANs could be on systems and may not be identified depending on the tool or tools used. And that is the reason for this post, so that everyone understands the limitations of these tools and the fact that they are not going to give definitive results.

Specialized Tools

There are a number of vendors that have developed tools that have been developed to specifically find PANs. While these tools are typically cheaper than a full DLP solution and some of these tools provide for the scanning of databases, it has been my experience that these tools are no better or worse than OpenDLP, the open source DLP solution.

Then there are the very specialized tools that were developed to convert data from flat files and older databases to new databases or other formats. Many of these vendors have added modules to these tools in the form of proprietary methods to identify all sorts of sensitive data such as PANs. While this proprietary approach significantly reduces false positives, it unfortunately makes these tools very expensive, starting at $500K and going ever higher, based on the size and environment they will run. As a result, organizations looking at these tools will need more than just use their need for PAN search capability to justify their cost.

The bottom line is that searching for PANs is not as easy as the solution vendors portray. And even with extensive tuning of such solutions, the false positive rate is likely going to make the investigation into your search results very time consuming. If you want to significantly reduce your false positive rate, then you should expect to spend a significant amount of money to achieve that goal.

Happy hunting.

This was cross-posted from the PCI Guru blog.

Copyright 2010 Respective Author at Infosec Island