Dark data and risks for business and managers

The term “dark data” refers to operational information that is unused because the data that carries it is hidden or forgotten in secret corners of IT systems. It is concerning this data that IT directors and compatibility and security managers can expect interesting challenges, especially in the context of the new EU Data Protection Regulation 2016/679 (General Data Protection Regulation, GDPR).

Gartner defines dark data as information assets that are created in the course of normal business activity and then remain unused. Once collected, this type of data is used in passing and then their traces are lost, ie no one knows where they are, they are difficult to find, and no one is looking for them, because it is not known that they exist.

The new GDPR is not particularly interested in where the data centers are. Its requirements must be met where business operations are carried out in the EU or data of an EU citizen are processed. The envisaged sanctions are frightening.

Other strict regulations have come into force in recent months or are on the horizon. New cyber rules for the financial sector are being implemented in New York from March 1, which could lead to criminal liability for non-compliance.

In short, 2017 is the year in which organizations are forced to comply with compliance requirements without sacrificing security. This will mean a boom in automation to increase the visibility of personal data processing systems, change the tracking of risk assessments and audit reporting.

Dark data contains valuable assets, but also risks that cannot be ignored. The attention of all those who run IT systems is needed, as the importance of the dilemma of increasing dark data increases with acceleration over time and has important implications for organizations that neglect the growing databerg. It, in turn, is composed of three elements:

Business-critical data – vital, operational, useful for business success, protected, and proactively managed.

Redundant, Obsolete, and Trivial (ROT) data. Redundant are usually those that are repeated or duplicated, obsolete ones have no value for the business, trivial ones have no meaning for business. Proactive minimization of ROT-characterized data and guaranteed removal and deletion at regular intervals is needed.

Dark data – those that have no identified value for the business. They may include vital, business-critical data or useless ROT data. In all cases, they consume resources. This data needs to be discovered and systematized in the BCD (Business Critical Data) or ROT groups.

Daily operations at the strategic and operational levels cause an increase in ROT and dark data. Strategic decisions are about budgeting disk space, not business value, and organizational decisions involve hiring cloud resources rather than imposing data management responsibilities. At the same time, on an individual level, employees believe that corporate resources are endless, free to use for work and personal needs.

Now is the time for organizations to take control of dark data, as arrays with them will one day exceed their ability to control and control them, and the time is approaching when the GDPR will come into force.

The necessary steps are:

providing visibility, ie detection and mapping of dark data;

processing and implementation of follow-up control by inclusion in the corporate data lifecycle management scheme.

Illumination of dark data reduces the exposed risks and unlocks hidden values ​​for the business. It is best to start by eliminating ROTs, as they subsequently turn dark, and the next step should be to teach users hygiene when working with data.

Where is the dark data?

Various authors classify dark data in the following summary list of structured and unstructured data:

• Emails, documents, video, audio in file servers, social media, cloud resources.

• Old files, data that is kept just in case, content in portable devices, content in cloud resources beyond the control of the organization’s IT system policies.

• Intentionally or accidentally hidden data in file systems – inside hidden files, fake bad clusters, deliberately hidden (hidden) files, etc.

There are other sources of dark data for which I did not find literature sources, so in the article, I introduced the terms explicit and hidden dark data. Their definitions are based on the visibility to them from the point of view of their owner.

Empirical experience from practice in many real projects allows me to add the following sources to the above list:

Hidden data in files positioned in conventional file systems – old documents, graphics, scanned documents, completed pdf forms, notes in Word documents, handwritten notes on scanned documents.

Data generated by operating systems: uncleaned deleted recycle bin in Windows, Linux, or Unix, service caches in RAM, disk caches and caches in database machines, caches in proxy servers, web servers, and specialized Internet caching servers, temporary files.

Software development processes are also accompanied by data, such as those from test cases. Often, these are real sets of data that are provided to programmers for testing the software being tested, used, and then take on unforeseen risks after freezing program code when everyone forgets about it.

Keep track of running applications, such as web browser caches, bsh history, encryption keys (such as VPN or SSH), syslog, or event manager entries that no one cares about.

Data that is in forgotten virtual machines are only installed or active in local hypervisors or cloud infrastructures, or that is not deleted after the end of the cloud service lease.

Data generated by various devices operating in IoT-type installations, such as implanted or wearable on the human body, communicating via a local area network (BAN).

Forgotten structured data that was created by various manual operations in case of accidental requests or service settings of database schemas some time ago and no one knows whether they are used or not.

Data that is official but located on employees’ mobile devices and private computers can also be defined as dark, as it is no longer under the control of the organization that owns it, on the one hand, and on the other, also carry risks, as from then on their life cycle is unknown

Forgotten structured data that was created in various database machines long ago because of old accounting, warehousing, and other programs for which it is no longer clear whether anyone uses and cares for them.

Security risks can become real incidents if dark data gets into malicious people or becomes visible to certain environments, such as business competitors, due to the lack of control over them by their owner.

Hidden risks or monsters in the ocean of data

Professionals concerned with standards have reason to worry about dark data in their organizations, as unmanaged data often contains outdated, inaccurate information that can be misinterpreted if found by auditors or law enforcement officials.

All forms of electronic information (ESI) can be the subject of litigation, even if the data is out of date or incomplete. The presence of unmanaged uncategorized dark data can lead to an increase in the value and cost of detecting and analyzing them. The risk increases if the dark data contains drafts or copies of documents that should have been destroyed.

Failure to look at dark data, unstructured or forgotten, can lead to financial losses due to non-compliance with legal regulations in addition to the impact on the company’s business reputation.

Data that is subject to regulations that need to be kept, but not stored properly, can lead to sanctions for the organization. When this data is requested by the court, but cannot be localized, the company can pay a serious price.

Poorly categorized data can violate access rights. If it is not known what the relevant arrays contain, confusing situations may arise when certain employees have access to them. If in such cases the access sensitive information, this can lead to business risks of data breaches.

Maintaining all data in archiving systems can create the illusion of security, but if the organization does not know what the data is or where it is located, the cost of storing and managing it can easily exceed the allowable values. Huge volumes of data lead to long backup times and their recovery can become problematic due to the difficulties associated with the need for recovery and analysis time, which compromises the whole backup scheme.

At first glance, the line between PII (Personally Identifiable Information) and non-PII data is blurred. For example, it has long been known that there are specific data that, when isolated, may seem anonymous, but when taken together, it can be used to effectively identify the exact person.

The easiest way to find out these so-called quasi-PII is the trio of full date of birth, zip code, and gender. If a company publishes a set of data that has been “anonymized or depersonalized” by removing all standard credentials but leaves these three items, the smart hacker is very likely to find the person’s name and address behind that data.

The consequences of poor data management can be seen in an interesting experiment, in which contractors purchased 200 used hard drives and SSDs from eBay and Craiglist in early 2016. 67% of media contained PII, 11% – sensitive corporate information, 9% – emails, 5% – spreadsheets with sales information, and 1% – records from CRM. 36% – deleted data that is easily recovered from the Recycle Bin or with simple commands from the file system.

Control over dark data to reduce risk

Organizations can choose to control their dark data with a plan, the right tools, and a methodology designed to shed light on the unknown. The benefits of taking these actions need to be seen through the prism of business efficiency. Avoiding dark data with modern information management approaches reduces the challenges and headaches that their presence can create.

The risks depend on the type of data and their quality, which a certain investigator can extract from the collection of dark data that he manages to obtain. Reference: “Risk Management”, http://agilemanagement.wikidot.com/risk-management

Depending on this, the risks include:

Violation of regulations – if the data are within the scope of regulatory measures, such as confidentiality, financial information (for credit cards or bank accounts), medical (for patients), but are found in collections with dark data. Such exposure may result in violations of the law and financial liability.

Opportunities for business intelligence. If the dark data contains sensitive information that pertains to business operations, practices, competitive advantages, intellectual property, business partnerships, etc., disclosure of these facts may compromise important business ventures and relationships.

Reputation risk. Any data breach is bad for the organization. This also applies to dark data, especially in light of other security risks and types. Reference: “What is risk management”, https://www.policymatters.net/what-is-project-risk-management/

Missing business opportunities to create know-how and enter new markets. If the organization does not invest in the detection, analysis, and research of dark data, other companies may do so. Then the value that is hidden in them can serve its competitors. The value of the exploited dark data may lead to improvements in deepening knowledge for employees and customers, reducing costs, increasing productivity and profits, avoiding obligations and responsibilities before the law.

Unmeasured exposure. By definition, dark data contains information that is difficult or costly to obtain or that is acquired from unknown and therefore unappreciated sources of intelligence or research and unclear opportunities for exposing loss or damage to the organization. Secrets stored in dark data can be harmful, but there is no way to know for sure. This should dispel the complacency or indifference of those who take these risks seriously at all.

The main differences between the risks exposed by overt and covert dark data are presented in the table. Reference: “Risk management practices in project management”, Risk management practices in project management


The described risks should be of interest to IT managers – they affect issues that are technical, systemic, business continuity, privacy, and confidentiality. Reference: “The Qualitative Approach to Project risk assessment”, The Qualitative Approach to Project risk assessment

The data that has reached the end of its life cycle are positioned on various devices – servers, SSDs, portable peripherals. Secure data deletion is required to reuse these devices. Otherwise, it is best to physically destroy the media themselves with their resources or use a service from a verified ITAD provider (IT Asset Disposition) after secure deletion. ITAD, or IT Asset Management, is a business built around eliminating obsolete or unwanted equipment in a safe and environmentally responsible way.

Leave a comment

Your email address will not be published. Required fields are marked *