Data Mining

What's wrong with data mining today?

What is Data Mining? With the recent increase in media focus on cryptocurrencies such as Bitcoin, Ethereum and how the mining of these digital coins and tokens influences their value, one would be forgiven for assuming the term Data Mining is associated exclusively to cryptocurrency. However, the reality is not so.

For perspective, let us break down the term into its concepts. The process of data mining is essentially the extraction of knowledge from large sets of information. This knowledge would then be relevant to particular use cases in the real world, such as business decisions or public services.

A block of data can be mined in five key stages; collection, storage, organisation, manipulation and visualisation. Data must first be collected by a party, through means such as web scraping, surveys or other methods. This data is then loaded into a data warehouse. The party will then need to decide whether this data will be stored in the cloud or on physical servers. Cloud storage has proved a vital tool for most businesses today as increasingly large data sets can be stored virtually for relatively low costs and for the process of data mining it is in most cases the optimal form of data storage.

Data warehousing then puts the block of data through an organisation process where specific sections are taken for analysis separately. However, a data warehouse could also be constructed after organisation in order to make the warehouse specific to a particular interest. This aspect of the process is inevitably one that introduces bias and potentially damages how ethical the manipulation of that data is. As data is separated, variable information that may influence particular values in a data set or population could become distorted or broken off from the structural relationships that are essential to exist in order to prevent the quality and reliability of such data being compromised.

The specific analysis of sections of data is what depends on the quality of what is delivered in the visualisation and knowledge stage. This is where algorithmic bias and discrimination can come into the equation. The blame for existing or developing algorithmic bias cannot be pointed at the machines, rather, a more likely source of the problematic nature of the issue would be inaccuracies and poor professional judgments regarding forecasting methods.

In cases such as the effectiveness of predictive mapping software in policing, questions concerning the integrity of the data sets used to train the actual models would be more than valid. This scenario exposes ethical flaws in such a method, as judgements on the observation of a truly random individual will lack precision compared to a judgement based on a recorded offender. A model can only perform and train as well as the quality of data it is given. However, despite integrity of such datasets and trust in the process of collection being so crucial, we still find ourselves at the beginning of a colossal ethical investigation into these very factors.

In the majority of cases, when data is finally visualised, statistically context specific statements can often be misconstrued particularly in media headlines. This was evident throughout media coverage surrounding COVID. Wildly inaccurate and bold statements seemed to be painted across headlines every other day, only for admissions of falsified data or poorly handled datasets to be blurted out by news corporations on behalf of organisations who's poor data mining practices narrowly escaped the fire of becoming breaking news during a pandemic.

In conclusion, I suppose one can take away a shred of acknowledgement of the fact that data mining is a more complex and variable world than it seemed in earlier stages of the internet. Data simply being generated and existing makes up an enormous aspect of our every day lives, however, the ways in which it is used are shaping greater influences as we make further developments in understanding and manipulating this core component shared between humans and machines.

Did you find this article valuable?

Support Farid Hamid by becoming a sponsor. Any amount is appreciated!