Top Three Data Privacy Issues Facing AI Today
AI has taken the world by storm, but there are some critical privacy issues that need to be considered
This post was written by me, but originally published on DailyHodl.com
AI (artificial intelligence) has caused frenzied excitement among consumers and businesses alike – driven by a passionate belief that LLMs (large language models) and tools like ChatGPT will transform the way we study, work and live.
But just like in the internet’s early days, users are jumping in without considering how their personal data is used – and the impact this could have on their privacy.
There have already been countless examples of data breaches within the AI space. In March 2023, OpenAI temporarily took ChatGPT offline after a ‘significant’ error meant users were able to see the conversation histories of strangers.
That same bug meant the payment information of subscribers – including names, email addresses and partial credit card numbers – were also in the public domain.
In September 2023, a staggering 38 terabytes of Microsoft data was inadvertently leaked by an employee, with cybersecurity experts warning this could have allowed attackers to infiltrate AI models with malicious code.
Researchers have also been able to manipulate AI systems into disclosing confidential records.
In just a few hours, a group called Robust Intelligence was able to solicit personally identifiable information from Nvidia software and bypass safeguards designed to prevent the system from discussing certain topics.
Lessons were learned in all of these scenarios, but each breach powerfully illustrates the challenges that need to be overcome for AI to become a reliable and trusted force in our lives.
Gemini, Google’s chatbot, even admits that all conversations are processed by human reviewers – underlining the lack of transparency in its system.
“Don’t enter anything that you wouldn’t want to be reviewed or used,” says an alert to users warns.
AI is rapidly moving beyond a tool that students use for their homework or tourists rely on for recommendations during a trip to Rome.
It’s increasingly being depended on for sensitive discussions – and fed everything from medical questions to our work schedules.
Because of this, it’s important to take a step back and reflect on the top three data privacy issues facing AI today, and why they matter to all of us.
1. Prompts aren’t private
Tools like ChatGPT memorize past conversations in order to refer back to them later. While this can improve the user experience and help train LLMs, it comes with risk.
If a system is successfully hacked, there’s a real danger of prompts being exposed in a public forum.
Potentially embarrassing details from a user’s history could be leaked, as well as commercially sensitive information when AI is being deployed for work purposes.
As we’ve seen from Google, all submissions can also end up being scrutinized by its development team.
Samsung took action on this in May 2023 when it banned employees from using generative AI tools altogether. That came after an employee uploaded confidential source code to ChatGPT.
The tech giant was concerned that this information would be difficult to retrieve and delete, meaning IP (intellectual property) could end up being distributed to the public at large.
Apple, Verizon and JPMorgan have taken similar action, with reports suggesting Amazon launched a crackdown after responses from ChatGPT bore similarities to its own internal data.
As you can see, the concerns extend beyond what would happen if there’s a data breach but to the prospect that information entered into AI systems could be repurposed and distributed to a wider audience.
Companies like OpenAI are already facing multiple lawsuits amid allegations that their chatbots were trained using copyrighted material.
2. Custom AI models trained by organizations aren’t private
This brings us neatly to our next point – while individuals and corporations can establish their custom LLM models based on their own data sources, they won’t be fully private if they exist within the confines of a platform like ChatGPT.
There’s ultimately no way of knowing whether inputs are being used to train these massive systems – or whether personal information could end up being used in future models.
Like a jigsaw, data points from multiple sources can be brought together to form a comprehensive and worryingly detailed insight into someone’s identity and background.
Major platforms may also fail to offer detailed explanations of how this data is stored and processed, with an inability to opt out of features that a user is uncomfortable with.
Beyond responding to a user’s prompts, AI systems increasingly have the ability to read between the lines and deduce everything from a person’s location to their personality.
In the event of a data breach, dire consequences are possible. Incredibly sophisticated phishing attacks could be orchestrated – and users targeted with information they had confidentially fed into an AI system.
Other potential scenarios include this data being used to assume someone’s identity, whether that’s through applications to open bank accounts or deepfake videos.
Consumers need to remain vigilant even if they don’t use AI themselves. AI is increasingly being used to power surveillance systems and enhance facial recognition technology in public places.
If such infrastructure isn’t established in a truly private environment, the civil liberties and privacy of countless citizens could be infringed without their knowledge.
3. Private data is used to train AI systems
There are concerns that major AI systems have gleaned their intelligence by poring over countless web pages.
Estimates suggest 300 billion words were used to train ChatGPT – that’s 570 gigabytes of data – with books and Wikipedia entries among the datasets.
Algorithms have also been known to depend on social media pages and online comments.
With some of these sources, you could argue that the owners of this information would have had a reasonable expectation of privacy.
But here’s the thing – many of the tools and apps we interact with every day are already heavily influenced by AI – and react to our behaviors.
The Face ID on your iPhone uses AI to track subtle changes in your appearance.
TikTok and Facebook’s AI-powered algorithms make content recommendations based on the clips and posts you’ve viewed in the past.
Voice assistants like Alexa and Siri depend heavily on machine learning, too.
A dizzying constellation of AI startups is out there, and each has a specific purpose. However, some are more transparent than others about how user data is gathered, stored and applied.
This is especially important as AI makes an impact in the field of healthcare – from medical imaging and diagnoses to record-keeping and pharmaceuticals.
Lessons need to be learned from the internet businesses caught up in privacy scandals over recent years.
Flo, a women’s health app, was accused by regulators of sharing intimate details about its users to the likes of Facebook and Google in the 2010s.
Where do we go from here?
AI is going to have an indelible impact on all of our lives in the years to come. LLMs are getting better with every passing day, and new use cases continue to emerge.
However, there’s a real risk that regulators will struggle to keep up as the industry moves at breakneck speed.
And that means consumers need to start securing their own data and monitoring how it is used.
Decentralization can play a vital role here and prevent large volumes of data from falling into the hands of major platforms.
DePINs (decentralized physical infrastructure networks) have the potential to ensure everyday users experience the full benefits of AI without their privacy being compromised.
Not only can encrypted prompts deliver far more personalized outcomes, but privacy-preserving LLMs would ensure users have full control of their data at all times – and protection against it being misused.