Building Personalized Datasets: The First Step to Conceiving a Digital Twin

For individuals to generate a personal and private AI Assistant, first they need robust unique datasets to train their Personalized Learning Model.

Existing forms of Artificial Intelligence use foundation models trained on large amounts of data. Businesses focused on building AI models are using data they have stored for years. Most individuals simply do not or lack the knowledge to store their data in a format to be used to train AI models. The first step to creating personal AI co-pilots, free from corporate influence, is to provide end users with a means to construct personal datasets in their own environments through a simple, private and secure Data Collection Tool.

Overview

Everyone can start their journey towards a privately controlled artificially trained intelligence today. The first step is accumulating data through observation of digital behavior. To accommodate individuals of varying degrees of technical knowledge and trust of the digital realm, a simple secure Data Collection Tool must be provided to begin birthing a digital twin. Individuals must establish a flourished backend dataset before any further steps in the process. This will take time. Once enough behavioral data is gathered, the models need to be trained. This takes even more time and depends on the amount of data accumulated and the hardware resources available to each individual. It is vital we provide end users the ability to capture data securely and locally as soon as possible. (1)

The Missing Backend Data

Frontend capabilities for a personal digital co-pilot already exist, and can be used locally with personalized data sets on any computer with enough resources. (2, 3) The problem is the backend. The data is the missing piece of this puzzle. Individuals need the ability to create a local data set that will eventually train a model to be used as a backend solution for the frontend models. We do not need to rewrite the existing wheel. We need to customize it.

Popular large language models and machine learning datasets are built by large corporations like OpenAI, Google and Meta, who have fine tuned their models with years of data analysis (4). Even large businesses building their own knowledge-based models have years of data to work from. The masses need to start collecting their own personal datasets as a basis for their own Personalized Learning Model (PLM).

This proposal defines a Personalized Learning Model as:

Personalized Learning Model: A proprietary machine learning model with a unique neural network trained from one human’s private behavioral datasets.

Open source engineers can provide a software solution that simplifies the data collection process required to construct the core of the Personalized Learning Model.

The Data Collection Tool

The proposed Data Collection Tool will provide individuals with an interface to 1) choose what data is stored, 2) where the data is stored and 3) how data is collected. The tool should operate outside corporate cloud solutions, preferably on personal LANs. The tool’s priority is individual dataset ownership through localized storage to establish trust in the tools integrity.

Building Trust through Education

There is also opportunity to build trust and rapport by educational means during onboarding. Educating individuals with an overview of AI is most impactful at this phase of their twin’s conception. Data collection is the easiest part of the PLM to understand from a technical or mathematical perspective. The next step, training the PLM through semi-supervised learning (5), may go beyond some individuals understanding and cause a loss of interest. Proving that a digital twin is accessible to all, despite varying educational backgrounds, should take precedence at the data collection stage. This emphasizes reliability in the technology stack and the engineers building the tool.

Considerations for the Architecture:

How do we build an application that collects private behavioral data for end users with varying degrees of trust and knowledge without instilling fear and doubt?

1. The app must operate in a local, secure environment, on a users local network or local device.

2. The GUI should follow existing standard UX practices to pursuade trust through a familiar medium.

3. App users must be educated in the underlying architecture during onboarding. (Ie. The downsides of web2 architecture, the importance of network security at the edge, or how datasets are used to train models.)

  • Semantic education helps with trust too.
    • For instance, the word collection can cause anxiety in the context of particular government or corporate entities, but the verb collect is actually defined as, “to gain or regain control of”. Since the goal of the digital twin is to empower individuals to regain control over their data, small semantic observations like this may help reiterate the tool’s goal.

Problems to overcome

There are limitations to be addressed, like a person’s hardware resources and the complexity of collecting behavioral data from multiple personal devices in a securely owned central server.

Priorities

Regardless of posed problems, this open source tool is an opportunity to help individuals ‘mix and pour’ a solid foundation at the onset.

Therefore, there are two values this tool must prioritize. 1) Personal ownership and 2) trust in the technology. The individual must own the storage solution, (ideally have physical access to it,) and have a high level understanding of what the tool empowers them to create.

Conclusion

Without personalized datasets to train a unique neural network to match a sole individual, one cannot birth a genuine digital twin. Skepticism towards digital autonomy will always lurk if the opportunity to simplify data collection and educate individuals is not provided at this fertile stage. Personalized Learning Models cannot be trustfully trained without the proper datasets.

Below are questions to spark more thorough dialogue on building the Data Collection Tool.

Further Questions

– How can individuals with old devices still benefit from the tool?

– How does the app/tool monitor behavior? What can be monitored?

– Does the tool work constantly in the background on a device?

– Can the tool collect from all the individual’s personal devices without using a central server on a corporate cloud?

– Will every end user need an education in network security and server administration?

– Can this tool be the first opportunity to establish and implement standardized rules in democratized AI?

– How do we deal with semantics for establishing trust?

“Data management is more than merely building the models you’ll use for your business. You’ll need a place to store your data and mechanisms for cleaning it and controlling for bias before you can start building anything.” ~ [6]

Interested in joining a community focused on democratizing AI? Check out Kwaai AI Lab.

Sources

1. https://huggingface.co/docs/transformers/perf_train_gpu_one

2. https://huggingface.co/docs/transformers/index

3. https://www.youtube.com/watch?v=wrD-fZvT6UI

4. https://beebom.com/best-large-language-models-llms/

5. https://www.ibm.com/blog/supervised-vs-unsupervised-learning/

6. https://www.ibm.com/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks/

Further Reading

Existing Personalized LLMs

https://www.personal.ai/ Personal AI is strictly for language bots. Creates a language model based on your communications. Not localized, available in the cloud.

https://link.springer.com/chapter/10.1007/978-3-031-24337-0_22 This paper speaks of strictly language based models.

Decentralized Data Store

https://solidproject.org/

Best Large Language Models in Use Today

https://beebom.com/best-large-language-models-llms/

Artificial “Narrow” Intelligence vs. Artificial “General” & “Super” Intelligence

https://www.ibm.com/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks/ Claims AGI and ASI don’t exist yet. Regardless, individuals with personalized datasets can harness and influence these powers when they do arise.

Hugging Face Transformers – an open source solution to the frontend of a digital twin

– https://huggingface.co/docs/transformers/index

https://huggingface.co/docs/transformers/transformers_agents

Supervised vs Unsupervised Learning

https://www.ibm.com/blog/supervised-vs-unsupervised-learning/

Fine Tuning vs. Knowledge-Based Models

https://www.promptengineering.org/master-prompt-engineering-llm-embedding-and-fine-tuning/

Edge Computing

https://www.etsi.org/technologies/multi-access-edge-computing

Semantics

Sourced from Merriam-Webster Dictionary.

Persona (noun) 2b: the personality that a person (such as an actor or politician) projects in public

Personality (noun) 3a: the complex of characteristics that distinguishes an individual or a nation or group. Especially the totality of an individual’s behavioral and emotional characteristics

Personalize (verb) 2: to make personal or individual

Twin (noun) 2: one of two persons or things closely related to or resembling each other

Uniquely (adj) 1: being the only one : sole

Collect (verb) 3: to gain or regain control of

Digital (adj) 1: of, relating to, or utilizing devices constructed or working by the methods or principles of electronics

Virtual (adj) 2: being on or simulated on a computer or computer network

Autonomy (noun) 1: the quality or state of being self-governing