Getting started in synthetic data generation with AWS Clean Rooms - Analytics

Getting started in synthetic data generation with AWS Clean Rooms - Analytics

Source: Dev.to

What is hot off the press ✨ from AWS re:Invent 2025 for Analytics? ## Lesson Objectives ## What are AWS Clean Rooms? ## What are the benefits? ## What are the new features? ## What are the use cases? ## How do I get started? ## Tutorial: Synthetic dataset generation for ML model training ## More Learning Resources ## Next Month ## 10,000 AI Ideas Competition I am still enthralled and excited about the data and AI innovation for data scientists, researchers, data engineers and data analysts from keynotes from Amazon Web Services CEO Matt Garman on Day 2 and VP of Agentic AI Dr Swami Sivasubramanian on Day 3 in their announcements. Maintaining data privacy is important for enterprises and data science teams involved in training machine learning models. Let me introduce you to a new feature for AWS Clean Rooms to enhance privacy in synthetic data generation for machine learning model training. In this lesson you will learn the following: AWS Clean Rooms is an AWS service that allows organizations to analyze their data in a secure environment and also collaborate with others without sharing their underlying proprietary data. AWS Clean Rooms allows organizations to generate data insights from multiple companies without having to physically move raw data. You may use APIs to include AWS Clean Rooms in your company's workflow. For example, if you work on a multi-agency government transformation project, you may create permissions to your raw data in AWS or Snowflake environment and allow contractors or another team's data analysts access to collaborate with system analysts from the transformation department using zero-ETL. You do not need a data engineer to build or maintain a complex data pipeline once you add permissions in AWS Clean Rooms, other participants may collaborate and access data from another company or department. You may also integrate other AWS services with AWS Clean Rooms such as Amazon Athena, AWS Glue, Amazon S3, AWS Secrets Manager, AWS CloudTrail and CloudFormation. At AWS re:Invent 2025, AWS Clean Rooms launched a new feature to enhance the privacy of generating synthetic data for a custom ML model which became generally available on November 30 2025. This new feature allows teams to create a synthetic version of sensitive data that may be used in a secure environment to train machine learning models. With synthetic data generation, people and entities from the original dataset will be de-identified. With privacy-enhancing synthetic data generation there are more use cases for machine learning industries such as healthcare, government, defence, marketing and more. This allows data scientists and analysts to access granular data for machine learning model training. Considerations before you get started with synthetic data generation: There are pre-requisites for establishing an AWS Clean Room which include: Step 0: Login into your AWS account as an IAM Admin User. (Note: If you do not have an AWS account you can create one here). Navigate to AWS Clean Rooms and today I am working from the AWS Sydney region (ap-southeast-2). This is the list of other AWS supportedregions for AWS Clean Rooms. Step 1: A collaboration member creates an analysis template that includes: a) SQL defines the dataset. b) Privacy-related configurations ensure the synthetic data meets data providers’ compliance requirements. Define the Collaboration by entering a name and a description followed by member details. Specify member abilities a) Identify which user will run SQL queries b) Identify who will receive the results of the analysis c) Identify which user will train the machine learning models d) Identify who receive the output from inference For building analysis queries and also machine learning model purpose-built workflows you may choose who is responsible for the payment. You may specify which Amazon S3 bucket will include your machine learning model output and the result format of the analysis e.g. csv file or parquet. Before creating membership, you will also need to check the box that you agree to pay for the compute costs of collaboration. Step 2: Data providers approve the require analysis template, the collaboration query creates a machine learning input channel. You may classify the columns in your data output schema as categorical or numerical. Step 3: Clean Rooms ML generates a synthetic dataset and verifies it meets privacy thresholds in the analysis template. You may adjust the privacy threshold in % between the range of 50 to 100 to ensure that member of the raw data cannot be identified. Step 4: Once thresholds are satisfied, the ML input channel populates the synthetic dataset. You may adjust the privacy level in epsilon to match your organizations' privacy requirements. Step 5: Customers use the ML input channel to train the custom ML model linked to the collaboration. Being selected as an AWS Community in your area of expertise enables you to win AWS swag, participate in webinars, join a global community of builders, access AWS certification vouchers and also receive mentorship from AWS experts to level up your knowledge. If you are looking for a challenge and would like to win cash prizes, you may participate now in an AWS AI project over the summer holidays. 10,000 AI Ideas Competition You may imagine and be creative to build solutions using Kiroand other AI services such as Amazon Bedrock. Until the next update, happy learning! 😀 Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - What are AWS Clean Rooms? - What are the benefits? - What are the new features? - What are the use cases? - How do I get started? - AWS Clean Rooms allows organizations to generate data insights from multiple companies without having to physically move raw data. You may use APIs to include AWS Clean Rooms in your company's workflow. - For example, if you work on a multi-agency government transformation project, you may create permissions to your raw data in AWS or Snowflake environment and allow contractors or another team's data analysts access to collaborate with system analysts from the transformation department using zero-ETL. - Link and match customer recordsfrom multiple companies to train and deploy machine learning models. You may even bring in your own machine learning model and deploy it to access data insights from other companies without sharing a custom ML model or raw data. - You need to configure details of who will pay for the synthetic dataset - Synthetic data generation does not remove or redact sensitive values (i.e. PII) from the original dataset. - Does not support data generation from text data. - Sign up for an AWS account - Set up service roles for AWS Clean Rooms such as administrator or collaboration member - Set up service roles for AWS Clean Rooms ML - Select Create Collaboration - AWS Clean Rooms User Guide - AWS On Air- Collaborate across multiple data sources and clouds in AWS Clean Rooms - Creating a configured table with Snowflake data source - AWS Clean Rooms SQL Reference - AWS Clean Rooms API Reference - AWS Clean Rooms ML API Reference - AWS Clean Rooms launches privacy-enhancing synthetic dataset generation for ML model training - If you would like to start your new year in a vibrant community, the AWS Community Builders Program will be opening soon in early January 2026 for the annual application intake. You may join the waitlist and be notified by email as soon as the application opens.