Confidential Compute

Install docker images

  • Install a Weavechain node locally if not done already, the easiest way is by starting it as a docker
  • Install a local jupyter jupyter server to connect to the node (if not done already)
  • Allow running local docker images by running
  docker run -d -v /var/run/docker.sock:/var/run/docker.sock -p 0.0.0.0:2375:2375 bobrik/socat TCP-LISTEN:2375,fork UNIX-CONNECT:/var/run/docker.sock

Prepare the data

  • go to the folder where the node was installed, download the sample.csv and save it under storage/files/private_files (create the private_files folder if missing), or do it from command line once in the weavechain node folder
  mkdir -p storage/files/private_files
  cd storage/files/private_files
  curl -O https://public.weavechain.com/file/sample.csv  

Run the notebook

  • make sure both the node and jupyter server dockers are running
  • connect to the local jupyter server http://127.0.0.1:18888/notebooks/sample-compute.ipynb
  • use the token taken from the weave_jupyter_public docker logs
  • OR, if you're not using the provided docker server, download the notebook from here and run it in your locally configured jupyter server
  • run the cells one by one, in case of errors check for the docker images running properly and without errors in their logs and for the ports being open
  • contact us on Telegram or via email support@weavechain.com
  • see below a non-interactive output of how the notebook should look like

Sample of expected notebook:


In this demo notebook we will showcase running a compute to data:

  • we will create a table in a private data collection that cannot be accessed remotely
  • we'll populate it with data
  • run a compute to data task to train a model without having access to the data
  • verify the task lineage
  • generate a ZK proof for the data

The default Weavechain node installation is preconfigured to support this scenario (by connecting to a public weave, having a private data collection defined and mapped to a in-process SQLite instance and read rights for that collection already given).

1. Create an API session

import pandas as pd

from weaveapi.records import *
from weaveapi.options import *
from weaveapi.filter import *
from weaveapi.weaveh import *

WEAVE_CONFIG = "config/demo_client_local.config"
nodeApi, session = connect_weave_api(WEAVE_CONFIG)
{"res":"ok","data":"pong 1674717913804"}

2. Read data from the prepared file

  • go to the folder where the local node was installed
  • download sample.csv and place it under storage/file/private_files folder
  • theoretically we could have used the file from the jupyter server, using this step to show how to connect a local storage to the node
  • the private_files storage is already configured in the node at install time, the following config section marking it as non-replicated and as storing raw files (many formats are supported, from CSV to feather or ORC, for those each file being treated as a table)
  'private_files': {
     'connectionAdapterType': 'file',
     'replication': {
       'type': 'none',
      },
      'fileConfig': {
        'rootFolder': 'weavestorage/files',
        'format': 'file'
      }
  }
file_storage = "private_files"
file = "sample.csv"

import csv, base64
from io import StringIO
reply = nodeApi.downloadTable(session, file_storage, file, None, "file", READ_DEFAULT_NO_CHAIN).get()
data = base64.b64decode(reply["data"]).decode("utf-8-sig")

df = pd.read_csv(StringIO(data), sep=",")
display(df.head())
id name age gender air_pollution alcohol_use dust_allergy occupational_hazards genetic_risk chronic_lung_disease ... fatigue weight_loss shortness_of_breath wheezing swallowing_difficulty clubbing_of_fingernails frequent_cold dry_cough snoring level
0 1 Lorenzo Rasmussen 33 1 2 4 5 4 3 2 ... 3 4 2 2 3 1 2 3 4 1
1 2 Zechariah Gallegos 17 1 3 1 5 3 4 2 ... 1 3 7 8 6 2 1 7 2 2
2 3 Lukas Jenkins 35 1 4 5 6 5 5 4 ... 8 7 9 2 1 4 6 7 2 3
3 4 Trey Holden 37 1 7 7 7 7 6 7 ... 4 2 3 1 4 5 6 7 5 3
4 5 Branson Rivera 46 1 6 8 7 7 7 6 ... 3 2 4 1 4 2 4 2 3 3

5 rows × 26 columns

2. Create a local table private table starting from the file

  • we drop the existing table if already existing and re-create it from scratch
  • a weavechain node can also connect to existing tables, reading their structure, but in this case we create it via the API
  • we create the table in a pre-configured data collection that will be not replicated and stored in a local SQLite
  'private_files': {
     'connectionAdapterType': 'sqlite',
     'replication': {
       'type': 'none',
      },
      'jdbcConfig': {
        'database': 'weavestorage/storage_private.db'
      }
  }
data_collection = "private"
table = "oncology_data"

columns = {}
for c in df.columns:
    if c == "id":
        coldef = { "type": "LONG", "isIndexed": True, "isUnique": True, "isNullable": False }
    elif c == "name":
        coldef = { "type": "STRING", "readTransform": "ERASURE" }
    else:
        coldef = { "type": "DOUBLE" }
    columns[c] = coldef

layout = { 
    "columns": columns, 
    "idColumnIndex": 0, 
    "isLocal": False,
    "applyReadTransformations": True
}
#print(layout)

nodeApi.dropTable(session, data_collection, table).get()
reply = nodeApi.createTable(session, data_collection, table, CreateOptions(False, False, layout)).get()
print(reply)

records = Records(table, df.to_numpy().tolist())
reply = nodeApi.write(session, data_collection, records, WRITE_DEFAULT).get()
print(reply)
{'res': 'ok', 'target': {'operationType': 'CREATE_TABLE', 'organization': 'weavedemo', 'account': 'weaveyh5R1ytoUCZnr3JjqMDfhUrXwqWC2EWnZX3q7krKLPcg', 'scope': 'private', 'table': 'oncology_data'}}
{'res': 'ok', 'target': {'operationType': 'WRITE', 'organization': 'weavedemo', 'account': 'weaveyh5R1ytoUCZnr3JjqMDfhUrXwqWC2EWnZX3q7krKLPcg', 'scope': 'private', 'table': 'oncology_data'}, 'data': 'weaveyh5R1ytoUCZnr3JjqMDfhUrXwqWC2EWnZX3q7krKLPcg,4RKj6WTnS2AQrLXz04Sr2UnBxcS7dn0am5ymb2KiHDs=,3pHcKrG2afiuoMQ6x6w8GaArkn5TjjfwCcwiCqWotW1kUKMPh4kAv32yBmU8Lr85dcYcv1g68TexDb4riPMZAyQB'}
reply = nodeApi.read(session, data_collection, table, None, READ_DEFAULT_NO_CHAIN).get()
#print(reply)
df = pd.DataFrame(reply["data"])

df.head()
id name age gender air_pollution alcohol_use dust_allergy occupational_hazards genetic_risk chronic_lung_disease ... fatigue weight_loss shortness_of_breath wheezing swallowing_difficulty clubbing_of_fingernails frequent_cold dry_cough snoring level
0 1 Lorenzo Rasmussen 33 1 2 4 5 4 3 2 ... 3 4 2 2 3 1 2 3 4 1
1 2 Zechariah Gallegos 17 1 3 1 5 3 4 2 ... 1 3 7 8 6 2 1 7 2 2
2 3 Lukas Jenkins 35 1 4 5 6 5 5 4 ... 8 7 9 2 1 4 6 7 2 3
3 4 Trey Holden 37 1 7 7 7 7 6 7 ... 4 2 3 1 4 5 6 7 5 3
4 5 Branson Rivera 46 1 6 8 7 7 7 6 ... 3 2 4 1 4 2 4 2 3 3

5 rows × 26 columns

3. Mark the table private

layout["isLocal"] = True
nodeApi.updateLayout(session, data_collection, table, json.dumps({ "layout": layout})).get()
{'res': 'ok',
 'target': {'operationType': 'UPDATE_LAYOUT',
  'organization': 'weavedemo',
  'account': 'weaveyh5R1ytoUCZnr3JjqMDfhUrXwqWC2EWnZX3q7krKLPcg',
  'scope': 'private',
  'table': 'oncology_data'}}

and data cannot be read any longer except from the local node (we expect a Not authorized reply here)

reply = nodeApi.read(session, data_collection, table, None, READ_DEFAULT_NO_CHAIN).get()
print(reply)
{'res': 'err', 'target': {'operationType': 'READ', 'organization': 'weavedemo', 'account': 'weaveyh5R1ytoUCZnr3JjqMDfhUrXwqWC2EWnZX3q7krKLPcg', 'scope': 'private', 'table': 'oncology_data'}, 'message': 'Not authorized'}

4. Train an ML model on the private data

  • run on the node machine
  docker pull gcr.io/weavechain/oncology_xgboost:latest
  • use latest-arm64 if your machine is ARM
  • the data owner needs to purposely enable running a certain image
  • the node needs to be able to connect to the local docker instance
  • in the default configuration file installed with the node, the sample script is pre-authorized with the following line
  'allowedImages': [ 'gcr.io/weavechain/oncology_xgboost' ]
  • in case of error, uncomment the #print(reply) below to see details
  • (compute to data is just one of the patterns of confidential computing supported, MPC and Homomorphic Encryption could also be used)
reply = nodeApi.compute(session, "gcr.io/weavechain/oncology_xgboost", COMPUTE_DEFAULT).get()
#print(reply)
output = reply["data"]["output"]
print(output[:1200] + "...")
output = json.loads(output)
{"model": "YmluZgAAAD8XAAAAAwAAAAEAAAAAAAAAAQAAAAcAAAABAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOAAAAAAAAAG11bHRpOnNvZnRwcm9iBgAAAAAAAABnYnRyZWUsAQAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAAAsAAAAAAAAAAAAAABcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP////8BAAAAAgAAAAIAAIAAAJBAAAAAgAMAAAAEAAAACgAAgAAAYEAAAAAABQAAAAYAAAAJAACAAADwQAEAAIAHAAAACAAAABYAAIAAAKBAAQAAAAkAAAAKAAAACQAAgAAAIEACAACA//////////8AAAAA1mVlvgIAAAD//////////wAAAADmFLw+AwAAgP//////////AAAAAEGE5D4DAAAA//////////8AAAAA5RlPvgQAAID//////////wAAAADkGc8+BAAAAP//////////AAAAAG0+Y779nEdD4zjeQ1p2i70AAAAASfgoQ3EcU0P6hyI/AAAAAGRhoUFUVWlDjUE0vwAAAADk9SNC4zgCQycHqz8AAAAAz+gWQhzHoUJlQ/6+AAAAAAAAAACN42RDMio/vwAAAAAAAAAA4ziOQBW8nD8AAAAAAAAAAKqq8kI2br4/AAAAAAAAAADjOA5BlJUsvwAAAAAAAAAA4zgOQZOVrD8AAAAAAAAAAP//j0KwXj2/AAAAAAEAAAAVAAAAAAAAA...

5. Check the variable importance in the trained model

  • we can now use the model that was trained on data that is not seen by the researcher
  • we need to install xgboost in order to do so, run the section below only once (ARM machines might xgboost encounter version mismatches)
!pip install scikit-learn
!pip install xgboost
!pip install matplotlib
import base64
from xgboost import XGBClassifier

f = open("model.serialization", "wb")
f.write(base64.b64decode(output["model"]))
f.close()

model = XGBClassifier()
model.load_model('model.serialization')
if output.get("features") is not None:
    model.get_booster().feature_names = output["features"]
#print(model)

vimp = model.get_booster().get_score(importance_type='weight')
#print(vimp)
keys = list(vimp.keys())
values = list(vimp.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(30, columns="score").plot(kind='barh', figsize = (20, 10)).invert_yaxis()
/usr/local/lib/python3.9/dist-packages/xgboost/sklearn.py:782: UserWarning: Loading a native XGBoost model with Scikit-Learn interface.
  warnings.warn("Loading a native XGBoost model with Scikit-Learn interface.")

png

6. Verify signature for the output model

  • the signature is done on the hash of all input data, the hash of the docker image and the hash of the output
  • these hashes could be put on a blockchain as a proof of the ML model lineage
  • somebody with access to data can verify the input hashes
  • if multiple people have access to the data (and the training is deterministic), the same hashes are expected to be signed by different nodes
import base58
data = reply["data"]

signature = data["outputSignature"]
check = nodeApi.verifyLineageSignature(signature, data.get("inputHash"), data.get("computeHash"), data.get("paramsHash"), data["output"])

print("Signature:", data["outputSignature"])
print("Valid:", check)
print("\nInput Hash:", data["inputHash"])
print("Compute Hash:", data["computeHash"])
print("Output Hash:", data["outputHash"])
print("Output:", data["output"][:400] + "..." + data["output"][-400:])
Signature: 2psqu7mwhn3KLy1Vzb9uSBh1QSNuka4aynk7m7VwXvdvDe5cAzXd3HP4NGtgWNV2pzToEsadGJNTpPwtxmNgcE3z
Valid: True

Input Hash: 6LfKoQQMqb8fYgA1PwBPKMhFaJ59Fn3DWw6qTRri4zjN
Compute Hash: 3ia9p7Ayg6PBmFg9zUXVyrmUYbt4jz4YCUY3Cx7nXE1Q
Output Hash: YWuSGd184YZhfDWw758gPJaEmsCNqe6rfGBkznbWuF5
Output: {"model": "YmluZgAAAD8XAAAAAwAAAAEAAAAAAAAAAQAAAAcAAAABAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOAAAAAAAAAG11bHRpOnNvZnRwcm9iBgAAAAAAAABnYnRyZWUsAQAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...6IjMifX0=", "features": ["age", "air_pollution", "alcohol_use", "balanced_diet", "chest_pain", "chronic_lung_disease", "clubbing_of_fingernails", "coughing_of_blood", "dry_cough", "dust_allergy", "fatigue", "frequent_cold", "gender", "genetic_risk", "obesity", "occupational_hazards", "passive_smoker", "shortness_of_breath", "smoking", "snoring", "swallowing_difficulty", "weight_loss", "wheezing"]}

7. Generate zk proofs for data

  • Bulletproofs are used and multiple gadgets are supported (see the API)
  • generating proofs can be very time consuming as the volume of data grows
options = ZKOptions(False, 300, ["*"], 2048, DEFAULT_COMMITMENT)
reply = nodeApi.zkProof(session, data_collection, table, "numbers_are_non_zero", json.dumps({}), [ "age" ], None, options).get()
proof = reply["data"]
print(proof[:400] + "..." + proof[-400:])
A1BoqCGaWqcPbn46SdNoWUgC3T3i3D9TXHLdauL55MDxpCosaRwtFoMHDj8MFchaStvrMUk7EkGDGEptdyGk5vh2RjYSAFZBio1XuDRTkau2AYED4eA1zZV26GEX6rytpvH8SJ4DJevS2WKeDsL2fHPQ4zwuLqkqSNUhWCJPeoZmsQ56VPNJoqKewC6sDLPE1so33wA5bJpot8oFUpiETCEiMFF4J72ZDHbR46ix9kcVdZugiJJhFaVanyq8U7SATHLZyNxH4abjQ2aXEuVWtfnU3guuqggUFLg8R5BwmkMtvwZXXrxJAR3gHnxs3EFv887csDyTyxiPsj2U51gd4E32AXbDADH9b5DUFLWVWsqfxV7QDmG9yT2k2nuhoByTgL9jhMjwErgAKw24...UErC2vvTcKwUbpZXcoaKuyKj4S3gzUaWVt3ZamD7yEmu8nb1EFR9wPNF8pDj5ErfQT4JbGLK86d3DfunZ5DxvF9zaXneALTedwdjviyQNykK7uMR2kyqvbwZ75apJhMrQNemigN5ndYX5QymEStKrTKgm1QVaLLxGAtjJHdqAeXtRDnd2bGQ2uFMCSJJQqFxFrKNWK2CK6dxg52jB6xRKzbN3GH4jXFDrjQUgXtvgBThd7VYBtZGK9STmcpsGPKujzu7Ls9NEuXHgTFJaXQdZ1CaUCvcGxrqE3EDooiE13gkqDkpbqBg2Mo9SzmreYugnWY9PiHrLfRzgthssH4hh63MAfmPwSa7oFp8c1RarNUeScXQw4YPGtAZJvC23m2SLhySyDxedHckZs3c

Verify proof

reply = nodeApi.verifyZkProof(session, proof, "numbers_are_non_zero", json.dumps({ "count": 1000 }), None, 2048).get()
print(reply)
{'res': 'ok', 'data': 'true'}