Slice & stream arrays¶
We saw how LaminDB allows to query & search across artifacts using registries: Query & search registries.
Let us now query the datasets in storage themselves. Here, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.
# replace with your username and S3 bucket
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays
Show code cell output
✓ logged in with email testuser1@lamin.ai
! updating cloud SQLite 's3://lamindb-ci/test-arrays/.lamindb/lamin.db' of instance 'testuser1/test-arrays'
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
→ initialized lamindb: testuser1/test-arrays
Import lamindb and track this notebook.
import lamindb as ln
import numpy as np
import zarr
ln.track()
Show code cell output
→ connected lamindb: testuser1/test-arrays
→ created Transform('Cr3DVbVdgXsA0000', key='arrays.ipynb'), started new Run('3xLkFiSkFYO6kXSZ') at 2025-10-16 13:17:37 UTC
→ notebook imports: lamindb==1.13.1 numpy==2.3.4 zarr==3.1.3
• recommendation: to identify the notebook across renames, pass the uid: ln.track("Cr3DVbVdgXsA")
We’ll need some test data:
ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()
Show code cell output
Artifact(uid='vxO6eAu9ys1ew4Fm0000', is_latest=True, key='testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-10-16 13:17:38 UTC, is_locked=False)
AnnData¶
An h5ad artifact stored on s3:
artifact = ln.Artifact.get(key="pbmc68k.h5ad")
artifact.path
S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')
access = artifact.open()
This object is an AnnDataAccessor object, an AnnData object backed in the cloud:
access
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:
access.X
Show code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">
You can subset it like a normal AnnData object:
obs_idx = access.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
access.obs.percent_mito <= 0.05
)
access_subset = access[obs_idx]
access_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Subsets load arrays into memory upon direct access:
access_subset.X
Show code cell output
array([[-0.326, -0.191, 0.499, ..., -0.21 , -0.636, -0.49 ],
[ 0.811, -0.191, -0.728, ..., -0.21 , 0.604, -0.49 ],
[-0.326, -0.191, 0.643, ..., -0.21 , 2.303, -0.49 ],
...,
[-0.326, -0.191, -0.728, ..., -0.21 , 0.626, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
shape=(35, 765), dtype=float32)
To load the entire subset into memory as an actual AnnData object, use to_memory():
adata_subset = access_subset.to_memory()
adata_subset
Show code cell output
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Add a column to a cloud AnnData object¶
It is also possible to add columns to .obs and .var of cloud AnnData objects without downloading them.
Create a new AnnData zarr artifact.
adata_subset.write_zarr("adata_subset.zarr")
artifact = ln.Artifact(
"adata_subset.zarr", description="test add column to adata"
).save()
artifact
Artifact(uid='ESzUKFvr2wjTMt2g0000', is_latest=True, description='test add column to adata', suffix='.zarr', otype='AnnData', size=215211, hash='aSHN77yMrOMiMzo6jh1xEA', n_files=120, branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-10-16 13:17:40 UTC, is_locked=False)
with artifact.open(mode="r+") as access:
access.add_column(where="obs", col_name="ones", col=np.ones(access.shape[0]))
display(access)
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 35 × 765
constructed for the AnnData object ESzUKFvr2wjTMt2g.zarr
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito', 'ones']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
The version of the artifact is updated after the modification.
artifact
Artifact(uid='ESzUKFvr2wjTMt2g0001', is_latest=True, description='test add column to adata', suffix='.zarr', size=215962, hash='3Gf4tPzfnj06zeqiigcFOg', n_files=123, branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-10-16 13:17:49 UTC, is_locked=False)
artifact.delete(permanent=True)
→ deleting all versions of this artifact because they all share the same store
SpatialData¶
It is also possible to access AnnData objects inside SpatialData tables:
artifact = ln.Artifact.using("laminlabs/lamindata").get(
key="visium_aligned_guide_min.zarr"
)
access = artifact.open()
→ transferred: Artifact(uid='bjH534dxVi1drmLZ0001'), Storage(uid='D9BilDV2')
access
Show code cell output
SpatialDataAccessor object
constructed for the SpatialData object bjH534dxVi1drmLZ.zarr
with tables: ['table']
access.tables
Show code cell output
Accessor for the SpatialData attribute tables
with keys: ['table']
This gives you the same AnnDataAccessor object as for a normal AnnData.
table = access.tables["table"]
table
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 37 × 18085
constructed for the AnnData object table
obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
obsm: ['spatial']
uns: ['spatial', 'spatialdata_attrs']
var: ['feature_types', 'gene_ids', 'genome', 'symbols']
You can subset it and read into memory as an actual AnnData:
table_subset = table[table.obs["clone"] == "diploid"]
table_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 31 × 18085
obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
obsm: ['spatial']
uns: ['spatial', 'spatialdata_attrs']
var: ['feature_types', 'gene_ids', 'genome', 'symbols']
adata = table_subset.to_memory()
Generic HDF5¶
Let us query a generic HDF5 artifact:
artifact = ln.Artifact.get(key="testfile.hdf5")
And get a backed accessor:
backed = artifact.open()
The returned object contains the .connection and h5py.File or zarr.Group in .storage
backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>
Parquet¶
A dataframe stored as sharded parquet.
Note that it is also possible to register and access Hugging Face paths. For this huggingface_hub package should be installed.
artifact = ln.Artifact.using("laminlabs/lamindata").get(key="sharded_parquet")
artifact.path.view_tree()
Show code cell output
/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
└── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()
Show code cell output
→ transferred: Artifact(uid='78XWb8yD09SCgVfl0000'), Storage(uid='5EYyeftHljIs')
This returns a pyarrow dataset.
backed
<pyarrow._dataset.FileSystemDataset at 0x7f039ca16680>
backed.head(5).to_pandas()
Show code cell output
| cell_type | n_genes | percent_mito | |
|---|---|---|---|
| index | |||
| CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 |
| AGATATTGACCACA-1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 |
| GCAGGGCTGTATGC-8 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 |
| TTATGGCTGGCAAG-2 | CD4+/CD25 T Reg | 1236 | 0.023963 |
| CACGACCTGGGAGT-7 | CD4+/CD25 T Reg | 1010 | 0.016620 |
It is also possible to open a collection of cloud artifacts.
collection = ln.Collection.using("laminlabs/lamindata").get(
key="sharded_parquet_collection"
)
backed = collection.open()
Show code cell output
→ transferred: Artifact(uid='yBp5v9RRptoIrIMQ0000')
→ transferred: Artifact(uid='fB33zDQDFb0i3Yxw0000')
→ transferred: Collection(uid='6aWTZ7J2ej1Rj22q0000')
backed
<pyarrow._dataset.FileSystemDataset at 0x7f03a55355a0>
backed.to_table().to_pandas()
Show code cell output
| cell_type | n_genes | percent_mito | |
|---|---|---|---|
| index | |||
| CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 |
| AGATATTGACCACA-1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 |
| GCAGGGCTGTATGC-8 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 |
| TTATGGCTGGCAAG-2 | CD4+/CD25 T Reg | 1236 | 0.023963 |
| CACGACCTGGGAGT-7 | CD4+/CD25 T Reg | 1010 | 0.016620 |
| AATCTCACTCAGTG-3 | CD4+/CD45RO+ Memory | 1183 | 0.016056 |
| CTAGTTTGGCTTAG-4 | CD4+/CD45RO+ Memory | 1002 | 0.018922 |
| ACGCCGGAAGCCTA-6 | CD8+/CD45RA+ Naive Cytotoxic | 1292 | 0.018315 |
| CTGACCACCATGGT-4 | CD8+/CD45RA+ Naive Cytotoxic | 1559 | 0.024427 |
| AGTTAAACAAACAG-1 | CD19+ B | 1005 | 0.019806 |
| CTACGCACAGGGTG-3 | CD4+/CD45RO+ Memory | 1053 | 0.012073 |
| CAGACAACAAAACG-7 | CD4+/CD25 T Reg | 1109 | 0.012702 |
| GAGGGTGACCTATT-1 | CD4+/CD25 T Reg | 1003 | 0.012971 |
| TGACTGGAACCATG-7 | Dendritic cells | 1277 | 0.012961 |
| ACGACCCTGTCTGA-3 | Dendritic cells | 1074 | 0.017466 |
| GTTATGCTACCTCC-3 | CD14+ Monocytes | 1201 | 0.016839 |
| GTGTCAGATCTACT-6 | CD14+ Monocytes | 1014 | 0.025417 |
| AAGAACGAACTCTT-6 | CD14+ Monocytes | 1067 | 0.019530 |
| TACTCTGACGTAGT-1 | Dendritic cells | 1118 | 0.012069 |
| TAAGCTCTTCTGGA-4 | CD14+ Monocytes | 1059 | 0.021497 |
By default Artifact.open() and Collection.open() use pyarrow to lazily open dataframes. polars can be also used by passing engine="polars". Note also that .open(engine="polars") returns a context manager with LazyFrame.
with collection.open(engine="polars") as lazy_df:
display(lazy_df.collect().to_pandas())
Show code cell output
| cell_type | n_genes | percent_mito | index | |
|---|---|---|---|---|
| 0 | Dendritic cells | 1277 | 0.012961 | TGACTGGAACCATG-7 |
| 1 | Dendritic cells | 1074 | 0.017466 | ACGACCCTGTCTGA-3 |
| 2 | CD14+ Monocytes | 1201 | 0.016839 | GTTATGCTACCTCC-3 |
| 3 | CD14+ Monocytes | 1014 | 0.025417 | GTGTCAGATCTACT-6 |
| 4 | CD14+ Monocytes | 1067 | 0.019530 | AAGAACGAACTCTT-6 |
| 5 | Dendritic cells | 1118 | 0.012069 | TACTCTGACGTAGT-1 |
| 6 | CD14+ Monocytes | 1059 | 0.021497 | TAAGCTCTTCTGGA-4 |
| 7 | CD4+/CD45RO+ Memory | 1034 | 0.010163 | CGTTATACAGTACC-8 |
| 8 | CD4+/CD45RO+ Memory | 1078 | 0.012831 | AGATATTGACCACA-1 |
| 9 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 | GCAGGGCTGTATGC-8 |
| 10 | CD4+/CD25 T Reg | 1236 | 0.023963 | TTATGGCTGGCAAG-2 |
| 11 | CD4+/CD25 T Reg | 1010 | 0.016620 | CACGACCTGGGAGT-7 |
| 12 | CD4+/CD45RO+ Memory | 1183 | 0.016056 | AATCTCACTCAGTG-3 |
| 13 | CD4+/CD45RO+ Memory | 1002 | 0.018922 | CTAGTTTGGCTTAG-4 |
| 14 | CD8+/CD45RA+ Naive Cytotoxic | 1292 | 0.018315 | ACGCCGGAAGCCTA-6 |
| 15 | CD8+/CD45RA+ Naive Cytotoxic | 1559 | 0.024427 | CTGACCACCATGGT-4 |
| 16 | CD19+ B | 1005 | 0.019806 | AGTTAAACAAACAG-1 |
| 17 | CD4+/CD45RO+ Memory | 1053 | 0.012073 | CTACGCACAGGGTG-3 |
| 18 | CD4+/CD25 T Reg | 1109 | 0.012702 | CAGACAACAAAACG-7 |
| 19 | CD4+/CD25 T Reg | 1003 | 0.012971 | GAGGGTGACCTATT-1 |
Yet another way to open several parquet files as a single dataset is via calling .open() directly for a query set.
backed = ln.Artifact.filter(suffix=".parquet").open()
! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order
backed
<pyarrow._dataset.FileSystemDataset at 0x7f03e83692a0>
Stream arrays into cloud¶
It is also possible to write directly into the default cloud (or local) storage of the current instance and then save as an Artifact. This can be done using from_lazy() that returns LazyArtifact. This object creates a real artifact on .save() with the provided arguments.
lazy = ln.Artifact.from_lazy(suffix=".zarr", overwrite_versions=True, key="mydata.zarr")
lazy
Show code cell output
LazyArtifact object with
path: s3://lamindb-ci/test-arrays/.lamindb/5YxjVbn6vXMrAqaW.zarr
arguments: {'key': 'mydata.zarr', 'overwrite_versions': True}
Stream an array into lazy.path in the default instance storage using zarr.
store = zarr.storage.FsspecStore.from_url(lazy.path.as_posix())
group = zarr.open(store, mode="w")
group["ones"] = np.ones(3)
Save and get the artifact.
artifact = lazy.save()
artifact
Show code cell output
Artifact(uid='5YxjVbn6vXMrAqaW0000', is_latest=True, key='mydata.zarr', suffix='.zarr', size=740, hash='dA8cBCSSPfA7OGsMXYcMcw', n_files=3, branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-10-16 13:18:04 UTC, is_locked=False)
artifact.delete(permanent=True)
Show code cell output
→ deleting all versions of this artifact because they all share the same store
# clean up test instance
ln.setup.delete("test-arrays", force=True)
Show code cell output
→ deleted storage record on hub 76e5f3b018085f52bcd5ca9b4d7e0ce5 | s3://lamindb-ci/test-arrays
→ deleted instance record on hub 587a82023ecb5ea28b3a448cb8240f7f