lamindb.Transform¶

Bases: SQLRecord, IsVersioned

Data transformations such as scripts, notebooks, functions, or pipelines.

A “transform” can refer to a Python function, a script, a notebook, or a pipeline. If you execute a transform, you generate a run (Run). A run has inputs and outputs.

A pipeline is typically created with a workflow tool (Nextflow, Snakemake, Prefect, Flyte, MetaFlow, redun, Airflow, …) and stored in a versioned repository.

Transforms are versioned so that a given transform version maps on a given source code version.

The definition of transforms and runs is consistent the OpenLineage specification where a Transform record would be called a “job” and a Run record a “run”.

Parameters:

key – str | None = None A short name or path-like semantic key.
type – TransformType | None = "pipeline" See TransformType.
version – str | None = None A version string.
description – str | None = None A description.
reference – str | None = None A reference, e.g., a URL.
reference_type – str | None = None A reference type, e.g., ‘url’.
source_code – str | None = None Source code of the transform.
revises – Transform | None = None An old version of the transform.

See also

track(): Globally track a script or notebook run.
Run: Executions of transforms.

Notes

Examples

Create a transform for a pipeline:

>>> transform = ln.Transform(key="Cell Ranger", version="7.2.0", type="pipeline").save()

Create a transform from a notebook:

>>> ln.track()

View predecessors of a transform:

>>> transform.view_lineage()

Attributes¶

property latest_run: Run¶: The latest run of this transform.

property name¶

property stem_uid: str¶

Universal id characterizing the version family.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length 12 (transform) or 16 (artifact, collection)
version_uid = "0000"  # an auto-incrementing 4-digit base62 number
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid

property versions: QuerySet¶

Lists all records of the same version family.

>>> new_artifact = ln.Artifact(df2, revises=artifact).save()
>>> new_artifact.versions()

Simple fields¶

uid: str¶: Universal id.

key: str | None¶

A name or “/”-separated path-like string.

All transforms with the same key are part of the same version family.

description: str | None¶: A description.

type: TransformType¶: TransformType (default "pipeline").

source_code: str | None¶: Source code of the transform.

hash: str | None¶: Hash of the source code.

reference: str | None¶: Reference for the transform, e.g., a URL.

reference_type: str | None¶: Reference type of the transform, e.g., ‘url’.

created_at: datetime¶: Time of creation of record.

updated_at: datetime¶: Time of last update to record.

version: str | None¶

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool¶: Boolean flag that indicates whether a record is the latest in its version family.

is_locked: bool¶: Whether the record is locked for edits.

Relational fields¶

branch: Branch¶: Whether record is on a branch or in another “special state”.

space: Space¶: The space in which the record lives.

created_by: User¶: Creator of record.

ulabels: ULabel¶: ULabel annotations of this transform.

predecessors: Transform¶

Preceding transforms.

Allows to _manually_ define predecessors. Is typically not necessary as data lineage is automatically tracked via runs whenever an artifact or collection serves as an input for a run.

runs: Run¶: Runs of this transform.

successors: Transform¶

Subsequent transforms.

See predecessors.

references: Reference¶: Linked references.

projects: Project¶: Linked projects.

blocks: TransformBlock¶: Blocks that annotate this artifact.

Class methods¶

classmethod from_git(url, path, key=None, version=None, entrypoint=None, branch=None)¶

Create a transform from a path in a git repository.

Parameters:

url (str) – URL of the git repository.
path (str) – Path to the file within the repository.
key (str | None, default: None) – Optional key for the transform.
version (str | None, default: None) – Optional version tag to checkout in the repository.
entrypoint (str | None, default: None) – Optional entrypoint for the transform.
branch (str | None, default: None) – Optional branch to checkout.

Return type:

Transform

Examples

Create from a Nextflow repo and auto-infer the commit hash from its latest version:

transform = ln.Transform.from_git(
    url="https://github.com/openproblems-bio/task_batch_integration",
    path="main.nf"
).save()

Create from a Nextflow repo and checkout a specific version:

transform = ln.Transform.from_git(
    url="https://github.com/openproblems-bio/task_batch_integration",
    path="main.nf",
    version="v2.0.0"
).save()
assert transform.version == "v2.0.0"

Create a sliding transform from a Nextflow repo’s dev branch. Unlike a regular transform, a sliding transform doesn’t pin a specific source code state, but adapts to whatever the referenced state on the branch is:

transform = ln.Transform.from_git(
    url="https://github.com/openproblems-bio/task_batch_integration",
    path="main.nf",
    branch="dev",
    version="dev",
).save()

Notes

A regular transform pins a specific source code state through its commit hash:

transform.source_code
#> repo: https://github.com/openproblems-bio/task_batch_integration
#> path: main.nf
#> commit: 68eb2ecc52990617dbb6d1bb5c7158d9893796bb

A sliding transform infers the source code state from a branch:

transform.source_code
#> repo: https://github.com/openproblems-bio/task_batch_integration
#> path: main.nf
#> branch: dev

If an entrypoint is provided, it is added to the source code below the path, e.g.:

transform.source_code
#> repo: https://github.com/openproblems-bio/task_batch_integration
#> path: main.nf
#> entrypoint: myentrypoint
#> commit: 68eb2ecc52990617dbb6d1bb5c7158d9893796bb

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").to_dataframe()

classmethod get(idlike=None, **expressions)¶

Get a single record.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
expressions – Fields and values passed as Django query expressions.

Raises:

docs:lamindb.errors.DoesNotExist – In case no matching record is found.

Return type:

SQLRecord

See also

Guide: Query & search registries
Django documentation: Queries

Examples

ulabel = ln.ULabel.get("FvtpPJLJ")
ulabel = ln.ULabel.get(name="my-label")

classmethod to_dataframe(include=None, features=False, limit=100)¶

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use arguments include or feature to include other data.

Parameters:

include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "ulabels__name", "cell_types__name", etc. or a list of such strings.
features (bool | list[str], default: False) – If a list of feature names, filters Feature down to these features. If True, prints all features with dtypes in the core schema module. If "queryset", infers the features used within the set of artifacts or records. Only available for Artifact and Record.
limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

Include the name of the creator in the DataFrame:

>>> ln.ULabel.to_dataframe(include="created_by__name"])

Include display of features for Artifact:

>>> df = ln.Artifact.to_dataframe(features=True)
>>> ln.view(df)  # visualize with type annotations

Only include select features:

>>> df = ln.Artifact.to_dataframe(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.
keep – When multiple records are found for a lookup, how to return the records. - "first": return the first record. - "last": return the last record. - False: return all records.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod using(instance)¶

Use a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods¶

view_lineage(with_successors=False, distance=5)¶

View lineage of transforms.

Note that this only accounts for manually defined predecessors and successors.

Auto-generate lineage through inputs and outputs of runs is not included.

restore()¶

Restore from trash onto the main branch.

Return type:: None

delete(permanent=None, **kwargs)¶

Delete record.

Parameters:: permanent (bool | None, default: None) – Whether to permanently delete the record (skips trash). If None, performs soft delete if the record is not already in the trash.
Return type:: None

Examples

For any SQLRecord object record, call:

>>> record.delete()

save(*args, **kwargs)¶

Save.

Always saves to the default database.

Return type:: TypeVar(T, bound= SQLRecord)

get_deferred_fields()¶: Return a set containing names of deferred fields for this instance.

refresh_from_db(using=None, fields=None, from_queryset=None)¶

Reload field values from the database.

By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.

Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.

When accessing deferred fields of an instance, the deferred loading of the field will call this method.

async arefresh_from_db(using=None, fields=None, from_queryset=None)¶

serializable_value(field_name)¶

Return the value of the field name for this instance. If the field is a foreign key, return the id value instead of the object. If there’s no Field object with this name on the model, return the model attribute’s value.

Used to serialize a field’s value (in the serializer, or form output, for example). Normally, you would just access the attribute directly and not use this method.

async asave(*args, force_insert=False, force_update=False, using=None, update_fields=None)¶

save_base(raw=False, force_insert=False, force_update=False, using=None, update_fields=None)¶

Handle the parts of saving which should be done only once per save, yet need to be done in raw saves, too. This includes some sanity checks and signal sending.

The ‘raw’ argument is telling save_base not to save any parent models and not to do any changes to the values before save. This is used by fixture loading.

async adelete(using=None, keep_parents=False)¶

prepare_database_save(field)¶

clean()¶: Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.

validate_unique(exclude=None)¶: Check unique constraints on the model and raise ValidationError if any failed.

date_error_message(lookup_type, field_name, unique_for)¶

unique_error_message(model_class, unique_check)¶

get_constraints()¶

validate_constraints(exclude=None)¶

clean_fields(exclude=None)¶: Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.