Getting Started with dataPi#
dataPi is a Python package that allows you to implement a distributed datalakehouse head made of data pods. This guide will help you get started with dataPi.
Installation#
To install dataPi from source:
Clone the repository
Run the following command:
pip install .
Initializing a New Project#
To create a new dataPi project, run:
datapi init [PROJECT_NAME]
If you don’t specify a project name, it will default to ‘datapi_project’.
This will create a project structure like this:
datapi_project
- config.yml
- resources
- - sample_resources.yml
- deployments/
- docs/
Configuration#
The config.yml file should contain your dataPi general configuration. Here’s an example:
# datapi configuration file
metastore_type: POLARIS
metastore_uri: 'METASTORE_URI/api/catalog'
metastore_credentials: 'CLIENT_ID:CLIENT_SECRET'
metastore_catalog: 'METASTORE_CATALOG_NAME'
# datapi datapods - Deployment settings
deployment:
deployment_target: GCP_CLOUD_RUN
build_service: GCP_CLOUD_BUILD
project_id: GCP_PROJECT_ID
registry_url: REGISTRY_URL
region: GCP_REGION
Defining Resources#
You can define your dataPods specs under the resources folder. Here are examples of reduction and projection resources:
Reduction Resource:
resource_name: RESOURCE_NAME
type: REST
depends_on:
- namespace: METASTORE_NAMESPACE_NAME
table: METASTORE_ICEBERG_TABLE_NAME
local_engine: duckdb
short_description: This a sample query
long_description: long-desc.md
operation_type: REDUCTION
aggregate: sales.sum()
group_by: quarter
filters: region = 'EMEA'
deploy: True
Projection Resource:
resource_name: RESOURCE_NAME
type: REST
depends_on:
- namespace: METASTORE_NAMESPACE_NAME
table: METASTORE_ICEBERG_TABLE_NAME
local_engine: duckdb
short_description: This a sample query
long_description: long-desc.md
operation_type: PROJECTION
select: sales quarter region
filters: region = 'EMEA'
deploy: True
Resources can depend on Iceberg tables registered on the catalog, or on another existing (and deployed) data resources:
depends_on:
- resource: RESOURCE_NAME
Basic Commands#
Here are some basic commands to get you started:
Deploy all resources:
datapi run --all
Deploy a single resource:
datapi run --resource [RESOURCE_NAME]
List all resources:
datapi show --all
List one resource:
datapi show --resource [RESOURCE_NAME]
Generate documentation for all resources:
datapi docs generate --all
Serve documentation:
datapi docs serve
Using the Python Client SDK#
You can use the included Python client SDK to access data from your application:
client = Client(project_id=project_id, region=region, resource_name=resource_name)
services = client.list_services()
print("Available services:")
for resource, url in services.items():
print(f"- {resource}: {url}")
data = client.get_data()
print("Data from example_resource:", data)
For more detailed information, please refer to the full documentation.