Snoflake Production Guide

Comprehensive guide covering essential Snowflake concepts, features, and common interview topics for data engineering, data analyst, and architect positions.

1. Core Architecture Components

Key Concepts to Master:

Multi-cluster Shared Data Architecture: Understand separation of storage and compute
Virtual Warehouses:
- Sizes (XS to 4XL) and scaling patterns
- Auto-suspend and auto-resume functionality
- Multi-cluster warehouses for concurrency
- Scaling policies (Standard vs Economy)
Micro-partitions:
- 50-500MB of uncompressed data
- Automatic clustering and pruning
- Columnar storage format
- Immutable nature
Cloud Services Layer: Authentication, query optimization, metadata management

2. Performance Optimization Techniques

Query Performance:

Query Pruning: How Snowflake eliminates unnecessary micro-partitions
Result Caching:
- 24-hour cache retention
- Deterministic query requirements
- Cache invalidation scenarios
Metadata Cache: Cloud services layer caching
Data Cache: Virtual warehouse local disk caching
Clustering Keys:
- When to use clustering
- Maximum 4 columns per key
- Automatic reclustering
- Clustering depth monitoring
Search Optimization Service: Point lookup queries enhancement
Query Acceleration Service: Automatic query optimization for outliers

Common Performance Interview Questions:

How would you optimize a slow-running query?
When would you use clustering keys vs partitioning?
Explain the difference between scaling up vs scaling out
How does result caching work and when does it apply?

3. Data Loading and Unloading

Loading Methods:

COPY Command:
- Bulk loading best practice
- File format specifications
- Error handling (ON_ERROR options)
- VALIDATION_MODE for testing
Snowpipe:
- Continuous data ingestion
- REST API vs Auto-ingest
- Event notifications (AWS SQS, Azure Event Grid)
- Billing per core-second
Snowpipe Streaming:
- Real-time streaming ingestion
- Kafka and Kinesis connectors
- Exactly-once semantics
External Tables: Query data without loading
Data Sharing: Zero-copy cloning across accounts

File Formats and Stages:

File Formats: CSV, JSON, Avro, ORC, Parquet, XML
Stage Types:
- User stages (@~)
- Table stages (@%tablename)
- Named internal stages
- External stages (S3, Azure, GCS)
PUT/GET Commands: Local file operations

4. Security Features

Access Control:

RBAC (Role-Based Access Control):
- System-defined roles (ACCOUNTADMIN, SYSADMIN, SECURITYADMIN)
- Role hierarchy and inheritance
- Principle of least privilege
Discretionary Access Control (DAC): Object ownership model
Row Access Policies: Row-level security implementation
Column-level Security:
- Dynamic Data Masking
- External tokenization
- Column-level encryption

Data Protection:

Encryption:
- At-rest: AES-256 encryption
- In-transit: TLS 1.2+
- Tri-Secret Secure with customer-managed keys
Network Policies: IP allowlisting/blocking
Private Link: AWS PrivateLink, Azure Private Link
OAuth Integration: External authentication
MFA Support: Multi-factor authentication

5. Time Travel and Data Protection

Time Travel:

Retention Periods:
- Standard Edition: 1 day
- Enterprise Edition: Up to 90 days
- Transient/Temporary tables: 0-1 day
Operations:
- Query historical data (AT/BEFORE)
- Clone historical objects
- Undrop tables/schemas/databases

Fail-safe:

7-day period after Time Travel
Only accessible by Snowflake Support
Not available for transient/temporary tables

6. Cost Optimization Strategies

Compute Cost Management:

Warehouse Sizing: Start small, monitor, and scale as needed
Auto-suspend: Set appropriate timeouts (5-10 minutes typical)
Resource Monitors: Set credit quotas and alerts
Query Optimization: Use EXPLAIN plans and Query Profile
Warehouse Segregation: Separate workloads by warehouse

Storage Cost Management:

Data Retention: Implement appropriate Time Travel settings
Compression: Automatic, but monitor compression ratios
Cloning: Use zero-copy clones instead of copying data
Table Types: Use transient for non-critical data
Fail-safe Costs: Understand the 7-day additional storage

7. Advanced Features

Data Sharing:

Secure Data Sharing: No data movement or copying
Data Marketplace: Monetize and discover datasets
Data Exchange: Private data sharing hub
Reader Accounts: Share with non-Snowflake users

Semi-structured Data:

VARIANT Data Type: Store JSON, Avro, ORC, Parquet, XML
Querying: Dot notation and bracket notation
FLATTEN Function: Explode nested arrays
Automatic Schema Detection: INFER_SCHEMA function

Programmability:

Stored Procedures: JavaScript, SQL, Python, Java, Scala
User-Defined Functions (UDFs): Scalar and tabular functions
Snowpark: DataFrame API for Python, Java, Scala
Tasks: Scheduled SQL execution
Streams: Change Data Capture (CDC)
External Functions: AWS Lambda, Azure Functions integration

8. Common Interview Scenarios

Scenario-based Questions:

High Concurrency Issue:
- Solution: Multi-cluster warehouses
- Consider scaling policy (Standard vs Economy)
Large Data Migration:
- Use COPY with parallel processing
- Stage files in cloud storage
- Consider file sizing (100-250MB compressed)
Real-time Analytics:
- Implement Snowpipe or Snowpipe Streaming
- Use Streams and Tasks for CDC
- Consider Dynamic Tables
Cost Spike Investigation:
- Check warehouse utilization
- Review query history and duration
- Analyze storage growth

9. SQL and Query Patterns

Important SQL Features:

CTEs (Common Table Expressions): WITH clause usage
Window Functions: ROW_NUMBER, RANK, LAG, LEAD
QUALIFY Clause: Filter window function results
PIVOT/UNPIVOT: Data transformation
MERGE Statement: Upsert operations
Recursive CTEs: Hierarchical queries
MATCH_RECOGNIZE: Pattern matching in sequences

10. Monitoring and Troubleshooting

Key Views and Functions:

Information Schema: Database metadata queries
Account Usage Schema:
- QUERY_HISTORY view
- WAREHOUSE_METERING_HISTORY
- STORAGE_USAGE
- LOGIN_HISTORY
Query Profile: Visual execution plan analysis
SYSTEM Functions:
- SYSTEM$CLUSTERING_DEPTH
- SYSTEM$CLUSTERING_INFORMATION
- SYSTEM$PIPE_STATUS

11. Best Practices to Mention

Development Best Practices:

Use appropriate warehouse sizes - Don't over-provision
Implement proper role hierarchy - Security first approach
Design efficient data models - Consider denormalization
Use clustering keys wisely - Only for large tables with specific access patterns
Monitor and set resource monitors - Prevent runaway costs
Leverage caching - Understand three-layer caching
Use COPY instead of INSERT - For bulk operations
Partition large files - 100-250MB compressed ideal
Clean up temporary objects - Reduce storage costs
Document data lineage - Use tags and comments

12. Recent Features to Know

Latest Additions (2024-2025):

Dynamic Tables: Declarative data pipelines
Hybrid Tables (Unistore): OLTP + OLAP workloads
Iceberg Tables: Open table format support
Snowpark ML: Machine learning frameworks
Container Services: Run containerized workloads
Document AI: Extract data from documents
Alert and Notification: Event-driven architecture
Git Integration: Version control for SQL scripts

13. Sample Technical Questions

Architecture:

Explain how Snowflake's architecture differs from traditional data warehouses
What happens when you submit a query in Snowflake?
How does Snowflake handle concurrency?

Performance:

How would you improve the performance of a slow query?
When would you use a larger warehouse vs. a multi-cluster warehouse?
Explain the different caching layers in Snowflake

Data Loading:

Compare COPY command vs Snowpipe for data ingestion
How would you handle semi-structured data in Snowflake?
Design a solution for real-time data ingestion

Security:

How would you implement row-level security?
Explain the role hierarchy in Snowflake
How does Snowflake encrypt data?

14. Hands-on Skills to Demonstrate

Write efficient SQL queries using CTEs and window functions
Design a data loading pipeline using stages and COPY
Implement security with roles and access policies
Optimize query performance using Query Profile
Set up monitoring with resource monitors and alerts
Handle JSON data using VARIANT and FLATTEN
Create clones for development environments
Implement CDC using Streams and Tasks