Snoflake Production Guide
Comprehensive guide covering essential Snowflake concepts, features, and common interview topics for data engineering, data analyst, and architect positions.
1. Core Architecture Components
Key Concepts to Master:
- Multi-cluster Shared Data Architecture: Understand separation of storage and compute
- Virtual Warehouses:
- Sizes (XS to 4XL) and scaling patterns
- Auto-suspend and auto-resume functionality
- Multi-cluster warehouses for concurrency
- Scaling policies (Standard vs Economy)
- Micro-partitions:
- 50-500MB of uncompressed data
- Automatic clustering and pruning
- Columnar storage format
- Immutable nature
- Cloud Services Layer: Authentication, query optimization, metadata management
2. Performance Optimization Techniques
Query Performance:
- Query Pruning: How Snowflake eliminates unnecessary micro-partitions
- Result Caching:
- 24-hour cache retention
- Deterministic query requirements
- Cache invalidation scenarios
- Metadata Cache: Cloud services layer caching
- Data Cache: Virtual warehouse local disk caching
- Clustering Keys:
- When to use clustering
- Maximum 4 columns per key
- Automatic reclustering
- Clustering depth monitoring
- Search Optimization Service: Point lookup queries enhancement
- Query Acceleration Service: Automatic query optimization for outliers
Common Performance Interview Questions:
- How would you optimize a slow-running query?
- When would you use clustering keys vs partitioning?
- Explain the difference between scaling up vs scaling out
- How does result caching work and when does it apply?
3. Data Loading and Unloading
Loading Methods:
- COPY Command:
- Bulk loading best practice
- File format specifications
- Error handling (ON_ERROR options)
- VALIDATION_MODE for testing
- Snowpipe:
- Continuous data ingestion
- REST API vs Auto-ingest
- Event notifications (AWS SQS, Azure Event Grid)
- Billing per core-second
- Snowpipe Streaming:
- Real-time streaming ingestion
- Kafka and Kinesis connectors
- Exactly-once semantics
- External Tables: Query data without loading
- Data Sharing: Zero-copy cloning across accounts
File Formats and Stages:
- File Formats: CSV, JSON, Avro, ORC, Parquet, XML
- Stage Types:
- User stages (@~)
- Table stages (@%tablename)
- Named internal stages
- External stages (S3, Azure, GCS)
- PUT/GET Commands: Local file operations
4. Security Features
Access Control:
- RBAC (Role-Based Access Control):
- System-defined roles (ACCOUNTADMIN, SYSADMIN, SECURITYADMIN)
- Role hierarchy and inheritance
- Principle of least privilege
- Discretionary Access Control (DAC): Object ownership model
- Row Access Policies: Row-level security implementation
- Column-level Security:
- Dynamic Data Masking
- External tokenization
- Column-level encryption
Data Protection:
- Encryption:
- At-rest: AES-256 encryption
- In-transit: TLS 1.2+
- Tri-Secret Secure with customer-managed keys
- Network Policies: IP allowlisting/blocking
- Private Link: AWS PrivateLink, Azure Private Link
- OAuth Integration: External authentication
- MFA Support: Multi-factor authentication
5. Time Travel and Data Protection
Time Travel:
- Retention Periods:
- Standard Edition: 1 day
- Enterprise Edition: Up to 90 days
- Transient/Temporary tables: 0-1 day
- Operations:
- Query historical data (AT/BEFORE)
- Clone historical objects
- Undrop tables/schemas/databases
Fail-safe:
- 7-day period after Time Travel
- Only accessible by Snowflake Support
- Not available for transient/temporary tables
6. Cost Optimization Strategies
Compute Cost Management:
- Warehouse Sizing: Start small, monitor, and scale as needed
- Auto-suspend: Set appropriate timeouts (5-10 minutes typical)
- Resource Monitors: Set credit quotas and alerts
- Query Optimization: Use EXPLAIN plans and Query Profile
- Warehouse Segregation: Separate workloads by warehouse
Storage Cost Management:
- Data Retention: Implement appropriate Time Travel settings
- Compression: Automatic, but monitor compression ratios
- Cloning: Use zero-copy clones instead of copying data
- Table Types: Use transient for non-critical data
- Fail-safe Costs: Understand the 7-day additional storage
7. Advanced Features
Data Sharing:
- Secure Data Sharing: No data movement or copying
- Data Marketplace: Monetize and discover datasets
- Data Exchange: Private data sharing hub
- Reader Accounts: Share with non-Snowflake users
Semi-structured Data:
- VARIANT Data Type: Store JSON, Avro, ORC, Parquet, XML
- Querying: Dot notation and bracket notation
- FLATTEN Function: Explode nested arrays
- Automatic Schema Detection: INFER_SCHEMA function
Programmability:
- Stored Procedures: JavaScript, SQL, Python, Java, Scala
- User-Defined Functions (UDFs): Scalar and tabular functions
- Snowpark: DataFrame API for Python, Java, Scala
- Tasks: Scheduled SQL execution
- Streams: Change Data Capture (CDC)
- External Functions: AWS Lambda, Azure Functions integration
8. Common Interview Scenarios
Scenario-based Questions:
- High Concurrency Issue:
- Solution: Multi-cluster warehouses
- Consider scaling policy (Standard vs Economy)
- Large Data Migration:
- Use COPY with parallel processing
- Stage files in cloud storage
- Consider file sizing (100-250MB compressed)
- Real-time Analytics:
- Implement Snowpipe or Snowpipe Streaming
- Use Streams and Tasks for CDC
- Consider Dynamic Tables
- Cost Spike Investigation:
- Check warehouse utilization
- Review query history and duration
- Analyze storage growth
9. SQL and Query Patterns
Important SQL Features:
- CTEs (Common Table Expressions): WITH clause usage
- Window Functions: ROW_NUMBER, RANK, LAG, LEAD
- QUALIFY Clause: Filter window function results
- PIVOT/UNPIVOT: Data transformation
- MERGE Statement: Upsert operations
- Recursive CTEs: Hierarchical queries
- MATCH_RECOGNIZE: Pattern matching in sequences
10. Monitoring and Troubleshooting
Key Views and Functions:
- Information Schema: Database metadata queries
- Account Usage Schema:
- QUERY_HISTORY view
- WAREHOUSE_METERING_HISTORY
- STORAGE_USAGE
- LOGIN_HISTORY
- Query Profile: Visual execution plan analysis
- SYSTEM Functions:
- SYSTEM$CLUSTERING_DEPTH
- SYSTEM$CLUSTERING_INFORMATION
- SYSTEM$PIPE_STATUS
11. Best Practices to Mention
Development Best Practices:
- Use appropriate warehouse sizes - Don't over-provision
- Implement proper role hierarchy - Security first approach
- Design efficient data models - Consider denormalization
- Use clustering keys wisely - Only for large tables with specific access patterns
- Monitor and set resource monitors - Prevent runaway costs
- Leverage caching - Understand three-layer caching
- Use COPY instead of INSERT - For bulk operations
- Partition large files - 100-250MB compressed ideal
- Clean up temporary objects - Reduce storage costs
- Document data lineage - Use tags and comments
12. Recent Features to Know
Latest Additions (2024-2025):
- Dynamic Tables: Declarative data pipelines
- Hybrid Tables (Unistore): OLTP + OLAP workloads
- Iceberg Tables: Open table format support
- Snowpark ML: Machine learning frameworks
- Container Services: Run containerized workloads
- Document AI: Extract data from documents
- Alert and Notification: Event-driven architecture
- Git Integration: Version control for SQL scripts
13. Sample Technical Questions
Architecture:
- Explain how Snowflake's architecture differs from traditional data warehouses
- What happens when you submit a query in Snowflake?
- How does Snowflake handle concurrency?
Performance:
- How would you improve the performance of a slow query?
- When would you use a larger warehouse vs. a multi-cluster warehouse?
- Explain the different caching layers in Snowflake
Data Loading:
- Compare COPY command vs Snowpipe for data ingestion
- How would you handle semi-structured data in Snowflake?
- Design a solution for real-time data ingestion
Security:
- How would you implement row-level security?
- Explain the role hierarchy in Snowflake
- How does Snowflake encrypt data?
14. Hands-on Skills to Demonstrate
- Write efficient SQL queries using CTEs and window functions
- Design a data loading pipeline using stages and COPY
- Implement security with roles and access policies
- Optimize query performance using Query Profile
- Set up monitoring with resource monitors and alerts
- Handle JSON data using VARIANT and FLATTEN
- Create clones for development environments
- Implement CDC using Streams and Tasks