SQL for Data Science
Using SQL to query data is a common practice in data science today. Many table-based data formats, such as DataFrame and Excel, support SQL queries. Moreover, tools like Jupyter Notebook’s Magic Statements and IPython_SQL provide seamless integration between SQL and Python.
SQL Overview
SQL, or Structured Query Language, is based on the relational model theory introduced by Edgar F. Codd in 1970. The model employs tables (data structures in tabular form) to represent data and relationships between them. Initially, SQL was designed for operating and querying relational databases.
SQL was adopted as the standard language for relational databases by ANSI (American National Standards Institute) in 1986 and ISO (International Organization for Standardization) in 1987. These two organizations currently maintain SQL standards.
SQL defines two core components:
- Standard query syntax.
- Standard data types.
In addition to its core subsets, SQL has advanced extensions like support for unstructured data, enabling SQL to work with NoSQL, NewSQL, and big data systems.
Query Statements
SQL queries are based on subsets such as DDL, DML, DCL, and TCL. Common commands like SELECT
and CREATE
belong to these subsets. Although core subsets are universally supported by most database systems (DBMS), there may be some variations in extended features. For portability, use standard SQL syntax whenever possible.
DDL (Data Definition Language)
DDL, or Data Definition Language, defines and modifies database structures and objects (e.g., tables, views, and indexes). Common commands include CREATE
, ALTER
, and DROP
.
DML (Data Manipulation Language)
DML, or Data Manipulation Language, operates on data in the database, allowing users to query, insert, update, and delete data. Common commands include:
SELECT
: Query data.INSERT
: Insert data.UPDATE
: Update or modify data.DELETE
: Delete data.
DQL (Data Query Language) is a subset of DML, specifically for querying data.
DCL (Data Control Language)
DCL, or Data Control Language, manages user permissions and access to database objects. Common commands include GRANT
and REVOKE
.
TCL (Transaction Control Language)
TCL, or Transaction Control Language, manages transactions to ensure database operation integrity. Common commands include COMMIT
, ROLLBACK
, and SAVEPOINT
.
Data Types
In addition to defining query statements, SQL specifies data types such as numbers, strings, dates, and booleans. Date types, in particular, can be chosen based on specific requirements:
DATE
: An 8-digit date format (YYYYMMDD).TIME
: A 6-digit time format (HHMMSS).TIMESTAMP
: A 20-digit format (YYYYMMDDHHMMSSZZZZZZ), whereZZZZZZ
represents microseconds.
Notes and Observations
While SQL syntax is generally straightforward, query performance and implementation can vary significantly, and there may be multiple solutions to a problem.
Primary Key
Primary keys in databases serve several purposes:
- Uniqueness: Prevent duplicate data entries.
- Consistency and Integrity: Ensure each row has a unique identifier.
- Relationships: Establish table relationships via foreign keys.
- Efficiency: Enable faster query performance via indexing.
ER Model and ERD Diagram
Entity-Relationship Model (ER Model)
The ER model is a foundational theory for designing databases, focusing on entities, their attributes, and relationships. It allows for structured database design, including field relationships, data types, primary keys, and cardinality.
Entity Relationship Diagram (ERD Diagram)
An ERD diagram visualizes the ER model, making relationships and structures more intuitive.
Advantages
ER and ERD diagrams are database-specific design tools, offering functionalities like automatic SQL script generation, unlike generic diagram tools.
String and Range Queries
SQL supports string-based queries using wildcards such as %
for pattern matching:
SELECT * FROM tablename WHERE name LIKE 'R%';
Range queries can use BETWEEN
or IN
for flexibility:
SELECT * FROM tablename WHERE ReleaseYear BETWEEN 2000 AND 2010;
Or:
SELECT * FROM tablename WHERE Country IN ('UK', 'US');
Sorting
SQL allows sorting with the ORDER BY
clause:
SELECT * FROM tablename ORDER BY column_name;
Sorting can also use column indices:
SELECT * FROM tablename ORDER BY 2;
Grouping
Grouping is achieved with the GROUP BY
clause, often combined with aggregate functions like MAX
, MIN
, COUNT
, and SUM
. The HAVING
clause filters grouped results.
Example: Grouping and Filtering
SELECT category, SUM(quantity) AS total_quantity
FROM sales
GROUP BY category
HAVING total_quantity > 20;
Subqueries
Subqueries, or nested queries, are queries within another query. They can:
- Return values as columns.
- Act as conditions for filtering.
- Serve as virtual tables.
Example: Subquery in WHERE
SELECT * FROM employees
WHERE dept_id IN (SELECT dept_id FROM departments WHERE loc_id = '10001');
Aggregate Functions
SQL defines standard aggregate functions such as:
COUNT
: Count rows.SUM
: Compute column totals.AVG
: Calculate averages.MAX
: Find maximum values.MIN
: Find minimum values.
Multi-Table Queries
Multi-table queries can use subqueries, implicit joins, or JOIN ON
syntax.
Example: Implicit Join
SELECT * FROM table1, table2 WHERE table1.name = table2.name;