Table of Contents

Lucas Wu

SQL for Data Science

Using SQL to query data is a common practice in data science today. Many table-based data formats, such as DataFrame and Excel, support SQL queries. Moreover, tools like Jupyter Notebook’s Magic Statements and IPython_SQL provide seamless integration between SQL and Python.

SQL Overview

SQL, or Structured Query Language, is based on the relational model theory introduced by Edgar F. Codd in 1970. The model employs tables (data structures in tabular form) to represent data and relationships between them. Initially, SQL was designed for operating and querying relational databases.

SQL was adopted as the standard language for relational databases by ANSI (American National Standards Institute) in 1986 and ISO (International Organization for Standardization) in 1987. These two organizations currently maintain SQL standards.

SQL defines two core components:

  1. Standard query syntax.
  2. Standard data types.

In addition to its core subsets, SQL has advanced extensions like support for unstructured data, enabling SQL to work with NoSQL, NewSQL, and big data systems.

Query Statements

SQL queries are based on subsets such as DDL, DML, DCL, and TCL. Common commands like SELECT and CREATE belong to these subsets. Although core subsets are universally supported by most database systems (DBMS), there may be some variations in extended features. For portability, use standard SQL syntax whenever possible.

DDL (Data Definition Language)

DDL, or Data Definition Language, defines and modifies database structures and objects (e.g., tables, views, and indexes). Common commands include CREATE, ALTER, and DROP.

DML (Data Manipulation Language)

DML, or Data Manipulation Language, operates on data in the database, allowing users to query, insert, update, and delete data. Common commands include:

DQL (Data Query Language) is a subset of DML, specifically for querying data.

DCL (Data Control Language)

DCL, or Data Control Language, manages user permissions and access to database objects. Common commands include GRANT and REVOKE.

TCL (Transaction Control Language)

TCL, or Transaction Control Language, manages transactions to ensure database operation integrity. Common commands include COMMIT, ROLLBACK, and SAVEPOINT.

Data Types

In addition to defining query statements, SQL specifies data types such as numbers, strings, dates, and booleans. Date types, in particular, can be chosen based on specific requirements:

  1. DATE: An 8-digit date format (YYYYMMDD).
  2. TIME: A 6-digit time format (HHMMSS).
  3. TIMESTAMP: A 20-digit format (YYYYMMDDHHMMSSZZZZZZ), where ZZZZZZ represents microseconds.

Notes and Observations

While SQL syntax is generally straightforward, query performance and implementation can vary significantly, and there may be multiple solutions to a problem.

Primary Key

Primary keys in databases serve several purposes:

  1. Uniqueness: Prevent duplicate data entries.
  2. Consistency and Integrity: Ensure each row has a unique identifier.
  3. Relationships: Establish table relationships via foreign keys.
  4. Efficiency: Enable faster query performance via indexing.

ER Model and ERD Diagram

Entity-Relationship Model (ER Model)

The ER model is a foundational theory for designing databases, focusing on entities, their attributes, and relationships. It allows for structured database design, including field relationships, data types, primary keys, and cardinality.

Entity Relationship Diagram (ERD Diagram)

An ERD diagram visualizes the ER model, making relationships and structures more intuitive.

Advantages

ER and ERD diagrams are database-specific design tools, offering functionalities like automatic SQL script generation, unlike generic diagram tools.

String and Range Queries

SQL supports string-based queries using wildcards such as % for pattern matching:

SELECT * FROM tablename WHERE name LIKE 'R%';

Range queries can use BETWEEN or IN for flexibility:

SELECT * FROM tablename WHERE ReleaseYear BETWEEN 2000 AND 2010;

Or:

SELECT * FROM tablename WHERE Country IN ('UK', 'US');

Sorting

SQL allows sorting with the ORDER BY clause:

SELECT * FROM tablename ORDER BY column_name;

Sorting can also use column indices:

SELECT * FROM tablename ORDER BY 2;

Grouping

Grouping is achieved with the GROUP BY clause, often combined with aggregate functions like MAX, MIN, COUNT, and SUM. The HAVING clause filters grouped results.

Example: Grouping and Filtering

SELECT category, SUM(quantity) AS total_quantity
FROM sales
GROUP BY category
HAVING total_quantity > 20;

Subqueries

Subqueries, or nested queries, are queries within another query. They can:

  1. Return values as columns.
  2. Act as conditions for filtering.
  3. Serve as virtual tables.

Example: Subquery in WHERE

SELECT * FROM employees 
WHERE dept_id IN (SELECT dept_id FROM departments WHERE loc_id = '10001');

Aggregate Functions

SQL defines standard aggregate functions such as:

  1. COUNT: Count rows.
  2. SUM: Compute column totals.
  3. AVG: Calculate averages.
  4. MAX: Find maximum values.
  5. MIN: Find minimum values.

Multi-Table Queries

Multi-table queries can use subqueries, implicit joins, or JOIN ON syntax.

Example: Implicit Join

SELECT * FROM table1, table2 WHERE table1.name = table2.name;
(完)