Strings

String is immutable iterable (sequence) consists of Unicode characters. For Python it's almost like any other sequence (more like tuple which is immutable version of list).

String literals are written in a variety of ways:

  • Single quotes: 'allows embedded "double" quotes'

  • Double quotes: "allows embedded 'single' quotes".

  • Triple quoted: '''Three single quotes''', """Three double quotes"""

🪄 Code:

s1 = "Hello, I'm nice little string"
s2 = 'Hello, I\'m nice little string'   # escaping '
print(s1)
print(s2)

📟 Output:

Hello, I'm nice little string
Hello, I'm nice little string

Multiline string (matter of syntax, for Python they are all the same):

🪄 Code:

big_string = """She’s fifteen, sells flowers at the train station.
Sun and berries sweeten the oxygen beyond the mines.
Trains stop for a moment, move further on.
Soldiers go to the East, soldiers go to the West.
                               (c) Serhiy Zhadan
"""
print(big_string)

📟 Output:

She’s fifteen, sells flowers at the train station.
Sun and berries sweeten the oxygen beyond the mines.
Trains stop for a moment, move further on.
Soldiers go to the East, soldiers go to the West.
                               (c) Serhiy Zhadan

Main methods of strings

🪄 Code:

print(dir("some_string")) #Emm... actually all methods...

📟 Output:

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Method(s)Description

Cosmetic methods:

lower(), upper()

Return new string - in lowercase, uppercase

title(), capitalize()

Return new string - all words starts with uppercase, with first word in uppercase

center(w), ljust(w), rjust(w)

Return new string centered or justified to left/right in a string of length w

strip(), rstrip(), lstrip()

Return new string with removed whitespaces

replace(s, r[, count])

Return new string with all sub-strings s replaced by string r

Method(s)Description

Checks:

s in some_string

Return True/False - if sub-string s is part of some_string

islower(), isupper()

Return True/False - if all character are in lower/upper case

startswith(s), endswith(s)

Return True/False - if string starts/ends with a sub-string s

isalpha(), isalnum()

Return True/False - are all characters: alphabetical, alpha-numerical?

isdecimal(), isdigit(), isnumeric()

Return True/False - are all characters: regular digits, digits with super/subscripts or any numeric Unicode character?

isspace()

Return True/False - are all characters whitespaces (" ", "\n", "\t" etc.) ?

isprintable()

Return True/False - if all characters are printable

Method(s)Description

Searching:

count(s)

Return number of sub-string s is part of string

index(s)

Return index of first sub-string s that found in a string or ValueError

find(s)

Return index of first sub-string s that found in a string or -1

len(some_string)

Return int - length of string

Method(s)Description

Split, join, obtaining parts of string:

split(s)

Return list of string parts splitted by delimiter s (whitespace by default)

splitlines(s)

Return list of strings splitted by line ending

s.join(str_iterable)

Return new string - result of merging all strings from iterable with strings using delimiter s

some_string[i]

Return new string - one character by index i

some_string[n1:n2:step]

Return new string - sub-string from n1 till n2 (non-inclusive) with step step

Some examples

some_string = "Some funny string!"

Adding, multiplying(!) strings

🪄 Code:

some_string + " and another string"

📟 Output:

'Some funny string! and another string'

🪄 Code:

some_string * 3

📟 Output:

'Some funny string!Some funny string!Some funny string!'

Get length

🪄 Code:

len(some_string)

📟 Output:

18

Cosmetic/styling methods:

  • lower, upper, title, capitalize

🪄 Code:

some_string.lower(), some_string.upper(), some_string.title(), some_string.capitalize()

📟 Output:

('some funny string!',
 'SOME FUNNY STRING!',
 'Some Funny String!',
 'Some funny string!')

Various checking for lower/upper, all digits, all letters. Returns True/False.

🪄 Code:

print("abcde".islower())
print("ABCDE".isupper())

📟 Output:

True
True

🪄 Code:

"12345".isdigit(), "abc".isalpha(), "abc123".isalnum()

📟 Output:

(True, True, True)

Nice examples regarding checks:

🪄 Code:

low_str = "asdasdfhjksdhfjh"
low_str.islower(), low_str.isupper()

📟 Output:

(True, False)

🪄 Code:

almost_digit_str = "12123123 123123"
true_digit_str = "12312387877987987"
almost_digit_str.isdigit(), true_digit_str.isdigit()

📟 Output:

(False, True)

🪄 Code:

phone_num = "066-749-99-99"
name = "Johnny"  
name_with_spaces = name + " Walker" # Note that whitespaces are not alpha!
phone_num.isalpha(), name.isalpha(), name_with_spaces.isalpha()

📟 Output:

(False, True, False)

Checks of numeric/digits

It can be difficult to comprehend from the start, so the following table will show difference between three related methods (isdecimal(), isdigit(), isnumeric()):

  • isdecimal()

    • Only decimal digits: 0123456789

TestResult

isdecimal('123')

True

isdecimal('1₂34⁵')

False

isdecimal('½¾⅚')

False

isdecimal('一二三四五')

False

  • isdigit()

    • decimal digits: 0123456789

    • super- and subscripts

TestResult

isdigit('123')

True

isdigit('1₂34⁵')

True

isdigit('½¾⅚')

False

isdigit('一二三四五')

False

  • isnumeric()

    • decimal digits: 0123456789

    • super- and subscripts

    • vulgar fractions

    • numeric Unicode characters from other languages

TestResult

isnumeric('123')

True

isnumeric('1₂34⁵')

True

isnumeric('½¾⅚')

True

isnumeric('一二三四五')

True

The playground code to test these methods on those strings:

🪄 Code:

METHODS = "isdecimal", "isdigit", "isnumeric"
TEST_STRINGS = '123', '1₂34⁵', '½¾⅚', '一二三'

for str_ in TEST_STRINGS:
    for method in METHODS:
        method_str = f'{repr(str_)}.{method}()'
        print(f'{method_str:23} 🡒 {eval(method_str)}')

📟 Output:

'123'.isdecimal()       🡒 True
'123'.isdigit()         🡒 True
'123'.isnumeric()       🡒 True
'1₂34⁵'.isdecimal()     🡒 False
'1₂34⁵'.isdigit()       🡒 True
'1₂34⁵'.isnumeric()     🡒 True
'½¾⅚'.isdecimal()       🡒 False
'½¾⅚'.isdigit()         🡒 False
'½¾⅚'.isnumeric()       🡒 True
'一二三'.isdecimal()       🡒 False
'一二三'.isdigit()         🡒 False
'一二三'.isnumeric()       🡒 True

Checking for space-containing strings

  • To check for all-spaces string - use .isspace()

🪄 Code:

"    ".isspace(), "  \t \n  \r\n".isspace()

📟 Output:

(True, True)
  • A bit hackish way to check for alphabeticals with spaces - via using .replace()

🪄 Code:

"Hello World".replace(" ", "").isalpha()

📟 Output:

True

Stripping - removing whitespaces

  • .strip() - remove from the beginning and from the end both

  • .rstrip() - remove only from the end

  • .lstrip() - remove only from the beginning

🪄 Code:

whitespaces_str = "  Some text goes and goes...  "
print("Initial string:   >>>" + whitespaces_str + "<<<")
print("Stripped string: >>>" + whitespaces_str.strip() + "<<<")
print("R-stripped string: >>>" + whitespaces_str.rstrip() + "<<<")
print("L-stripped string: >>>" + whitespaces_str.lstrip() + "<<<")

📟 Output:

Initial string:   >>>  Some text goes and goes...  <<<
Stripped string: >>>Some text goes and goes...<<<
R-stripped string: >>>  Some text goes and goes...<<<
L-stripped string: >>>Some text goes and goes...  <<<

Get character by index (it's possible because string is sequence)

Indexing starts from 0. Negative indexing means counting from the end so -1 is the last item.

🪄 Code:

some_string, some_string[0], some_string[3], some_string[-1]

📟 Output:

('Some funny string!', 'S', 'e', '!')

Slicing

some_str[start:stop[:step]] (again sequence-like syntax) - getting part of sequence.

  • Returns items with indexes starting with first argument (start) till second (stop) non-included.

  • If argument omitted - by default it is either start or end contextually.

  • Optional third argument step - step.

🪄 Code:

some_string[0:4], some_string[5:10], some_string[0:-1], some_string[:]

📟 Output:

('Some', 'funny', 'Some funny string', 'Some funny string!')

🪄 Code:

some_string[:10:2], some_string[::3], some_string[::-1]

📟 Output:

('Sm un', 'Seuytn', '!gnirts ynnuf emoS')

Splitting/joining

🪄 Code:

some_string = "sdfsdfsdf !     sdfsdff \t \n wdfwefwefwef wefwef"
print(some_string.split(" "))
print(some_string.split())
print(some_string.split("n"))
print(some_string.split("!"))
print(some_string.split("ZZZZ"))

📟 Output:

['sdfsdfsdf', '!', '', '', '', '', 'sdfsdff', '\t', '\n', 'wdfwefwefwef', 'wefwef']
['sdfsdfsdf', '!', 'sdfsdff', 'wdfwefwefwef', 'wefwef']
['sdfsdfsdf !     sdfsdff \t \n wdfwefwefwef wefwef']
['sdfsdfsdf ', '     sdfsdff \t \n wdfwefwefwef wefwef']
['sdfsdfsdf !     sdfsdff \t \n wdfwefwefwef wefwef']

🪄 Code:

print("---".join(["First", "Second", "Third"]))
print(",".join(some_string.split()))

📟 Output:

First---Second---Third
sdfsdfsdf,!,sdfsdff,wdfwefwefwef,wefwef

Concatenation

Rare case where string can be merged if they are separated by any number of spaces. Strings must be presented by string object themselves not by variables or function call results

🪄 Code:

"Hello" " World"

📟 Output:

'Hello World'

🪄 Code:

some_big_string = "Everything started with music, " \
                  "with scars left by songs " \
                  "heard at fall weddings with other kids my age."
print(some_big_string)

📟 Output:

Everything started with music, with scars left by songs heard at fall weddings with other kids my age.

🪄 Code:

some_big_string = ("You will reply today, touching warm letters, "
                   "leafing through them in the dark, confusing vowels with consonants, "
                   "like a typewriter in an old Warsaw office. ")
print(some_big_string)

📟 Output:

You will reply today, touching warm letters, leafing through them in the dark, confusing vowels with consonants, like a typewriter in an old Warsaw office.

Unicode

ASCII

Previously characters used in text data were limited by encoding standard called ASCII (American Standard Code for Information Interchange). The key point was "American" so all non-latin characters were missing in that table.

ASCII basics were simple - single byte of data (8 bits) were used. The first 7 bits were used to code the identifier of the character so totally ASCII had 128 characters (2^7). This was done because at that time they though that 128 characters is enough and the last bit was used either for error checking or enabling italics or was set to plain 0.

Anyway, 128 ASCII characters were: 26 uppercase letters, 26 lowercase letters, 10 digits, punctuation symbols, some spacing characters, and some nonprintable control codes like (line feed), (carriage return), \a (bell), \b (backspace) etc:

🪄 Code:

import string
print(len(string.ascii_letters) + len(string.digits) + len(string.punctuation) + len(string.whitespace))

📟 Output:

100

In Python we can get ASCII "index" of the character with builtin function ord and get the character by that index with function chr.

🪄 Code:

print(chr(65))
print(ord("A"))

📟 Output:

A
65

Please note that in fact these functions work with Unicode table (that we will cover in a minute) but Unicode table begins with ASCII and extends it.

We can get all 128 characters of ASCII:

🪄 Code:

print("ASCII:\n", ''.join(chr(x) for x in range(128)))

📟 Output:

ASCII:
 �	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

So, ASCII is great we can't write neither cyrillic texts like ґуґл з'їв яйко-сподівайко nor swedish like Surströmming with it. The reason is simple - this table just doesn't have needed codepoints for non-latin characters.

That's why at some point other encodings (tables of codepoints) used all 8 bits were created:

  • latin-1, windows-1252

    • These cover all main European languages

  • latin-2

    • Central or Eastern European

  • windows-1251, koi8

    • These cover most cyrillic languages

  • big5

    • Traditional Chinese

To encode Python's string into some endocing the string method encode(coding) is used:

🪄 Code:

print("Surströmming".encode("latin_1"))
# print("Surströmming".encode("windows-1251")) # WON"T WORK

print("Piękna jest taka pewność, ale niepewność jest piękniejsza.".encode("latin2"))

print("Ґуґл з'їв яйко-сподівайко".encode("windows-1251"))
# print("Ґуґл з'їв яйко-сподівайко".encode("latin1")) # WON'T WORK

📟 Output:

b'Surstr\xf6mming'
b'Pi\xeakna jest taka pewno\xb6\xe6, ale niepewno\xb6\xe6 jest pi\xeakniejsza.'
b"\xa5\xf3\xb4\xeb \xe7'\xbf\xe2 \xff\xe9\xea\xee-\xf1\xef\xee\xe4\xb3\xe2\xe0\xe9\xea\xee"



---------------------------------------------------------------------------

UnicodeEncodeError                        Traceback (most recent call last)

Input In [6], in <cell line: 7>()
      4 print("Piękna jest taka pewność, ale niepewność jest piękniejsza.".encode("latin2"))
      6 print("Ґуґл з'їв яйко-сподівайко".encode("windows-1251"))
----> 7 print("Ґуґл з'їв яйко-сподівайко".encode("latin1"))


UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

Last updated